5/21/2014
DATA PREPARATION AND PROFILING:
STRATEGIES, CHALLENGES, AND EXPERIENCESTIM NORRIS AND MARK LUNDGREN
5/21/2014
TODAYS AGENDA
• Introductions•Date Profiling and Readiness•Lessons Learned•Future Direction
5/21/2014
ABOUT THE P20W DATA WAREHOUSE
• Statewide longitudinal data system • De-identified data about people's early childhood, Kindergarten through 12th
grade, higher education and workforce experiences and performances
• Collected and linked from existing state agency data systems.
• It includes data about the kinds of services they receive, programs in which they participate, and their academic performance and program or degree completion.
• It also includes a variety of demographic data so we are able to look at a variety of different groups of people.
• Personally identifiable information, such as names, social security numbers, addresses, and other data which can identify a person as an individual, are not part of the research database.
4
ECEAP students
K-12 students
K-12 teachers
CTC students
Baccalaureate students
National Student Clearinghouse
Workforce
IPEDS Financial
Data Sources
data
Data Management, Governance
Standards, confidentiality, security
Critical questions
Data dictionary, matching,
longitudinal linking, cross-sector derived
elements
P-20/W datasets
ERDC
Research
Data to partner agencies
PCHEES
Collaborative research
Ad-hoc requests (data and research) for
partners and legislature
LEAP
External requests for data
Feedback reports (behalf of agencies)
Output
OFM
5/21/2014
DATA FLOW PROCESS
•Chart of data flow goes here
5/21/2014
DATA SOURCE CHARACTERISTICS
•Over 20 source data feeds•Data systems being developed in
parallel•Some migrated historic data,
some didn’t
5/21/2014
DATA PREPARATION: DATA PROFILING
•Do it early, do it often•Verification of data dictionary•Descriptive statistics•Distinct counts and percentages• Zero, blanks and nulls•Minimum and maximum values• Patterns of data
5/21/2014
DATA PREPARATION: DATA PROFILING
• Dataset validation checks• Counts of records by time, institution
• Values and codes over time• Systematic changes (0,1 to Y,N)• Values defined in data dictionary• Quality of data• Names and identifiers• Data elements
5/21/2014
DATA PREPARATION: DATA PROFILING
• Toolset varied by analyst• SAS• Informatica Data Analyst• Excel
• Goal of understanding the data• Constraints• Completeness, patterns over time• Values of each data element
5/21/2014
DATA PREPARATION: DATA READINESS
•Document and expand results of profiling process•Generate the “goto” resource for
follow-up question•Resource to begin data loading• Content that feeds the data
dictionary
5/21/2014
DATA PREPARATION: DATA READINESS
• Information about:•Data provider•Data file•Data elements
5/21/2014
READINESS CONTENT ITEMSDataset elements Data element
Number of records Name and description
Years Provided Acceptable values
Primary key Data format/length
Business owner and steward Business rules
Update frequency Identity matching flag
Extract process Field/record level data rules
Known issues Security category
Dataset level rules Notes
5/21/2014
DATA READINESS TEMPLATE
• s
5/21/2014
WHAT WE’VE LEARNED
• Customers need to be involved•Dictionaries don’t match data• Educate our analyst on the data,
the customer on the vision of the database•Avoid custom extracts•More time required up front
5/21/2014
TOWARD THE FUTURE
• Empower the provider by offering guidance and tools for profiling•Develop feedback process of data
quality and edits back to customer•Open and transparent
5/21/2014
QUESTIONS?