5/21/2014 d ata p reparation and p rofiling : s trategies, challenges, and experiences t im n orris...

Post on 05-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

5/21/2014

DATA PREPARATION AND PROFILING:

STRATEGIES, CHALLENGES, AND EXPERIENCESTIM NORRIS AND MARK LUNDGREN

5/21/2014

TODAYS AGENDA

• Introductions•Date Profiling and Readiness•Lessons Learned•Future Direction

5/21/2014

ABOUT THE P20W DATA WAREHOUSE

• Statewide longitudinal data system • De-identified data about people's early childhood, Kindergarten through 12th

grade, higher education and workforce experiences and performances

• Collected and linked from existing state agency data systems.

• It includes data about the kinds of services they receive, programs in which they participate, and their academic performance and program or degree completion.

• It also includes a variety of demographic data so we are able to look at a variety of different groups of people.

• Personally identifiable information, such as names, social security numbers, addresses, and other data which can identify a person as an individual, are not part of the research database.

4

ECEAP students

K-12 students

K-12 teachers

CTC students

Baccalaureate students

National Student Clearinghouse

Workforce

IPEDS Financial

Data Sources

data

Data Management, Governance

Standards, confidentiality, security

Critical questions

Data dictionary, matching,

longitudinal linking, cross-sector derived

elements

P-20/W datasets

ERDC

Research

Data to partner agencies

PCHEES

Collaborative research

Ad-hoc requests (data and research) for

partners and legislature

LEAP

External requests for data

Feedback reports (behalf of agencies)

Output

OFM

5/21/2014

DATA FLOW PROCESS

•Chart of data flow goes here

5/21/2014

DATA SOURCE CHARACTERISTICS

•Over 20 source data feeds•Data systems being developed in

parallel•Some migrated historic data,

some didn’t

5/21/2014

DATA PREPARATION: DATA PROFILING

•Do it early, do it often•Verification of data dictionary•Descriptive statistics•Distinct counts and percentages• Zero, blanks and nulls•Minimum and maximum values• Patterns of data

5/21/2014

DATA PREPARATION: DATA PROFILING

• Dataset validation checks• Counts of records by time, institution

• Values and codes over time• Systematic changes (0,1 to Y,N)• Values defined in data dictionary• Quality of data• Names and identifiers• Data elements

5/21/2014

DATA PREPARATION: DATA PROFILING

• Toolset varied by analyst• SAS• Informatica Data Analyst• Excel

• Goal of understanding the data• Constraints• Completeness, patterns over time• Values of each data element

5/21/2014

DATA PREPARATION: DATA READINESS

•Document and expand results of profiling process•Generate the “goto” resource for

follow-up question•Resource to begin data loading• Content that feeds the data

dictionary

5/21/2014

DATA PREPARATION: DATA READINESS

• Information about:•Data provider•Data file•Data elements

5/21/2014

READINESS CONTENT ITEMSDataset elements Data element

Number of records Name and description

Years Provided Acceptable values

Primary key Data format/length

Business owner and steward Business rules

Update frequency Identity matching flag

Extract process Field/record level data rules

Known issues Security category

Dataset level rules Notes

5/21/2014

DATA READINESS TEMPLATE

• s

5/21/2014

WHAT WE’VE LEARNED

• Customers need to be involved•Dictionaries don’t match data• Educate our analyst on the data,

the customer on the vision of the database•Avoid custom extracts•More time required up front

5/21/2014

TOWARD THE FUTURE

• Empower the provider by offering guidance and tools for profiling•Develop feedback process of data

quality and edits back to customer•Open and transparent

5/21/2014

QUESTIONS?

top related