a framework for exploring data quality in a large data system willard hom institute on research...

42
A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Upload: madeline-calhoun

Post on 27-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

A Framework for Exploring Data Quality in a Large Data System

Willard Hom

Institute on Research & Statistics

April 8, 2004

Page 2: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Problem: How can we plan to explore data quality in a large data system?

Basic Response: Match the needs for data quality exploration with your resources/contexts for doing so.

Page 3: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Note on Topic Coverage

• Focus on traditional MIS data (numeric and string variables).

• Not covered are other data forms (audio, visual, GIS, and narrative text/reports).

• MIS staff and expertise are obviously critical so this talk focuses mostly on contributions researchers can make.

Page 4: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Reasons to Explore Data Quality

• Meet professional duty as researchers.• Help us to judge the types of analysis that

we can do with a data system. • Prevent others from misusing data.• Facilitate the improvement of data.• Counter frequent myths about your data.• Help justify the agency’s mission & funding

(esp. when data is necessary for funding).

Page 5: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Why Exploring Data Quality Is Hard for Organizations

• Inexperience in the topic (and lack of expertise).• Concrete added expense but no clear value.• Sunken costs (incl. pressure for a time series).• Finding errors will not mean perfect data will

result.• Political & administrative sensitivity.• Not usually mandated.

Page 6: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Some Dimensions of Data Quality

• Accuracy.*• Completeness.*• Consistency.• Currency.• Accessibility (Usability).

* areas where researchers or statisticans can contribute most effectively

Page 7: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Accuracy

• Closeness to “true” value of a variable.

• Unbiased (absence of systematic error).

• Level of precision (or “coarseness ” of the recorded values).

Page 8: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Completeness

• The degree of coverage (in terms of “cases” and of “variables”) for the analysis of a “target population.”*

• No, or very few, missing values where true values exist.

*also issues relating to longitudinal studies and explanatory modeling.

Page 9: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Consistency

• Equivalence of “instrumentation” and formatting across time, space, and other “batches” of the data collection/management environment.

Page 10: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Currency

• Minimal lag time between occurrence of new phenomenon and the availability of data values in the system to represent the new phenomenon.

• Minimal lag time between discovery of errors and the correction of those errors in historical data.

Page 11: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Accessibility/Usability

• Ease of manipulation by target users (file format, record format, field format, system compatibility)

• Clarity of metadata for proper data analysis.

• Breadth of access (authority for use).

Page 12: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Some Factors in Choosing Which Variables to Explore

• Risk from errors in a variable.

• Ease of error detection for a variable.

• Ease of error correction for a variable.

• Cost of error detection for a variable.

• Cost of error correction for a variable.

Page 13: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Two Basic Data Exploration Tracks

• Data editing and testing.

• Process analysis.

• Both are important to use.

Page 14: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Data Editing/Testing

• Screen for allowable range of values.• Screen for outliers (univariate or

multivariate).• Statistical quality control methods.• Screen within a record or across records.• These can prevent some error and can detect

some error, but they rarely find root causes.

Page 15: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Some Caveats for Editing/Testing

• Some outliers are true values while some inliers are not. (Error detection is complicated.)

• Testing depends upon the analyst’s ability to use some “gold standard” in a comparison.

• Variables with restricted range of measurement or on a categorical scale present a different challenge than outliers for variables using an interval scale (with no range restriction or truncation).

Page 16: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Process Analysis

• Analyze each step in the data’s history ---which includes the initial data generating step and all ensuing steps in the data processing---right down to the user of a final report or analysis.

Page 17: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Some Caveats for Process Analysis

• This is a multi-disciplinary concept (needing at least MIS, social scientists, and subject matter experts).

• This can cost far more (in time and resources) than the editing/testing track.

• The metadata factor is important here.• It is critical for finding the root cause of

data error.

Page 18: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Administrative Issues in Data Error

• Publish data so that data originators can correct errors in the system (a feedback loop)---another benefit of data “usability.”

• Consider the incentives that data creators or intermediaries have to bias data on purpose (so alter the incentives or monitor closely).

• Consider factors that can motivate the production of more accurate and complete data (show that data get used and how the costs of error will hit them)---especially when lack of effort is the cause.

Page 19: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

How Researchers Can Add Value

• Models in social psychology and economics to understand a data generation/processing system.

• Field observation (and interviews) of data collection process.

• Verbal protocol methods for process actors.• Experimentation to develop improvements.• Statistical tools for sampling (incl. audit

sampling), outliers, odd data patterns, control charts, and validity/reliability studies.

Page 20: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Some Examples of Exploration

• Determining the number of CC students in an academic year with a bachelor’s degree.*

• Validating the students’ self-reported goals for CC enrollment (at time of initial registration).*

• Checking the accuracy of a flag for first-time CC student.**

* hypothetical example** actual historical example

Page 21: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Determining the number of CC students in an academic year with a bachelor’s degree.

• A differential fee for CC students who have a BA/BS could motivate students to misreport the prior attainment of a BA/BS.

• A partial test for potential reporting bias could use databases of higher ed. enrollment to check for degree status in a random sample of CC students.

• We could use the sample proportion (in lieu of the “population” proportion in the MIS) if the MIS proportion lies outside the sample’s 95% confidence interval.

Page 22: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Validating the students’ self-reported goals for CC enrollment (at time of initial registration).

• Student-reported goals may lack validity if students give the question no cognitive effort.

• A sample of CC students could be re-interviewed (phone or face-to-face) to check the reliablity of the initial response (a case of test-retest reliability).

• A qualitative evaluation could use field observation and/or post-survey de-briefing.

• Researchers could use the verbal protocol method to understand the ways that students interpret the question.

Page 23: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Checking the accuracy of a flag for first-time CC student .

• CC students mark their status as “first-time CC students” or some other category, but field staff noted apparent reporting errors.

• The state-wide MIS has records of CC enrollment by individual student over a span of years.

• Programmers checked each new cohort of CC students for any prior CC enrollment.

Page 24: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Some Pitfalls in Data Exploration

• If cases lack unique identifiers, you can’t use another data source to cross-check for data agreement.

• Even if cross-referencing indicates disagreement in data values, we may not know which source, if either of them, has the correct values.

• To find coverage errors (target population errors), you need alternate data and a statistical analysis of population profiles (because total N’s may agree).

• Survey data demand special methods such as re-interviewing and instrument validation.

Page 25: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Rule of Thumb 1:

Do data exploration as near to the data generating step as possible; this will help in achieving correct data.

Page 26: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Reasons for RoT 1:

• As time and proximity from the data source increase, the chances for getting correct data decrease.

• As more time passes and advancement into a data system occurs, the risk of dispersion of bad data grows---making amelioration more difficult.

Page 27: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Rule of Thumb 2:

It’s impossible to achieve perfect data: seek to find the levels of quality

that are critical.

Page 28: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Reasons for RoT 2:

• In large data systems, some loss of quality is inevitable.

• Usually, we cannot afford to achieve perfect data.

• Usually, we can achieve analytical goals with less-than-perfect data.

Page 29: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Rule of Thumb 3:

With limited resources, we will need to trade-off breadth for depth in data

exploration.

Page 30: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Reasons for RoT 3:

• In-depth data exploration takes time and expertise—which agencies usually have in limited supply.

Page 31: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Rule of Thumb 4:

Expertise in statistical analysis and research in the relevant subject area are indispensable to effective data quality exploration (in concert with MIS staff) .

Page 32: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Reasons for RoT 4:

• Staff who only have MIS expertise (with no expertise in statistical analysis or the relevant research topic) :

1. cannot fully understand the level of data quality (accuracy and completeness) needed, and2. cannot use the various statistical methods to detect potentially erroneous data.

Page 33: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

How Critical Is Data Quality for Your System?

• Are the fates of clients dependent on data quality?

• Is program funding or program evaluation directly linked to your data?

• Does your job depend upon data quality? (If your data are poor, will it be outsourced?)

Page 34: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Can You Document Your Current Data Quality?

• Do you have a system in place that prevents data errors?

• Do you have a system in place that measures data quality (and detects error)?

• How rigorous are your steps to prevent error and measures of data quality?

Page 35: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Is There A Credibility Gap?

• Do analysts/decision-makers downplay the reports or conclusions that are based on your data system?

• Do analysts/decision-makers prefer alternate data sources when they draw their conclusions?

Page 36: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

What Are Your Capacities?

• Does your agency have close control over the data system (that is, “cradle-to-grave”)?

• Does your agency have researchers with the skill/education to explore system data quality?

• Does your agency have MIS staff with skill/education to explore system data quality?

• Does your agency have time, staff availability, and funds to undertake data quality exploration?

• Is the data system a stable, long-term operation?

Page 37: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Is There Management Support?

• Does management want a short-term solution---basically a “defensive” agenda---just find ways to rebut criticisms of your data’s quality?

• Does management want a long-term solution to data quality issues---a comprehensive strategy to prevent error and to raise quality?

Page 38: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

A Suggested Framework(assuming a bottom-up mode)

Assess importance of DQ.

Can you document current DQ?

Is there a credibility gap?

What are your capacities?

Is there management support?

Page 39: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

P.S. Some Factors in Initial Data Quality Problems

• Researchers may not have an active role in the design of data systems (an administrative and political issue)---if the agency has qualified researchers at all.

• Researchers may not have enough time, tools, or special training to help plan valuable outputs that a proposed data system could deliver.

• Analytical needs change but systems often do not adapt well to emerging needs or environmental changes.

Page 40: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

“Nutshell” Bibliography

• Dasu, T. & T. Johnson. (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons: New York.

• Iglewicz, B. & D.C. Hoaglin. (1993). How to Detect and Handle Outliers. American Society for Quality Control: Milwaukee, Wisconsin.

Page 41: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

“Nutshell” Bibliography (cont.)

• Naus, J.I. (1975). Data Quality Control and Editing. Marcel Dekker: New York.

• Olson, J.E. (2003). Data Quality: The Accuracy Dimension. Morgan Kaufmann: San Francisco.

• Redman, T.C. (1992). Data Quality: Management and Technology. Bantam: New York.

Page 42: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004

Willard Hom

• Director of Research & Planning UnitChancellor’s Office, California Community Colleges, 1102 Q Street, Sacramento, CA 95814-6511

• E-mail: [email protected]

• Phone: (916) 327-5887