process for data quality assurance · pdf file“good decisions require good data”...
TRANSCRIPT
“Good decisions require good data”
Process for
Data Quality Assurance
at Manitoba Centre for Health Policy (MCHP)
Mahmoud Azimaee
Data Analyst at ICES
Literature and Resources
• CIHI Data Quality Framework, (2009
edition)
• UK’s NHS Data Quality Reports
• Handbook on Data Quality
Assessment Methods and Tools, (European Commission)
• Handbook on Improving Quality by
Analysis of Process Variables , (European Commission)
• Data fitness
(Australian National Statistical Service)
Data Quality at MCHP
1. Data Quality Indicators
2. Rating System
– CIHI Data Quality Framework, (2009 edition)
3. Data Quality Report
– UK’s NHS Data Quality Reports
4. Practical Approach
5. Automation
– Cody’s Data Cleaning
Techniques Using SAS, (by Ron Cody)
Data Quality Indicators and Rating
System • Example:
– Completeness: Rate of missing values for all data
elements.
• Consistency : Agreement with registry database.
Level of Agreement With Other Databases
Completeness
Measurement Error
Level of Bias
Degree of Problems with Consistency
Accuracy Reliability
Availability and Quality of:
Documents , Policies and Procedures, Formats Libraries, Metadata, Data Model Diagrams
Time to Acquisition
Time to Release
Currency of Data
Identifying Units of Analysis (Persons, Places, Things, ...)
Level of Agreement With the Literature and available reports
Internal Consistency
Stability Across Time
Linkability
Completeness
(Missing Values)
Correctness
(Invalid codes, Invalid Dates, Out of Range, Outliers and Extreme Observations)
Accuracy Internal Validity
External Validity Timeliness Interpretability
Database Level
(In Data Management)
Research Level
(In a Specific Research Projects)
Data Quality Assurance
MCHP Data Quality Framework:
Data
Management
Process at
MCHP
6. Release Data to Analyst(s) and Researcher(s)
Meet with programmer(s) and researcher(s) to present data structure and content
5. Document Data
Including original documents, data model diagram, SPDS data dictionary, history, file variations and structural changes, revisions and common problems and data quality report, where available
4. Evaluate Data Quality
Test the installed data using standardized protocol
Identify solutions to address deficiencies in data quality
Prepare data quality report for addition to standard documentation
3. Apply SAS Programs
Apply Normalization or De-normalization as required
Normalization can be defined as the practice of optimizing table structures by eliminating redundancy and inconsistent dependency
Apply data field and SAS format
standards
Install on SPD server
(This includes indexing, sorting and clustering)
Create Metadata
If there is a problem, liaise with the source
agency
2. Become Familiar with Data Structure and Content
Review provided documentation If required, create a data model for the
original data If receiving test data, test it and send
feedback to the source agency
1. Formulate the Request and Receive the Data
Check the data sharing
agreements
Liaise with the source agency to acquire available data, data model diagram, data dictionary, documentation about historical changes in data content,
format, and structure, data quality reports
Prepare the data request
letter
Receive the data and associated documentation
How to Present Data Quality
Results?
• CIHI Data Quality Report
• UK’s NHS Data Quality Report
– VODIM Test Analysis Methodology
• Valid
• Other
• Default
• Invalid
• Missing
• Valid
• Invalid
• Missing
• Outlier
VIMO!
VIMO Table
(1) I just discovered that the
data system we have been
working on for the last five
years has major data quality
problems.
(2) That is why I treat data
systems the same way I do
sausage – I do not want to know
what is inside either one.
(3) Ouch!! That is why I am a
vegetarian!
Conversation from: Data Quality and Record Linkage Techniques, Thomas N. Herzog, et al. 2007, Springer
• Example 1: Identifying Outliers/Extreme
Observations: 1. Standard Deviation (Mean +/- 2*SD)
2. Trimmed Standard Deviation
(MeanTrimmed10% +/- 2*1.49*SDTrimmed10%)
3. Interquartile Range
(Q1 – k*IQR , Q3 + k*IQR), k=2.5
– Ordered statistics for calculating quartiles is very
memory intensive
=>> P² method to approximate the quartiles
(Using QMETHOD=P2 in PROC MEANS)
[piecewise-parabolic (P²) algorithm invented by
Jain and Chlamtac (1985)]
Operational Approaches
Operational Approaches
Example 2: Stability Across Time
Based on CIHI guideline:
– Trend analysis is used to examine changes in
core data elements over time
– No change across years may also be an
indication of a problem if the data is expected to
naturally trend upward or downward
– Changes in methodology or inclusion/exclusion
criteria should be taken into account to determine
whether the observed changes were real or not.
Example 2: Stability Across Time (Continued)
• Identify unusual changes
– Outlier analysis
• Outlier analysis requires a model
– How to choose an appropriate model in an
automated fashion?
• Fit a series of common models:
– Simple Linear: Y=β0 + β 1X
– Quadratic: Y= β 0 + β 1X2
– Exponential: Y= β 0 + β 1exp(X)
– Logarithmic: Y= β 0 + β 1log(X)
– SQRT: Y= β 0 + β 1 𝑥
– Inverse: Y= β 0 + β 1 1
𝑥
– Negative Exponential: Y= β 0 + + β 1Exp(-X)
Example 2: Stability Across Time (Continued)
• Choose the best model with the minimum MSE
• Re-fit the chosen model on the data
• Do an outlier analysis
– Estimate Studentized residuals for each observation (with the
current observation deleted)
• Flag significant observations as potential outliers
• Flag observation with no changes over time
• How about Small Cell Size Policy? (0<Frequency<6)
– Use the actual values in modeling but flag and then
force them to 3 in the report
• MCHP’s data repository includes over 65
health and other administrative databases,
(linkable using a common encrypted
individual identifier).
• Annual updates for most of the databases
in its repository.
• Designing an automated process became a
must!
Automation
• A SAS Macro based application package
was developed (16 Macros)
– Pre Data Quality Macro (1)
– Main Macros (6)
– Intermediate Macros (9)
Automation
Automation (Continued)
Documentation System
VIMO Macro
GETNOBS Macro
INVALID Macro
POSTMUN Macro
OUTLIER Macro
GETFORMAT Macro
GETVARLIST Macro
METADATA MACRO
Special Features:
• Can handle standalone and Clustered tables
• Can Validate Postal and Municipal codes
LINK Macro
GETNOBS Macro
Automation (Continued)
TREND Macro
GETNOBS Macro
FISCAL Macro
MONTHLY Macro
Automation (Continued)
CONTENT MACRO
AGREEMENT MACRO
PHINCHECKMACRO
• Checks 3rd and 5th positions of PHINs which must be 0 and 9
• Compares the distribution of the first position with the corresponding
PHINs from registry files
Automation (Continued)
Non-Automated Indicators
• Internal Consistency
• Timeliness
Level of Agreement With Other Databases
Completeness
Measurement Error
Level of Bias
Degree of Problems with Consistency
Accuracy Reliability
TREND Macro LINK Macro PHINCHECK Macro AGREEMENT Macro
VOMO Macro
Availability and Quality of:
Documents , Policies and Procedures, Formats Libraries, Metadata, Data Model Diagrams
Time to Acquisition
Time to Release
Identifying Units of Analysis (Persons, Places, Things, ...)
Level of Agreement With the Literature and available reports
Internal Consistency
Stability Across Time
Linkability
Completeness
(Missing Values)
Correctness
(Invalid codes, Invalid Dates, Out of Range, Outliers and Extreme Observations)
Accuracy Internal Validity
External Validity Timeliness Interpretability
Database Level
(In Data Management)
Research Level
(In a Specific Research Projects)
Data Quality Assurance
• Data Quality Website
• Central Format Library
• Metadata Database
• Standardization
– Bad standards are better than no standards at all!
Missing Links!
• Data Quality Algebra
• Data Quality Axioms
Data Quality As A Science
Acknowledgment
• Mr. Mark Smith (MCHP Associate Director, Repository)
• Dr. Lisa Lix (Associate Professor at University of Saskatchewan)
CONTACT INFORMATION
Mahmoud Azimaee
Institute for Clinical Evaluative Sciences
Work Phone: (647) 480-4055 (Ex. 3618)
E-mail: [email protected]
Web: www.dastneveshteha.com