data quality case study prepared by orc macro. 2 background –data correction tracking system sas...
TRANSCRIPT
2
• Background– Data Correction
• Tracking system
• SAS AF query application
• Guidelines
– Profile Analysis• SSNs
• Names
Data Correction
3
Profile Analysis—SSNs
P ersons_tn=346,381
P P R Fs
V alid -looking S S N sn=234,311 (68% )
S hared S S N sn=7,100
R epeated S S N sn=3,406
Inva lidn=15,973 (5% )
M issingn=96,097 (27% )
4
Profile Analysis—SSNs
Shared SSNs (n=7,100)
Different Names27%
Candidates for Correction
Same or Similar Names73%
Candidates for Collapse
5
Profile Analysis—Names
Possible Duplicates
23%n=79,300
Unique Persons77%
n=267,081
P ersons_tn=346,381
R epeated N am esn=114,209
N am e G roupsn=30,447
Ind iv idua l P ersonsn=34,909
P ossib le D uplica tesn=79,300
U nique N am esn=232,172
6
Profile Analysis—Names
N am e G roupsn=30,447
114,209 P ro filesS equentia l/M u ltip le
P ro filesn=20,375
Inva lid /M issing S S N sn=83,521
S hared S S N sn=2,092
A pparent V a lid S S N sn=30,668
Typo/D ata E ntryn=3,622
U nique S S N sn=24,954
C ontractsn=18,650 (91% )
O thern=1,725 (9% )
8
• Identifying the extent of the problem
• Investigating based on type of error
• Validating the investigation
• Implementing the change
• Tracking the identification, investigation, validation, and implementation
Data Correction
9
PERSON_ID=3070908—PPRF record
• Identification of problem– Two different middle initials found
• Investigation of problem– TA module– Scripts run
• Validation of information– Name, SSN, degree(s), grant(s)– Sources
Data Correction—An Example
10
PERSON_ID=3070908—PPRF record
• Implementation of correction– Grants report submitted to NIH OD
• Tracking of correction– Internal tracking system
• Post-correction– Loss of control of data
Data Correction—An Example
12
Focus of Our Activities
Examination of the Database, Procedures, and Interface
Development of Modified Use Cases
Unified Modeling Language
Identification and Extractionof Business Rules
Identification of BusinessModel
13
Data Quality Issues
• Type-over of information• Generation of duplicate persons• Collapsing • Changes in degree and address data• Generation of orphans
14
Type-Over Practices
• Intentions: – Assign a new principal investigator (PI) to a grant
– Change the name of a PI on a grant
– Correct a misspelled name
• Consequences:– Inclusion of incorrect information in a person profile
– Absence of linkages between PIs and grant applications
– Creation of false linkages between PIs and grant applications
15
Factors Affecting Quality
• Relatively easy access to person-related data elements
• Lack of self-validation routines
• Interface issues
16
Solutions
• Restricted access
• Quality control validation
• Interface simplification
• Self-validation algorithm
17
Data Quality Validation
• Who does it?– ICs
– A Quality Assurance group
– Other
• How is it done?– Staging areas
– Manual and intelligent filtering
– Architecture
21
Higher-Level Analysis
The following are being examined relative to their effect on quality:• Commons interface with IMPAC II• Database redundancy• Business rules in the database• Master person file• Front-end design• Human factors• Ownership
23
• Evaluate the different identification algorithms currently in use for IMPAC II
• Develop identification algorithm(s) and procedures
• Serve as consultant and guarantor of efficacy of algorithm implementation
Major Goals
Quality improvements plan for personal identifiers
24
• Understanding the technical infrastructure
• Identification of specific areas of concern
• Development/proposal of data quality expectations
• Development/proposal of appropriate, acceptable solutions
Moving Forward
25
Outline• Definition• Rules• Risks and Costs• NIH Expectations• Process• Measurements/Metrics• Testing• Continuous Improvements• Conclusions
Data Quality White Paper
Knowledge assets are very real and carry tremendous value.
26
Development/Proposal of Data Quality Expectations
Develop-ment/Proposalof Appropriate,
Acceptable Solutions
Identification of Specific Areas of Concern
Understanding the Technical Infrastructure
Examination of the Database, Procedures, and Interface
Development of Modified Use Cases
Unified Modeling Language
Identification and Extraction of Business Rules
Identification of Business
Model
Conclusion