data quality case study prepared by orc macro. 2 background –data correction tracking system sas...

26
Data Quality Case Study Prepared by ORC Macro

Upload: kristian-newton

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Data Quality Case Study

Prepared

by

ORC Macro

2

• Background– Data Correction

• Tracking system

• SAS AF query application

• Guidelines

– Profile Analysis• SSNs

• Names

Data Correction

3

Profile Analysis—SSNs

P ersons_tn=346,381

P P R Fs

V alid -looking S S N sn=234,311 (68% )

S hared S S N sn=7,100

R epeated S S N sn=3,406

Inva lidn=15,973 (5% )

M issingn=96,097 (27% )

4

Profile Analysis—SSNs

Shared SSNs (n=7,100)

Different Names27%

Candidates for Correction

Same or Similar Names73%

Candidates for Collapse

5

Profile Analysis—Names

Possible Duplicates

23%n=79,300

Unique Persons77%

n=267,081

P ersons_tn=346,381

R epeated N am esn=114,209

N am e G roupsn=30,447

Ind iv idua l P ersonsn=34,909

P ossib le D uplica tesn=79,300

U nique N am esn=232,172

6

Profile Analysis—Names

N am e G roupsn=30,447

114,209 P ro filesS equentia l/M u ltip le

P ro filesn=20,375

Inva lid /M issing S S N sn=83,521

S hared S S N sn=2,092

A pparent V a lid S S N sn=30,668

Typo/D ata E ntryn=3,622

U nique S S N sn=24,954

C ontractsn=18,650 (91% )

O thern=1,725 (9% )

7

• Definition

• Statistics

• Status

OLTP—Commons Cases

8

• Identifying the extent of the problem

• Investigating based on type of error

• Validating the investigation

• Implementing the change

• Tracking the identification, investigation, validation, and implementation

Data Correction

9

PERSON_ID=3070908—PPRF record

• Identification of problem– Two different middle initials found

• Investigation of problem– TA module– Scripts run

• Validation of information– Name, SSN, degree(s), grant(s)– Sources

Data Correction—An Example

10

PERSON_ID=3070908—PPRF record

• Implementation of correction– Grants report submitted to NIH OD

• Tracking of correction– Internal tracking system

• Post-correction– Loss of control of data

Data Correction—An Example

Developing a

Data Quality Business Plan

12

Focus of Our Activities

Examination of the Database, Procedures, and Interface

Development of Modified Use Cases

Unified Modeling Language

Identification and Extractionof Business Rules

Identification of BusinessModel

13

Data Quality Issues

• Type-over of information• Generation of duplicate persons• Collapsing • Changes in degree and address data• Generation of orphans

14

Type-Over Practices

• Intentions: – Assign a new principal investigator (PI) to a grant

– Change the name of a PI on a grant

– Correct a misspelled name

• Consequences:– Inclusion of incorrect information in a person profile

– Absence of linkages between PIs and grant applications

– Creation of false linkages between PIs and grant applications

15

Factors Affecting Quality

• Relatively easy access to person-related data elements

• Lack of self-validation routines

• Interface issues

16

Solutions

• Restricted access

• Quality control validation

• Interface simplification

• Self-validation algorithm

17

Data Quality Validation

• Who does it?– ICs

– A Quality Assurance group

– Other

• How is it done?– Staging areas

– Manual and intelligent filtering

– Architecture

18

GM Module Screen GM1040

19

GM Module Screen COM1100

20

Self Validation

• Name-matching algorithm

• Consistency checking

21

Higher-Level Analysis

The following are being examined relative to their effect on quality:• Commons interface with IMPAC II• Database redundancy• Business rules in the database• Master person file• Front-end design• Human factors• Ownership

Development of a

Data Quality Model

23

• Evaluate the different identification algorithms currently in use for IMPAC II

• Develop identification algorithm(s) and procedures

• Serve as consultant and guarantor of efficacy of algorithm implementation

Major Goals

Quality improvements plan for personal identifiers

24

• Understanding the technical infrastructure

• Identification of specific areas of concern

• Development/proposal of data quality expectations

• Development/proposal of appropriate, acceptable solutions

Moving Forward

25

Outline• Definition• Rules• Risks and Costs• NIH Expectations• Process• Measurements/Metrics• Testing• Continuous Improvements• Conclusions

Data Quality White Paper

Knowledge assets are very real and carry tremendous value.

26

Development/Proposal of Data Quality Expectations

Develop-ment/Proposalof Appropriate,

Acceptable Solutions

Identification of Specific Areas of Concern

Understanding the Technical Infrastructure

Examination of the Database, Procedures, and Interface

Development of Modified Use Cases

Unified Modeling Language

Identification and Extraction of Business Rules

Identification of Business

Model

Conclusion