fundamental of quality data - anthony ndungu
TRANSCRIPT
Research Methods Group
http://worldagroforestry.org/research-methods/
Fundamentals of data quality control
Antony Karanja [email protected]
Science Week 2013
Outline Why Quality/Consistent data? Where do we start? Levels of Data Quality Check Standard Data Cleaning
Procedure Storage, sharing and archiving
Introduction
• Why Quality Data?• Data quality is crucial for quality research
results.– Factual and represent real world– Accuracy, reliability and replication– Integrity, credibility and reputation– Avoid your paper being rejected
• A study can only be as good as the data
Introduction
• Where does Quality Data start?Research design stage– Questionnaire design/Data Collection
Tool/Research Methods used– Data collection– Data entry management– Data cleaning
Data Quality Check Stages
Field Data Collection• Data collection tool design. • Personnel training and research objectivity• Pretest/Piloting• Back check (field audit)- going back to the same
sample surveyed. • Identify the source of error (respondent,
enumerator, you)• Take the necessary action
Field Data Collection
• Enumerators/ Data collection clerks trainingo Research objective/targeto Survey Tool Contents – Masteredo “Survey interviewing is a story that needs to following in a directed manner and with a objective”
o What and how to ask for exact datao Survey questions shouldn’t be altered (even on
translation to local language)o Flow of sections and content on each mastered,o Follow survey instruction well.o “Given 6 hours to cut down a tree”; 4Hours-Sharpening tools (axe), 2 hours cutting
Field Data Collection
• Back Checks/ Field Audit• Proposed protocols 5%-10%, random across
the team• Every team and every surveyor is back
checked as soon as possible• Compare results and act accordingly
Back Check protocol
• How to do a back check?Develop a plan before you start surveyingSelect your back check questionsSelect your back check teamExecuting the back checkDealing with the resultsBack checks in the context of electronic surveyingSee Back Check manual for Specific details on each
of these steps.
Data Quality Check Stages
Research design
Data collection tool
Rigorous training and pretest
Data Quality Collection
Field Audit/Back Check
Sit ins/Spot Checks
Physical editing
Structuring data collection protocols
Data Entry Level
• Field data collection stage done!• We collect tons of Data through surveys. • How do we convert them to a form which we
can analyze?• Simple Answer: Create data sets• For both Paper bases surveys and Digital data
Collection(DDC)/Computer Assisted Interviewing(CAI))
DISASTER is just waiting to happen
1. Unorganized surveys. Misplaced an entire village. Lost data.
2. Sent data to another project site. Truck crashed. Lost data.
3. Server crashed. No backup. Lost data.
4. No one checked data quality. Turns out, there’s no ID variable. Lost data.
5. No one monitored data entry contractor. Turns out, they copy + pasted data and changed the IDs. Lost data.
Rules for Data Entry
• Double Blind Entry• Enter PII separately & encrypt• Two Unique Identifiers• Data Cleaning
Double data entry
• The gold standard for professional data entry. (What is collected in what is coded/entered)
• The two data sets are compared, differences are examined and corrections are made.
• “Garbage in- Garbage out.” Don’t enter garbage data. If you want any analysis of your data to be valid, your data itself must be valid.
• Specific program designed for data entry (CsPro, Ms Access/MySQL, Excel SPSS, Epi Info, Epi Data etc.), ensure double (blind) data entry is done
Double Data entry Flow
1st Entry 2nd Entry
Discrepancies
Reconciliation
Questionnaire
Final Dataset
If Stata, cfout, readreplace, cfbyCsPro and Access
Data Audit (3rd entry) is done after this and normally accepted error rate is 0.5%
Data Entry Level
o Double entered and verified…..What do you do next?
o Data Cleaningo There is no one standard cleaning process, but
it is very common to do the following tasks on every dataset;
Standard Data Cleaning Procedures
a) Labeling variables and labeling variable values (scale response or pre-coded responses)
b) Unique Identifiers, Skip Patterns Check (data logical tests). Maintain code book!
Standard Data Cleaning Procedures
b) Unique Identifiers, Skip Patterns Check (data logical tests). Advance
Standard Data Cleaning Procedures
c) Unique Identifiers, Skip Patterns Check (data logical tests). Advance- Splitting
Standard Data Cleaning Procedures
d) *Massaging* data; Used for data cleaning and analysis (extracting datasets)o Reshaping, o Collapsing, o Merging or o Appending datasets
Standard Data Cleaning Procedures
Data Cleaning Scripts
Indicators Extraction
and Analysis
Database/Data on Server
Data Inconsistencies/ Errors in the data to be corrected