fellows training finca client assessment tool (fcat): data cleaning slides incorporate important...

24
Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health Survey Interviewer’s Manual.

Upload: charles-potter

Post on 13-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Fellows Training

FINCA Client Assessment Tool (FCAT): Data Cleaning

Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health Survey Interviewer’s Manual.

Page 2: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

FCAT 2009 Data Cleaning

1. Data Integrity

2. Data Formats in FCAT

3. Data Challenges in FCAT

Page 3: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Data Integrity

If Data is acceptable to use for statistical analysis, that means it has:

INTEGRITY

Test: Will researchers question the results of a study simply based upon the data set that was used?

Page 4: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Data Integrity (continued)

Data has integrity if it is valid and reliable

Internal validity• The concept you are trying to capture should be accurately measured

External validity• What populations do your findings apply to?

(also known as “generalizability”)• Does your sample represent the population?

Statistical Validity• Will statistical models yield valid results?

Reliability• Can the results be replicated or repeated?

Page 5: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Good Data

Importance of good data:

• Accuracy in findings

• Helps direct policy and operations

• Contributes to development of products and services

Page 6: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Examples of Integrity: Recall

High Validity, Low Reliability(Measurement Error)

Example: Expenditure recall over long periods

Solution: Shorten periods, verify responses, reframe questions (health is better or worse than average?)

Page 7: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Examples of Integrity: Self-Reporting

Low Validity, High Reliability(Systematic Bias)

Example: 85% of motorists self-report that they are above-average drivers

Solution: Ask their friends to rate them

Page 8: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

1. Data Integrity

2. Data Formats in FCAT

3. Data Challenges in FCAT

FCAT 2009 Data Cleaning

Page 9: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Data Formats in FCAT

Data is recorded in 5 different formats:

CategoricalNon-overlapping, exclusive, and finite

Ex. Home Ownership1. Owned2. Leased3. Privately rented4. Government rented5. Rent free6. Squatted7. Other, please write-in

Ordinal/ScaledRated according to a given scale

Ex. Rate the loan application process:

1. Very difficult

2. Difficult

3. Easy

4. Very easy

Binary

Yes or No

1 2 3

Page 10: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Data Formats in FCAT (Continued)

Data is recorded in 5 different formats:

Write-ins

Text write-ins

Ex. Others please write in response: ______

*Be aware of the type of response expected to avoid inconsistencies and outliers.

Open-Ended

Number write-in

Ex. Food expenditures for the week: __ (in local currency)

Time to gather water: __ (minutes)

*Note: Always record units of measure

4 5

Page 11: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

1. Data Integrity

2. Data Formats in FCAT

3. Data Challenges in FCAT

FCAT 2009 Data Cleaning

Page 12: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Data Challenges in FCAT

Inconsistent values

Outliers

Missing values

Calculated values

Others

Cleaning data

Page 13: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Data Cleaning

Data is only useable if it is properly cleaned

As the interviewer and the one familiar with the data, it is your job to ensure that the data is correct

Page 14: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Inconsistent Values

Continue w/ FINCA?

1=Yes, 2=No

Who made the decision to leave?

Why did FINCA or Village Bank ask

you to leave?

Do you plan to return in the future?

1=Yes, 2=No

2 Village Bank Client defaulted 1

2 Client N/A N/A

1 N/A Client defaulted N/A

1. Definition: When a second response is made invalid (either impossible or simply inaccurate) by an earlier given answer

2. Examples:

3. Treatment:a. Filterb. Annotate

(shaded cells show inconsistencies):

Page 15: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Outliers

1. Definition: Response outside the range of values

Page 16: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Outliers (continued)

2. Examples:1) In general how is your health at this time?

1. Excellent2. Good3. Poor4. Very Poor

• Answer: 7

2) How much does your household spend per week for food?• Answer in Ecuador: $10,000

3. Treatment:a. Filterb. Annotatec. Correct value, if possible (e.g. mean of positive values)

Special mention: Inliers. If a question calls for integers and the recorded answer is a decimal. e.g. recording a child’s age as .5 if he is yet to complete a year.

Outlier: Response is out of answer range

Outlier: Response amount is very unlikely

Page 17: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Missing Values

Continue w/ FINCA?

1=Yes, 2=No

Who made the decision to leave?

Why did FINCA or Village Bank

ask you to leave?

Do you plan to return in the future?

1=Yes, 2=No

2 Village Bank Group dissolved 1

2 Client defaulted 2

1 N/A N/A

1. Definition: a. Stated information not recorded, not legitimate skips b. _____

2. Examples:

3. Treatment:a. Filterb. Annotatec. Correct value, if possible

(in shaded cells)

Ex. If you can distinguish between missing value and legitimate skips, replace missing values with the mean over a defined sample (e.g. branch or region).

Page 18: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Calculated Values and Other Challenges

Calculated Values

1. Definition: Data derived from sub-aggregated variables

2. Examples: DPCE, PPP converted from local currency unit

3. Treatment: Record units of measureCheck formulas

OthersText is text; numbers are numbers. Do not write in text responses for columns that accept only numbers. Please use the “Other” or “Notes” columns for this purpose.

Page 19: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Cleaning Data – Do’s

Frequent and periodic• End of the day• Much easier to clean 20 interviews than 80 or 320!

Smaller samples are easier to manage• Avoids locality effects on false identification• Avoids contamination of derived variables (e.g. DPCE)

Keep two files:• Raw data• Cleaned data• Always keep a back-up as well

Record and annotate all data issues in a log or tracking document

Techniques:• Filtering• Histograms• Pivot tables In other words, do not let

data problems snowball

Page 20: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Client ID Collection

Please collect Client ID information from EACH client interviewed.

It is not a violation of privacy, and you can assure the client that their personal information will not harm them in any way, that their responses will be to help make decisions to better loan products and services.

Page 21: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

SurveyID

For Entry into the Data Warehouse, we need to create a PRIMARY KEY for the Main Form to link to the cleaned Subform. The code appears like this when finished:

DC20083101 (2 letter country code, the year collected, and an overall interview number from one fellow)

Fellows should give each other a number (1, 2, or 3), and then should add a column in BOTH the main form AND the Household Subform.

Page 22: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

SurveyID (cont’d)

Fellow 1 should take his/her overall individual interview number and add 1000 to it, fellow 2 should add 2000, and fellow 3 should add 3000.

Ecuador=ECZambia=ZM

Therefore, the 14th interview performed by Mexico Fellow #2 would be MX20092014. It would read that in the main form AND the HHSubform. Please maintain this convention throughout the Fellowship.

Page 23: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Clean Data

Data is “clean” if:• All categorical codes match those in the survey design

sheet* Ex.: Match drinking water sources with codes 1-15

• All ordinal data are represented as whole numbers* Ex.: Do not have 3.4 years of education

• Outliers have been justified

• Missing data have been correctly annotated

Page 24: Fellows Training FINCA Client Assessment Tool (FCAT): Data Cleaning Slides Incorporate Important Information from ORC Macro. 2006. Demographic and Health

Questions?