the use of administrative sources for statistical purposes

64
The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources

Upload: didier

Post on 07-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

The Use of Administrative Sources for Statistical Purposes. Matching and Integrating Data from Different Sources. What is Matching?. Linking data from different sources Exact Matching - linking records from two or more sources, often using common identifiers - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Use of Administrative Sources for Statistical Purposes

The Use of Administrative Sourcesfor Statistical Purposes

Matching and Integrating Data from

Different Sources

Page 2: The Use of Administrative Sources for Statistical Purposes

What is Matching?• Linking data from different sources• Exact Matching - linking records from

two or more sources, often using

common identifiers• Probabilistic Matching - determining

the probability that records from different

sources should match, using a

combination of variables

Page 3: The Use of Administrative Sources for Statistical Purposes

Why Match?• Combining data sets can give more

information than is available from individual data sets

• Reduce response burden• Build efficient sampling frames• Impute missing data• To allow data integration

Page 4: The Use of Administrative Sources for Statistical Purposes

Models for Data Integration

• Statistical registers• Statistics from mixed source models

– Split population model– Split data approach– Pre-filled questionnaires– Using administrative data for non-

responders– Using administrative data for

estimation• Register-based statistical systems

Page 5: The Use of Administrative Sources for Statistical Purposes

Statistical Register

Survey data

Geographic information

systems

Administrative Sources

Other Statistical Registers

Satellite

registers

Statistical Registers

Page 6: The Use of Administrative Sources for Statistical Purposes

Mixed Source Models

• Traditionally one statistical output was based on one statistical survey

• Very little integration or coherence• Now there is a move towards more

integrated statistical systems• Outputs are based on several

sources

Page 7: The Use of Administrative Sources for Statistical Purposes

Split Population Model

• One source of data for each unit

• Different sources for different parts of the population

Page 8: The Use of Administrative Sources for Statistical Purposes

Split Population Model

Population of Statistical Units

Estimation Administrative Data

Statistical Survey

Statistics

Page 9: The Use of Administrative Sources for Statistical Purposes

Split Data Approach

• Several sources of data for each unit

Estimation Administrative Data

Statistical Survey

Unit 1 Unit 2 Unit 3 Unit n

Statistics

Page 10: The Use of Administrative Sources for Statistical Purposes

Pre-filled Questionnaires

• Survey questionnaires are pre-filled with data from other sources where possible

• Respondents check that the information is correct, rather than completing a blank questionnaire

• This reduces response burden ...... but may introduce a bias!

Page 11: The Use of Administrative Sources for Statistical Purposes

Example

Manufacture of wooden furniture

Page 12: The Use of Administrative Sources for Statistical Purposes

Using Administrative Data for Non-responders

• Administrative data are used directly to supply variables for units that do not respond to a statistical survey

• Often used for less important units, so that response-chasing resources can be focused on key units

Page 13: The Use of Administrative Sources for Statistical Purposes

Using Administrative

Data for Estimation• Administrative data are used as

auxiliary variables to improve the accuracy of statistical estimation

• Often used to estimate for small sub-populations or small geographic areas

Page 14: The Use of Administrative Sources for Statistical Purposes

Register-based Statisti

cal System

s

Real Estate Register

Business Register

Jobs and Other

Activities

Ad

min

istr

ati

ve

So

urc

es

Sta

tist

ical

Su

rve

ys

Statistical Outputs

Statistical Registers

Population Register

Page 15: The Use of Administrative Sources for Statistical Purposes

MatchingTerminology

Page 16: The Use of Administrative Sources for Statistical Purposes

Matching Keys

• Data fields used for matching e.g.• Reference Number• Name• Address• Postcode/Zip Code/Area Code• Birth/Death Date• Classification (e.g. ISIC, ISCO)• Other variables (age, occupation, etc.)

Page 17: The Use of Administrative Sources for Statistical Purposes

Distinguishing Power 1

• This relates to the uniqueness of the matching key

• Some keys or values have higher distinguishing powers than others

• High - reference number, full name, full address

• Low - sex, age, city, nationality

Page 18: The Use of Administrative Sources for Statistical Purposes

Distinguishing Power 2

• Can depend on level of detail– Born 1960, Paris

– Born 23 June 1960, rue de l’Eglise, Montmartre, Paris

• Choose variables, or combinations of variables with the highest distinguishing power

Page 19: The Use of Administrative Sources for Statistical Purposes

Match

• A pair that represents the same entity in reality

A A

Page 20: The Use of Administrative Sources for Statistical Purposes

Non-match

• A pair that represents two different entities in reality

A B

Page 21: The Use of Administrative Sources for Statistical Purposes

Possible Match

• A pair for which there is not enough information to determine whether it is a match or a non-match

A a

Page 22: The Use of Administrative Sources for Statistical Purposes

False Match

• A pair wrongly designated as a match in the matching process (false positive)

A B=

Page 23: The Use of Administrative Sources for Statistical Purposes

False Non-match

• A pair which is a match in reality, but is designated as a non-match in the matching process (false negative)

A A

Page 24: The Use of Administrative Sources for Statistical Purposes

MatchingTechniques

Page 25: The Use of Administrative Sources for Statistical Purposes

Clerical Matching

• Requires clerical resources

- Expensive

- Inconsistent

- Slow

- Intelligent

Page 26: The Use of Administrative Sources for Statistical Purposes

Automatic Matching

• Minimises human intervention

- Cheap

- Consistent

- Quick

- Limited intelligence

Page 27: The Use of Administrative Sources for Statistical Purposes

The Solution

• Use an automatic matching tool to find obvious matches and no-matches

• Refer possible matches to specialist staff

• Maximise automatic matching rates and minimise clerical intervention

Page 28: The Use of Administrative Sources for Statistical Purposes

How Automatic

Matching Works

Page 29: The Use of Administrative Sources for Statistical Purposes

Standardisation

• Generally used for text variables

• Abbreviations and common terms are replaced with standard text

• Common variations of names are standardised

• Postal codes, dates of birth etc. are given a common format

Page 30: The Use of Administrative Sources for Statistical Purposes

Blocking• If the file to be matched against is

very large, it may be necessary to break it down into smaller blocks to save processing time– e.g. if the record to be matched is in a

certain town, only match against other records from that town, rather than all records for the whole country

Page 31: The Use of Administrative Sources for Statistical Purposes

Blocking• Blocking must be used carefully, or

good matches will be missed

• Experiment with different blocking criteria on a small test data set

• Possible to have two or more passes with different blocking criteria to maximise matches

Page 32: The Use of Administrative Sources for Statistical Purposes

Parsing

• Names and words are broken down into matching keyse.g. Steven Vale stafan val

Stephen Vael stafan val

• Improves success rates by allowing matching where variables are not identical

Page 33: The Use of Administrative Sources for Statistical Purposes

Scoring

• Matched pairs are given a score based on how closely the matching variables agree

• Scores determine matches, possible matches and non-matches

Page 34: The Use of Administrative Sources for Statistical Purposes

Score100

x

y

0

Matches

PossibleMatches

Non-matches

Page 35: The Use of Administrative Sources for Statistical Purposes

How to DetermineX and Y

• Mathematical methodse.g. Fellegi / Sunter method

• Trial and Error

• Data contents and quality may change over time so periodic reviews are necessary

Page 36: The Use of Administrative Sources for Statistical Purposes

Enhancements

• Re-matching files at a later date reduces false non-matches (if at least one file is updated)

• Link to data cleaning software, e.g. address standardisation

Page 37: The Use of Administrative Sources for Statistical Purposes

Matching Software• Commercial products e.g.

SSAName3, Trillium, Automatch

• In-house products e.g. ACTR (Statistics Canada)

• Open-source products e.g. FEBRL

• No “off the shelf” products - all require tuning to specific needs

Page 38: The Use of Administrative Sources for Statistical Purposes

Internet Applications• Google (and other search engines)

– www.google.com

• Cascot – an automatic coding tool based on text matching– http://www2.warwick.ac.uk/fac/soc/ier/

publications/software/cascot/choose_classificatio/

• Address finders e.g. Postes Canada– http://www.postescanada.ca/tools/pcl/bin/

advanced-f.asp

Page 39: The Use of Administrative Sources for Statistical Purposes

Software Applications• Trigram method applied in SAS code

(freeware) for matching in the Eurostat business demography project

• Similar approach in UNECE “Data Locator” search tool

• Works by comparing groups of 3 letters, and counting matching groups

Page 40: The Use of Administrative Sources for Statistical Purposes

Trigram Method• Match “Steven Vale”

– Ste/tev/eve/ven/en /n V/ Va/Val/ale

• To “Stephen Vale”– Ste/tep/eph/phe/hen/en /n V/ Va/Val/ale– 6 matching trigrams

• And “Stephen Vael”– Ste/tep/eph/phe/hen/en /n V/ Va/Vae/ael– 4 matching trigrams

• Parsing would improve these scores

Page 41: The Use of Administrative Sources for Statistical Purposes

Matching in

Practice

Page 42: The Use of Administrative Sources for Statistical Purposes

Matching Records Without a Common Identifier

The UK Experience

by

Steven Vale (Eurostat / ONS)

and Mike Villars (ONS)

Page 43: The Use of Administrative Sources for Statistical Purposes

The Challenge

• The UK statistical business register relies on several administrative sources

• It needs to match records from these different sources to avoid duplication

• There is no system of common business identification numbers in UK

Page 44: The Use of Administrative Sources for Statistical Purposes

The Solution

• Records are matched using business name, address and post code

• The matching software used is Identity Systems / SSA-NAME3

• Matching is mainly automatic via batch processing, but a user interface also allows the possibility of clerical matching

Page 45: The Use of Administrative Sources for Statistical Purposes

Batch Processing 1

• Name is compressed to form a namekey, the last word of the name is the major key

• Major keys are checked against those of existing records at decreasing levels of accuracy until possible matches are found

• The name, address and post codes of possible matches are compared, and a score out of 100 is calculated

Page 46: The Use of Administrative Sources for Statistical Purposes

Batch Processing 2

• If the score is >79 it is considered to be a definite match

• If the score is between 60 and 79 it is considered a possible match, and is reported for clerical checking

• If the score is <60 it is considered a non-match

Page 47: The Use of Administrative Sources for Statistical Purposes

Clerical Processing

• Possible matches are checked and linked where appropriate using an on-line system

• Non-matches with >9 employment are checked - if no link is found they are sent a Business Register Survey questionnaire

• Samples of definite matches and smaller non-matches are checked periodically

Page 48: The Use of Administrative Sources for Statistical Purposes

Problems Encountered 1

• “Trading as” or “T/A” in the namee.g. Mike Villars T/A Mike’s Coffee Bar, Bar would be the major key, but would give too many matches as there are thousands of bars in the UK.

• Solution - split the name so that the last word prior to “T/A” e.g. Villars is the major key, improving the quality of matches.

Page 49: The Use of Administrative Sources for Statistical Purposes

Problems Encountered 2• The number of small non-matched units

grows over time leading to increasing duplication

• Checking these units is labour intensive

• Solutions

– Fine tune matching parameters

– Re-run batch processes

– Use extra information e.g. legal form / company number where available

Page 50: The Use of Administrative Sources for Statistical Purposes

Future Developments• Clean and correct addresses prior to

matching using “QuickAddress” and the Post Office Address File

• Links to geographical referencing

• Business Index - plans to link registers of businesses across UK government departments

• Unique identifiers?

Page 51: The Use of Administrative Sources for Statistical Purposes

One Number Census Matching

by

Ben Humberstone (ONS)

Page 52: The Use of Administrative Sources for Statistical Purposes

One Number Census• Aim: To estimate and adjust for

underenumeration in the 2001 Census

• Census Coverage Survey (CCS) - 1% sample stratified by hard-to-count area– 320,000 households

– 500,000 people

• 101 Estimation Areas in England and Wales

Page 53: The Use of Administrative Sources for Statistical Purposes

ONC ProcessCensus CCS

Matching

Quality Assurance

Imputation

Dual System Estimation

Adjusted Census DB

Page 54: The Use of Administrative Sources for Statistical Purposes

ONC Matching ProcessCCS Census

Clerical Review

Clerical Matching

Probability Matching

Matched Records

Exact Matching

Quality Assurance

KeyGreen = CCSBlue = CensusRed = Matched pairItalics = Automated

Page 55: The Use of Administrative Sources for Statistical Purposes

Data Preparation• Names

– Soundex used to bring together different spellings of the same name• Anderson, Andersen = A536• Smith, Smyth = S530

• Addresses– Converted to a numeric/alpha string

• 12a Acacia Avenue = 12AA• Top Flat, 12 Acacia Ave. = 12AA

Page 56: The Use of Administrative Sources for Statistical Purposes

Exact Matching• Data “blocked” at postcode level

• Households matched on key variables– surname, address name/number,

accommodation type, number of people

• Individuals from within matched households matched– forename, surname, day of birth, month

of birth, marital status, relationship to head of household

Page 57: The Use of Administrative Sources for Statistical Purposes

Probability Matching• Block by postcode• Compare CCS with all Census

households in postcode + neighbouring postcodes using key variables

• Create matrix according to match weight

• Repeat for people within matched households

CCS Census Cum. Weight1 Acacia Ave 1 Acacia Ave 14501 Acacia Ave 1a Acacia Ave 7401 Acacia Ave 11 Acacia Ave 2201 Acacia Ave 12 Acacia Ave 112

Page 58: The Use of Administrative Sources for Statistical Purposes

Probability Matching• Matching weights

• Apply threshold to cumulative weights

• 2 thresholds– High probability matches

– Low probability matches

CensusProbability Detached Semi-detached TerraceDetached +10 +1 -5

CCS Semi-detached -1 +7 -3Terrace -10 +5 +6

Page 59: The Use of Administrative Sources for Statistical Purposes

Automatic Match Review• Clerical role

• Matchers presented with all low probability matches– Household matches

– Matched individuals within matched households

• Access to form images to check scanning

• Basic yes/no operation

Page 60: The Use of Administrative Sources for Statistical Purposes

Clerical Matching• Clerical matching of all unmatched

records

• Matchers - perform basic searches, match or defer

• Experts - carry out detailed searches on deferred records and review matches

• Quality assurance staff - review experts work including all unmatchable records using estimation area wide searches

Page 61: The Use of Administrative Sources for Statistical Purposes

Quality Assurance

• Experts and Quality Assurance staff

• Double Matching– Estimation area matched twice,

independently

– Outputs compared, discrepancies checked

• Matching protocol– Based on best practice

Page 62: The Use of Administrative Sources for Statistical Purposes

Resources

• 8 - 10 Matchers

• 4 - 5 Expert Matchers

• 2 - 3 Quality Assurance staff

• 3 Research Officers/Supervisors

• 1 Senior Research Officer

• Computer Assisted Matching System

Page 63: The Use of Administrative Sources for Statistical Purposes

Quality Assurance

• False negative rate: < 0.1%

• 1 Estimation area matched per day

England & Wales Household PersonAutomatically Matched 58.8% 51.1%Clerically Resolved 13.7% 11.4%Clerically Matched 22.3% 30.7%Unmatched CCS 5.0% 6.4%Excluded CCS 0.2% 0.3%Unmatched Census 12.8% 11.7%Excluded Census 0.0% 0.0%

Page 64: The Use of Administrative Sources for Statistical Purposes

Group Discussion

Practical experiences of data matching