census linkage: advances in techniques · record linkage determining if two records belong to the...
TRANSCRIPT
Rachel ShipseyOperational Researcher| Methodology
Census Linkage
Census linkage:Advances in techniques
October 2019
Record linkage
Determining if two records belong to the same entity
Within a dataset or between multiple datasets
Entity can be a person, an event, a business etc.
Census Linkage
Census matching 2021
People and households in Census and Census Coverage Survey (CCS)
Underpins the census estimation process
• Very high accuracy targets
• Shorter deadline: more automatic matching, less clerical
Census Linkage
Refined matchkeys and improved probabilistic
Improve person matching
Census Linkage
Person matching: Matchkeys
Variables on which a record pair must agree to be declared a match.
String comparators can be incorporated
e.g.
New variables e.g. Alphaname for online transposition errors
Jaro-Winkler as an alternative string comparator
Researched optimal matchkey combinations to maximize match rate and minimize false positives
Census Linkage
Name levenshtein >0.8 DoB matches exactlyPostcode and house-number
matches exactly
Person matching: Probabilistic
Utilize our distributed computer system for better blocking
Research into optimal method for acquiring parameters for Fellegi-Sunter
Census Linkage
Person matching
Results:
In 2011 automatic person matching recall was 70%
Census Linkage
Stage Record pairs Record pair types Recall
Deterministic 549,000 Person matches ~ 85%
Deterministic 4,000 Non-unique matches for clerical
Probabilistic 24,000 Person matches ~ 89%
Probabilistic 28,689 Candidate pairs for clerical
Innovative ‘set’ variables
Household matching
Census Linkage
Automatic household matchingIn 2011 there were issues with address matching and inconsistencies in the ‘head-of-household’ variable.
Quality of address matching has been improved (elastic search) so UPRN is more reliable.
We have replaced ‘head of household’ with household ‘sets’ e.g.
Matchkeys run on the sets and household characteristics.
Census Linkage
Census household forenames CCS household forenames
[ NICK, EMMA, FRED, LOUISE ] [ NICHOLAS, EMMA, FREDERICK, LOUISE ]
Automatic household matching
Results:
In 2011 60% of household matches were automatic
Census Linkage
250,000 Household matches
3,000 Non-unique matches (clerical)
95% Recall
99.99% Precision
Matching unmatched people and households through association
Associative matching
Census Linkage
Matching via association
Matched households might contain unmatched people
Matched people might be in an unmatched household
Associate these relationships to create candidate pairs
• People: scored and automatically matched or sent to clerical
• Households: sent to clerical along with any unmatched people they contain
Census Linkage
Unmatched people: Automatic
Block unmatched people on matched household and calculate scores (FS):
In this example, an automatic match is made between Bob S and Robert Smith.
Three people are still unmatched…
Census Linkage
Matched Census Household Matched CCS Household Status
Mrs Cheryl Smith 07 / Mar / 1989 Cheryl Smith 07 / Mar / 1989 Matched
Mr D Smith 01 / Jan / 1900 David Smith 08 / Oct / 1980 Not – matched
Bob S 20 / Jan / 1982 Robert Smith 20 / Jan / 1982 Not – matched
Baby S 04 / Nov / 2010 Nicola Smith 05/ Nov / 2010 Not – matched
Tom Smith 07 / Jul /2008 T S Age 3 Not - matched
Unmatched people: Clerical
Clerically reviewing the cartesian product of 3 x 3 unmatched records is inefficient
Census Linkage
People in Census Household People in CCS Household
Mr D Smith 01/ Jan/ 1900 David Smith 08 /Oct/1980
Mr D Smith 01/ Jan/ 1900 Nicola Smith 05/ Nov/ 2010
Mr D Smith 01/ Jan/ 1900 T S Age 3
Baby S 04/ Nov/ 2010 David Smith 08 /Oct/1980
Baby S 04/ Nov/ 2010 Nicola Smith 05/ Nov/ 2010
Baby S 04/ Nov/ 2010 T S Age 3
Tom Smith 07/ Jul/ 2008 David Smith 08 /Oct/1980
Tom Smith 07/ Jul/ 2008 Nicola Smith 05/ Nov/ 2010
Tom Smith 07/ Jul/ 2008 T S Age 3
Unmatched people: Clerical
Instead the reviewer will be presented with a household view which is easier to interpret and match.
Census Linkage
Matched Census Household Matched CCS Household
Mrs Cheryl
Smith07 / Mar / 1989
Cheryl
Smith07 / Mar / 1989
Bob S 20 / Jan / 1982Robert
Smith20 / Jan / 1982
Mr D Smith 01 / Jan / 1900Nicola
Smith05 / Nov / 2010
Baby S 04 / Nov / 2010David
Smith08/ Oct / 1980
Tom Smith 07 /Jul/ 2008 T S Age 3
Unmatched people: Clerical
Instead the reviewer will be presented with a household view which is easier to interpret and match.
Census Linkage
Matched Census Household Matched CCS Household
Mrs Cheryl
Smith07 / Mar / 1989
Cheryl
Smith07 / Mar / 1989
Bob S 20 / Jan / 1982Robert
Smith20 / Jan / 1982
Mr D Smith 01 / Jan / 1900Nicola
Smith05 / Nov / 2010
Baby S 04 / Nov / 2010David
Smith08/ Oct / 1980
Tom Smith 07 /Jul/ 2008 T S Age 3
Unmatched households clerical
Households:
Census Linkage
Unmatched Census HH Unmatched CCS HH
Claire Shepherd 30 / Mar / 1993 Claire Shepherd 30 / Mar / 1993
Maggie S X Margret Shepherd 08 / Oct / 1980
Katherine X 01/ Oct / 1980 Chris Shepherd 20 / April / 1989
X Shepherd 01 / Jan / 1989 Cath Shepherd 20 / Jan / 1982
Unmatched households clerical
Households:
Census Linkage
Unmatched Census HH Unmatched CCS HH
Claire Shepherd 30 / Mar / 1993 Claire Shepherd 30 / Mar / 1993
Maggie S X Margret Shepherd 08 / Oct / 1980
Katherine X 01/ Oct / 1980 Chris Shepherd 20 / April / 1989
X Shepherd 01 / Jan / 1989 Cath Shepherd 20 / Jan / 1982
Matching via association
Results:
Census Linkage
Stage Pairs Pair types
Associative People matches
made automatically11,000 Person matches
Associative People candidate
matches sent to clerical83,000
Person candidate pairs (in 10,000
households)
Associative Households
candidate matches sent to clerical9,000 Household candidate pairs
Census to CCS improvements to automatic matching since 2011
Census Linkage
Informing clerical via machine learning algorithms
Pre-search with machine learning
Census Linkage
Finding the hard to match
Census Linkage
✓ Lots of automatic
✓ Also produced lots of pairs for clerical matching
Finding the hard to match
Census Linkage
✓ Lots of automatic
✓ Also produced lots of pairs for clerical matching
But haven’t yet captured the last 1% of matches
Finding the hard to match
Census Linkage
✓ Lots of automatic
✓ Also produced lots of pairs for clerical matching
But haven’t captured the last 1%
Which could be any of the residuals
Pre-search with machine learning
Could speed up searching by offering possible candidates
Researching machine learning algorithms to generate possible candidates and order them
Our goal is to make our pre-search algorithm good enough that if the match is not presented, say in the first 20 candidates, then we can say with confidence that no match exists
Census Linkage
Census Linkage
Optimized deterministic and introduced probabilistic
Improved RMR
Resolving multiple responses
RMR identifies and resolves duplicate census responses from the same household
Researched 17 matchkeys which find 288,000 duplicates (230,000 were found in 2011)
Probabilistic using Fellegi-Sunter:
• Resilient if 2021 data is very different to 2011
• Double check deterministic
Census Linkage
Advanced algorithm for identifying duplicate census returns
Automated checking algorithm
Census Linkage
Census to Census matching
Identifies duplicate person responses in different households
Run on person candidates from very strict blocking
In 2011 every candidate pair was clerically reviewed
Results inform the estimation of overcount
Census Linkage
Checking steps
Census Linkage
Automates decisions made by clerical staff in 2011
Very complex!
Produces lists of:
• automatically accepted duplicates
• rejected pairs
• pairs for clerical resolution
Confirmation steps
Testing:
Out of nearly 800,000 candidate pairs:
In 2011 100% would have been clerically resolved
Census Linkage
No. records Resolution category % of candidate pairs
455,000 Automatically identified as duplicates 57%
213,000 Candidate duplicate pairs sent to clerical 27%
125,000 Automatically rejected pairs 16%
Census Linkage
Matching encrypted data with the divide and conquer method
Use of distributed computing
Encrypted data
No string comparators or contained within tools so we use lots of derived variables
Lots of derived variables ⇒ lots of matchkeys
Census Linkage
Input string Hashed output
CHRISTINE A123V76B893F3897GH267389T567
CHRISTINA Y678N79FT632H7530B8A3D568U76
18 forename variables 9 middlename variables 15 surname variables = 2,430 matchkeys
Encrypted data
Huge processing requirement
Traditional matching on 122 matchkeys took over 24 hours
Divide & conquer:
• Equivalent to thousands of matchkeys
• Currently uses ~57 variables
• Parallelized so runs in ~8 hours
Census Linkage
Divide and conquer
Create 13 blocks which are matched in two stages:
• Derived agreement – exact, fuzzy, or no agreement?
• Run matchkeys at the derived agreement level e.g.
Census Linkage
Loose fuzzy forename Strict fuzzy surname Exact date of birth Exact gender Exact postcode
Exact agreement Strict fuzzy agreement Loose fuzzy agreement No agreement
Full name agrees e.g. Alphaname, nickname e.g. Soundex, Double Metaphone
Parallel processing
Census Linkage
Census Linkage
Quality assurancein data linkageONS Data Linking Symposium
Oct 2019
James Doidge, ICNARCKatie Harron, UCL
Quality assurance vs
quality assessment
•“The systematic measurement, comparison with a standard, monitoring of processes and an associated feedback loop that confers error prevention”
Quality assurance:
•The accuracy of most links is unknowable; there are no standards to measure by. Only indirect or partial assessments are possible.
•There are two dimensions of quality and a trade-off between them. The relative value of each depends on the application.
But in data linkage:
Match status (true relationship)
Match Non-match
Link
status
Link True link False link
Non-link Missed link True non-link
Two types of error
Recall
Proportion of
matches that are
linked
(𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦)
Precision
Proportion of
links that are true
(𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑣𝑒 𝑣𝑎𝑙𝑢𝑒)
Matches
Low agreement
Non-matches
High agreement
Definite matchesDefinite non-matches Possible matches
Depends on quality of matching variables
High precision,low recall
Low agreement High agreement
Definite matchesDefinite non-matches Possible matches
Missedlinks
MatchesNon-matches
Low agreement High agreement
Definite matchesDefinite non-matches Possible matches
Low precision, high recall
Falselinks
MatchesNon-matches
This Photo by Unknown Author is licensed under CC BY-NC-ND
Impacts of false links on
analysis
Potential misclassification and measurement error
Mixed up values
But, if equivalent, then no impact!
Erroneous inclusion/exclusion in an analysis
Selection bias, e.g. exclusion of records falsely linked to death register
‘Merging’ of multiple people’s records into one
Misclassification/measurement error and undercounting
Impacts of missed links on analysis
Missing data
Misclassification and measurement error
When links are ‘meaningfully interpreted’e.g. mortality from linked death records
Erroneous inclusion/exclusion in an analysis
Selection biase.g. excluding unlinked records
‘Splitting’ of one person’s records into many
Misclassification/measurement error and double-counting
Example:‘Splitting’ in Hospital Episode Statistics
7.0
7.1
7.2
7.3
7.4
1997 1998 1999 2000 2001
Millio
ns
Year
Number of HESIDs (patients) in HES, by year
2-step algorithm 3-step algorithm
What can analysts do about linkage error?
• Requires uncertain links and link-level information about match quality
• Not guaranteed to capture ‘true’ valueSensitivity analysis
• Requires uncertain links and link-level information about match quality
Probabilistic analysis (imputation/weighting)
• Requires dataset-level measures of linkage accuracy
•Rates of missed links and false links
•Distribution of each with respect to variables of interestBias analysis
Example: Estimating prevalence of Down’s Syndrome in HES vs cytogenetic register
0123456789
101112131415
1998 2003 2008 2013
Cases per 10,000 births
Year of birth
NDSCR only Whole of HES
Linear (NDSCR only) Linear (Whole of HES)
Restricting HES to a birth cohort to mitigate splitting
0123456789
101112131415
1998 2003 2008 2013
Cases per 10,000 births
Year of birth
HES birth cohort only NDSCR only Whole of HES
Using linked data
0123456789
101112131415
1998 2003 2008 2013
Cases per 10,000 births
Year of birth
Linked data(base case)
HES birth cohort only
NDSCR only Whole of HES
Using linked data with quantitative bias analysis
0123456789
101112131415
1998 2003 2008 2013
Cases per 10,000 births
Year of birth
Linked data(base case)
HES birth cohort only
NDSCR only Whole of HES Linear (Linked data(base case))
Using linked data with quantitative bias analysis
0123456789
101112131415
1998 2003 2008 2013
Cases per 10,000 births
Year of birth
Plausibile range(upper - lower)
Linked data(base case)
HES birth cohort only NDSCR only Whole of HES
Assessing linkage qualitywith identifiers
Technique False links Missed links
% ∆ % ∆
Clerical review
(usually of a sample of candidate links)
~ ~
Apply algorithm to training data or ‘gold standard’
(often a subset)
✓ ✓ ✓ ✓
Apply algorithm to ‘negative controls’
(records that should not link)
✓ ✓
%: rate; ∆: distribution; ~ partially/depends
Assessing linkage qualitywithout identifiers
Technique False links Missed links
% ∆ % ∆
Comparison of linked and unlinked records ~ ✓
Analysis of ‘positive controls’ (subset that should link) ~ ✓
Comparison of linkable and unlinkable records
(or high/low quality matching data)
~ ✓
Comparison of plausible and implausible links ~ ✓
Comparison of observed to plausible number of links ✓
Comparison of linked data to reference statistics ~ ~ ~ ~
%: rate; ∆: distribution; ~ partially/depends
Comparing linked vs unlinked records(or positive controls)
Ford JB, Roberts CL, Taylor LK (2006) Characteristics of unmatched maternal and baby records in linked birth
records and hospital discharge data. Paediatr Perinat Ep 20 (4):329-337
Mother-baby links Unlinked babies Unlinked mothers
Comparing plausible and implausible links
Hagger-Johnson et al (2014) Identifying possible false matches in anonymized hospital administrative data without
patient identifiers. Health Serv Res DOI:10.1111/1475-6773.12272
Advice for data linkers
■ Be realistic. Accept that uncertainty and error exist.
■ Understand your data. How might errors and inconsistencies have been introduced into matching variables? Which records should or should not link? How many links are expected for each record?
– Engage with data collectors
– Explore your inputs
– Explore your outputs
■ Understand the applicationWhat will be the impacts of linkage error?
– Engage with users
■ Find your balance
Linkage outputs
Data quality
Need for high
recall
Human & computing resources
Need for high
precision
Suggested minimum outputs
Include detailed information about the linkage algorithm
• Including approach to data cleaning
1
Include record-level information about matching variable quality
• Indicators of missing/invalid for each matching variable
2
Include link-level information about match quality
• Pattern of agreement
• Match ranks/ match rules/ match weights, etc.
3
Include uncertain links and unlinked records
• (when possible)
4
Include information about identified errors
• If possible, include ‘quality assured’ links in extract
• If not possible, include aggregate characteristics for these
5
A final commentUltimately, the best way to quality assure data linkage is to
ensure collection of high-quality matching variables, with:
– Unique identifiers, validated at the
point of collection
– Back-up options: Additional, unique
combinations of variables
■ Minimise processing of personal data
(only use for linkage)
■ Minimise access to personal data (only by
data linkers)
■ Don’t minimise collection of personal data
to the point that linkage is impeded and
the value of the dataset is diminished
Acknowledgements & resourcesAcknowledgements
■ ESRC
– Administrative Data Research Network (defunct)
– National Centre for Research Methods
■ Wellcome Trust (KH)
■ Prof Ruth Gilbert and Prof Harvey Goldstein
Further reading
■ Doidge, J. C., & Harron, K. (2019). Reflections on modern methods: linkage error bias.
International Journal of Epidemiology (in press)
■ Harron K. L., Doidge J. C. , Knight H. E. , et al. A guide to evaluating linkage quality for the
analysis of linked data. International Journal of Epidemiology 2017; 46: 1699-710.
■ Doidge, J. C., & Harron, K. (2018). Demystifying probabilistic linkage: Common myths and
misconceptions. International Journal of Population Data Science, 3(1).
http://dx.doi.org/10.23889/ijpds.v3i1.410
■ Harron, K., Goldstein, H., & Dibben, C. (Eds.). (2016). Methodological developments in data
linkage. Chichester, UK: John Wiley & Sons, Ltd.
Understanding the educational background of offenders
Data sharing between the Ministry of Justice and
Department for Education
October 2019
Background
• Proof of concept share – 2015:
• Police National Computer and magistrates’ courts data linked to the
National Pupil Database
• 70% match rate
• Offences, sentences, educational outcomes and characteristics
• Range of published outputs
• Increased interest in the power of data sharing and data linking
• Focus on Serious Violence and ‘What Works’
63
Key challenges
64
This Photo by Unknown Author is
licensed under CC BY-SA
Project outline
65
Discussions
with
stakeholders
DSA
draftingDPIA
Design of
technical
solution
Compliance review
Sign-off
Cohort
build
Exchange of
activity dataMatching QA Analysis
Matching the data - process
• Matching to take place within DfE to minimise the personal information shared
• offender cohort that can potentially be matched represents fewer individuals than can
potentially be matched from the DfE
66
Offender cohort [Dataset A]
Offender cohort identified in DfE data
[Dataset B]
DfE attach education data to individuals in
Dataset B
[Dataset C]
MoJ attach justice data to individuals in
Dataset B
[Dataset D]
Matching challenges
MoJ offender cohort
DfE education
cohort
67
Linked
cohort
[Dataset B]
Matching methodology
• Inclusion of alias versions of identifiers in the matching, to improve chances of successful matches.
68
• Iterate through multiple
matching rules with each match
accompanied by a variable
indicating the match quality /
strength
• Decisions will be taken jointly by the MoJ and DfE
following matching as to the quality of match that
will be accepted and brought into the linked
dataset.
Maximising the value
• Permitted uses
• Range of access routes:
• Internal settings
• ONS Secure Research Service (DfE)
• Justice MicroData Lab (MoJ)
• Engagement with users and allies
• OGDs
• Academia
• ADR-UK
• Strategic Framework
69
The Future
This iteration
• Analysis across government – e.g.:
• Educational background of young offenders
• County lines
• Evaluations – e.g.:
• Youth endowment fund
• Long-term foster care
Further iterations
• Extensions to data
• Further partnerships
70
Annex – contacts and publications
Contacts
• DfE (to learn more) – Gary Connell ([email protected])
• DfE (access to data) – Data Sharing Team ([email protected])
• MoJ (to learn more) – David Dawson ([email protected])
• MoJ (access to data) – Data Access Group ([email protected])
Publications
• Understanding the educational background of young offenders -https://www.gov.uk/government/statistics/understanding-the-educational-background-of-young-offenders-full-report
• Examining the educational background of young knife possession offenders -https://www.gov.uk/government/statistics/knife-and-offensive-weapon-sentencing-january-to-march-2018
• Examining the educational background of prolific offenders -https://www.gov.uk/government/statistics/criminal-justice-system-statistics-quarterly-december-2018
71
Probabilistic and Deterministic Data Linkage
@ SAIL Databank / UK Secure e-Research Platform (UKSeRP)
Simon Thompson – Chief Technical Officer,
Swansea University Medical School
The Story
Context
What is SAIL Databank ?
1. A “safe”, legal and publicly acceptable response to the need for “open” person-based linked data for research and intelligence.
2. Citizen-centric individual-level, data, at scale, linking together health and social data from across public sector in Wales
3. Carefully curated individual-level data, rendered anonymous in use by robust and transparent socio-technical systems.
4. Access available to any legitimate person for any legitimate, public benefit purpose.
5. Strong continuing engagement with the public through panels and on-going consultations
6. Internationally-recognised best practice system, increasingly implemented across the world
SAIL Databank Majors in dataset relating to the Welsh population, but not exclusively
In the past very health focused, now person focused and any data relating to people
Five “safes”:
• Split file approach with Trusted Third Party
• Reliable, automated probabilistic matching / data linkage process (in a TTP)
• De-identification via multiple (automated) encryption
• Secure data transportation (data inwards)
• Data risk reduction
• Independent scrutiny of data utilisation proposals
• Remote data access only (no data leaves)
• Disclosure control (safe outputs)
• High security –defence in depth, multi-layered firewalls + regular penetration testing
• External verification of compliance with Information Governance (ISO 27001, audit)
• Automation, Automation, Automation (the three ‘A’s!)
SAIL Databank
• Over 32 billion records for >5 million people
• Most data goes back 10-20 years
• All pre-linked data
• 300+ approved SAIL projects, with 152 active today
• 120 staff in Swansea working on Health Informatics related projects
41 Cores datasets
162 Project Specific datasets
• Governance Model and Privacy Protection• Research to data not data to researcher
• Rich collaborative virtual space
• Large Data Collection• Lots of health data but others too
• No exclusively Welsh data but has all Wales datasets and holdings
• UKSeRP as infrastructure• Performance & secure remote access
• Multi Modality • adding Omics.*, Imaging, NLP, GIS
Rea
ch
(New)
Content removed
• SAIL uses NHS Wales Informatics Service (NWIS) as our trusted third party (TTP)
with no shared roles or staff.
Split file principal
Supplier TTP
SAIL
DemographicsN
OSupplier
SAIL
TTP
No shared roles
No shared access
Split File Submission
Tools and components to enable this...
SAIL Split File Principal
Additional Project level encryption of ALF_E → PALF_E
Based on ISD algorithm
• RALF – Residential ALF
• Ability to identify “family”
groupings / co-inhabitancy
• Ability to compute vectors of
social influences – distance to
nearest off licence / hospital, air
pollution
• New RALF switch from PAF to
Address Base
Residential / Geo-Spatial Linkage
SAIL Databank : Repository
ALF_E(Linkage key)
All Datasets are Linkable Projects get linked data cuts / views
The Story
“Combine and share your data and stay in complete control”
Programmes using UKSeRP today..
• UK Secure e-Research Platform (UKSeRP) • SAIL - Health related person data
• ADRC - None Health person level data
• DPUK - 35 dementia cohorts + Imaging + Genomics
• ALSPAC - From birth cohort, deep phynotyped
• UK Biobank (outcomes) – Routine data and SNIPS
• UK MS Register - UK register of people with MS – EHR & PROMS
• MRC Pathfinder - Mental health platform(s)
• CLIMB - Microbial Genomics
• UKCRIS - Mental health unstructured data
• ELGH - East London Genomics and Health Programme
• DSB - Collection of smaller projects
• GOV - Welsh government use
• HWW - Welsh PROMS
SeRP Coverage / Deployments
Australia: Monash - been running for a while, Curtin - operational by end of yearCanada: British Columbia – install Jan 2020, operational by Apr 2020
UK Contracts
• Spine / No Spine / pseudo Spine
• Multi Data Modality – Routine data, Project data, Imaging, Genomics, NLP, Unstructured
• Vast variation in data and curation quality
• Fully automated / Full separation
• Tenancy specific linkage models / levels of acceptability
• Project defined quality thresholds
• Tuning – migration effecting linkage approach
• Maintain backwards compatibility – supporting10 years worth of research
New Linkage needed for a new era
New ALF – clearly more complicated
At the heart of it is a new linkage engine
• Create matching pools.
• All datasets are de-duplicated (matching within a dataset)
• All datasets fed into a pool will be linked to all other datasets in pool where possible.
• A matching pool can be linked to a “core pool” : remove dependence of spine while keeping spine.
• Linkage expressed as a graph of nodes with confidence weighting between nodes
• Clusters identified and numbered – Encrypted id is new ALF
• Current ALF maintained for backward compatibility.
• Pools retain data to enable linkage to any future data added, NRDA-Linkage has option to remove
pool after linkage.
1000 foot view of new linkage capability
Linkage Project
A
C
B
Linkage algorithm 1• Rows in B
• Rows in C
• B and C
Linkage algorithm 2• Output of Linkage Algorithm 1 to A
Project Linkage strategy defined
Graph at Time : T
Linkage Project
A
C
B
Linkage algorithm 1• Rows in B
• Rows in C
• Rows in D
• B and C and D
Linkage algorithm 2• Output of Linkage Algorithm 1 to A
Project Linkage strategy defined
D
Graph at Time : T+1
Stored Graph has temporal aspect
Linkage can change overtime as extra/new data added
What was the state at time=x
• Assessment of all pairs to decide if they belong to the same person
• Identify all pairs of records for each individual
• Combine ‘true positive’ pairs together into Groups
• Group output provides the linkage map
NRDA brings world leading linkage
Privacy preserving linkage https://computation.curtin.edu.au/wp-content/uploads/sites/25/2017/10/Schnell_2017_Curtin_CS_2.pdf
Bloom filtered one-way hashing of source files.
Still able to do deterministic and probabilistic data linkage
with only a marginal drop in accuracy.
Computational more expensive
Linkage strategy vs disclosure risk vs utility (field/row enc.)
Encryption key distribution and reuse
Ideal for one off linkage scenarios
However as a weapon of last resort, it can argued that it
renders the source files as none PII, allowing for
submission?
Question: Professor Rainer Schnell speaking next ☺
ALF v2
LinkProjID ProjectID LinkProjectType Name Owner Created Source ALF ALF2 RALF Bloom Destination
11 Persistant PEDW core dataset NWIS 01/01/2018FILE X X X X
UKSERP
SAIL
12 123 Temp NRDA 99 - P0123 NRDA99 02/02/2018NRDA Sharing X X
UKSERP
ADRC
Submission LinkProjID Filename size Datestamp Starte/End ALF ALF2 RALF Encrypt XX
123 11 File1-123.csv 1gb 02/02/2018 02/02/2018 11:24 yes NA not yet NA
Data IN Controller Monitor+ performance metrics
Control
In Queue
Out Queue
ALFControl
In Queue
Out Queue
RALFControl
In Queue
Out Queue
Encrypted Linkage
Control
In Queue
Out Queue
Data Out Data Transport
Switching Service
FTPS
UKSeRP / SAIL
Remote Dashboard
System Admin
NRDA
New Product (NRDA-Linkage) – Deployable Infrastructure for TTP role
UKSeRP – every tenancy has its own linkage engine
DATASET• Access Control• Data storage• Documentation• Schema Editor• ER Diagram• Metrics and
Validation• Artefacts / Files
Web Front End
FTP / ETL
DATASET• Access Control• Data storage• Documentation• Schema Editor• ER Diagram• Metrics and
Validation• Artefacts / Files
DATASET• Access Control• Data storage• Documentation• Schema Editor• ER Diagram• Metrics and
Validation• Artefacts / Files
Security, Configuration & Capability Model
Pu
blis
hin
g
Local Data Catalogue
Linkage & Matching
Database Loader
(File
Sp
litte
r)
Sharing & IG
Data Quality and Metrics
MS SQL
PostgreSQL
External
MS SQL
IBM DB2
HADOOP
PostgreSQL
Trusted Third PartyLinkage & Matching
Other Appliance
Regional / Global Data Catalogue
Not TTP configuration, but it organisation decides on “Chinese wall” approach then available
Also part of the NRDA (product) so can be deployed to pre-link dataset before submission
“Combine and share your data and stay in complete control”
97
The Future is about Federation of data silos
Taking into account
- Governance
- Local design constraints
- Cyber Security
- Diversity is fine (supports innovation)
- Operational and onward costs
Fed-Discovery
Fed-NLP
Fed-Linkage
Fed-Analysis
98
Federation – Progress so far
Fed-Discovery
Fed-NLP
Fed-Linkage
Fed-Analysis
V1 / MVP – Ready start testing
To be achieved by start 2020
Mature design – use case needs defining
V1 done – New integration– Expand NLP engines
Done / To do / Design
Next Phase
• Federated or Networked: Deterministic / Probabilistic Data Linkage
99Strong use-case from Australia to test and validate approach
• Linkage engine agnostic = Linkage Sites • Network effect = Virtual overlay of interconnected nodes (contributing / using)• Linkage graph = Virtual linkage applied to graph / project specific views• Modular rule based approach, onward sharing, scope of dissemination,…… lots of options
Example: Inter cohort linkage – Connect to local major linkage site
I have been Simon Thompson, Swansea University
You have been great