census linkage: advances in techniques · record linkage determining if two records belong to the...

Rachel ShipseyOperational Researcher| Methodology

Census Linkage

Census linkage:Advances in techniques

October 2019

Record linkage

Determining if two records belong to the same entity

Within a dataset or between multiple datasets

Entity can be a person, an event, a business etc.

Census Linkage

Census matching 2021

People and households in Census and Census Coverage Survey (CCS)

Underpins the census estimation process

• Very high accuracy targets

• Shorter deadline: more automatic matching, less clerical

Census Linkage

Refined matchkeys and improved probabilistic

Improve person matching

Census Linkage

Person matching: Matchkeys

Variables on which a record pair must agree to be declared a match.

String comparators can be incorporated

e.g.

New variables e.g. Alphaname for online transposition errors

Jaro-Winkler as an alternative string comparator

Researched optimal matchkey combinations to maximize match rate and minimize false positives

Census Linkage

Name levenshtein >0.8 DoB matches exactlyPostcode and house-number

matches exactly

Person matching: Probabilistic

Utilize our distributed computer system for better blocking

Research into optimal method for acquiring parameters for Fellegi-Sunter

Census Linkage

Person matching

Results:

In 2011 automatic person matching recall was 70%

Census Linkage

Stage Record pairs Record pair types Recall

Deterministic 549,000 Person matches ~ 85%

Deterministic 4,000 Non-unique matches for clerical

Probabilistic 24,000 Person matches ~ 89%

Probabilistic 28,689 Candidate pairs for clerical

Innovative ‘set’ variables

Household matching

Census Linkage

Automatic household matchingIn 2011 there were issues with address matching and inconsistencies in the ‘head-of-household’ variable.

Quality of address matching has been improved (elastic search) so UPRN is more reliable.

We have replaced ‘head of household’ with household ‘sets’ e.g.

Matchkeys run on the sets and household characteristics.

Census Linkage

Census household forenames CCS household forenames

[ NICK, EMMA, FRED, LOUISE ] [ NICHOLAS, EMMA, FREDERICK, LOUISE ]

Automatic household matching

Results:

In 2011 60% of household matches were automatic

Census Linkage

250,000 Household matches

3,000 Non-unique matches (clerical)

95% Recall

99.99% Precision

Matching unmatched people and households through association

Associative matching

Census Linkage

Matching via association

Matched households might contain unmatched people

Matched people might be in an unmatched household

Associate these relationships to create candidate pairs

• People: scored and automatically matched or sent to clerical

• Households: sent to clerical along with any unmatched people they contain

Census Linkage

Unmatched people: Automatic

Block unmatched people on matched household and calculate scores (FS):

In this example, an automatic match is made between Bob S and Robert Smith.

Three people are still unmatched…

Census Linkage

Matched Census Household Matched CCS Household Status

Mrs Cheryl Smith 07 / Mar / 1989 Cheryl Smith 07 / Mar / 1989 Matched

Mr D Smith 01 / Jan / 1900 David Smith 08 / Oct / 1980 Not – matched

Bob S 20 / Jan / 1982 Robert Smith 20 / Jan / 1982 Not – matched

Baby S 04 / Nov / 2010 Nicola Smith 05/ Nov / 2010 Not – matched

Tom Smith 07 / Jul /2008 T S Age 3 Not - matched

Unmatched people: Clerical

Clerically reviewing the cartesian product of 3 x 3 unmatched records is inefficient

Census Linkage

People in Census Household People in CCS Household

Mr D Smith 01/ Jan/ 1900 David Smith 08 /Oct/1980

Mr D Smith 01/ Jan/ 1900 Nicola Smith 05/ Nov/ 2010

Mr D Smith 01/ Jan/ 1900 T S Age 3

Baby S 04/ Nov/ 2010 David Smith 08 /Oct/1980

Baby S 04/ Nov/ 2010 Nicola Smith 05/ Nov/ 2010

Baby S 04/ Nov/ 2010 T S Age 3

Tom Smith 07/ Jul/ 2008 David Smith 08 /Oct/1980

Tom Smith 07/ Jul/ 2008 Nicola Smith 05/ Nov/ 2010

Tom Smith 07/ Jul/ 2008 T S Age 3

Unmatched people: Clerical

Instead the reviewer will be presented with a household view which is easier to interpret and match.

Census Linkage

Matched Census Household Matched CCS Household

Mrs Cheryl

Smith07 / Mar / 1989

Cheryl

Smith07 / Mar / 1989

Bob S 20 / Jan / 1982Robert

Smith20 / Jan / 1982

Mr D Smith 01 / Jan / 1900Nicola

Smith05 / Nov / 2010

Baby S 04 / Nov / 2010David

Smith08/ Oct / 1980

Tom Smith 07 /Jul/ 2008 T S Age 3

Unmatched households clerical

Households:

Census Linkage

Unmatched Census HH Unmatched CCS HH

Claire Shepherd 30 / Mar / 1993 Claire Shepherd 30 / Mar / 1993

Maggie S X Margret Shepherd 08 / Oct / 1980

Katherine X 01/ Oct / 1980 Chris Shepherd 20 / April / 1989

X Shepherd 01 / Jan / 1989 Cath Shepherd 20 / Jan / 1982

Matching via association

Results:

Census Linkage

Stage Pairs Pair types

Associative People matches

made automatically11,000 Person matches

Associative People candidate

matches sent to clerical83,000

Person candidate pairs (in 10,000

households)

Associative Households

candidate matches sent to clerical9,000 Household candidate pairs

Census to CCS improvements to automatic matching since 2011

Census Linkage

Informing clerical via machine learning algorithms

Pre-search with machine learning

Census Linkage

Finding the hard to match

Census Linkage

✓ Lots of automatic

✓ Also produced lots of pairs for clerical matching


Census Linkage



But haven’t yet captured the last 1% of matches


Census Linkage



But haven’t captured the last 1%

Which could be any of the residuals

Pre-search with machine learning

Could speed up searching by offering possible candidates

Researching machine learning algorithms to generate possible candidates and order them

Our goal is to make our pre-search algorithm good enough that if the match is not presented, say in the first 20 candidates, then we can say with confidence that no match exists

Census Linkage

Census Linkage

Optimized deterministic and introduced probabilistic

Improved RMR

Resolving multiple responses

RMR identifies and resolves duplicate census responses from the same household

Researched 17 matchkeys which find 288,000 duplicates (230,000 were found in 2011)

Probabilistic using Fellegi-Sunter:

• Resilient if 2021 data is very different to 2011

• Double check deterministic

Census Linkage

Advanced algorithm for identifying duplicate census returns

Automated checking algorithm

Census Linkage

Census to Census matching

Identifies duplicate person responses in different households

Run on person candidates from very strict blocking

In 2011 every candidate pair was clerically reviewed

Results inform the estimation of overcount

Census Linkage

Checking steps

Census Linkage

Automates decisions made by clerical staff in 2011

Very complex!

Produces lists of:

• automatically accepted duplicates

• rejected pairs

• pairs for clerical resolution

Confirmation steps

Testing:

Out of nearly 800,000 candidate pairs:

In 2011 100% would have been clerically resolved

Census Linkage

No. records Resolution category % of candidate pairs

455,000 Automatically identified as duplicates 57%

213,000 Candidate duplicate pairs sent to clerical 27%

125,000 Automatically rejected pairs 16%

Census Linkage

Matching encrypted data with the divide and conquer method

Use of distributed computing

Encrypted data

No string comparators or contained within tools so we use lots of derived variables

Lots of derived variables ⇒ lots of matchkeys

Census Linkage

Input string Hashed output

CHRISTINE A123V76B893F3897GH267389T567

CHRISTINA Y678N79FT632H7530B8A3D568U76

18 forename variables 9 middlename variables 15 surname variables = 2,430 matchkeys

Encrypted data

Huge processing requirement

Traditional matching on 122 matchkeys took over 24 hours

Divide & conquer:

• Equivalent to thousands of matchkeys

• Currently uses ~57 variables

• Parallelized so runs in ~8 hours

Census Linkage

Divide and conquer

Create 13 blocks which are matched in two stages:

• Derived agreement – exact, fuzzy, or no agreement?

• Run matchkeys at the derived agreement level e.g.

Census Linkage

Loose fuzzy forename Strict fuzzy surname Exact date of birth Exact gender Exact postcode

Exact agreement Strict fuzzy agreement Loose fuzzy agreement No agreement

Full name agrees e.g. Alphaname, nickname e.g. Soundex, Double Metaphone

Parallel processing

Census Linkage

Census Linkage

Quality assurancein data linkageONS Data Linking Symposium

Oct 2019

James Doidge, ICNARCKatie Harron, UCL

Quality assurance vs

quality assessment

•“The systematic measurement, comparison with a standard, monitoring of processes and an associated feedback loop that confers error prevention”

Quality assurance:

•The accuracy of most links is unknowable; there are no standards to measure by. Only indirect or partial assessments are possible.

•There are two dimensions of quality and a trade-off between them. The relative value of each depends on the application.

But in data linkage:

Match status (true relationship)

Match Non-match

Link

status

Link True link False link

Non-link Missed link True non-link

Two types of error

Recall

Proportion of

matches that are

linked

(𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦)

Precision

Proportion of

links that are true

(𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑣𝑒 𝑣𝑎𝑙𝑢𝑒)

Matches

Low agreement

Non-matches

High agreement

Definite matchesDefinite non-matches Possible matches

Depends on quality of matching variables

High precision,low recall

Low agreement High agreement


Missedlinks

MatchesNon-matches

Low agreement High agreement


Low precision, high recall

Falselinks

MatchesNon-matches

This Photo by Unknown Author is licensed under CC BY-NC-ND

http://www.otherfood-devos.com/2010_10_01_archive.html

https://creativecommons.org/licenses/by-nc-nd/3.0/

Impacts of false links on

analysis

Potential misclassification and measurement error

Mixed up values

But, if equivalent, then no impact!

Erroneous inclusion/exclusion in an analysis

Selection bias, e.g. exclusion of records falsely linked to death register

‘Merging’ of multiple people’s records into one

Misclassification/measurement error and undercounting

Impacts of missed links on analysis

Missing data

Misclassification and measurement error

When links are ‘meaningfully interpreted’e.g. mortality from linked death records

Erroneous inclusion/exclusion in an analysis

Selection biase.g. excluding unlinked records

‘Splitting’ of one person’s records into many

Misclassification/measurement error and double-counting

Example:‘Splitting’ in Hospital Episode Statistics

7.0

7.1

7.2

7.3

7.4

1997 1998 1999 2000 2001

Millio

ns

Year

Number of HESIDs (patients) in HES, by year

2-step algorithm 3-step algorithm

What can analysts do about linkage error?

• Requires uncertain links and link-level information about match quality

• Not guaranteed to capture ‘true’ valueSensitivity analysis

• Requires uncertain links and link-level information about match quality

Probabilistic analysis (imputation/weighting)

• Requires dataset-level measures of linkage accuracy

•Rates of missed links and false links

•Distribution of each with respect to variables of interestBias analysis

Example: Estimating prevalence of Down’s Syndrome in HES vs cytogenetic register

0123456789

101112131415

1998 2003 2008 2013

Cases per 10,000 births

Year of birth

NDSCR only Whole of HES

Linear (NDSCR only) Linear (Whole of HES)

Restricting HES to a birth cohort to mitigate splitting

0123456789

101112131415

1998 2003 2008 2013


Year of birth

HES birth cohort only NDSCR only Whole of HES

Using linked data

0123456789

101112131415

1998 2003 2008 2013


Year of birth

Linked data(base case)

HES birth cohort only

NDSCR only Whole of HES

Using linked data with quantitative bias analysis

0123456789

101112131415

1998 2003 2008 2013


Year of birth


HES birth cohort only

NDSCR only Whole of HES Linear (Linked data(base case))

Using linked data with quantitative bias analysis

0123456789

101112131415

1998 2003 2008 2013


Year of birth

Plausibile range(upper - lower)


HES birth cohort only NDSCR only Whole of HES

Assessing linkage qualitywith identifiers

Technique False links Missed links

% ∆ % ∆

Clerical review

(usually of a sample of candidate links)

~ ~

Apply algorithm to training data or ‘gold standard’

(often a subset)

✓ ✓ ✓ ✓

Apply algorithm to ‘negative controls’

(records that should not link)

✓ ✓

%: rate; ∆: distribution; ~ partially/depends

Assessing linkage qualitywithout identifiers

Technique False links Missed links

% ∆ % ∆

Comparison of linked and unlinked records ~ ✓

Analysis of ‘positive controls’ (subset that should link) ~ ✓

Comparison of linkable and unlinkable records

(or high/low quality matching data)

~ ✓

Comparison of plausible and implausible links ~ ✓

Comparison of observed to plausible number of links ✓

Comparison of linked data to reference statistics ~ ~ ~ ~

%: rate; ∆: distribution; ~ partially/depends

Comparing linked vs unlinked records(or positive controls)

Ford JB, Roberts CL, Taylor LK (2006) Characteristics of unmatched maternal and baby records in linked birth

records and hospital discharge data. Paediatr Perinat Ep 20 (4):329-337

Mother-baby links Unlinked babies Unlinked mothers

Comparing plausible and implausible links

Hagger-Johnson et al (2014) Identifying possible false matches in anonymized hospital administrative data without

patient identifiers. Health Serv Res DOI:10.1111/1475-6773.12272

Advice for data linkers

■ Be realistic. Accept that uncertainty and error exist.

■ Understand your data. How might errors and inconsistencies have been introduced into matching variables? Which records should or should not link? How many links are expected for each record?

– Engage with data collectors

– Explore your inputs

– Explore your outputs

■ Understand the applicationWhat will be the impacts of linkage error?

– Engage with users

■ Find your balance

Linkage outputs

Data quality

Need for high

recall

Human & computing resources

Need for high

precision

Suggested minimum outputs

Include detailed information about the linkage algorithm

• Including approach to data cleaning

1

Include record-level information about matching variable quality

• Indicators of missing/invalid for each matching variable

2

Include link-level information about match quality

• Pattern of agreement

• Match ranks/ match rules/ match weights, etc.

3

Include uncertain links and unlinked records

• (when possible)

4

Include information about identified errors

• If possible, include ‘quality assured’ links in extract

• If not possible, include aggregate characteristics for these

5

A final commentUltimately, the best way to quality assure data linkage is to

ensure collection of high-quality matching variables, with:

– Unique identifiers, validated at the

point of collection

– Back-up options: Additional, unique

combinations of variables

■ Minimise processing of personal data

(only use for linkage)

■ Minimise access to personal data (only by

data linkers)

■ Don’t minimise collection of personal data

to the point that linkage is impeded and

the value of the dataset is diminished

Acknowledgements & resourcesAcknowledgements

■ ESRC

– Administrative Data Research Network (defunct)

– National Centre for Research Methods

■ Wellcome Trust (KH)

■ Prof Ruth Gilbert and Prof Harvey Goldstein

Further reading

■ Doidge, J. C., & Harron, K. (2019). Reflections on modern methods: linkage error bias.

International Journal of Epidemiology (in press)

■ Harron K. L., Doidge J. C. , Knight H. E. , et al. A guide to evaluating linkage quality for the

analysis of linked data. International Journal of Epidemiology 2017; 46: 1699-710.

■ Doidge, J. C., & Harron, K. (2018). Demystifying probabilistic linkage: Common myths and

misconceptions. International Journal of Population Data Science, 3(1).

http://dx.doi.org/10.23889/ijpds.v3i1.410

■ Harron, K., Goldstein, H., & Dibben, C. (Eds.). (2016). Methodological developments in data

linkage. Chichester, UK: John Wiley & Sons, Ltd.

Understanding the educational background of offenders

Data sharing between the Ministry of Justice and

Department for Education

October 2019

Background

• Proof of concept share – 2015:

• Police National Computer and magistrates’ courts data linked to the

National Pupil Database

• 70% match rate

• Offences, sentences, educational outcomes and characteristics

• Range of published outputs

• Increased interest in the power of data sharing and data linking

• Focus on Serious Violence and ‘What Works’

63

Key challenges

64

This Photo by Unknown Author is

licensed under CC BY-SA

http://www.securitysafes.co.uk/a-timeline-of-the-chubb-safes-company/

https://creativecommons.org/licenses/by-sa/3.0/

Project outline

65

Discussions

with

stakeholders

DSA

draftingDPIA

Design of

technical

solution

Compliance review

Sign-off

Cohort

build

Exchange of

activity dataMatching QA Analysis

Matching the data - process

• Matching to take place within DfE to minimise the personal information shared

• offender cohort that can potentially be matched represents fewer individuals than can

potentially be matched from the DfE

66

Offender cohort [Dataset A]

Offender cohort identified in DfE data

[Dataset B]

DfE attach education data to individuals in

Dataset B

[Dataset C]

MoJ attach justice data to individuals in

Dataset B

[Dataset D]

Matching challenges

MoJ offender cohort

DfE education

cohort

67

Linked

cohort

[Dataset B]

Matching methodology

• Inclusion of alias versions of identifiers in the matching, to improve chances of successful matches.

68

• Iterate through multiple

matching rules with each match

accompanied by a variable

indicating the match quality /

strength

• Decisions will be taken jointly by the MoJ and DfE

following matching as to the quality of match that

will be accepted and brought into the linked

dataset.

Maximising the value

• Permitted uses

• Range of access routes:

• Internal settings

• ONS Secure Research Service (DfE)

• Justice MicroData Lab (MoJ)

• Engagement with users and allies

• OGDs

• Academia

• ADR-UK

• Strategic Framework

69

The Future

This iteration

• Analysis across government – e.g.:

• Educational background of young offenders

• County lines

• Evaluations – e.g.:

• Youth endowment fund

• Long-term foster care

Further iterations

• Extensions to data

• Further partnerships

70

Annex – contacts and publications

Contacts

• DfE (to learn more) – Gary Connell ([email protected])

• DfE (access to data) – Data Sharing Team ([email protected])

• MoJ (to learn more) – David Dawson ([email protected])

• MoJ (access to data) – Data Access Group ([email protected])

Publications

• Understanding the educational background of young offenders -https://www.gov.uk/government/statistics/understanding-the-educational-background-of-young-offenders-full-report

• Examining the educational background of young knife possession offenders -https://www.gov.uk/government/statistics/knife-and-offensive-weapon-sentencing-january-to-march-2018

• Examining the educational background of prolific offenders -https://www.gov.uk/government/statistics/criminal-justice-system-statistics-quarterly-december-2018

71

mailto:[email protected]




https://www.gov.uk/government/statistics/understanding-the-educational-background-of-young-offenders-full-report

https://www.gov.uk/government/statistics/knife-and-offensive-weapon-sentencing-january-to-march-2018

https://www.gov.uk/government/statistics/criminal-justice-system-statistics-quarterly-december-2018

Probabilistic and Deterministic Data Linkage

@ SAIL Databank / UK Secure e-Research Platform (UKSeRP)

Simon Thompson – Chief Technical Officer,

Swansea University Medical School

The Story

Context

What is SAIL Databank ?

1. A “safe”, legal and publicly acceptable response to the need for “open” person-based linked data for research and intelligence.

2. Citizen-centric individual-level, data, at scale, linking together health and social data from across public sector in Wales

3. Carefully curated individual-level data, rendered anonymous in use by robust and transparent socio-technical systems.

4. Access available to any legitimate person for any legitimate, public benefit purpose.

5. Strong continuing engagement with the public through panels and on-going consultations

6. Internationally-recognised best practice system, increasingly implemented across the world

SAIL Databank Majors in dataset relating to the Welsh population, but not exclusively

In the past very health focused, now person focused and any data relating to people

Five “safes”:

• Split file approach with Trusted Third Party

• Reliable, automated probabilistic matching / data linkage process (in a TTP)

• De-identification via multiple (automated) encryption

• Secure data transportation (data inwards)

• Data risk reduction

• Independent scrutiny of data utilisation proposals

• Remote data access only (no data leaves)

• Disclosure control (safe outputs)

• High security –defence in depth, multi-layered firewalls + regular penetration testing

• External verification of compliance with Information Governance (ISO 27001, audit)

• Automation, Automation, Automation (the three ‘A’s!)

SAIL Databank

• Over 32 billion records for >5 million people

• Most data goes back 10-20 years

• All pre-linked data

• 300+ approved SAIL projects, with 152 active today

• 120 staff in Swansea working on Health Informatics related projects

41 Cores datasets

162 Project Specific datasets

• Governance Model and Privacy Protection• Research to data not data to researcher

• Rich collaborative virtual space

• Large Data Collection• Lots of health data but others too

• No exclusively Welsh data but has all Wales datasets and holdings

• UKSeRP as infrastructure• Performance & secure remote access

• Multi Modality • adding Omics.*, Imaging, NLP, GIS

Rea

ch

(New)

Content removed

• SAIL uses NHS Wales Informatics Service (NWIS) as our trusted third party (TTP)

with no shared roles or staff.

Split file principal

Supplier TTP

SAIL

DemographicsN

OSupplier

SAIL

TTP

No shared roles

No shared access

Split File Submission

Tools and components to enable this...

SAIL Split File Principal

Additional Project level encryption of ALF_E → PALF_E

Based on ISD algorithm

• RALF – Residential ALF

• Ability to identify “family”

groupings / co-inhabitancy

• Ability to compute vectors of

social influences – distance to

nearest off licence / hospital, air

pollution

• New RALF switch from PAF to

Address Base

Residential / Geo-Spatial Linkage

SAIL Databank : Repository

ALF_E(Linkage key)

All Datasets are Linkable Projects get linked data cuts / views

The Story

“Combine and share your data and stay in complete control”

Programmes using UKSeRP today..

• UK Secure e-Research Platform (UKSeRP) • SAIL - Health related person data

• ADRC - None Health person level data

• DPUK - 35 dementia cohorts + Imaging + Genomics

• ALSPAC - From birth cohort, deep phynotyped

• UK Biobank (outcomes) – Routine data and SNIPS

• UK MS Register - UK register of people with MS – EHR & PROMS

• MRC Pathfinder - Mental health platform(s)

• CLIMB - Microbial Genomics

• UKCRIS - Mental health unstructured data

• ELGH - East London Genomics and Health Programme

• DSB - Collection of smaller projects

• GOV - Welsh government use

• HWW - Welsh PROMS

SeRP Coverage / Deployments

Australia: Monash - been running for a while, Curtin - operational by end of yearCanada: British Columbia – install Jan 2020, operational by Apr 2020

UK Contracts

• Spine / No Spine / pseudo Spine

• Multi Data Modality – Routine data, Project data, Imaging, Genomics, NLP, Unstructured

• Vast variation in data and curation quality

• Fully automated / Full separation

• Tenancy specific linkage models / levels of acceptability

• Project defined quality thresholds

• Tuning – migration effecting linkage approach

• Maintain backwards compatibility – supporting10 years worth of research

New Linkage needed for a new era

New ALF – clearly more complicated

At the heart of it is a new linkage engine

• Create matching pools.

• All datasets are de-duplicated (matching within a dataset)

• All datasets fed into a pool will be linked to all other datasets in pool where possible.

• A matching pool can be linked to a “core pool” : remove dependence of spine while keeping spine.

• Linkage expressed as a graph of nodes with confidence weighting between nodes

• Clusters identified and numbered – Encrypted id is new ALF

• Current ALF maintained for backward compatibility.

• Pools retain data to enable linkage to any future data added, NRDA-Linkage has option to remove

pool after linkage.

1000 foot view of new linkage capability

Linkage Project

A

C

B

Linkage algorithm 1• Rows in B

• Rows in C

• B and C

Linkage algorithm 2• Output of Linkage Algorithm 1 to A

Project Linkage strategy defined

Graph at Time : T

Linkage Project

A

C

B

Linkage algorithm 1• Rows in B

• Rows in C

• Rows in D

• B and C and D

Linkage algorithm 2• Output of Linkage Algorithm 1 to A

Project Linkage strategy defined

D

Graph at Time : T+1

Stored Graph has temporal aspect

Linkage can change overtime as extra/new data added

What was the state at time=x

• Assessment of all pairs to decide if they belong to the same person

• Identify all pairs of records for each individual

• Combine ‘true positive’ pairs together into Groups

• Group output provides the linkage map

NRDA brings world leading linkage

Privacy preserving linkage https://computation.curtin.edu.au/wp-content/uploads/sites/25/2017/10/Schnell_2017_Curtin_CS_2.pdf

Bloom filtered one-way hashing of source files.

Still able to do deterministic and probabilistic data linkage

with only a marginal drop in accuracy.

Computational more expensive

Linkage strategy vs disclosure risk vs utility (field/row enc.)

Encryption key distribution and reuse

Ideal for one off linkage scenarios

However as a weapon of last resort, it can argued that it

renders the source files as none PII, allowing for

submission?

Question: Professor Rainer Schnell speaking next ☺

https://computation.curtin.edu.au/wp-content/uploads/sites/25/2017/10/Schnell_2017_Curtin_CS_2.pdf

ALF v2

LinkProjID ProjectID LinkProjectType Name Owner Created Source ALF ALF2 RALF Bloom Destination

11 Persistant PEDW core dataset NWIS 01/01/2018FILE X X X X

UKSERP

SAIL

12 123 Temp NRDA 99 - P0123 NRDA99 02/02/2018NRDA Sharing X X

UKSERP

ADRC

Submission LinkProjID Filename size Datestamp Starte/End ALF ALF2 RALF Encrypt XX

123 11 File1-123.csv 1gb 02/02/2018 02/02/2018 11:24 yes NA not yet NA

Data IN Controller Monitor+ performance metrics

Control

In Queue

Out Queue

ALFControl

In Queue

Out Queue

RALFControl

In Queue

Out Queue

Encrypted Linkage

Control

In Queue

Out Queue

Data Out Data Transport

Switching Service

FTPS

UKSeRP / SAIL

Remote Dashboard

System Admin

NRDA

New Product (NRDA-Linkage) – Deployable Infrastructure for TTP role

UKSeRP – every tenancy has its own linkage engine

DATASET• Access Control• Data storage• Documentation• Schema Editor• ER Diagram• Metrics and

Validation• Artefacts / Files

Web Front End

FTP / ETL





Security, Configuration & Capability Model

Pu

blis

hin

g

Local Data Catalogue

Linkage & Matching

Database Loader

(File

Sp

litte

r)

Sharing & IG

Data Quality and Metrics

MS SQL

PostgreSQL

External

MS SQL

IBM DB2

HADOOP

PostgreSQL

Trusted Third PartyLinkage & Matching

Other Appliance

Regional / Global Data Catalogue

Not TTP configuration, but it organisation decides on “Chinese wall” approach then available

Also part of the NRDA (product) so can be deployed to pre-link dataset before submission

“Combine and share your data and stay in complete control”

97

The Future is about Federation of data silos

Taking into account

- Governance

- Local design constraints

- Cyber Security

- Diversity is fine (supports innovation)

- Operational and onward costs

Fed-Discovery

Fed-NLP

Fed-Linkage

Fed-Analysis

98

Federation – Progress so far

Fed-Discovery

Fed-NLP

Fed-Linkage

Fed-Analysis

V1 / MVP – Ready start testing

To be achieved by start 2020

Mature design – use case needs defining

V1 done – New integration– Expand NLP engines

Done / To do / Design

Next Phase

• Federated or Networked: Deterministic / Probabilistic Data Linkage

99Strong use-case from Australia to test and validate approach

• Linkage engine agnostic = Linkage Sites • Network effect = Virtual overlay of interconnected nodes (contributing / using)• Linkage graph = Virtual linkage applied to graph / project specific views• Modular rule based approach, onward sharing, scope of dissemination,…… lots of options

Example: Inter cohort linkage – Connect to local major linkage site

I have been Simon Thompson, Swansea University

You have been great

census linkage: advances in techniques · record linkage determining if two records belong to the...

Documents