an identifier scheme for the digitising scotland project

12
An Identifier Scheme for the Digitising Scotland Project Alasdair J G Gray Department of Computer Science, Heriot-Watt University, Edinburgh @gray_alasdair www.macs.hw.ac.uk/~ajg33 Özgür Akgün, Uni. of St Andrews Ahamd Alsadeeqi, Heriot-Watt Uni. Peter Christen, Australian National Uni. Tom Dalton, Uni. of St Andrews Alan Dearle, Uni. of St Andrews Chris Dibben, Uni. of Edinburgh Eilidh Garret, Uni. of Essex Graham Kirby, Uni. of St Andrews Alice Reid, Uni. of Cambridge Lee Williamson, Uni. of Edinburgh

Upload: alasdair-gray

Post on 22-Jan-2018

692 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: An Identifier Scheme for the Digitising Scotland Project

An Identifier Scheme for the Digitising Scotland Project

Alasdair J G GrayDepartment of Computer Science,

Heriot-Watt University, Edinburgh

@gray_alasdair

www.macs.hw.ac.uk/~ajg33

Özgür Akgün, Uni. of St AndrewsAhamd Alsadeeqi, Heriot-Watt Uni.Peter Christen, Australian National Uni.Tom Dalton, Uni. of St AndrewsAlan Dearle, Uni. of St AndrewsChris Dibben, Uni. of EdinburghEilidh Garret, Uni. of EssexGraham Kirby, Uni. of St AndrewsAlice Reid, Uni. of CambridgeLee Williamson, Uni. of Edinburgh

Page 2: An Identifier Scheme for the Digitising Scotland Project

Digitising Scotland Project

Large scale family reconstruction studies and Pedigrees• Transcription of data• Linking of data

Performed at scale• Whole nation• Large timeframe

1 June 2017 ADRN Conference 2

Page 3: An Identifier Scheme for the Digitising Scotland Project

Project Team

Backgrounds

• Demographers • Historians • Computer Scientists

Distributed team

1 June 2017 ADRN Conference 3

St Andrews Cambridge Edinburgh Edinburgh Australia

Page 4: An Identifier Scheme for the Digitising Scotland Project

Transcribing Scotland’s Vital Records: 1855 – 1974

• 24M records• Birth

• Marriage

• Death

• 18M individuals

41 June 2017 ADRN Conference

Page 5: An Identifier Scheme for the Digitising Scotland Project

Data Linkage ChallengesLow quality data

Probabilistic matches

Scalability

Skewed name

distributionsJohn Grant

Fisherman

Fiona Sinclair

Ian Grant

Smithy

Born: 1861

Stuart Adam

Wheelwright

Morag Scott

Flora Adam

Seamstress

Born: 1866

Married: 1884

John Grant

Farmer

Fiona Sinclaire

Iain Grant

Born: 1860

1 June 2017 ADRN Conference 5

Page 6: An Identifier Scheme for the Digitising Scotland Project

Linking Skye Data

1 June 2017 ADRN Conference 6

Page 7: An Identifier Scheme for the Digitising Scotland Project

Discussing records

Eilidh, I’m having problems with the Skye record B-BABY-8293.

Peter, which transcribed certificate is that?

It is the record for Chris Dibben, born 18 March 1893.

That is the child on record 5457. It should link to the death on record

5754, 4 December 1959.

Thanks, found it now. It is record D-DEATH-2182.

1 June 2017 ADRN Conference 7

Page 8: An Identifier Scheme for the Digitising Scotland Project

Existing Identifier Schemes

Historians: Example: 5457

• Incremental integer• Easily confused with other record

types

• Identifies certificate not actors

• Based on order of transcription• Not derived from data

• Unique for a file• Excel spreadsheet

Record Linkage: Example: B-BABY-8293

• Encode type of certificate and actor on certificate

• Four digits generated by linkage process• Different from those used by the

historians

• Different for each run of linkage pre-processing

1 June 2017 ADRN Conference 8

Page 9: An Identifier Scheme for the Digitising Scotland Project

Desiderata for Identifiers1. Identifier for each

actor on a certificate2. Exchangeable between

researchers3. Unique generation

process from the data4. Immutable to data

changes, e.g. typo discovered in data

5. Human derivable from data records

6. Human interpretable

7. Compact to enable efficient computation

8. Susceptible to blocking9. Globally unique

10.Consistent approach for all record types

11.Compatible with pre-existing NRS approach

12.Compatibility with Open Data Standards

1 June 2017 ADRN Conference 9

Page 10: An Identifier Scheme for the Digitising Scotland Project

Identifier Scheme

B1903_164_00_baby

1 June 2017 ADRN Conference 10

typeYear_district_subdistrict_entryNumber_role

Page 11: An Identifier Scheme for the Digitising Scotland Project

Certificate RolesBirth• baby • mother • father • registrar • informant

Marriage• groom • groom_father• groom_mother• bride • bride_father• bride_mother• witness1 • witness2 • officiant • registrar

Death• deceased • mother • father • spouse1…spousen• informant • doctor • registrar

1 June 2017 ADRN Conference 11

Page 12: An Identifier Scheme for the Digitising Scotland Project

Conclusions

• Agreed identifier scheme

typeYear_district_subdistrict_entryNumber_role

• Meets desiderata

• Reliant on “clean” parts of certificate

• Compatible with NRS

• Improved team communications

Alasdair Graywww.macs.hw.ac.uk/~ajg33/

[email protected]

@gray_alasdair

Acknowledgements:• Julia Jennings• Christine Jones• Diego Ramiro-Farinas

1 June 2017 ADRN Conference 12