matching names in parallel t. hickey access 2006 2006 october

Matching names in parallel

T. Hickey

Access 20062006 October

Virtual International Authority File

Link national authority records Build on their authority work

Move towards universal bibliographic control

• Allow national or regional variations in authorized forms to co-exist

• Support needs for variations in preferred language, script, and spelling

• 10 million WorldCat records in non-English metadata

Joint VIAF Project

Matching Variations

In the LCNAF and PND authority files: Same name, same person Same name, different people Different names, same person Missing person in one file

Two Different People – One Name

Adams, Mike PND: a golfer LCNAF: author of a Beatles collector's guide

Same Name

Different

People

One Person – Two Names

LCNAF: Morel, Pierre PND: Morellus, Petrus

Same Person

Different

Names

Enhancing the Authorities

Bibliographic

Record

Derived

Authority

Authority

Record

Enhanced

Authority

Strong Matching Attributes

A work (title) in common Common control numbers (ISBN, ISSN, or LCCN) Exact birth and death year Joint authors Name as subject

Weaker Attributes

Only one of birth/death date(s) (allows some variation) Subject area of works (two levels) Format (books, films, musical scores, etc.) Language Publisher Partial title match

Date of publication Country Role (author, illustrator, composer, etc.) Format (books, films, musical scores, etc.)

Computing it

Standard approach• Generate keys and data• Load information into a database• Index it• Extract fields needed

Map/Reduce approach• Split the database up• Run parallel jobs• Bring information together via map/reduce• Assemble information in stages

Map/Reduce

Two stages• Map

• Read in source file (e.g. MARC-21)• Write out key + data

• Reduce• Read in array of data for each unique key• Write out key + data

Overview of MapReduce

Source: Dean & Ghemawat (Google)

Our Implementation

Written in Python Uses ssh and XML-RPC for control and communication Map/Reduce seems to add ~ 10% overhead Ran an earlier implementation on a 48 cpu cluster Current VIAF cluster is a 12 cpu cluster on 4 nodes Running Linux and 64-bit Python

VIAF Matching Code

17 modules 1,100 lines of code Plus

• 600 lines configuration• 2,755 lines of tables embedded in code

VIA

F D

ata

Flo

w

get changed Ids

eliminate forename, date conflicts from buckets

Extract Data

build buckets

surname:forename,date

compare

build compare data

id:tag, data

build compare data

id:tag, data

build name:id map

name:id

map authorities

authority id: bib id

changed authority ids

potential pairsidentify compare data

pair id:[bib/auth]idselect compare data

pair id: compare data

map authorities

authority id: bib id

name:id

build name:id map

pair id: scores

identify compare data

pair id:[bib/auth]id

select compare data

pair id: compare data

LC Authority

Extract DataExtract Data

LC Catalog PND Authority

Extract Data

PND Catalog

Extract Data

PND Catalog

WorldCat Identities

Bring together all of WorldCat’s information about people• Name(s)• Works by and about• Subjects• Dates• Fiction/non-fiction• Roles• Co-authors

Add links• Wikipedia• Authority files

Sam

ple

Iden

tity

Statistics

Nearly 19 million different ‘identities’ in WorldCat 80 million (nominally) controlled headings

The WorldCat Identity code is ~800 lines of Python in 4 modules (plus XSLT, CSS, etc.)

Identities Data Flow

Stage 1

NameInfo Citation

Stage 3

Stage 4

NameInfo CitationsStage 2

Cover Art WorldCat FRBR Audience

Authorities

IdentitiesWikipedia

Identities Stage 1Extract Data From WorldCat

Input: WorldCat (MARC-21) Map output:

• NameKey <nameInfo>• WorkID <citation>

Reduce output:• WorkID <best citation>• NameKey <cumulative nameInfo>

Identities Stage 2Extract Data From Authorities

Input: NACO Authorities file (MARC-21) Map output

• NameKey <authorityInfo>• XTos• XFroms

Reduce output• NameKey <authorityInfo, symetric xrefs>

Identities Stage 3Connect Citations with Names

Input• Stage 1 output

• WorkID <by/about citation>’s• NameKey <nameInfo>

Map output• NameKey <nameInfo>• NameKey <topCitations>

Identities Stage 4Create Identities

Input• Authority info from stage 2• Merged name info from stage 3• Merged citations from stage 3

Map output• Pass through

Reduce output• Pnkey <Identity Record>

Schedules

Identities• Up this year?

VIAF• Reload, rematch this year• Public service up early 2007

Conclusions

Our merged files (e.g. WorldCat) are really quite large More processing power opens up new ways of manipulating

and looking at our data Parallel processing is the only way to obtain the cycles

needed Map-Reduce is an attractive way to do parallel processing

• Forces decomposition• Scales well• Opens up new possibilities

Thank you

T. HickeyVIAF.orghttp://errol.oclc.org/laf/n82-54463.html

Access 20062006 October

http://viaf.org/

http://errol.oclc.org/laf/n82-54463.html

matching names in parallel t. hickey access 2006 2006 october

Documents

datapair id

different identities

identities data flowstage

authority workmove

worldcat marc

array of data

dataload information

mapreduceassemble information