matching names in parallel t. hickey access 2006 2006 october
TRANSCRIPT
Matching names in parallel
T. Hickey
Access 20062006 October
Virtual International Authority File
Link national authority records Build on their authority work
Move towards universal bibliographic control
• Allow national or regional variations in authorized forms to co-exist
• Support needs for variations in preferred language, script, and spelling
• 10 million WorldCat records in non-English metadata
Joint VIAF Project
Matching Variations
In the LCNAF and PND authority files: Same name, same person Same name, different people Different names, same person Missing person in one file
Two Different People – One Name
Adams, Mike PND: a golfer LCNAF: author of a Beatles collector's guide
Same Name
Different
People
One Person – Two Names
LCNAF: Morel, Pierre PND: Morellus, Petrus
Same Person
Different
Names
Enhancing the Authorities
Bibliographic
Record
Derived
Authority
Authority
Record
Enhanced
Authority
Strong Matching Attributes
A work (title) in common Common control numbers (ISBN, ISSN, or LCCN) Exact birth and death year Joint authors Name as subject
Weaker Attributes
Only one of birth/death date(s) (allows some variation) Subject area of works (two levels) Format (books, films, musical scores, etc.) Language Publisher Partial title match
Date of publication Country Role (author, illustrator, composer, etc.) Format (books, films, musical scores, etc.)
Computing it
Standard approach• Generate keys and data• Load information into a database• Index it• Extract fields needed
Map/Reduce approach• Split the database up• Run parallel jobs• Bring information together via map/reduce• Assemble information in stages
Map/Reduce
Two stages• Map
• Read in source file (e.g. MARC-21)• Write out key + data
• Reduce• Read in array of data for each unique key• Write out key + data
Overview of MapReduce
Source: Dean & Ghemawat (Google)
Our Implementation
Written in Python Uses ssh and XML-RPC for control and communication Map/Reduce seems to add ~ 10% overhead Ran an earlier implementation on a 48 cpu cluster Current VIAF cluster is a 12 cpu cluster on 4 nodes Running Linux and 64-bit Python
VIAF Matching Code
17 modules 1,100 lines of code Plus
• 600 lines configuration• 2,755 lines of tables embedded in code
VIA
F D
ata
Flo
w
get changed Ids
eliminate forename, date conflicts from buckets
Extract Data
build buckets
surname:forename,date
compare
build compare data
id:tag, data
build compare data
id:tag, data
build name:id map
name:id
map authorities
authority id: bib id
changed authority ids
potential pairsidentify compare data
pair id:[bib/auth]idselect compare data
pair id: compare data
map authorities
authority id: bib id
name:id
build name:id map
pair id: scores
identify compare data
pair id:[bib/auth]id
select compare data
pair id: compare data
LC Authority
Extract DataExtract Data
LC Catalog PND Authority
Extract Data
PND Catalog
Extract Data
PND Catalog
WorldCat Identities
Bring together all of WorldCat’s information about people• Name(s)• Works by and about• Subjects• Dates• Fiction/non-fiction• Roles• Co-authors
Add links• Wikipedia• Authority files
Sam
ple
Iden
tity
Statistics
Nearly 19 million different ‘identities’ in WorldCat 80 million (nominally) controlled headings
The WorldCat Identity code is ~800 lines of Python in 4 modules (plus XSLT, CSS, etc.)
Identities Data Flow
Stage 1
NameInfo Citation
Stage 3
Stage 4
NameInfo CitationsStage 2
Cover Art WorldCat FRBR Audience
Authorities
IdentitiesWikipedia
Identities Stage 1Extract Data From WorldCat
Input: WorldCat (MARC-21) Map output:
• NameKey <nameInfo>• WorkID <citation>
Reduce output:• WorkID <best citation>• NameKey <cumulative nameInfo>
Identities Stage 2Extract Data From Authorities
Input: NACO Authorities file (MARC-21) Map output
• NameKey <authorityInfo>• XTos• XFroms
Reduce output• NameKey <authorityInfo, symetric xrefs>
Identities Stage 3Connect Citations with Names
Input• Stage 1 output
• WorkID <by/about citation>’s• NameKey <nameInfo>
Map output• NameKey <nameInfo>• NameKey <topCitations>
Identities Stage 4Create Identities
Input• Authority info from stage 2• Merged name info from stage 3• Merged citations from stage 3
Map output• Pass through
Reduce output• Pnkey <Identity Record>
Schedules
Identities• Up this year?
VIAF• Reload, rematch this year• Public service up early 2007
Conclusions
Our merged files (e.g. WorldCat) are really quite large More processing power opens up new ways of manipulating
and looking at our data Parallel processing is the only way to obtain the cycles
needed Map-Reduce is an attractive way to do parallel processing
• Forces decomposition• Scales well• Opens up new possibilities
Thank you
T. HickeyVIAF.orghttp://errol.oclc.org/laf/n82-54463.html
Access 20062006 October