Download - Cni2012
2012-11-04 - SLIDE
12/11/12
Building an Archival Identity Management Network: Transforming
Archival Practice and Historical Research
Daniel Pitti* and Brian Tingle*** Institute for Advance Technology in the Humanities
** California Digital Library
Thanks to Ray R. Larson of the University of California, Berkeley, School of Information for many of the slides here
2012-11-04 - SLIDE
12/11/12
Funding and People• Funding and Timeline
– National Endowment for the Humanities– May 2010-April 2012– Andrew W. Mellon Foundation– May 2012-April 2014
• People– Daniel Pitti (PI) and Worthy Martin (Institute for Advanced
Technology in the Humanities, University of Virginia)– Adrian Turner and Brian Tingle (California Digital Library,
University of California)– Ray Larson (School of Information, University of California,
Berkeley)
2012-11-04 - SLIDE
12/11/12
The Source Data• EAD-encoded finding aids (guides to archival
records)– 150K– Primarily from U.S. sources, but also U.K. and
France• Archival authority records (360K)
– National Archives and Records Administration– State Archive of New York– Smithsonian Institution– British Library– National Archives (France) & BnF
• WorldCat Archival Descriptions: 2M
2012-11-04 - SLIDE
12/11/12
Library and Museum Authority Records• Getty Vocabulary Program: Union List of
Artist Names (293K personal and corporate names)
• Virtual International Authority File (16M+ cluster records)– Contributed from around the world by national
libraries and others
2012-11-04 - SLIDE
12/11/12
2012-11-04 - SLIDE
12/11/12
Methods and Processing• Extract EAC-CPF records from existing EAD-
encoded archival descriptions– Extracting both creators and referenced CPF
names• Match EAC-CPF records against one another
and against existing authority records (ULAN, VIAF, LCNAF)– Enhance EAC-CPF by normalizing entries,
adding alternative entries, titles (VIAF), and historical data (ULAN)
• Create a prototype historical resource and access system– Historical data and social-professional networks– Links to archive, library, and museum resources
(by and about)
2012-11-04 - SLIDE
12/11/12
Example EAD Record (Hub)<EAD> <EADHEADER LANGENCODING = "ISO 639"> <EADID>GB 0133 TAB </EADID> <FILEDESC> <TITLESTMT> <TITLEPROPER>Tabley Muniments </TITLEPROPER> </TITLESTMT> <PUBLICATIONSTMT> <PUBLISHER>John Rylands University Library of Manchester </PUBLISHER> <ADDRESS> <ADDRESSLINE>150 Deansgate </ADDRESSLINE> <ADDRESSLINE>Manchester </ADDRESSLINE> <ADDRESSLINE>... (Parts removed )… </FRONTMATTER>
<ARCHDESC LEVEL = "FONDS" LANGMATERIAL = "English"> <DID> <REPOSITORY>University of Manchester, John Rylands University Library of Manchester </REPOSITORY> <UNITID ENCODINGANALOG = "ISADG3.1.1." COUNTRYCODE = "GB" REPOSITORYCODE = "0133">GB 0133 TAB </UNITID> <UNITTITLE LABEL = "Title" ENCODINGANALOG = "ISADG3.1.2.">Tabley Muniments </UNITTITLE> <UNITDATE LABEL = "Dates of Creation" ENCODINGANALOG = "ISADG3.1.3.">19th century </UNITDATE> <PHYSDESC LABEL = "Extent" ENCODINGANALOG = "ISADG3.1.5."> <EXTENT>1.24 cu.m </EXTENT> </PHYSDESC> <ORIGINATION LABEL = "Creator" ENCODINGANALOG = "ISADG3.2.1."> <FAMNAME SOURCE = "NCARULES">Warren, family, of Tabley, Cheshire </FAMNAME> <PERSNAME SOURCE = "NCARULES">Warren, John Byrne Leicester, 1835-1895, 3rd Baron de Tabley, poet </PERSNAME> </ORIGINATION> </DID>
2012-11-04 - SLIDE
12/11/12
Example EAD Record (Hub)<BIOGHIST ENCODINGANALOG = "ISADG3.2.2."> <HEAD>Administrative/Biographical History </HEAD> <P>The poet John Byrne Leicester Warren, later 3rd and last Baron de Tabley, of Tabley near Knutsford, Cheshire, was born in 1835, the son of the 2nd Baron de Tabley (1811-1887), and his wife, Catherina. His mother was Italian, the daughter of the count de Soglio, and Warren spent much of his early childhood with her in Italy and Greece. He was educated at Eton and Christ Church, Oxford. At Oxford he published a volume of poetry. Originally he published under the pseudonyms George F. Preston (1859-1862) and William Lancaster (1863-1868), but latterly under his own name. </P> <P>His early verse included <TITLE>Praeterita </TITLE> (1863), <TITLE>Eclogues and Monodramas </TITLE> (1864), <TITLE>Studies in Verse </TITLE> (1865), <TITLE>Philocletes </TITLE> (1866), and <TITLE>Orestes </TITLE> (1868). His early work was Tennysonian in style, but he was later to be influenced by both Browning and Swinburne. In 1873 he produced …. (some data removed)…
2012-11-04 - SLIDE
12/11/12
Example EAD Record (Hub)<SCOPECONTENT ENCODINGANALOG = "ISADG3.3.1."> <HEAD>Scope and Content </HEAD> <P>The collection consists mainly of the personal papers of the 3rd Baron de Tabley. The papers reflect his interests in literature, politics, botany and numismatics and include correspondence with numerous prominent later Victorian figures. Attention should also be drawn to de Tabley’s extensive and important collection of armorial bookplates. </P> <P>Correspondents include Sir Mountstuart Grant Duff, Edmund Gosse, Lord Houghton, A.C.Benson, and Robert Bridges. There are volumes of Tabley's essays and verse, as well as a considerable number of notebooks and loose manuscripts of verse and other writings. There are various bundles and boxes relating to "Coins", "Botany", "Poetry", "Literary", "Financial" and bookplates. </P> </SCOPECONTENT> <ADD> <OTHERFINDAID ENCODINGANALOG = "ISADG3.4.6."> <P>Preliminary survey list. </P> </OTHERFINDAID> <RELATEDMATERIAL ENCODINGANALOG = "ISADG3.5.3."> <P>There is correspondence with the 3rd Baron de Tabley among the Edward Freeman Papers, held at JRULM. The Library also has custody of the important Tabley Book Collection. </P> </RELATEDMATERIAL> <SEPARATEDMATERIAL> <P>The family and estate papers of the Leicester-Warren Family of Tabley are held by Cheshire Record Office. Some of these papers were originally in the custody of the John Rylands University Library of Manchester. </P> </SEPARATEDMATERIAL> </ADD>
2012-11-04 - SLIDE
12/11/12
Example EAD Record (Hub)<CONTROLACCESS> <HEAD>Index terms </HEAD> <GEOGNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "a">Tabley Inferior</EMPH> <EMPH ALTRENDER = "a-">Cheshire SJ7378</EMPH> </GEOGNAME> <PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Benson</EMPH> <EMPH ALTRENDER = "forename">Arthur Christopher</EMPH> <EMPH ALTRENDER = "dates">1862-1923</EMPH> </PERSNAME> <PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Bridges</EMPH> <EMPH ALTRENDER = "forename">Robert Seymour</EMPH> <EMPH ALTRENDER = "dates">1844-1930</EMPH> </PERSNAME> <PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Duff</EMPH> <EMPH ALTRENDER = "title">Sir</EMPH><EMPH ALTRENDER = "forename">Mountstuart Elphinstone Grant</EMPH> <EMPH ALTRENDER = "dates">1829-1906</EMPH> <EMPH ALTRENDER = "epithet">Knight</EMPH> </PERSNAME> <PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Gosse</EMPH><EMPH ALTRENDER = "title">Sir</EMPH><EMPH ALTRENDER = "forename">Edmund William</EMPH> <EMPH ALTRENDER = "dates">1849-1928</EMPH> <EMPH ALTRENDER = "epithet">Knight</EMPH> </PERSNAME>
<PERSNAME SOURCE = "NCARULES"><EMPH ALTRENDER = "surname">Milnes</EMPH> <EMPH ALTRENDER = "forename">Richard Monckton</EMPH> <EMPH ALTRENDER = "dates">1809-1885</EMPH> <EMPH ALTRENDER = "epithet">1st Baron Houghton</EMPH> </PERSNAME> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "a">Bookplates</EMPH> </SUBJECT> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "a">Botany</EMPH> </SUBJECT> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "a">Numismatics</EMPH> </SUBJECT> <SUBJECT SOURCE = "LCSH"><EMPH ALTRENDER = "a-">Poetry</EMPH> <EMPH ALTRENDER = "a">Modern</EMPH> <EMPH ALTRENDER = "y">19th century</EMPH> </SUBJECT> </CONTROLACCESS> </ARCHDESC></EAD>
2012-11-04 - SLIDE
12/11/12
2010-2012 Extraction Results• Source data: 30,000 finding aids • EAC-CPF records extracted
– LoC: 43,702 from 1,159 finding aids– OAC: 91,811 from ~15,400 – NWDA: 22,609 from 5,160– VH: 15,175 from 8,390– Total 173,297
2012-11-04 - SLIDE
12/11/12
Phase II preliminary results• unmerged SIA Henry Correspondence• 32,988 Names
• unmerged WorldCat MARC• 4,548,270 Names
2012-11-04 - SLIDE
12/11/12
Methods and Processing• Extract EAC-CPF records from existing EAD-
encoded archival descriptions– Extracting both creators and referenced CPF
names• Match EAC-CPF records against one another
and against existing authority records (ULAN, VIAF, LCNAF)– Enhance EAC-CPF by normalizing entries, adding
alternative entries, titles (VIAF), and historical data (ULAN)
• Create a prototype historical resource and access system– Historical data and social-professional networks– Links to archive, library, and museum resources
(by and about)
2012-11-04 - SLIDE
12/11/12
The Problem• Proliferation of the forms of names
– Different names for the same person– Different people with the same names
• Examples – from Books in Print (semi-controlled but not
consistent)– ERIC author index (not controlled)
2012-11-04 - SLIDE
12/11/12
Goethe
…etc…
2012-11-04 - SLIDE
12/11/12
John Muir
2012-11-04 - SLIDE
12/11/12
Library and Archive Authority Control• Library (or bibliographic) authority control is almost
exclusively about the control of names• Archival identity control involves biographical-
historical description of the CPF entity– Descriptions based on controlled vocabularies, for
example, occupations, place of birth and death– But also biographical-historical description
• Prose• Chronological list
• Archival authority control provides context for understanding records, the context of their creation, the provenance
2012-11-04 - SLIDE
12/11/12
Repository of merged EAC
RecordsEAC Repository
VIAF Repository
Connect exactly
matching records
Connect records using
name authority
information
Repository of connected EAC
Records(MongoDB)
Merge
Cheshire Search
Merging EAC-CPF RecordsLCNAF Repository ULAN Repository
2012-11-04 - SLIDE
12/11/12
Repository of merged EAC
RecordsEAC Repository
VIAF Repository
Connect exactly
matching records
Connect records using
name authority
information
Repository of connected EAC
Records(MongoDB)
Merge
Cheshire Search
Merging EAC-CPF Records
2012-11-04 - SLIDE
12/11/12
Connect Exact Matches• The EAC-CPF records provide the names
without having to parse texts, etc.• Allows us to use some simple methods like
exact matching– Assume identical name entries means the
same person/corporate body/family– Enter the full names and record IDs into a
database and flag IDs with same names for merging
2012-11-04 - SLIDE
12/11/12
But…• Exact merging assumes that archives are
following LC cataloging practice in their EAD records– There are some problems with this assumption
2012-11-04 - SLIDE
12/11/12
Some failures for merging…• Different abbreviations:
– A. & G. Carisch & C.– A. & G. Carisch & Co.
• And spacing issues:– A. C. Peters & Bro.– A. C. Peters & Brother.– A. C. Peters. (??)– A. C.Peters & Bro.
• Completeness and alternate rules– Tabb, John B. (John Banister), 1845-1909.– Tabb, John Banister, 1845-1909.
• Also differing transliterations for non-Latin scripts
2012-11-04 - SLIDE
12/11/12
More…• Variant romanizations (and spacing):
– M. P. Belaieff.– M. P. Belaïeff.– M. P. Bieliaev.– M.P. Belaïeff.– M.P.Belaïeff.
• Initials vs. names:– Zabolotskii, N.A.– Zabolotskii, Nikolai Alekseevich, 1903-1958.– Zabolotskii.
2012-11-04 - SLIDE
12/11/12
More…• Inverted order vs. uninverted
– Taylor, Zachary, 1784-1850.– Zachary Taylor.
• Various combinations:– Tchaikovsky, Peter I.– Tchaikovsky, Pëtr Il.– Tchaikovsky, Piotr Ilyich.– Tchaikovsky, Pyotr Il.– Tchaikovsky, Pyotr Ilyich.
2012-11-04 - SLIDE
12/11/12
Repository of merged EAC
RecordsEAC Repository
VIAF Repository
Connect exactly
matching records
Connect records using
name authority
information
Repository of connected EAC
Records(MongoDB)
Merge
Cheshire Search
Merging EAC-CPF Records
2012-11-04 - SLIDE
12/11/12
Search Authority Files• For each name, formulate a search of the
VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching)– Search both the “authoritative” and “non-
authoritative” forms– Consider any name matching a non-
authoritative form to be a candidate match for the authoritative form
– Flag EAC records that match the same authority record as potential matches
2012-11-04 - SLIDE
12/11/12
Shingle Language Model for names
Name: Einstein Albert
Shingle sequence: ein, ins, nst, ste, tei, ein … , ert
Probability that the sequence (ins, nst, ste) follows ein is very high for the name einstein
Krishna Janakiraman and Sean Marimpietri - Biograph
NGRAM or Shingle Matching
2012-11-04 - SLIDE
12/11/12
Name 1 : Einstein AlbertName 2 : Ainshtain AlbertName 3 : Albert Einstein
ein
ins
nstste
ein In n a
alb
ert
al
rteteiein
Ain
ins
nshsht
hta tai ain
alb
ert
al
rteteiein
ein
ins
nstste
ein In n a
alb
ert
al
rteteiein
lbelbe lbe
Shingle Language Model for names
Krishna Janakiraman and Sean Marimpietri - Biograph
2012-11-04 - SLIDE
12/11/12
Repository of merged EAC
RecordsEAC Repository
VIAF Repository
Connect exactly
matching records
Connect records using
name authority
information
Repository of connected EAC
Records(MongoDB)
Merge
Cheshire Search
Merging EAC-CPF Records
2012-11-04 - SLIDE
12/11/12
Merge Flagged Records• For all of the exact matches and authority
matches– Use the Authoritative form of the name– Combine data from each match into a single
EAC-CPF record– Retain all source record IDs and information
• Finally, output the merged EAC-CPF records
2012-11-04 - SLIDE
12/11/12
Inputs to SNAC merging• LoC: 43,702 EAC-CPF records derived from 1159
finding aids
• OAC: 91,814 EAC-CPF records derived from ~15,400 finding aids
• NWDA: 24952 EAC-CPF records derived from 5,568 finding aids
• VH: 15,175 EAC-CPF records
• Total: 175,688 Input EAC records for merging
• Result: 128,781 “unique” names
2012-11-04 - SLIDE
12/11/12
Another view of the numbers…• 95624 Person names merged from 125555
Person records• 31287 Institutions merged from 47189
Institution records• 1980 Families merged from 2899 Family
records
2012-11-04 - SLIDE
12/11/12
Merging Conclusions• There will not be a single merging method,
but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information
2012-11-04 - SLIDE
12/11/12
Next• Developing an updateable database of
merged EAC data (dumping Mongo for PostgreSQL)– Will permit incremental addition of new data
and support editing and “forced” merges• Process the 2M WorldCat archival
descriptions• Process the 150,000 finding aids• Convert several hundred thousand archival
authority records into EAC-CPF and match/merge process
2012-11-04 - SLIDE
12/11/12
Methods and Processing• Extract EAC-CPF records from existing EAD-
encoded archival descriptions– Extracting both creators and referenced CPF
names• Match EAC-CPF records against one another and
against existing authority records (ULAN, VIAF, LCNAF)– Enhance EAC-CPF by normalizing entries, adding
alternative entries, titles (VIAF), and historical data (ULAN)
• Create a prototype historical resource and access system– Historical data and social-professional networks– Links to archive, library, and museum resources
(by and about)
12/11/12
Outline
• User Persona
• Search and Display
• Network graph visualization
• Linked Data / RDF
• Future Plans
12/11/12
Meet the target users
• Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks. Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event. He also TAs an undergraduate history class and sometimes has to help students find topics for papers.
• Connie: Works at an institution that contributed records to the project. Is going to be asking themselves how this site would be useful to their users. Wants to understand how their records were used and what the added value is.
• Quincy: Library School Student working to QA record matching.
• Adele: Person doing authority work during collection processing.
• Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically.
Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or
product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)
12/11/12
Outline
• User Persona
• Search and Display
• Network graph visualization
• Linked Data / RDF
• Future Plans
12/11/12
12/11/12
12/11/12
Advanced limits match EAC sections
12/11/12
Outline
• User Persona
• Search and Display
• Network graph visualization
• Context widget (needs new name)
• Linked Data / RDF
• Future Plans
12/11/12
Tinkerpop graph database stack
• Simple "property graph" model
• "JDBC for graph databases" [SNAC is using Neo4J for the graphDB]
• XPath like "gremlin" for graph query
• REST interfaces with "Rexster"
• For me, this was 10 to 100 times easier than using RDF
12/11/12
Outline
• User Persona
• Search and Display
• Network graph visualization
• Linked Data / RDF
• Future Plans
12/11/12
What is Linked Open Data?
• w3c Semantic Web Technology Stack
• Web of atomized Data, not a web of documents
• RDF; OWL ontologies; SPARQL queries; triple/quad/quint stores
• httpRange14; content negotiation; CURIE
• No restrictions on data use; free and easy license
• Lenny wants it, but does Randy?
12/11/12
What is Linked Open Data?
• Getting to the good stuff
• Blue underlined text
• Pulling in data from multiple sources, in an intelligent way, into a "document"
• Understand and discover relationships
• Open access for research, education, private study and other fair use
RDFa owl:sameAs
HTML 5 microdata in chron list
http://templates.xdams.net/IBC/ontology/eac-cpf.rdf
Silvia Mazziniregesta.exe srl
12/11/12
&mode=xml2owl [experimental]
12/11/12
My opinion on the use cases for w3c RDF tech
• Good for publishing data
• Good for controlled vocabularies
• Data models?
• Most people with open source RDF-store type systems do the real stuff with solr
• Consider a graph database
12/11/12
Outline
• User Persona
• Search and Display
• Linked Data / RDF
• Network graph visualization
• Future Plans
12/11/12
Future Plans
• Conduct assessment activities involving members of target audiences to establish mental model of users for design work
• Scale interface to millions of names
• Visualizations useful and integrated (network and geospatial)
• Stable URLs between batches for linked data
• Social and personalization features (gateway to crowdsourcing)
• Integration with local systems (such as with the context widget)
12/11/12
• Photo attribution http://www.flickr.com/photos/dsevilla/139656712/in/photostream/
• http://xtf.cdlib.org/
• http://code.google.com/p/eac-graph-load/source/browse/README.txt
• http://tinkerpop.com/
• http://thejit.org/
• https://github.com/tingletech/snac-related-widget