building a linked open data set
TRANSCRIPT
Implementing a Linked Open Data set
Joel Richard
Smithsonian Libraries
SLA Annual Conference, July
2012
Who are the Smithsonian Libraries?
• 20 Libraries in the U.S. and Panama
• Supports research of staff and the public
• Strong effort to digitize pre-1923 texts
• Taxonomic Literature II is one of these texts
Joel Richard,
SLA Annual Conference, July
2012
Summary of Agenda
• Our data set and process
• Conversion to Linked Data
• Storing Linked Data
• Examples and More Info
• Summary
• … and Best brew pubs in Chicago
Joel Richard,
SLA Annual Conference, July
2012
What is Linked Data?
HTTP URIs identify things to Humans and computers
Identifiers are related to other identifiers (or values) via predicates in a “triple”:
Charles Darwin // Creator // On the Origin of Species
See also :
http://linkeddata.org/
http://en.wikipedia.org/wiki/Linked_Data
http://richard.cyganiak.de/2007/10/lod/
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
Joel Richard,
SLA Annual Conference, July
2012
http://richard.cyganiak.de/2007/10/lod/
Taxonmic Literature II
Essential Reference Tool for Botanists
Authors and their Publications from1753 to 1940
It is a “database in book form.”
Our process
Scanned the pages
Hired contractor for OCR and correction (99.97% accuracy)
Received XML dataset from Contractor
Verified and Imported to SQL Server
Built a website to search the data
Joel Richard,
SLA Annual Conference, July
2012
Great! Let’s make some linked data!
First...what does 99.97% accuracy mean?
Joel Richard,
SLA Annual Conference, July
2012
~12,000 Errors
Great! Let’s make some linked data!
Select Identifiers for your data
http://library.si.edu/tl-2/author/darwin
http://library.si.edu/tl-2/title/origin_of_species
http://library.si.edu/tl-2/title/1313
Choose vocabularies for predicates(harder than it sounds)
OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIBO, etc.
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
Mondeca Labs
Linked Open Vocabularies (LOV)
Vocabulary of a Friend(VOAF)
A vocabulary for describing other vocabularies
http://labs.mondeca.com/dataset/lov
Joel Richard,
SLA Annual Conference, July
2012
http://library.si.edu/tl2/author/darwin
http://library.si.edu/tl2/title/origin…
tl2:creatorhttp://library.si.edu/tl2/title/1313
owl:sameAshttp://viaf.org/viaf/27063124
dc:creatorhttp://library.si.edu/tl2/author/darwin
owl:sameAshttp://www.archive.org/details/
originofspecies00darwuoft
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
foaf:lastName, foaf:familyName
foaf:firstName, foaf:givenName
foaf:name, skos:prefLabel
tl2:birthYear
tl2:deathYear
skos:definition
tl2:personAbbreviation
tl2:titleNumber
dc:title
event:place
dc:publisher
dc:created
tl2:titleAbbreviation
http://library.si.edu/tl2/author/darwinRDF Type = foaf:Person
http://library.si.edu/tl2/title/origin…RDF Type = bibo:Book
Great! Let’s make some linked data!
How are we going to store all this?
We’re using Drupal. RDFa is built-in, RDF extensions is an add-on module.
Probably not a good idea for very large datasets.
TL-2: 10,000 authors + 37,000 titles becomes about 400,000 triples.
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
Storage considerations
Performance of Drupal Import:
Feeds Import: 7 Hours for 35k Records
Other options? Still searching…
Our linked data set will grow to at least 600-700k Drupal nodes.
Is Drupal the best way to do this?
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
Storage considerations
2000 US Census
19 million households received “long form”
Joshua Tauberer: converted to 1bln triples
http://www.rdfabout.com/demo/census/
Carefully consider your storage options!
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
Storage
ARC2 used by Drupal 7
RDBMS via D2RQ
RDBMS via Triplify
OpenLink Virtuoso
See Also:
http://www.w3.org/2001/sw/rdb2rdf/use-cases/
Joel Richard,
SLA Annual Conference, July
2012
Linked Data. What’s the point?
Disambiguation
Connecting Relevant Information
More visible via search
Enrichment of your data
Easier reuse of data
Joel Richard,
SLA Annual Conference, July
2012
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
http://en.openei.org/apps/mashathon2010/
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
http://data.nytimes.com/schools/schools.html
Joel
Richard, [email protected]
SLA Annual Conference, July
2012
http://data.nytimes.com/N38444093941437235523
Joel Richard,
SLA Annual Conference, July
2012
http://www.worldcat.org/oclc/7619054
Other Examples and Info
Library of Congress: Linked Data Serviceshttp://id.loc.gov/
Schema.orghttp://www.schema.org
Data.gov / Semantichttp://www.data.gov/semantic
Linked Data.orghttp://linkeddata.org/
Stephen Dale: Linked Data in Actionhttp://www.slideshare.net/stephendale/linked-data-in-action-4487244
Joel Richard,
SLA Annual Conference, July
2012
Joel Richard,
SLA Annual Conference, July
2012
Thank you!
?
[email protected]://slideshare.net/joelrichard