finding and consuming (linked) open data
TRANSCRIPT
Finding and consuming (Linked) Open Data
Christophe Guret (@cgueret)
March 8, 2012
http://latc-project.eu
http://www.vu.nl
http://ehumanities.nl
The next two hours
Open DataWhat is it? Why opening data?
How to find Open Data
How to consume it
Hands-on session
Linked Data & Linked Open DataWhat is it? Relation with Open Data?
How to get Linked Data
Ways to consume it
Hands-on session
http://www.flickr.com/photos/-jvl-/4983920242
Open Data
Open Data
A piece of content or data is open if anyone is free to use, reuse, and redistribute it subject only, at most, to the requirement to attribute and share-alike.
http://opendefinition.org/
Why opening data?
Data has more value than applications
Data is more used if it's easier to use it
Credit: Dorothea Salo, http://www.slideshare.net/cavlec/rdf-rda-and-other-tlas
Open Data for Public institutions
Improve transparencyActive citizenship and data journalism
Create new opportunitiesDevelop need-focused applications almost for freeSee all AppsforX challenges (Amsterdam, Nederland, )
http://opendatachallenge.org/
Let businesses sell services around the data
Improve efficiencyHelp share data within institutions
Open Data for Researchers
Consider data as an assetLike papers, can be referenced to
Like papers, open access for increased usage
Better scienceReproducibility of experiments
Cross usage of data sets in different studies
Improve transparency (and decrease fraud?)
Data workflow
Search for the relevant data sets
Do data integration and clean up
Visualise and/or analyse the data
Re-publish integrated and curated data
Data workflow
Search for the relevant data sets
Do data integration and clean up
Visualise the data
Re-publish integrated and curated data
Three ways to search for data
Generic search engine with specific targetUse keywords or keywords + file type
Browse data archivesFocused around particular topic(s)
Explored by facets and keywords
Use data portalsYellow pages for data archives, faceted search
Hubs for both data and applications
Using a search engine
Data archive Dryad
Data archive Easy
Data portal Overheid.nl
Data portal Publicdata.eu
Data portal Kasabi
Data catalogs
Data workflow
Search for the relevant data sets
Do data integration and clean up
Visualise the data
Re-publish integrated and curated data
Data integration
Unify the different data in a single formatXLS + PDF + CSV => CSV
Integrate the dataConnect the bits and pieces
Curate the dataFix errors in the data
Process the data in preparation for its usageStemming, removal of stop words,
Normalisation of values
Data integration
Unify the different data in a single formatXLS + PDF + CSV => CSV
Integrate the dataConnect the bits and pieces
Curate the dataFix errors in the data
Process the data in preparation for its usageStemming, removal of stop words,
Normalisation of values
Use Linked Data tosave time there!
Data workflow
Search for the relevant data sets
Do data integration and clean up
Visualise the data
Re-publish integrated and curated data
Visualise data DataMarket
Visualise data Google explorer
Visualise data Microsoft explorer
Visualise data WolframAlpha
Data workflow
Search for the relevant data sets
Do data integration and clean up
Visualise the data
Re-publish integrated and curated data
Publish processed data
How?Send to data archive
Publish on web sites
Why?Re-usability
Community process (if I do it, other will do it)
Scientific process
Hands on session
In 2001, what were the council election results in the county of Warwickshire (UK) ?
What is the evolution of literacy rate in Tanzania since 1988 ?
Can you make this plot of unemployment ratesusing the Google Public data explorer ?
Linked Data
http://www.flickr.com/photos/erikcharlton/3337465138
Linked Data & Linked Open Data
What is the problem?
Frank and Christophe publish some open data
Roi wants to combine and enrich it
Marvel icons: mermer, DeviantArt
KennissenStad
ChristopheAmsterdam
PeterBarcelona
DavidParijs
Frank
VillePays
BarceloneEspagne
ParisFrance
AmsterdamPays-Bas
Christophe
Roi
WWWWWW
What is the problem?
Data integration issueKennissen, Stad, Ville, Pays ?
Paris = Parijs ?
Amsterdam = Amsterdam ?
Lot of work for the data consumer
KennissenStad
ChristopheAmsterdam
PeterBarcelona
DavidParijs
VillePays
BarceloneEspagne
ParisFrance
AmsterdamPays-Bas
+
=
?
Why is this so problematic?
Un-even balance of information
Christophe and Frank have more of it than Roi
Solution: share more information
Amsterdam = Amsterdam ?Replace Amsterdam by Amsterdam, Netherlands
Kennissen, Stad, Ville, Pays ?Provide a description for the meaning of the columns as a separate document
Paris = Parijs ?Use English names instead of local ones
But is that enough?
There could still be several Amsterdam, NetherlandsPrecise until 100% certain of uniqueness
Documentation of columns is one more thing to consume to use the data
It's hard to enforce the usage of a single language to name things
Linked Data idea
Data integration at the data levelDefine things in the data set
Use unambiguous identifiers for the things
Associate descriptions to the identifiers
Connect things together
2Name fr is ParisName nl is Parijs...1Name is Christophe...
Works in
Linked Data and the Web
Proposal: use the Web as a platformIdentifiers = URIs
Descriptions = de-referenced documents
ex:Christophedbpedia:Amsterdamex:worksIn
Use of compact URIsdbpedia = http://dbpedia.org/resource/ex = http://example.org/
This is a tripleThis is a resource
What is at dbpedia:Amsterdam ?
Benefits of Linked Data
Data model of triples and resources:Everything defined as described things and relations
Cope easilly with heterogeneous descriptions
Easy to cross-reference things between data sets
The network contains both the data and its description
Use the Web and other open standards (RDF, SPARQL, ...)
ex:Acquaintanceex:Christopheex:Peterex:Daviddbpedia:Amsterdamdbpedia:Barcelonadbpedia:Parisex:worksInex:worksInex:worksInrdf:typerdf:typerdf:typeFrank publishes his data
KennissenStad
ChristopheAmsterdam
PeterBarcelona
DavidParijs
Christophe re-use part of Frank's data to publish his data
ex:Acquaintanceex:Christopheex:Peterex:Daviddbpedia:Amsterdamdbpedia:Barcelonadbpedia:Parisdbpedia:Netherlandsdbpedia:Spaindbpedia:Franceex:worksInex:worksInex:isInex:isInex:worksInex:isInrdf:typerdf:typerdf:typeVillePays
BarceloneEspagne
ParisFrance
AmsterdamPays-Bas
Roi add some more information
ex:Acquaintanceex:Christopheex:Peterex:Daviddbpedia:Amsterdamdbpedia:Barcelonadbpedia:Parisdbpedia:Netherlandsdbpedia:Spaindbpedia:Francedbpedia:Europeex:worksInex:worksInex:isInex:isInex:worksInex:isInex:isInex:isInex:isInrdf:typerdf:typerdf:typeConocido@esrdf:label
Reasoning with Semantics
Bonus!dbpedia:Netherlandsdbpedia:Europeex:isIndbpedia:Amsterdamex:isInex:isInowl:TransitivePropertyrdf:type+
=
dbpedia:Europeex:isIndbpedia:AmsterdamExample usageMaterialize implicit information
Check for consistency
Linked Data vs Linked Open Data
Linked Data doesn't imply Open Data!Possible to use Linked Data principles to closed data
Open Data doesn't imply Linked DataMany open data is not yet published as linked data
Linked data + Open Data = Linked Open DataGlobal, web-scale, data space of open data
Rough estimate of size
295 data sets, 31B facts in LOD Cloud
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Everyone can enrich the cloud
ex:Acquaintanceex:Christopheex:Peterex:Daviddbpedia:Amsterdamdbpedia:Barcelonadbpedia:Parisdbpedia:Netherlandsdbpedia:Spaindbpedia:Francedbpedia:Europeex:worksInex:worksInex:isInex:isInex:worksInex:isInex:isInex:isInex:isInrdf:typerdf:typerdf:typeConocido@esrdf:label
Get Linked Open Data
Linked Data is a graph data base on the Web
It can be consumed in two waysAs documents on the WebOpen the resources and ask for RDF content to get a graph
As a data baseQuery the data with SPARQL (equivalent of SQL)
Search for RDF documents
Look for the RDF export
Look for the RDF export
Look for the RDF export
Sindice Web data inspector
Hands on session
Get the RDF of a BestBuy product
Get RDF out of rottentomatoes
Use-case: building a social
network of musicians
Goal
Make a networkNodes = artists
Edges => play(ed) in the same band
Use Freebase as data source
Getting the data
First option:Get all the pages for all the artists as RDF
Merge them
Filter the data to keep only the desired relations
Second option:Extract a sub-graph out of the data graph of Freebase
SPARQL query
PREFIX fb:
SELECT distinct ?name1 ?name2 WHERE { ?g1 fb:music.group_membership.group ?group. ?g1 fb:music.group_membership.member ?member1. ?member1 fb:type.object.name ?name1.
?g2 fb:music.group_membership.group ?group. ?g2 fb:music.group_membership.member ?member2. ?member2 fb:type.object.name ?name2.
filter ((?g1 != ?g2) && (?member1 != ?member2)) filter ((lang(?name1)="en") && (lang(?name2)="en")) filter (str(?name1) < str(?name2))}
Result
Use factforge.netContains a copy of the data from Freebase
Understands SPARQL queries
Results: http://bit.ly/music_sn
Hot line for Linked (Open) Data
Christophe [email protected]
http://www.few.vu.nl/~cgueret
@cgueret
Rinke [email protected]
http://www.rinkehoekstra.nl/
@rinkehoekstra
/