consuming linked data 4/5 semtech2011

69
Consuming Linked Data Juan F. Sequeda Semantic Technology Conference June 2011

Upload: juan-sequeda

Post on 11-May-2015

2.503 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Consuming Linked Data 4/5 Semtech2011

Consuming Linked Data

Juan F. SequedaSemantic Technology Conference

June 2011

Page 2: Consuming Linked Data 4/5 Semtech2011

Now what can we do with this data?

Page 3: Consuming Linked Data 4/5 Semtech2011

Linked Data Applications

• Software system that makes use of data on the web from multiple datasets and that benefits from links between the datasets

Page 4: Consuming Linked Data 4/5 Semtech2011

Characteristics of Linked Data Applications

• Consume data that is published on the web following the Linked Data principles: an application should be able to request, retrieve and process the accessed data

• Discover further information by following the links between different data sources: the fourth principle enables this.

• Combine the consumed linked data with data from sources (not necessarily Linked Data)

• Expose the combined data back to the web following the Linked Data principles

• Offer value to end-users

Page 5: Consuming Linked Data 4/5 Semtech2011

Generic Applications

Page 6: Consuming Linked Data 4/5 Semtech2011

Linked Data Browsers

Page 7: Consuming Linked Data 4/5 Semtech2011

Linked Data Browsers

• Not actually separate browsers. Run inside of HTML browsers

• View the data that is returned after looking up a URI in tabular form

• User can navigate between data sources by following RDF Links

• (IMO) No usability

Page 8: Consuming Linked Data 4/5 Semtech2011
Page 9: Consuming Linked Data 4/5 Semtech2011

Linked Data Browsers

• http://browse.semanticweb.org/• Tabulator• OpenLink Dataexplorer• Zitgist• Marbles• Explorator• Disco• LinkSailor

Page 10: Consuming Linked Data 4/5 Semtech2011

Linked Data (Semantic Web) Search Engines

Page 11: Consuming Linked Data 4/5 Semtech2011

Linked Data (Semantic Web) Search Engines

• Just like conventional search engines (Google, Bing, Yahoo), crawl RDF documents and follow RDF links.– Current search engines don’t crawl data, unless it’s RDFa

• Human focus Search– Falcons - Keyword– SWSE – Keyworkd– VisiNav – Complex Queries

• Machine focus Search– Sindice – data instances– Swoogle - ontologies– Watson - ontologies– Uberblic – curated integrated data instances

Page 12: Consuming Linked Data 4/5 Semtech2011

(Semantic) SEO ++

• Markup your HTML with RDFa• Use standard vocabularies (ontologies)– Google Vocabulary– Good Relations– Dublin Core

• Google and Yahoo will crawl this data and use it for better rendering

Page 13: Consuming Linked Data 4/5 Semtech2011
Page 14: Consuming Linked Data 4/5 Semtech2011

On-the-fly Mashups

Page 15: Consuming Linked Data 4/5 Semtech2011

http://sig.ma

Page 16: Consuming Linked Data 4/5 Semtech2011

Domain Specific Applications

Page 17: Consuming Linked Data 4/5 Semtech2011

Domain Specific Applications

• Government– Data.gov– Data.gov.uk– http://data-gov.tw.rpi.edu/wiki/Demos

• Music– Seevl.net

• Dbpedia Mobile• Life Science– LinkedLifeData

• Sports– BBC World Cup

Page 18: Consuming Linked Data 4/5 Semtech2011

Faceted Browsers

Page 19: Consuming Linked Data 4/5 Semtech2011

http://dbpedia.neofonie.de/browse/

Page 20: Consuming Linked Data 4/5 Semtech2011

http://dev.semsol.com/2010/semtech/

Page 21: Consuming Linked Data 4/5 Semtech2011

Query your data

Page 22: Consuming Linked Data 4/5 Semtech2011

Find all the locations of all the original paintings of Modigliani

Page 23: Consuming Linked Data 4/5 Semtech2011

Select all proteins that are linked to a curated interaction from the literature and to inflammatory response

http://linkedlifedata.com/

Page 24: Consuming Linked Data 4/5 Semtech2011

SPARQL Endpoints

• Linked Data sources usually provide a SPARQL endpoint for their dataset(s)

• SPARQL endpoint: SPARQL query processing service that supports the SPARQL protocol*

• Send your SPARQL query, receive the result

* http://www.w3.org/TR/rdf-sparql-protocol/

Page 25: Consuming Linked Data 4/5 Semtech2011

Where can I find SPARQL Endpoints?

• Dbpedia: http://dbpedia.org/sparql

• Musicbrainz: http://dbtune.org/musicbrainz/sparql

• U.S. Census: http://www.rdfabout.com/sparql

• http://esw.w3.org/topic/SparqlEndpoints

Page 26: Consuming Linked Data 4/5 Semtech2011

Accessing a SPARQL Endpoint

• SPARQL endpoints: RESTful Web services• Issuing SPARQL queries to a remote SPARQL

endpoint is basically an HTTP GET request to the SPARQL endpoint with parameter query

GET /sparql?query=PREFIX+rd... HTTP/1.1 Host: dbpedia.org User-agent: my-sparql-client/0.1

URL-encoded string with the SPARQL query

Page 27: Consuming Linked Data 4/5 Semtech2011

Query Results Formats

• SPARQL endpoints usually support different result formats:– XML, JSON, plain text

(for ASK and SELECT queries)– RDF/XML, NTriples, Turtle, N3

(for DESCRIBE and CONSTRUCT queries)

Page 28: Consuming Linked Data 4/5 Semtech2011

Query Results Formats

PREFIX dbp: http://dbpedia.org/ontology/PREFIX dbpprop: http://dbpedia.org/property/SELECT ?name ?bday WHERE { ?p dbp:birthplace <http://dbpedia.org/resource/Berlin> . ?p dbpprop:dateOfBirth ?bday . ?p dbpprop:name ?name .}

Page 29: Consuming Linked Data 4/5 Semtech2011
Page 30: Consuming Linked Data 4/5 Semtech2011
Page 31: Consuming Linked Data 4/5 Semtech2011

Query Result Formats

• Use the ACCEPT header to request the preferred result format:

GET /sparql?query=PREFIX+rd... HTTP/1.1 Host: dbpedia.org User-agent: my-sparql-client/0.1 Accept: application/sparql-results+json

Page 32: Consuming Linked Data 4/5 Semtech2011

Query Result Formats

• As an alternative some SPARQL endpoint implementations (e.g. Joseki) provide an additional parameter out

GET /sparql?out=json&query=... HTTP/1.1 Host: dbpedia.org User-agent: my-sparql-client/0.1

Page 33: Consuming Linked Data 4/5 Semtech2011

Accessing a SPARQL Endpoint

• More convenient: use a library• SPARQL JavaScript Library

– http://www.thefigtrees.net/lee/blog/2006/04 sparql_calendar_demo_a_sparql.html

• ARC for PHP– http://arc.semsol.org/

• RAP – RDF API for PHP– http://www4.wiwiss.fu-berlin.de/bizer/rdfapi/index.html

Page 34: Consuming Linked Data 4/5 Semtech2011

Accessing a SPARQL Endpoint

• Jena / ARQ (Java)– http://jena.sourceforge.net/

• Sesame (Java)– http://www.openrdf.org/

• SPARQL Wrapper (Python)– http://sparql-wrapper.sourceforge.net/

• PySPARQL (Python)– http://code.google.com/p/pysparql/

Page 35: Consuming Linked Data 4/5 Semtech2011

Accessing a SPARQL Endpoint

Example with Jena/ARQimport com.hp.hpl.jena.query.*;

String service = "..."; // address of the SPARQL endpoint String query = "SELECT ..."; // your SPARQL query QueryExecution e =

QueryExecutionFactory.sparqlService(service, query)

ResultSet results = e.execSelect(); while ( results.hasNext() ) {

QuerySolution s = results.nextSolution(); // ...

}

e.close();

Page 36: Consuming Linked Data 4/5 Semtech2011

Querying a single dataset is quite boring

compared to

Issuing queries over multiple datasets

Page 37: Consuming Linked Data 4/5 Semtech2011

Creating a Linked Data Application

Page 38: Consuming Linked Data 4/5 Semtech2011

Linked Data Architectures

• Follow-up queries• Querying Local Cache• Crawling• Federated Query Processing• On-the-fly Dereferencing

Page 39: Consuming Linked Data 4/5 Semtech2011

Follow-up Queries

• Idea: issue follow-up queries over other datasets based on results from previous queries

• Substituting placeholders in query templates

Page 40: Consuming Linked Data 4/5 Semtech2011

String s1 = "http://cb.semsol.org/sparql"; String s2 = "http://dbpedia.org/sparql";

String qTmpl = "SELECT ?c WHERE{ <%s> rdfs:comment ?c }";String q1 = "SELECT ?s WHERE { ..."; QueryExecution e1 = QueryExecutionFactory.sparqlService(s1,q1); ResultSet results1 = e1.execSelect(); while ( results1.hasNext() ) {

QuerySolution s1 = results.nextSolution(); String q2 = String.format( qTmpl, s1.getResource("s"),getURI()

); QueryExecution e2=

QueryExecutionFactory.sparqlService(s2,q2); ResultSet results2 = e2.execSelect(); while ( results2.hasNext() ) {

// ... }e2.close();

}e1.close();

Find a list of companies Filtered by some criteria and return Dbpedia URIs from them

Page 41: Consuming Linked Data 4/5 Semtech2011

Follow-up Queries

• Advantage– Queried data is up-to-date

• Drawbacks– Requires the existence of a SPARQL endpoint for

each dataset– Requires program logic– Very inefficient

Page 42: Consuming Linked Data 4/5 Semtech2011

Querying Local Cache

• Idea: Use an existing SPARQL endpoint that provides access to a set of copies of relevant datasets

• Use RDF dumps of each dataset• SPARQL endpoint over a majority of datasets

from the LOD cloud at:

http://lod.openlinksw.com/sparql

http://uberblic.org

Page 43: Consuming Linked Data 4/5 Semtech2011

Querying a Collection of Datasets

• Advantage:– No need for specific program logic– Includes the datasets that you want– Complex queries and high performance– Even reasoning

• Drawbacks:– Depends on existence of RDF dump– Requires effort to set up and to operate the store – How to keep the copies in sync with the originals?– Queried data might be out of date

Page 44: Consuming Linked Data 4/5 Semtech2011

Crawling

• Crawl RDF in advance by following RDF links• Integrate, clean and store in your own

triplestore• Same way we crawl HTML today• LDSpider

Page 45: Consuming Linked Data 4/5 Semtech2011

Crawling

• Advantages:– No need for specific program logic – Independent of the existence, availability, and

efficiency of SPARQL endpoints– Complex queries with high performance– Can even reason about the data

• Drawbacks:– Requires effort to set up and to operate the store – How to keep the copies in sync with the originals?– Queried data might be out of date

Page 46: Consuming Linked Data 4/5 Semtech2011

Federated Query Processing

• Idea: Querying a mediator which distributes sub-queries to relevant sources and integrates the results

Page 47: Consuming Linked Data 4/5 Semtech2011

Federated Query Processing

• Instance-based federation– Each thing described by only one data source – Untypical for the Web of Data

• Triple-based federation– No restrictions – Requires more distributed joins

• Statistics about datasets required (both cases)

Page 48: Consuming Linked Data 4/5 Semtech2011

Federated Query Processing

• DARQ (Distributed ARQ)– http://darq.sourceforge.net/ – Query engine for federated SPARQL queries– Extension of ARQ (query engine for Jena)– Last update: June 2006

• Semantic Web Integrator and Query Engine(SemWIQ)– http://semwiq.sourceforge.net/– Last update: March 2010

• Commercial– …

Page 49: Consuming Linked Data 4/5 Semtech2011

Federated Query Processing

• Advantages:– No need for specific program logic – Queried data is up to date

• Drawbacks:– Requires the existence of a SPARQL endpoint for

each dataset– Requires effort to set up and configure the

mediator

Page 50: Consuming Linked Data 4/5 Semtech2011

In any case:

• You have to know the relevant data sources– When developing the app using follow-up queries– When selecting an existing SPARQL endpoint over

a collection of dataset copies– When setting up your own store with a collection

of dataset copies– When configuring your query federation system

• You restrict yourself to the selected sources

Page 51: Consuming Linked Data 4/5 Semtech2011

In any case:

• You have to know the relevant data sources– When developing the app using follow-up queries– When selecting an existing SPARQL endpoint over

a collection of dataset copies– When setting up your own store with a collection

of dataset copies– When configuring your query federation system

• You restrict yourself to the selected sourcesThere is an alternative:

Remember, URIs link to data

Page 52: Consuming Linked Data 4/5 Semtech2011

On-the-fly Dereferencing

• Idea: Discover further data by looking up relevant URIs in your application on the fly

• Can be combined with the previous approaches

• Linked Data Browsers

Page 53: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

• Applies the idea of automated link traversal to the execution of SPARQL queries

• Idea:– Intertwine query evaluation with traversal of RDF links– Discover data that might contribute to query results

during query execution• Alternately:– Evaluate parts of the query – Look up URIs in intermediate solutions

Page 54: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 55: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 56: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 57: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 58: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 59: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 60: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 61: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 62: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 63: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

Page 64: Consuming Linked Data 4/5 Semtech2011

Link Traversal Based Query Execution

• Advantages:– No need to know all data sources in advance– No need for specific programming logic– Queried data is up to date– Does not depend on the existence of SPARQL

endpoints provided by the data sources• Drawbacks:– Not as fast as a centralized collection of copies– Unsuitable for some queries– Results might be incomplete (do we care?)

Page 65: Consuming Linked Data 4/5 Semtech2011

Implementations

• Semantic Web Client library (SWClLib) for Javahttp://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/• SWIC for Prologhttp://moustaki.org/swic/

Page 66: Consuming Linked Data 4/5 Semtech2011

Implementations

• SQUIN http://squin.org – Provides SWClLib functionality as a Web service– Accessible like a SPARQL endpoint– Install package: unzip and start• Less than 5 mins!

– Convenient access with SQUIN PHP tools:

$s = 'http:// ...'; // address of the SQUIN service $q = new SparqlQuerySock( $s, '... SELECT ...' ); $res = $q->getJsonResult();// or getXmlResult()

Page 67: Consuming Linked Data 4/5 Semtech2011

Real World Example

Page 68: Consuming Linked Data 4/5 Semtech2011

What else?

• Vocabulary Mapping– foaf:name vs foo:name

• Identity Resolution– ex:Juan owl:sameAs foo:Juan

• Provenance• Data Quality• License

Page 69: Consuming Linked Data 4/5 Semtech2011

Getting Started

• Finding URIs– Use search engines

• Finding SPARQL Endpoints