introduction to linked data and semantic web technology · introduction to linked data and semantic...
TRANSCRIPT
1
Introduction to linked data and Semantic Web technology
Dave Raggett, W3C
6 October 2009 With thanks to Ivan Herman for use of some of his slides
2
The Unfinished Revolution
● Today's Web is designed for people to interpret○ Using your eyes and your mind
● Each website only covers part of your needs○ You have to do integrate information across websites○ This is time consuming and a waste of effort
● We should put computers to work on our behalf○ We need to find ways for software to query, combine and
interpret data accessible over the Web– Michael Dertouzos: “The Unfinished Revolution, How to Make
Technology Work for Us--Instead of the Other Way Around”
3
So what is the Semantic Web?
4
It is, essentially, the Web of Data and the technologies to realize that
5
Is it that simple...
● Of course, the devil is in the details○ a common model has to be provided for machines
to describe and query the data and its connections○ the “classification” of the terms can become very
complex for specific knowledge areas: this is where ontologies, thesauri, etc, enter the game…
6
Linked Data
7
Data Integration withthe Semantic Web
● Map each data source into binary relations● Merge the relations● Start making queries
○ Uniform representation of relations as RDF Triples
subject objectVerb
All three are named with URIs
8
A simplified book store example
ID Author Title Publisher YearISBN0-00-651409-X The Glass Palace 2000id_xyz id_qpr
ID Name Home Page
ID CityHarper Collins London
id_xyz Ghosh, Amitav http://www.amitavghosh.com
Publ. Nameid_qpr
SQL database:
9
Export data as relations
10
Another book store example
A B D E
1 ID Titre Original
2
ISBN0 2020386682 A13 ISBN-0-00-651409-X
3
6 ID Auteur7 ISBN-0-00-651409-X A12
11
12
13
TraducteurLe Palais des miroirs
NomGhosh, AmitavBesse, Christianne
Spreadsheet
11
Export it as relations
12
Merge the relations
13
Merging continued...
14
Merging identical nodes
15
Add some missing knowledge
● We “feel” that a:author and f:auteur should be the same
● But an automatic merge doesn't know that without help
● We will add some extra information to the merged data:○ a:author same as f:auteur
○ both identify a “Person”
○ a term that a community may have already defined:
○ a “Person” is uniquely identified by his/her name and, say, homepage
○ it can be used as a “category” for certain type of resources
16
The merged relations
17
Start making queries
● You can now ask for the home page of the original author of a translated book
● This information is made available byreasoning over the merged datasets
18
What did we do?
19
Web of Data
● We should publish data on servers○ In standard ways rather than ad hoc approaches
● Set RDF links among the data items from different data sets○ URIs as globally unique names○ URIs for downloadable datasets○ URIs for Web APIs
● Encourage people to innovate○ More data○ More applications
● Watch the network effect work its magic!
20
Linked Open Data Cloud,March 2008
21
Linked Open Data Cloud,March 2009
22
All this sounds nice, but isn't that just a dream?
23
2007 Gartner Predictions
● During the next 10 years, Web-based technologies will improve the ability to embed semantic structures [… it] will occur in multiple evolutionary steps…
● By 2017, we expect the vision of the Semantic Web […] to coalesce […] and the majority of Web pages are decorated with some form of semantic hypertext.
● By 2012, 80% of public Web sites will use some level of semantic hypertext to create SW documents […] 15% of public Web sites will use more extensive Semantic Web-based ontologies to create semantic databases
24
Corporate adoption
● Major companies offer (or will offer) Semantic Web tools or systems using Semantic Web: Adobe, Oracle, IBM, HP, Software AG, GE, Northrop Gruman, Altova, Microsoft, Dow Jones, …
● Others are using it (or consider using it) as part of their own operations: Novartis, Pfizer, Telefónica, …
● Some of the names of active participants in W3C SW related groups: ILOG, HP, Agfa, SRI International, Fair Isaac Corp., Oracle, Boeing, IBM, Chevron, Siemens, Nokia, Pfizer, Sun, Eli Lilly, …
25
Query languages
26
Querying RDF with SPARQL
● A query language for RDF data● Similar in syntax and spirit to SQL
SELECT ?p WHERE { ?L1 arcrole:parent-child ?b1 . ?b1 xl:type xl:link . ?b1 xl:from ?p OPTIONAL { ?L2 arcrole:parent-child ?b2 . ?b2 xl:type xl:link . ?b2 xl:to ?p } FILTER (!BOUND(?b2)) }
27
Defining shared vocabularies
28
Data Types
● RDFS defines some predicates for common datatypes, e.g.
○ Booleans
○ Numbers
○ Strings
As XML or as natural language, e.g. Spanish
○ Dates
○ Classes
● Resources can belong to several classes
29
OWL for Ontologies
● RDFS is useful, but complex applications may want more
● OWL adds lots of possibilities○ Characterization of properties○ Disjointness or equivalence of classes○ In RDFS, you can subclass existing classes○ In OWL, you can construct classes from existing ones
– Through set intersection, union, complement, etc.
● But this comes at a cost...
30
OWL Profiles
● Trade off between rich semantics for expressibility and ease of making inferences○ Simpler inference engines are possible with restrictions
on which terms can be used and under what circumstances
● OWL full○ Very expressive, but not computable in general
● OWL DL○ Popular computable subset of OWL full
● OWL 2 defines further profiles
31
Rules
32
Rule Languages
● May be more convenient than ontologies● Example
○ A cheap book is a novel with over 500 pages and costing less than $8
● W3C Rule Interchange Format (RIF)○ Family of languages for rule interchange
– For different kinds of rule language
○ Uses include– Negotiating eBusiness contracts across platforms
– Access to business rules of supply chain partners
– Managing inter-organizational business policies
33
XBRL and the Semantic Web
34
Why translate XBRLto another format?
● It is very expensive to process 10-50MB of XML on each query○ Memory and CPU intensive: about one second
of CPU time per 10MB of XML source
● Better to pre-process filings into a persistent format designed to match needs of queries○ Current tools use proprietary solutions
● RDF and OWL as natural choices○ Mature standards
○ Facilitate mashing financial data with other kinds of information available over the Web
○ Web APIs and standards would enable an ecosystem of value adding players
35
XBRL as RDF/Turtle
Part of US GAAP taxonomy
@prefix usfr-pte: <http://www.xbrl.org/us/fr/common/pte/2005-02-28>.
usfr-pte:ChangeOtherCurrentAssets rdf:type xbrli:monetaryItemType; xbrli:periodType "duration".usfr-pte:ChangeOtherCurrentLiabilities rdf:type xbrli:monetaryItemType; xbrli:periodType "duration".
_:link155 arcrole:parent-child [ xl:type xl:link; xl:role role1:StatementFinancialPosition; xl:use "prohibited"; xl:priority "1"^^xsd:integer; xl:order "1.0"^^xsd:decimal; xl:from usfr-pte:IntangibleAssetsNetAbstract; xl:to usfr-pte:IntangibleAssetsGoodwill; ].
36
XBRL as RDF/Turtle
Sample of an XBRL Instance file
_:context_FY07Q3 xl:type xbrli:context; xbrli:entity [ xbrli:identifier "0000789019"; xbrli:scheme <http://sec.gov/CIK>; ]; xbrli:period ( [ xbrli:startDate "2007-01-01"^^xsd:date; xbrli:endDate "2007-03-31"^^xsd:date; ] ).
_:unit_usd xbrli:measure iso4217:USD.
_:fact209 xl:type xbrli:fact; xl:provenance _:provenance1; rdf:type us-gaap:PaymentsToAcquireProductiveAssets; rdf:value "461000000"^^xsd:integer; xbrli:decimals "-6"^^xsd:integer; xbrli:unit _:unit_USD; xbrli:context _:context_FY07Q3.
37
XBRL and OWL
● XBRL Taxonomy loosely equates to OWL ontology
○ But note XBRL's taxonomy overrides
● Automated mapping is mostly feasible
○ As demonstrated by Rhizomik XSD2OWL
● XBRL's formal semantics are weak
● XBRL versioning standard will describe differences between different versions of the same taxonomy, e.g. US GAAP 2008, 2009
○ Unaware of work on mapping this into OWL
○ Is it a good match to real world needs?
– e.g. rules of thumb for computing analytic ratios
● Reasoning across different taxonomies remains a major challenge
○ e.g. US GAAP vs IFRS
38
Web-based ecosystemfor financial data
● Publishers of raw data○ Investor relation websites
○ Government agencies
○ News agencies
● Data aggregators○ Republish data as linkable triples, Sparql queries
○ Higher level APIs for common queries
– Results as charts or tables
○ Web of scripts that add value
– Custom analytics across filings
● Smart search engines
● Communities○ Share reviews, comments, analyses, mashups, ...
39
Smart Search Engines
● Imagine search engines that provide selected financial highlights for each company that matches the search criteria you just entered○ With salient numbers and charts
● The search results tailor the data provided according to your interests○ Based upon analysis of the search criteria and other
information gleaned from previous searches○ Subject to your privacy preferences, of course! **
● Interactive data you can drill down on
** My other job is on privacy and identity management for an EU FP7 project
40
Thank you for listening