learning content patterns from linked data
TRANSCRIPT
Emir Muñoz
Fujitsu (Ireland) Limited
National University of Ireland Galway
LD4IE 2014 @ ISWC, Riva del Garda, Trentino, Italy. Oct 20th, 2014
http://bit.ly/1xYTR6Z
(@emir_munoz)
2
<subject, predicate, object>
Domain(predicate) ??
Range(predicate) ??
3
select distinct ?obj where
{?sub <http://dbpedia.org/property/isbn> ?obj}
Let’s run the following SPARQL query over endpoint…
And some more ...
The endpoint response is a table with the values for the isbn property:
So, what is the correct range for ? 4
0
71090
6176526
2
2.7073
140043853
1107020697
2940013968264
0978-02-02+02:00
http://dbpedia.org/resource/N/a
"?"@en
"ISBN 0-312-85182-0"@en
"See text"@en
"various"@en
"ISBN 978-0-465-02656-2, ISBN 0-14-017997-6"@en
"ISBN 0-553-07875-5 & ISBN 0-553-56166-9"@en
"The Claiming of Sleeping Beauty: ISBN 0-452-26656-4"@en
"-2.0"^^<http://dbpedia.org/datatype/second>
"TBA"@en
"not available"@en
"[[#Bibliography"@en
LOV Statistics (by July 7th, 2014):
446 vocabularies
10 classes and 20 properties in average
5
range of isbn is
http://schema.org/Text
…but still, is it what I’m looking for? what is the syntax? 6
Etymology apo- + apsis
Noun apoapsis (plural apoapsides)
(astronomy) The point of a body's elliptical orbit about the system's centre of mass where the distance between the body and the centre of mass is at its maximum.
Property: apoapsis
[http://en.wiktionary.org/wiki/apoapsis]
Earth
Satellite
dbr:17049_Miron dbo:apoapsis 4.01288e+11
7
8
https://github.com/dbpedia/extraction-framework/blob/master/
core/src/main/scala/org/dbpedia/extraction/ontology/OntologyDatatypes.scala
<subject, predicate, object>
1488-07-28+02:00 "September 2012"@en "--08-26+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay>
1982-05-23+02:00 "August 2012"@en "--01-24+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay>
2007-04-11+02:00 "July 2009"@en "--06-11+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay>
Lerman et al. (JAIR 2003)
First column: [NUM-NUM-NUM+NUM:NUM] (plain literal)
Second column: [ALPHA<space>NUM] (plain literal + lang)
Third column: [--NUM-NUM+NUM:NUM] (typed literal)
<http://dbpedia.org/property/date>
9
Let be the set of
content patterns.
Lerman et al. (JAIR 2003)
More specific categories
For the input set:
That generates the following patterns:
Values are decomposed in tokens, and
each token is represented by a syntactic
class.
10
2.4 billion RDF triples
53,230 properties
Version 3.9
Split
Method
19.25% plain literals
18.02% typed literals
62.73% without lang or datatype (xsd:string)
11
For apoapsis example, we extracted one pattern
And we also found some other related properties:
For date example, we extracted 7 patterns
http://dbpedia.org/ontology/apoapsis LARGE/FLOAT_NUMBER 1.0
http://dbpedia.org/ontology/Planet/apoapsis LARGE/FLOAT_NUMBER 1.0
http://dbpedia.org/ontology/Spacecraft/apoapsis LARGE/FLOAT_NUMBER 1.0
http://dbpedia.org/property/apoapsis NUMBER 0.9230769230769231
http://dbpedia.org/property/apoapsis LARGE/FLOAT_NUMBER 0.75213675
http://dbpedia.org/property/date -- SMALL_NUMBER - SMALL_NUMBER 0.2
http://dbpedia.org/property/date ALPHANUMERIC MEDIUM_NUMBER 0.166
http://dbpedia.org/property/date ALPHANUMERIC 2012 0.032
http://dbpedia.org/property/date ALPHANUMERIC.ALPHANUMERIC 0.012
And more …
12
The user has this value: “2014-10-20”.
What property can he use? dbp:dateCreated, dbp:dateOfProduction, dbp:dateOpened,
dbp:dateSigned, dbp:dateOfPremiere, dbp:date, among others.
What is the property dbp:admCtrOf used for?
"town of republic significance of Meleuz"@en (http://dbpedia.org/resource/Meleuz)
"town of oblast significance of Oktyabrsk"@en (http://dbpedia.org/resource/Oktyabrsk)
"town of republic significance of Sortavala"@en (http://dbpedia.org/resource/Sortavala)
it is used to declare Administrative Control Of
13
Check for atypical values (outliers) Close look into the most (in)frequent patterns
Possible errors during automatic extraction
For the dbp:isbn property we can find the following values:
"summer or autumn 380"@en "Late November"@en
"Fall 1040"@en 680
"December, 67 BC"@en "April-July 1799"@en
http://dbpedia.org/resource/New_Year's_Day http://dbpedia.org/resource/Second_Interm
ediate_Period_of_Egypt
"New moon day of Kartika, celebrations begin two
days prior and end two days after that date"@en
Are they or values? 14
E-mail: [email protected]
Given name: John
Surname: Snow
Birthday: 1986-02-14
A vCard, may be annotated
with microformat hCard
LD4IE Challenge
2014
vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE 0.82
vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com 0.69
vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE 0.54
vcard:email mailto : ALPHA @ ALPHANUMERIC . com 0.46
vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE 0.36
We can use our database to extract and validate the email:
vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5
vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5
…also the birthday
15
Extraction of lexico-syntactic patterns from LD datasets
Different use cases:
Search for properties
Validation of values
Information extraction based on patterns
Future work:
Study of consistency analysis of knowledge bases
Extension of patterns to cover other knowledge bases
Among others
16
500,000 content patterns