exploring our world with freebase

Post on 20-Jan-2015

9.768 Views

Category:

Education

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

I gave this talk on Oct 2 at the Semantic Technology and Business conference. In this talk I discuss how I process Freebase data with the open source Infovore framework, which processes Freebase and other RDF data quickly by using Hadoop, Map/Reduce, and Amazon Web Services

TRANSCRIPT

Exploring Our World With Freebase

Paul Houlepaul@ontology2.com

Google Knowledge Graph

MQL{ "status": "200 OK", "code": "/api/status/ok", "result": { "type": "/music/artist", "name": "The Police", "album": [ "Outlandos d'Amour", "Reggatta de Blanc", "Zenyatta Mondatta", "Ghost in the Machine", "Synchronicity" ] } }

My path to the semantic web

My path to the semantic web

My path to the semantic web

Spring 2012

Fall 2012

Quad Dump

Official RDF Dump

Infovore 1.0 released as open source under Apache License

13+ million Invalid Facts

Image cc-by from arj03

Infovore 1.0

Quad Dump -> RDF

Infovore 1.1

General RDF Cleanup& Filtering

Millipede framework – Map/Reduce on a single computer

Infovore 2

What does Freebase cover?

Is it a bibliographic database?

Ahead of their time?

Reading Room, Library of Congress

MARC… in electronic form since 1969!

First standard data format with variable length fields & I18N.

Now everybody has a bibliographic database…

Or, do documents annotate the world?

Social Semantic Systems

Linked Data User-Generated Content

The dominant paradigm

Triple store

How to break your triple store

http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/

The RDF data warehouse

ETL

warehouse

operations

development

science

The RDF data warehouse II

warehouse

Operations tools

Science Tools

Latency: low is not low enough

operations

development

science

FreebaseDBpedia

any relational databasemachine learning

JenaAmazon Web Services

PHPmap/reduce frameworks (ex. Hadoop)

MongoDBSesame

Virtuoso OpenLinkother NoSQL database

Solid State Drives (SSD)other cloud computing service

Neo4JRuby

Drupalalternative JVM languages (ex. Scala or Clojure)

other triple storeany key/value store (ex. JDBM or Berkeley DB)

OWLIMAllegrograph

4storeFactual

dotNetRDFStardog

Kasabi/Talis PlatformOracle Spatial RDF

0 10 20 30 40 50 60

Tools Popular With :BaseKB Users

Map/ReduceInputs

Mappers

Shuffle

Sort

Reducers

Output

RDF: Reduction on Subject

:Goat:Bear:Alligator:Iguana:Dog:Elephant:Cat:Horse:Fox

:Alligator:Dog:Goat

:Bear:Elephant:Horse

:Cat:Fox:Iguana

Jena Framework

SDB

Relational db-based Triple store

TDB

Native disk-based triple store

Model

In-memory triple store

“We use Jena Models like PHP programmers use hashtables”

-- Kendall Clark, Clark and Parsia

Hadoop Physical Architecture

Namenode

JobtrackerDatanodes&

TasktrackersHDFS

My development cluster – Namenode/JobTracker

Hadoop toleratesHardware failures

My other computer is

“It’s harder to make up names for things than to invent them”

- Tom SwiftFictional American Inventor

Infovore modules

bakemonoharuhi

centipedechopper

Bakemono Super JAR

Bakemono Super JAR

Contains applications like

freebaseRDFPrefilter pse3 ranSample sieve3

Named after Japanese word for “monsters”

“Haruhi”

(1) Japanese religious word for “Full of Spirit” ; (2) a very dominant person

Unpacking the Freebase RDF Dump

photograph Copyright 2010 Ian Munroe CC-BY SA

Inputs

Mappers

freebaseRDFPrefilter removes…

Wasteful Facts• 120M+ copies of the “a” predicate• 60M+ access control predicates

Violent and Dangerous facts

ns:common.topic ns:type.type.instance ?o .

Is repeated 30M times, and if you group on ?s and keep them in memory…

… uneven bin distribution …

331 332330

333

334 335… …

Parallel Super Eyeball

“triples”

valid triples junk

Currently, 250,000 or so triples in Freebase are rejected by PSE3

Parallel Super Eyeball 3

Horizontal Decomposition of Freebase

a5%

description18%

key11%

keyNs13%

label6%

name6%

notability1%

nfp0%

text8%

web6%

links20%

other7%

percentage of gz compressed size

a16%

description1%

key9%

keyNs11%

label6%

name6%

notability2%

nfp2%

text1%

web5%

links32%

other11%

percentage of facts

a15%

description7%

key8%

keyNs9%

label4%name

4%notability

2%nfp1%

text3%

web6%

links30%

other11%

percentage of uncompressed size

rdf:type aka “a”

16% 15%5%

facts bytes compressed bytes

ns:m.02qvftw rdf:type ns:business.employer .

RDFS Inference

:a :Actor ?

RDFS Inference

Jesse Plemons

Todd

:a :Actor .

Jesse Plemons

Todd

implies

Descriptions

1%

facts

18%

bytes

7%

compressed

Descriptionsns:m.010bfy ns:common.topic.description

"Riverside \u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt .

ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .

Descriptionsns:m.010bfy ns:common.topic.description

"Riverside \u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt .

ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .

This does not compute!

Descriptionsns:m.010bfy ns:common.topic.description

"Riverside \u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt .

ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .

Labels and Namesns:american_football.football_division rdfs:label

"American football division"@en .

ns:american_football.football_conference rdfs:label"Grupper inom amerikansk fotboll"@sv .

ns:american_football.football_player ns:type.object.name"Football-Spieler"@de .

ns:american_football.football_team ns:type.object.name "American football-team"@nl .

Freebase Labels Are Not Unique

Dbpedia Labels are Unique

https://github.com/paulhoule/infovore/wikihttps://groups.google.com/forum/#!forum/infovore-basekb

Keys in the Freebase dump

• Most objects represented by mid identifiers

Keys in the Freebase dump

• Schema objects have friendly identifiers

Keys in the Freebase dump

Examples…ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .

ns:american_football.football_division rdfs:label"American football division"@en .

Freebase always uses the same key in the ?s, ?p, and ?o fields, but...

It wasn’t always this way

… the old quad dump used mids in the subject field, but others in the destination field …

Turtle0

Turtle1

Turtle2

Turtle3

Extract namespace graph

Convert all identifiers to mids

Extract type information from schema

Convert to RDF types

:BaseKB 2012

Freebase Knows Many Keysns:g.11vk55hmr ns:type.object.key "/base/dspl/us_census/population/place" . ns:m.010004m ns:type.object.key "/authority/musicbrainz/339a2897-9ba4-4820-a2a8-f234c22608a4“ . ns:Lm.01003_ ns:type.object.key "/wikipedia/de/Krum_$0028Texas$0029“ . ns:m.01010d ns:type.object.key "/wikipedia/en_id/135860" .ns:m.0100_b ns:type.object.key "/authority/gnis/1352653" .ns:m.0100l2 ns:type.object.key "/authority/hud/countyplace/4814101390" . ns:m.01031l ns:type.object.key "/en/chandler_texas" .ns:m.015g9m ns:type.object.key "/en/aliens_from_space" .ns:m.015gdl ns:type.object.key "/en/self-publishing" .ns:m.015gjr ns:type.object.key "/authority/nndb/231$002F000085973" .

… and type.object.key spells them out …

A directed acyclic graph/m/01root

/m/019swikipedia

/m/047w32vauthority

/m/0gt9en

/m/05x_rjrGeoff_Simmons

/wikipedia/en/Geoff_Simmons = /authority/wikipedia/en/Geoff_Simmons

key: namespace encodes the graph

ns:m.010005 key:wikipedia.pt "Corinth_$0028Texas$0029" .ns:m.010005h key:authority.musicbrainz "ab0b82ce-d1be-4641-b0d1-838896a25887" .

Useful external keys

Music

http://www.freebase.com/authority/musicbrainz/e217a1e9-9ec8-4e88-aebc-7d6b720384c1

Musical Composition

Recording

“Recording appears on Album as track #”

Functional Requirements For Bibliographic Records (FRBR)

Nick Hexium Rap Rock

311

Omaha, NE Los Angeles, CA

Unique data in DBpedia

Wikipedia Categories

Wikipedia Page Links

“Smushing”

dbpedia:Striated_Heron :linksTo dbpedia:Heron .dbpedia:Striated_Heron owl:sameAs ns:m.01v7dp .dbpedia:Heron owl:sameAs ns:m.01jgnh .

Ns:m.01v7dp :linksTo ns:m.01jgnh .

Duck Types

• ?a performed on music track ?b- ?a is a musician

Duck Types

• ?a employed ?b- ?a is an employer

Duck Types

• Book ?a was written about ?b– ?b is a book subject

The Problem of Notability

ns:m.0100007 ns:common.topic.notable_types ns:m.0kpv11.ns:m.01000_r ns:common.topic.notable_types ns:m.0kpv11.ns:m.01000dh ns:common.topic.notable_types ns:m.09jd9nh.ns:m.01000pp ns:common.topic.notable_types ns:m.09jd9nh.ns:m.01000px ns:common.topic.notable_types ns:m.0kpv11.ns:m.01000w ns:common.topic.notable_types ns:m.01m9.ns:m.01000yk ns:common.topic.notable_types ns:m.0kpv11.ns:m.010012t ns:common.topic.notable_types ns:m.0kpv11.ns:m.010014_ ns:common.topic.notable_types ns:m.09jd9nh.ns:m.010019c ns:common.topic.notable_types ns:m.09jd9nh.

Analysis with Chopper and Pig

Why APIs suck(Including SPARQL endpoints)

• Provider can afford maximum $/query

• If you need a more complex query you’ve got no option!

Cluster creation made easy

:BaseKB Now:BaseKB Now

Pig Script – count common types

$ piggrunt> run chopper/src/main/pig/lib/chopper.piggrunt> a = LOAD '/freebase/20130915/a/' USING com.ontology2.chopper.io.PrimitiveTripleInput();grunt> oNodes = FOREACH a GENERATE o;grunt> groupNodes = GROUP oNodes BY o;grunt> countedNodes = FOREACH groupNodes GENERATE group AS uri:chararray,COUNT(oNodes) AS cnt:long;grunt> sortedNodes = ORDER countedNodes BY cnt DESC;grunt> top100= DUMP sortedNodes;

Most frequent types(<http://rdf.basekb.com/ns/common.topic>,39030195)(<http://rdf.basekb.com/ns/common.notable_for>,18747254)(<http://rdf.basekb.com/ns/music.release_track>,13304261)(<http://rdf.basekb.com/ns/music.recording>,8902041)(<http://rdf.basekb.com/ns/music.single>,6297869)(<http://rdf.basekb.com/ns/common.document>,5580077)(<http://rdf.basekb.com/ns/media_common.cataloged_instance>,3030634)(<http://rdf.basekb.com/ns/book.book_edition>,2771323)(<http://rdf.basekb.com/ns/people.person>,2742157)(<http://rdf.basekb.com/ns/type.namespace>,2689781)(<http://rdf.basekb.com/ns/book.isbn>,2601099)(<http://rdf.basekb.com/ns/type.content>,2499648)(<http://rdf.basekb.com/ns/measurement_unit.dated_integer>,2466557)

Compound Value Typesand our 4D world

The 13th most prevalent type(<http://rdf.basekb.com/ns/common.topic>,39030195)(<http://rdf.basekb.com/ns/common.notable_for>,18747254)(<http://rdf.basekb.com/ns/music.release_track>,13304261)(<http://rdf.basekb.com/ns/music.recording>,8902041)(<http://rdf.basekb.com/ns/music.single>,6297869)(<http://rdf.basekb.com/ns/common.document>,5580077)(<http://rdf.basekb.com/ns/media_common.cataloged_instance>,3030634)(<http://rdf.basekb.com/ns/book.book_edition>,2771323)(<http://rdf.basekb.com/ns/people.person>,2742157)(<http://rdf.basekb.com/ns/type.namespace>,2689781)(<http://rdf.basekb.com/ns/book.isbn>,2601099)(<http://rdf.basekb.com/ns/type.content>,2499648)(<http://rdf.basekb.com/ns/measurement_unit.dated_integer>,2466557)

:Las_Vegas

945

1910

:US_Census_Bureau

population

number

date

source

25 1900

945 1910

2,304 1920

5,165 1930

8,422 1940

24,624 1950

64,405 1960

125,787 1970

164,674 1980

260,561 1990

284,931 1991

297,326 1992

312,634 1993

336,380 1994

354,559 1995

372,849 1996

391,074 1997

405,245 1998

418,658 1999

484,487 2000

498,638 2001

507,219 2002

516,723 2003

534,168 2004

544,806 2005

552,855 2006

559,892 2007

562,849 2008

567,641 2009

584,539 2010

589,317 2011

19001920

19401960

19801991

19931995

19971999

20012003

20052007

20092011

0

100000

200000

300000

400000

500000

600000

700000

Population of Las Vegas, NV

Series1

Axis Title

Vertical Divisions of FreebaseWikipedia Topics Movies and Television Travel and Lodging

:BaseKB Lite

Separating Blank Nodes

Separating Blank Nodes

Separating Blank Nodes

Separating Blank Nodes

:BaseKB Now

• Created Weekly by automated process• Delivered to AMZN S3• Accepted facts are 100% Valid RDF• Rejected facts collected for inspection• “Violent” predicates removed to fight skew• Horizontally divided for fast processing

http://basekb.com/

Infovore Software

http://github.com/paulhoule/infovore/wiki

top related