exploring our world with freebase

125
Exploring Our World With Freebase Paul Houle [email protected]

Upload: paul-houle

Post on 20-Jan-2015

9.767 views

Category:

Education


4 download

DESCRIPTION

I gave this talk on Oct 2 at the Semantic Technology and Business conference. In this talk I discuss how I process Freebase data with the open source Infovore framework, which processes Freebase and other RDF data quickly by using Hadoop, Map/Reduce, and Amazon Web Services

TRANSCRIPT

Page 1: Exploring our world with freebase

Exploring Our World With Freebase

Paul [email protected]

Page 5: Exploring our world with freebase

Google Knowledge Graph

Page 8: Exploring our world with freebase

MQL{ "status": "200 OK", "code": "/api/status/ok", "result": { "type": "/music/artist", "name": "The Police", "album": [ "Outlandos d'Amour", "Reggatta de Blanc", "Zenyatta Mondatta", "Ghost in the Machine", "Synchronicity" ] } }

Page 9: Exploring our world with freebase

My path to the semantic web

Page 10: Exploring our world with freebase

My path to the semantic web

Page 11: Exploring our world with freebase

My path to the semantic web

Page 13: Exploring our world with freebase

Spring 2012

Page 14: Exploring our world with freebase

Fall 2012

Quad Dump

Official RDF Dump

Infovore 1.0 released as open source under Apache License

Page 15: Exploring our world with freebase

13+ million Invalid Facts

Image cc-by from arj03

Page 16: Exploring our world with freebase

Infovore 1.0

Quad Dump -> RDF

Infovore 1.1

General RDF Cleanup& Filtering

Millipede framework – Map/Reduce on a single computer

Page 17: Exploring our world with freebase

Infovore 2

Page 18: Exploring our world with freebase

What does Freebase cover?

Page 19: Exploring our world with freebase

Is it a bibliographic database?

Page 20: Exploring our world with freebase

Ahead of their time?

Reading Room, Library of Congress

Page 21: Exploring our world with freebase
Page 22: Exploring our world with freebase
Page 23: Exploring our world with freebase
Page 24: Exploring our world with freebase

MARC… in electronic form since 1969!

First standard data format with variable length fields & I18N.

Page 25: Exploring our world with freebase
Page 26: Exploring our world with freebase

Now everybody has a bibliographic database…

Page 27: Exploring our world with freebase

Or, do documents annotate the world?

Page 28: Exploring our world with freebase

Social Semantic Systems

Linked Data User-Generated Content

Page 29: Exploring our world with freebase
Page 30: Exploring our world with freebase
Page 31: Exploring our world with freebase
Page 32: Exploring our world with freebase
Page 33: Exploring our world with freebase

The dominant paradigm

Triple store

Page 34: Exploring our world with freebase

How to break your triple store

http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/

Page 35: Exploring our world with freebase

The RDF data warehouse

ETL

warehouse

operations

development

science

Page 36: Exploring our world with freebase

The RDF data warehouse II

warehouse

Operations tools

Science Tools

Page 37: Exploring our world with freebase

Latency: low is not low enough

Page 38: Exploring our world with freebase

operations

development

science

Page 39: Exploring our world with freebase

FreebaseDBpedia

any relational databasemachine learning

JenaAmazon Web Services

PHPmap/reduce frameworks (ex. Hadoop)

MongoDBSesame

Virtuoso OpenLinkother NoSQL database

Solid State Drives (SSD)other cloud computing service

Neo4JRuby

Drupalalternative JVM languages (ex. Scala or Clojure)

other triple storeany key/value store (ex. JDBM or Berkeley DB)

OWLIMAllegrograph

4storeFactual

dotNetRDFStardog

Kasabi/Talis PlatformOracle Spatial RDF

0 10 20 30 40 50 60

Tools Popular With :BaseKB Users

Page 40: Exploring our world with freebase
Page 41: Exploring our world with freebase

Map/ReduceInputs

Mappers

Shuffle

Sort

Reducers

Output

Page 42: Exploring our world with freebase

RDF: Reduction on Subject

:Goat:Bear:Alligator:Iguana:Dog:Elephant:Cat:Horse:Fox

:Alligator:Dog:Goat

:Bear:Elephant:Horse

:Cat:Fox:Iguana

Page 43: Exploring our world with freebase

Jena Framework

SDB

Relational db-based Triple store

TDB

Native disk-based triple store

Model

In-memory triple store

“We use Jena Models like PHP programmers use hashtables”

-- Kendall Clark, Clark and Parsia

Page 44: Exploring our world with freebase

Hadoop Physical Architecture

Namenode

JobtrackerDatanodes&

TasktrackersHDFS

Page 45: Exploring our world with freebase

My development cluster – Namenode/JobTracker

Page 46: Exploring our world with freebase
Page 47: Exploring our world with freebase
Page 48: Exploring our world with freebase

Hadoop toleratesHardware failures

Page 49: Exploring our world with freebase

My other computer is

Page 51: Exploring our world with freebase

“It’s harder to make up names for things than to invent them”

- Tom SwiftFictional American Inventor

Page 52: Exploring our world with freebase

Infovore modules

bakemonoharuhi

centipedechopper

Page 53: Exploring our world with freebase

Bakemono Super JAR

Page 54: Exploring our world with freebase

Bakemono Super JAR

Contains applications like

freebaseRDFPrefilter pse3 ranSample sieve3

Named after Japanese word for “monsters”

Page 55: Exploring our world with freebase

“Haruhi”

(1) Japanese religious word for “Full of Spirit” ; (2) a very dominant person

Page 56: Exploring our world with freebase

Unpacking the Freebase RDF Dump

photograph Copyright 2010 Ian Munroe CC-BY SA

Page 59: Exploring our world with freebase

Inputs

Mappers

Page 60: Exploring our world with freebase

freebaseRDFPrefilter removes…

Wasteful Facts• 120M+ copies of the “a” predicate• 60M+ access control predicates

Violent and Dangerous facts

ns:common.topic ns:type.type.instance ?o .

Is repeated 30M times, and if you group on ?s and keep them in memory…

Page 61: Exploring our world with freebase

… uneven bin distribution …

331 332330

333

334 335… …

Page 63: Exploring our world with freebase

Parallel Super Eyeball

“triples”

valid triples junk

Currently, 250,000 or so triples in Freebase are rejected by PSE3

Page 64: Exploring our world with freebase

Parallel Super Eyeball 3

Page 66: Exploring our world with freebase

Horizontal Decomposition of Freebase

Page 67: Exploring our world with freebase

a5%

description18%

key11%

keyNs13%

label6%

name6%

notability1%

nfp0%

text8%

web6%

links20%

other7%

percentage of gz compressed size

Page 68: Exploring our world with freebase

a16%

description1%

key9%

keyNs11%

label6%

name6%

notability2%

nfp2%

text1%

web5%

links32%

other11%

percentage of facts

Page 69: Exploring our world with freebase

a15%

description7%

key8%

keyNs9%

label4%name

4%notability

2%nfp1%

text3%

web6%

links30%

other11%

percentage of uncompressed size

Page 70: Exploring our world with freebase

rdf:type aka “a”

16% 15%5%

facts bytes compressed bytes

ns:m.02qvftw rdf:type ns:business.employer .

Page 71: Exploring our world with freebase

RDFS Inference

:a :Actor ?

Page 72: Exploring our world with freebase

RDFS Inference

Jesse Plemons

Todd

Page 73: Exploring our world with freebase

:a :Actor .

Jesse Plemons

Todd

implies

Page 74: Exploring our world with freebase

Descriptions

1%

facts

18%

bytes

7%

compressed

Page 75: Exploring our world with freebase

Descriptionsns:m.010bfy ns:common.topic.description

"Riverside \u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt .

ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .

Page 76: Exploring our world with freebase

Descriptionsns:m.010bfy ns:common.topic.description

"Riverside \u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt .

ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .

This does not compute!

Page 77: Exploring our world with freebase

Descriptionsns:m.010bfy ns:common.topic.description

"Riverside \u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt .

ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .

Page 78: Exploring our world with freebase

Labels and Namesns:american_football.football_division rdfs:label

"American football division"@en .

ns:american_football.football_conference rdfs:label"Grupper inom amerikansk fotboll"@sv .

ns:american_football.football_player ns:type.object.name"Football-Spieler"@de .

ns:american_football.football_team ns:type.object.name "American football-team"@nl .

Page 79: Exploring our world with freebase

Freebase Labels Are Not Unique

Page 80: Exploring our world with freebase

Dbpedia Labels are Unique

Page 81: Exploring our world with freebase

https://github.com/paulhoule/infovore/wikihttps://groups.google.com/forum/#!forum/infovore-basekb

Page 82: Exploring our world with freebase

Keys in the Freebase dump

• Most objects represented by mid identifiers

Page 83: Exploring our world with freebase

Keys in the Freebase dump

• Schema objects have friendly identifiers

Page 84: Exploring our world with freebase

Keys in the Freebase dump

Page 85: Exploring our world with freebase

Examples…ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .

ns:american_football.football_division rdfs:label"American football division"@en .

Freebase always uses the same key in the ?s, ?p, and ?o fields, but...

Page 86: Exploring our world with freebase

It wasn’t always this way

… the old quad dump used mids in the subject field, but others in the destination field …

Page 87: Exploring our world with freebase

Turtle0

Turtle1

Turtle2

Turtle3

Extract namespace graph

Convert all identifiers to mids

Extract type information from schema

Convert to RDF types

:BaseKB 2012

Page 88: Exploring our world with freebase

Freebase Knows Many Keysns:g.11vk55hmr ns:type.object.key "/base/dspl/us_census/population/place" . ns:m.010004m ns:type.object.key "/authority/musicbrainz/339a2897-9ba4-4820-a2a8-f234c22608a4“ . ns:Lm.01003_ ns:type.object.key "/wikipedia/de/Krum_$0028Texas$0029“ . ns:m.01010d ns:type.object.key "/wikipedia/en_id/135860" .ns:m.0100_b ns:type.object.key "/authority/gnis/1352653" .ns:m.0100l2 ns:type.object.key "/authority/hud/countyplace/4814101390" . ns:m.01031l ns:type.object.key "/en/chandler_texas" .ns:m.015g9m ns:type.object.key "/en/aliens_from_space" .ns:m.015gdl ns:type.object.key "/en/self-publishing" .ns:m.015gjr ns:type.object.key "/authority/nndb/231$002F000085973" .

… and type.object.key spells them out …

Page 89: Exploring our world with freebase

A directed acyclic graph/m/01root

/m/019swikipedia

/m/047w32vauthority

/m/0gt9en

/m/05x_rjrGeoff_Simmons

/wikipedia/en/Geoff_Simmons = /authority/wikipedia/en/Geoff_Simmons

Page 90: Exploring our world with freebase

key: namespace encodes the graph

ns:m.010005 key:wikipedia.pt "Corinth_$0028Texas$0029" .ns:m.010005h key:authority.musicbrainz "ab0b82ce-d1be-4641-b0d1-838896a25887" .

Page 91: Exploring our world with freebase

Useful external keys

Page 92: Exploring our world with freebase

Music

Page 93: Exploring our world with freebase
Page 94: Exploring our world with freebase

http://www.freebase.com/authority/musicbrainz/e217a1e9-9ec8-4e88-aebc-7d6b720384c1

Page 95: Exploring our world with freebase

Musical Composition

Recording

“Recording appears on Album as track #”

Page 96: Exploring our world with freebase

Functional Requirements For Bibliographic Records (FRBR)

Page 97: Exploring our world with freebase

Nick Hexium Rap Rock

311

Omaha, NE Los Angeles, CA

Page 98: Exploring our world with freebase

Unique data in DBpedia

Page 99: Exploring our world with freebase

Wikipedia Categories

Page 100: Exploring our world with freebase

Wikipedia Page Links

Page 101: Exploring our world with freebase

“Smushing”

dbpedia:Striated_Heron :linksTo dbpedia:Heron .dbpedia:Striated_Heron owl:sameAs ns:m.01v7dp .dbpedia:Heron owl:sameAs ns:m.01jgnh .

Ns:m.01v7dp :linksTo ns:m.01jgnh .

Page 102: Exploring our world with freebase

Duck Types

• ?a performed on music track ?b- ?a is a musician

Page 103: Exploring our world with freebase

Duck Types

• ?a employed ?b- ?a is an employer

Page 104: Exploring our world with freebase

Duck Types

• Book ?a was written about ?b– ?b is a book subject

Page 105: Exploring our world with freebase

The Problem of Notability

Page 106: Exploring our world with freebase

ns:m.0100007 ns:common.topic.notable_types ns:m.0kpv11.ns:m.01000_r ns:common.topic.notable_types ns:m.0kpv11.ns:m.01000dh ns:common.topic.notable_types ns:m.09jd9nh.ns:m.01000pp ns:common.topic.notable_types ns:m.09jd9nh.ns:m.01000px ns:common.topic.notable_types ns:m.0kpv11.ns:m.01000w ns:common.topic.notable_types ns:m.01m9.ns:m.01000yk ns:common.topic.notable_types ns:m.0kpv11.ns:m.010012t ns:common.topic.notable_types ns:m.0kpv11.ns:m.010014_ ns:common.topic.notable_types ns:m.09jd9nh.ns:m.010019c ns:common.topic.notable_types ns:m.09jd9nh.

Page 107: Exploring our world with freebase

Analysis with Chopper and Pig

Page 108: Exploring our world with freebase
Page 109: Exploring our world with freebase

Why APIs suck(Including SPARQL endpoints)

• Provider can afford maximum $/query

• If you need a more complex query you’ve got no option!

Page 111: Exploring our world with freebase

Cluster creation made easy

:BaseKB Now:BaseKB Now

Page 112: Exploring our world with freebase

Pig Script – count common types

$ piggrunt> run chopper/src/main/pig/lib/chopper.piggrunt> a = LOAD '/freebase/20130915/a/' USING com.ontology2.chopper.io.PrimitiveTripleInput();grunt> oNodes = FOREACH a GENERATE o;grunt> groupNodes = GROUP oNodes BY o;grunt> countedNodes = FOREACH groupNodes GENERATE group AS uri:chararray,COUNT(oNodes) AS cnt:long;grunt> sortedNodes = ORDER countedNodes BY cnt DESC;grunt> top100= DUMP sortedNodes;

Page 113: Exploring our world with freebase

Most frequent types(<http://rdf.basekb.com/ns/common.topic>,39030195)(<http://rdf.basekb.com/ns/common.notable_for>,18747254)(<http://rdf.basekb.com/ns/music.release_track>,13304261)(<http://rdf.basekb.com/ns/music.recording>,8902041)(<http://rdf.basekb.com/ns/music.single>,6297869)(<http://rdf.basekb.com/ns/common.document>,5580077)(<http://rdf.basekb.com/ns/media_common.cataloged_instance>,3030634)(<http://rdf.basekb.com/ns/book.book_edition>,2771323)(<http://rdf.basekb.com/ns/people.person>,2742157)(<http://rdf.basekb.com/ns/type.namespace>,2689781)(<http://rdf.basekb.com/ns/book.isbn>,2601099)(<http://rdf.basekb.com/ns/type.content>,2499648)(<http://rdf.basekb.com/ns/measurement_unit.dated_integer>,2466557)

Page 114: Exploring our world with freebase

Compound Value Typesand our 4D world

Page 115: Exploring our world with freebase

The 13th most prevalent type(<http://rdf.basekb.com/ns/common.topic>,39030195)(<http://rdf.basekb.com/ns/common.notable_for>,18747254)(<http://rdf.basekb.com/ns/music.release_track>,13304261)(<http://rdf.basekb.com/ns/music.recording>,8902041)(<http://rdf.basekb.com/ns/music.single>,6297869)(<http://rdf.basekb.com/ns/common.document>,5580077)(<http://rdf.basekb.com/ns/media_common.cataloged_instance>,3030634)(<http://rdf.basekb.com/ns/book.book_edition>,2771323)(<http://rdf.basekb.com/ns/people.person>,2742157)(<http://rdf.basekb.com/ns/type.namespace>,2689781)(<http://rdf.basekb.com/ns/book.isbn>,2601099)(<http://rdf.basekb.com/ns/type.content>,2499648)(<http://rdf.basekb.com/ns/measurement_unit.dated_integer>,2466557)

Page 116: Exploring our world with freebase
Page 117: Exploring our world with freebase

:Las_Vegas

945

1910

:US_Census_Bureau

population

number

date

source

Page 118: Exploring our world with freebase

25 1900

945 1910

2,304 1920

5,165 1930

8,422 1940

24,624 1950

64,405 1960

125,787 1970

164,674 1980

260,561 1990

284,931 1991

297,326 1992

312,634 1993

336,380 1994

354,559 1995

372,849 1996

391,074 1997

405,245 1998

418,658 1999

484,487 2000

498,638 2001

507,219 2002

516,723 2003

534,168 2004

544,806 2005

552,855 2006

559,892 2007

562,849 2008

567,641 2009

584,539 2010

589,317 2011

19001920

19401960

19801991

19931995

19971999

20012003

20052007

20092011

0

100000

200000

300000

400000

500000

600000

700000

Population of Las Vegas, NV

Series1

Axis Title

Page 119: Exploring our world with freebase

Vertical Divisions of FreebaseWikipedia Topics Movies and Television Travel and Lodging

:BaseKB Lite

Page 120: Exploring our world with freebase

Separating Blank Nodes

Page 121: Exploring our world with freebase

Separating Blank Nodes

Page 122: Exploring our world with freebase

Separating Blank Nodes

Page 123: Exploring our world with freebase

Separating Blank Nodes

Page 124: Exploring our world with freebase

:BaseKB Now

• Created Weekly by automated process• Delivered to AMZN S3• Accepted facts are 100% Valid RDF• Rejected facts collected for inspection• “Violent” predicates removed to fight skew• Horizontally divided for fast processing

http://basekb.com/

Page 125: Exploring our world with freebase

Infovore Software

http://github.com/paulhoule/infovore/wiki