gerhard weikum max planck institute for informatics weikum/ from information to knowledge:...
TRANSCRIPT
Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/
From Information to Knowledge:Harvesting Entities, Relationships, andTemporal Facts from Web Sources
Acknowledgements
Goal: Turn Web into Knowledge Base
comprehensive DB of human knowledge• everything that Wikipedia knows• everything machine-readable• capturing entities, classes, relationships
Source: DB & IR methods for knowledge discovery.Communications ofthe ACM 52(4), 2009
Approach: Harvesting Facts from WebPolitician Political Party
Angela Merkel CDU
Karl-Theodor zu Guttenberg CDU
Christoph Hartmann FDP
…
Company CEO
Google Eric Schmidt
Yahoo Overture
Facebook FriendFeed
Software AG IDS Scheer
…
Movie ReportedRevenue
Avatar $ 2,718,444,933
The Reader $ 108,709,522
Facebook FriendFeed
Software AG IDS Scheer
…
PoliticalParty Spokesperson
CDU Philipp Wachholz
Die Grünen Claudia Roth
Facebook FriendFeed
Software AG IDS Scheer
…
Actor Award
Christoph Waltz Oscar
Sandra Bullock Oscar
Sandra Bullock Golden Raspberry
…
Politician Position
Angela Merkel Chancellor Germany
Karl-Theodor zu Guttenberg Minister of Defense Germany
Christoph Hartmann Minister of Economy Saarland
…
Company AcquiredCompany
Google YouTube
Yahoo Overture
Facebook FriendFeed
Software AG IDS Scheer
…
YAGO-NAGA IWPCyc
TextRunnerReadTheWebWikiTax2WordNet
SUMO
Knowledge for Intelligence• entity recognition & disambiguation• understanding natural language & speech• knowledge services & reasoning for semantic apps (e.g. deep QA)
• semantic search: precise answers to advanced queries (by scientists, students, journalists, analysts, etc.)
FIFA 2010 finalists who played in a Champions League final?
Politicians who are also scientists?
Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure?...
German football coach when Bastian Schweinsteiger was born?
Relationships between Manfred Pinkal, Edsger Dijkstra, Michael Dell, and Renee Zellweger?
Outline
...
Automatic KB Construction
Growing & Maintaining the KB
Temporal Knowledge
What and Why
Wrap-up
What is Knowledge (in a KB)?
...
• facts / assertions: bornIn (BastianSchweinsteiger, Kolbermoor),
hasWon (BastianSchweinsteiger, BronzeFIFAWorldCup2010), playedInFinal (BastianSchweinsteiger, ChampionsLeague2010), …• taxonomic: instanceOf (BastianSchweinsteiger, footballPlayer),
subclassOf (footballPlayer, athlete), …• lexical / terminology: means (“Big Apple“, NewYorkCity),
means (“Apple“, AppleComputerCorporation) means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis) …• common-sense properties: apples are green, red, juicy, sweet, sour … - but not fast, smart … balls are round, smooth, slippery … - but not square, funny …• common-sense axioms: x: human(x) male(x) female(x) x: (male(x) female(x)) (female(x) ) male(x)) x: animal(x) (hasLegs(x) isEven(numberOfLegs(x)) …• procedural: how to fix/install/prepare/remove …• epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)),
believes (Copernicus, shape(Earth, sphere)) …
Tapping on Wikipedia Categories
http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: Example YAGO (Suchanek et al.: WWW‘07)Entity
Max_Planck
Apr 23, 1858
Person
City
Countrysubclass
Locationsubclass
instanceOf
subclass
bornOn
“Max Planck”
means(0.9)
subclass
Oct 4, 1947 diedOn
Kiel
bornInNobel Prize
Erwin_Planck
FatherOfhasWon
Scientist
means
“Max Karl Ernst Ludwig Planck”
Physicist
instanceOf
subclassBiologist
subclass
Germany
Politician
Angela Merkel
Schleswig-Holstein
State
“Angela Dorothea Merkel”
Oct 23, 1944diedOn
Organization
subclass
Max_Planck Society
instanceOf
means(0.1)
instanceOfinstanceOf
subclass
subclass
means
“Angela Merkel”
means
citizenOf
instanceOfinstanceOf
locatedIn
locatedIn
subclassAccuracy 95%
2 Mio. entities, 200 000 classes 40 Mio. RDF triples (facts) ( entity1-relation-entity2, subject-predicate-object )
KB‘s: Example YAGO (F. Suchanek et al.: WWW‘07)
http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: Example DBpedia (Auer, Bizer, et al.: ISWC‘07)
• 3 Mio. entities, • 1 Bio. facts (RDF triples)• 1.5 Mio. entities mapped to hand-crafted taxonomy of 259 classes with 1200 properties
http://www.dbpedia.org
Outline
...
Automatic KB Construction
Growing & Maintaining the KB
Temporal Knowledge
What and Why
Wrap-up
French Marriage Problem
facts in KB: new facts or fact candidates:
married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Michelle, Barack)married (Yoko, John)married (Kate, Leonardo)married (Carla, Sofie)married (Larry, Google)
1) for recall: pattern-based harvesting2) for precision: consistency reasoning
Pattern-Based Harvesting
Facts Patterns
(Hillary, Bill)
(Carla, Nicolas)
& Fact Candidates
X and her husband Y
X and Y on their honeymoon
X and Y and their children
X has been dating with Y
X loves Y
… • good for recall• noisy, drifting• not robust enough for high precision
(Angelina, Brad)
(Hillary, Bill)(Victoria, David)
(Carla, Nicolas)
(Angelina, Brad)
(Yoko, John)
(Carla, Benjamin)(Larry, Google)
(Kate, Pete)
(Victoria, David)
(Hearst 92, Brin 98, Agichtein 00, Etzioni 04, …)
Reasoning about Fact Candidates Use consistency constraints to prune false candidates
spouse(Hillary,Bill)spouse(Carla,Nicolas)spouse(Cecilia,Nicolas)spouse(Carla,Ben)spouse(Carla,Mick)spouse(Carla, Sofie)
spouse(x,y) diff(y,z) spouse(x,z)
f(Hillary)f(Carla)f(Cecilia)f(Sofie)
m(Bill)m(Nicolas)m(Ben)m(Mick)
spouse(x,y) f(x) spouse(x,y) m(y)
spouse(x,y) (f(x)m(y)) (m(x)f(y))
FOL rules (restricted): ground atoms:
Rules can be weighted(e.g. by fraction of ground atoms that satisfy a rule) ® uncertain / probabilistic data® compute prob. distr. of subset of atoms being the truth
Rules reveal inconsistenciesFind consistent subset(s) of atoms(“possible world(s)“, “the truth“)
spouse(x,y) diff(w,x) spouse(w,y)
Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006)
Map logical constraints & fact candidatesinto probabilistic graph model: Markov Random Field (MRF)
s(x,y) m(y)
s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…
s(x,y) diff(w,y) s(w,y)
s(x,y) f(x)
s(Ca,Nic) s(Ce,Nic)
s(Ca,Nic) s(Ca,Ben)
s(Ca,Nic) s(Ca,So)
s(Ca,Ben) s(Ca,So)
s(Ca,Ben) s(Ca,So)
s(Ca,Nic) m(Nic)
Grounding:
s(Ce,Nic) m(Nic)
s(Ca,Ben) m(Ben)
s(Ca,So) m(So)
f(x) m(x)
m(x) f(x)
Literal Boolean VarLiteral binary RV
Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006)
Map logical constraints & fact candidatesinto probabilistic graph model: Markov Random Field (MRF)
s(x,y) m(y)
s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…
s(x,y) diff(w,y) s(w,y)
s(x,y) f(x) f(x) m(x)
m(x) f(x)
m(Ben)
m(Nic) s(Ca,Nic)
s(Ce,Nic)
s(Ca,Ben)
s(Ca,So) m(So)
RVs coupledby MRF edgeif they appearin same clause
MRF assumption:P[Xi|X1..Xn]=P[Xi|N(Xi)]
Variety of algorithms for joint inference:Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, …
joint distribution has product form over all cliques
Related Alternative Probabilistic Models
software tools: alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/
Constrained Conditional Models [D. Roth et al. 2007]
Factor Graphs with Imperative Variable Coordination [A. McCallum et al. 2008]
log-linear classifiers with constraint-violation penaltymapped into Integer Linear Programs
RV‘s share “factors“ (joint feature functions)generalizes MRF, BN, CRF, …inference via advanced MCMCflexible coupling & constraining of RV‘s
m(Ben)
m(Nic) s(Ca,Nic)
s(Ce,Nic)
s(Ca,Ben)
s(Ca,So) m(So)
Reasoning for KB Growth: Direct Route
facts in KB:new fact candidates:
married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Carla, Sofie)married (Larry, Google)
+
patterns:X and her husband YX and Y and their childrenX has been dating with YX loves Y
?
1. facts are true; fact candidates & patterns hypothesesgrounded constraints clauses with hypotheses as vars
2. type signatures of relations greatly reduce #clauses3. cast into Weighted Max-Sat with weights from pattern stats
customized approximation algorithmunifies: fact cand consistency, pattern goodness, entity disambig.
(F. Suchanek et al.: WWW‘09)
www.mpi-inf.mpg.de/yago-naga/sofie/
Direct approach:
Facts & Patterns Consistency with SOFIE
constraints to connect facts, fact candidates, patterns(F. Suchanek et al.: WWW’09, N. Nakashole et al.: WebDB‘10)
functional dependencies:spouse(X,Y): X Y, Y X
relation properties:asymmetry, transitivity, acyclicity, …
type constraints, inclusion dependencies:spouse Person Person capitalOfCountry cityOfCountry
domain-specific constraints:bornInYear(x) + 10years ≤ graduatedInYear(x)
www.mpi-inf.mpg.de/yago-naga/sofie/
hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t
pattern-fact duality:
occurs(p,x,y) expresses(p,R) type(x)=dom(R) type(y)=rng(R) R(x,y)
name(-in-context)-to-entity mapping:
means(n,e1) means(n,e2) …
occurs(p,x,y) R(x,y) type(x)=dom(R) type(y)=rng(R) expresses(p,R)
Entity Disambiguation Revisitedoccurs (“divorced from“, Madonna, Guy Ritchie) expresses (“divorced from“, wasMarriedTo) wasMarriedTo (Madonna, Guy Ritchie)
actually is:occurs (“divorced from“, “Madonna“, “Guy Ritchie“) means (“Madonna“, Madonna Louise Ciccone ) expresses (“divorced from“, wasMarriedTo) wasMarriedTo (Madonna Louise Ciccone, Guy Ritchie) [0.7]
occurs (“divorced from“, “Madonna“, “Guy Ritchie“) means (“Madonna“, Madonna (Edvard Munch)) expresses (“divorced from“, wasMarriedTo) wasMarriedTo (Madonna (Edvard Munch), Guy Ritchie) [0.3]
• use context-similarity as disambiguation prior• set clause weights accordingly
reduced to normal case
entity level
word/phrase level
Experimental ResultsSOFIE (F. Suchanek et al.: WWW’09)• input: biographies of 400 US senators, 3500 HTML files• output: birth/death date&place, politicianOf (state)• run-time: 7 h parsing, 6 h hypotheses, 2 h Max-Sat• precision: 90-95 % (except for death place)• recall: ca. 750 extracted facts (300 politicianOf facts)
PROSPERA (N. Nakashole et al.: WebDB‘10):• input: 87 000 Wikipedia articles and Web homepages of scientists• output: hasAdvisor, graduatedAt, hasCollaborator, facultyAt, wonAward• run-time: 1 h total (largely parallelized)• precision: 85-95 % • recall: ca. 4000 extracted facts (400 hasAdvisor facts)
Now running experiments on ClueWeb‘09 corpus (500 Mio. English Web pages) with Hadoop cluster of 10x16 cores and 10x48 GB
Outline
...
Automatic KB Construction
Growing & Maintaining the KB
Temporal Knowledge
What and Why
Wrap-up
Temporal KnowledgeWhich facts for given relations hold at what time point or during which time intervals ?
marriedTo (Madonna, Guy) [ 22Dec2000, Dec2008 ]capitalOf (Berlin, Germany) [ 1990, now ]capitalOf (Bonn, Germany) [ 1949, 1989 ]hasWonPrize (JimGray, TuringAward) [ 1998 ]graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ]graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ]hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ]
How can we query & reason on entity-relationship factsin a “time-travel“ manner - with uncertain/incomplete KB ?
US president when Barack Obama was born?students of Hector Garcia-Molina while he was at Princeton?
French Marriage Problem
facts in KB
new fact candidates:
married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)divorced (Madonna, Guy)domPartner (Angelina, Brad)
1:
2:
3:
validFrom (2, 2008)
validFrom (4, 1996) validUntil (4, 2007)validFrom (5, 2010)validFrom (6, 2006)validFrom (7, 2008)
4: 5:6:7:8:
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
Challenge: Temporal Knowledgefor all people in Wikipedia (300 000) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night
consistency constraints are potentially helpful:• functional dependencies: husband, time wife• inclusion dependencies: marriedPerson adultPerson• age/time/gender restrictions: birthdate + < marriage < divorce
1) recall: gather temporal scopes for base facts2) precision: reason on mutual consistency
Difficult Dating
(Even More Difficult) Implicit Datingexplicit dates vs.implicit dates relative to other dates
(Even More Difficult) Relative Datingvague dates relative dates
narrative textrelative order
Framework for T-Fact Extraction(Theobald et al.: MUD’10, Wang et al.: EDBT’10; Zhang et al.: WebDB‘08)
1) represent temporal scopes of factsin the presence of incompleteness and uncertainty
2) gather & filter candidates for t-facts: extract base facts R(e1, e2) first; then focus on sentences with e1, e2 and date or temporal phrase
3) aggregate & reconcile evidence from observations
4) reason on joint constraints about facts and time scopes
1) Representing T-Fact Evidence
different resolutions, later refinement
uncertain & inconsistent evidence confidence distribution
After 4 years of happy marriage,Madonna and Sean got divorced in September 1989.
1: married(Madonna, Sean), earliestSince (1, 1-Jan-1985), latestSince (1, 31-Dec-1985),
earliestUntil (1, 1-Sep-1989), latestUntil (1, 30-Sep-1989)
event-style and state-style facts meta-facts to capture temporal scopes
1: married(Madonna, Sean), 2: married(Madonna, Guy), validSince (1, 16-Aug-1985), validUntil (1, 14-Sep-1989),
validSince (2, 22-Dec-2000), validUntil (2, 15-Dec-2008)3: wonAward(Sean, AcademyAwardForBestActor) validOn (3, 29-Feb-2004)
1984 1987 1990
µ=1987σ2=1
0.70.40.1
1984 1985 19901989
2) Gather & Filter T-Fact CandidatesChoice of sources:
news-style biography-styledate in header many dates in textrelative temp expr‘s explicit dates, narrativesimple language elaborated languagemany pronouns pronouns for main entity
Naive approach:use deep NLP (dependency parser) on every sentencethen use classifier (or structured-output learner) to detect t-facts too expensive
Bruni met recently divorced president Sarkozy in November 2007 at a dinner party.
She has said she is easily "bored with monogamy“ …
A romance is said to have started a few weeks ago between her and Biolay.
2) Gather & Filter: Multi-Stage Approachstage 1: sentences with e1 and e2 from R
stage 2: sentences that contain a temporal expression
stage 3: sentences where the t-expression refers to R(e1,e2)
• match noun phrases against YAGO means relation• use disambiguation prior for entity mentions
• use TARSQI tool to extract relative t-expressions and • map them to absolute dates or durations
• run dependency parser: check shortest path connecting e1, e2, verb, t-expr
• alternatively, consider only sentences with two noun groups & short surface distances of e1, e2, t-expr
Jim married Sue, but later left her and began an affair with Jane in 2005.
3) Aggregate & Reconcile T-Fact EvidenceIdeal input:Madonna and Sean were married from 16-Aug-85 until 12-Sep-89.Madonna and Sean married on August 16, 1985.Madonna and Sean got divorced in September 1989.
time
evid
ence
Imprecise input:Madonna and Sean were married from 1985 through 1989.Madonna and Sean were married four years in the late nineties.Madonna and Sean got divorced in fall 1989.
Noisy input:Madonna and Sean plan their wedding in summer 1985.Madonna and Sean just returned from their honeymoon (in Jan 1986).Madonna and Sean will be divorced by the the end of the year (1989).The marriage of Madonna and Sean will not survive this year (1987).
3) Aggregate & Reconcile T-Fact EvidenceReal input:…Madonna and Sean were chased during their honeymoon … (Jan 19, 1986)Madonna and her husband Sean opened the exhibition … (March 7, 1986)Madonna and her husband Sean were seen at … (April 1, 1986)Madonna and Sean met other couples at … (June 22, 1986)Madonna and Sean plan to have children … (July 4, 1986)Madonna and Sean would consider adopting a child … (July 14, 1986)Sean and his wife Madonna purchase another castle in … (November 5, 1986)...Madonna and Sean think about getting divorced … (April 21, 1989)The marriage of Madonna and Sean is in deep crisis … (May 11, 1989)…
time
evid
ence
3) Aggregate & Reconcile T-Fact EvidenceReal input:…Madonna and Sean were chased during their honeymoon … (Jan 19, 1986)Madonna and her husband Sean opened the exhibition … (March 7, 1986)Madonna and her husband Sean were seen at … (April 1, 1986)Madonna and Sean met other couples at … (June 22, 1986)Madonna and Sean plan to have children … (July 4, 1986)Madonna and Sean would consider adopting a child … (July 14, 1986)Sean and his wife Madonna purchase another castle in … (November 5, 1986)...Madonna and Sean think about getting divorced … (April 21, 1989)The marriage of Madonna and Sean is in deep crisis … (May 11, 1989)…
time
evid
ence
…..……..…
3) Aggregate & Reconcile: Solution
time
evid
ence
event histogram(begin)
event histogram(end)
state histogram(during)
• Classifer for t-fact observations: begin vs. during vs. end• Build separate histogram for each class (and each t-fact)• Combine histograms & derive high-confidence time scope
4) Joint Reasoning on Facts and T-Facts
X, Y, Z, T1, T2:m(X,Y) m(X,Z) validTime(m(X,Y),T1) validTime(m(X,Z),T2)
overlaps(T1, T2)
constraint:marriedTo (m) is an injective function at any given point
Combine & reconcile t-scopes across different facts
after grounding:
m(Carla, Nicolas) m(Cecilia, Nicolas) overlaps ([2008,2010], [1996,2007])
m(Carla, Nicolas) m(Carla, Benjamin) overlaps ([2008,2010], [2009,2011])
m(Ca,Nic) m(Ce,Nic) false
m(Ca,Nic) m(Ca,Ben) true
4) Joint Reasoning on Facts and T-Facts
time
m(Ca, Ben)m(Ca, Nic)
m(Ce, Nic)
m(Ca, Mi)
m(Ce, Mi)
Conflict graph:
m(Ca, Ben)[2009,2011]
m(Ca, Nic)[2008,2010]
m(Ce, Nic)[1996,2007]
m(Ca, Mi)[2004,2008]
m(Ce, Mi)[1998,2005]
Find maximalindependent set: subset of nodes w/o adjacent pairswith (evidence-)weighted nodes
4) Joint Reasoning on Facts and T-Facts
time
m(Ca, Ben)m(Ca, Nic)
m(Ce, Nic)
m(Ca, Mi)
m(Ce, Mi)
Conflict graph:
m(Ca, Ben)[2009,2011]
m(Ca, Nic)[2008,2010]
m(Ce, Nic)[1996,2007]
m(Ca, Mi)[2004,2008]
m(Ce, Mi)[1998,2005]
Find maximalindependent set: subset of nodes w/o adjacent pairswith (evidence-)weighted nodes
100
20
80
30 10
4) Joint Reasoning on Facts and T-Facts
time
m(Ca, Ben)m(Ca, Nic)
m(Ce, Nic)
m(Ca, Mi)
m(Ce, Mi)
alternative approach:split t-scopes and reason on consistency of t-fact partitions
Preliminary Results
playsForTeam(X,Z)@T1 playsForTeam(Y,Z)@T2 overlaps (T1,T2) teammates(X,Y)
• automatic extraction of t-facts about football/soccer from Wikipedia and news articles• query answering by reasoning on t-facts
Outline
...
Automatic KB Construction
Growing & Maintaining the KB
Temporal Knowledge
What and Why
Wrap-up
KB Building: Where Do We Stand?Knowledge Bases on Entities & Classes
Relationships
Temporal Knowledgewidely open (fertile) research ground:
• uncertain / incomplete temporal scopes of facts• joint reasoning on base-facts and time-scopes
good progress, but many challenges left:• recall & precision by patterns & reasoning• efficiency & scalability• soft rules, hard constraints, richer logics, …• open-domain discovery of new relation types
strong success story, some problems left:• large taxonomies of classes with individual entities• long tail calls for new methods• entity disambiguation remains grand challenge
Overall Take-Home
...
Historic opportunity: revive Cyc vision, make it real & large-scale ! KB as enabler of macroscopic „machine reading“challenging & risky, but high pay-off
Explore & exploit synergies between semantic, statistical, & social Web methods:statistical evidence + logical consistency !
Many interesting research topics for CS (+ CoLi):• efficiency & scalability• constraints & reasoning on uncertain data• NLP for temporal statements• statistical ranking for semantic search• knowledge-base life-cycle: growth & maintenance
Thank You !