entity extraction for query interpretation patrick pantel ǂ
DESCRIPTION
Entity Extraction for Query Interpretation Patrick Pantel ǂ. Query Representation and Understanding SIGIR July 23, 2010. Collaborators : Alpa Jain, Ana-Maria Popescu, Arkady Borkovsky , Eric Crestan, Hadar Shemtov, Marco Pennacchiotti, Nicolas Torzec, Vishnu Vyas - PowerPoint PPT PresentationTRANSCRIPT
Entity Extractionfor Query Interpretation
Patrick Pantelǂ
Query Representation and Understanding SIGIR
July 23, 2010
Collaborators:Alpa Jain, Ana-Maria Popescu, Arkady Borkovsky, Eric Crestan, Hadar Shemtov, Marco Pennacchiotti, Nicolas Torzec, Vishnu Vyas
ǂ Now at Microsoft Research
- 2 -Entity Extraction July 2010
The Ensemble Effect
1 23 41 59 77 95 114 132 150 168 186 205 223 241 259 277 295 314 332 350 368 386 405 423 441 459 477 4950.4
0.5
0.6
0.7
0.8
0.9
1
pattern-based distributional combined ml-combined ensemble
Rank
Prec
isio
n
(chart refers to class Athlete)
+19% average precision over state of
the art
- 3 -Entity Extraction July 2010
Ensemble Semantics
S1
SK
S2
KB
(Pennacchiotti and Pantel; EMNLP 2009)
- 4 -Entity Extraction July 2010
Ensemble Semantics
S1
SK
S2
KE
nK
E2
KE
1
FG1 FG2 FGm
KB
FEATURE GENERATORSK
NO
WLE
DG
E E
XTR
AC
TOR
S
(Pennacchiotti and Pantel; EMNLP 2009)
- 5 -Entity Extraction July 2010
Ensemble Semantics
S1
SK
S2
KE
nK
E2
KE
1
FG1 FG2 FGm
KB
FEATURE GENERATORSK
NO
WLE
DG
E E
XTR
AC
TOR
S
AG
GR
EG
ATO
R
(Pennacchiotti and Pantel; EMNLP 2009)
- 6 -Entity Extraction July 2010
Ensemble Semantics
S1
SK
S2
KE
nK
E2
KE
1
FG1 FG2 FGm
KB
FEATURE GENERATORS
RANKER
KN
OW
LED
GE
EX
TRA
CTO
RS
AG
GR
EG
ATO
R
MODELER
DECODER
(Pennacchiotti and Pantel; EMNLP 2009)
- 7 -Entity Extraction July 2010
S1
SK
S2
KE
nK
E2
KE
1
FG1 FG2 FGm
KB
FEATURE GENERATORS
RANKER
KN
OW
LED
GE
EX
TRA
CTO
RS
AG
GR
EG
ATO
R
MODELER
DECODER
Knowledge Extractors
- 8 -Entity Extraction July 2010
1) DRIVING IDEA
3) IN ACTION
How are Jubil Sarch and Kirov Chob related?
• Jubil Sarch starring Kirov Chob• Jubil Sarch featuring the beautiful Kirov Chob• Steven Soderbergh-directed Jubil Sarch which
earned star Kirov Chob an Oscar
2) ALGORITHM• State-of-the-art pattern-based
algorithm for relation extraction [Pasca et al., 2006]
• Instantiate typical relations of the class at hand
• (e.g. act-in(Actor,Movie))
Seed tuples
Learn patterns
Append reliable tuples
Find new tuples
Web
act-in(Tom Hanks, The Terminal)act-in(Nicole Kidman, Eyes Wide Shut)
Johnny DeppDenis LawsonBrad PittMorgan Freeman…
Pattern-based Extractor
- 9 -Entity Extraction July 2010
1) DRIVING IDEA
3) IN ACTION
What is tezgüno?– A bottle of tezgüno is on the table.– Everyone likes tezgüno.– Tezgüno makes you drunk.– We make tezgüno out of grapes.
2) ALGORITHM• State-of-the-art distributional entity
extractor [Pantel et al., EMNLP 2009]• Given a small set of seeds for a
given class, find distributionally similar candidate instances
Distributional Extractor
Nicole KidmanAl PacinoTom Hanks
Web
anna gunnolivier gueriteetomas von bromssenharry jonesjudy mathesonrobert keithmariah o'brienstarring dennis quaidnoah beery jrfederico castelluccioadienne shellybetty morangeorge takaijo anne worleyruth hampton
rex hagonalex fonggene burkemiguel hermoso arnaoeiko andocharles mccaughanyukijiro hotarualec christiedame wendy hillerjohn waynearthur lakesir herbert beerbohm treetonya wrightlori saunders
- 10 -Entity Extraction July 2010
S1
SK
S2
KE
nK
E2
KE
1
FG1 FG2 FGm
KB
FEATURE GENERATORS
RANKER
KN
OW
LED
GE
EX
TRA
CTO
RS
AG
GR
EG
ATO
R
MODELER
DECODER
Feature Generators
- 11 -Entity Extraction July 2010
Feature sets• 4 feature families• 5 feature types• 402 features
Web600M pages web crawl Query log1 year of queries (top 1M)Web TableFrom 600M pages web crawlWikipedia2M articles 2008 dump
- 12 -Entity Extraction July 2010
Query Log FeaturesJodie Foster Humphrey BogartRobert DuvallJodie Foster Humphrey BogartRobert Duvall
starringfilmographymoviesOscar
Johnny DepppicturesstarringDonnie BrascofilmographyTim BurtonQuotesmovies
Barack Obamabiographybirth certificatespeechquotespicturesinaugurationwiki
starringCydney BernardfilmographyThe AccusedmoviesfilmOscar
westernsfilmographyTangofilmsmoviesstarringApocaplypse Now
filmographyquotesmoviesOscarLauren Bacall starringImdb
top-PMI attributes
- 13 -Entity Extraction July 2010
Web-table Features
Harrison FordBurt LanacasterIan HartNicholas Jones
SeedsDenis LawsonCandidate
- 14 -Entity Extraction July 2010
Ranker
S1
SK
S2
KE
nK
E2
KE
1
FG1 FG2 FGm
KB
FEATURE GENERATORS
RANKER
KN
OW
LED
GE
EX
TRA
CTO
RS
AG
GR
EG
ATO
R
MODELER
DECODER
- 15 -Entity Extraction July 2010
EXPERIMENTS
S1
SK
S2
KE
nK
E2
KE
1
FG1 FG2 FGm
KB
FEATURE GENERATORS
RANKER
KN
OW
LED
GE
EX
TRA
CTO
RS
AG
GR
EG
ATO
R
MODELER
DECODER
- 16 -Entity Extraction July 2010
Experimental Setup• Task : Entity extraction• Tested classes : Actors, Athletes, Musicians
• Gold Standard : - 500 manually annotated instances per class
- 10 annotators, Kappa=0.88
• Evaluation : - Metric: average precision- 10-fold cross validation
• Comparisons : (B1) pattern-based extractor(B2) distributional extractor(E1) combined extractor(E2) ML-combined extractor [Mirkin et al.,2006]
- 17 -Entity Extraction July 2010
Experimental Results
FEATURESs = Source featuresw = Webcrawl (600M docs)q = Query Logs (1 year)t = Web tables (600M docs)k = Wikipedia
- 18 -Entity Extraction July 2010
Standalone Extractors gain : Actors
0
0.2
0.4
0.6
0.8
1
Prec
isio
n
Rank
Distributional : Precision vs. Rank analysis
distributionalensemble
0
0.2
0.4
0.6
0.8
1
Prec
isio
n
Rank
Pattern-based: Precision vs. Rank analysis
pattern-basedensemble
- 19 -Entity Extraction July 2010
Standalone Extractors gain : Athletes
0
0.2
0.4
0.6
0.8
1
Prec
isio
n
Rank
Distributional : Precision vs. Rank analysis
distributionalensemble
0
0.2
0.4
0.6
0.8
1
Prec
isio
n
Rank
Pattern-based: Precision vs. Rank analysis
pattern-basedensemble
- 20 -Entity Extraction July 2010
Standalone Extractors gain : Musicians
0
0.2
0.4
0.6
0.8
1
Prec
isio
n
Rank
Distributional : Precision vs. Rank analysis
distributionalensemble
0
0.2
0.4
0.6
0.8
1
Prec
isio
n
Rank
Pattern-based: Precision vs. Rank analysis
pattern-basedensemble
- 21 -Entity Extraction July 2010
Ensemble Effect: Athletes
1,15328,835
56,5180.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00Precision vs. Rank Analysis
upper_boundbaselinet-gbdt-t300
Rank
Prec
isio
n
- 22 -Entity Extraction July 2010
free
googleyou tubegirlspeopledisneybabytexasairspearsnew york … … … …
100 Most Frequent
Entities in QL
<10% Precision
Takeaway: Harvesting knowledge requires different metrics than using the knowledge
• Editorial cleaning of entities covering 80% of queries• Automatic cleaning of torso/tail• Impact on query and document Interpretation?
- 23 -Entity Extraction July 2010
Conclusions• Ensemble Semantics:
– Draw from many knowledge sources– Apply different extraction biases– Leverage many signals of knowledge
• Significant gains in both coverage and precision on entity extraction
• BUT, is the knowledge useful??– 90% precision huge list of actors results in many many query
interpretation errors!• Frequent terms tend to be ambiguous• Some highly accurate databases contain very bad errors (e.g., Texas and
Baby are both actors in Y! Movies)– Further processing is necessary to make use of the knowledge…