entity extraction for query interpretation patrick pantel ǂ

23
Entity Extraction for Query Interpretation Patrick Pantel ǂ Query Representation and Understanding SIGIR July 23, 2010 Collaborators: Alpa Jain, Ana-Maria Popescu, Arkady Borkovsky, Eric Crestan, Hadar Shemtov, Marco Pennacchiotti, Nicolas Torzec, Vishnu Vyas ǂ Now at Microsoft Research

Upload: geri

Post on 24-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Entity Extraction for Query Interpretation Patrick Pantel ǂ. Query Representation and Understanding SIGIR July 23, 2010. Collaborators : Alpa Jain, Ana-Maria Popescu, Arkady Borkovsky , Eric Crestan, Hadar Shemtov, Marco Pennacchiotti, Nicolas Torzec, Vishnu Vyas - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

Entity Extractionfor Query Interpretation

Patrick Pantelǂ

Query Representation and Understanding SIGIR

July 23, 2010

Collaborators:Alpa Jain, Ana-Maria Popescu, Arkady Borkovsky, Eric Crestan, Hadar Shemtov, Marco Pennacchiotti, Nicolas Torzec, Vishnu Vyas

ǂ Now at Microsoft Research

Page 2: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 2 -Entity Extraction July 2010

The Ensemble Effect

1 23 41 59 77 95 114 132 150 168 186 205 223 241 259 277 295 314 332 350 368 386 405 423 441 459 477 4950.4

0.5

0.6

0.7

0.8

0.9

1

pattern-based distributional combined ml-combined ensemble

Rank

Prec

isio

n

(chart refers to class Athlete)

+19% average precision over state of

the art

Page 3: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 3 -Entity Extraction July 2010

Ensemble Semantics

S1

SK

S2

KB

(Pennacchiotti and Pantel; EMNLP 2009)

Page 4: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 4 -Entity Extraction July 2010

Ensemble Semantics

S1

SK

S2

KE

nK

E2

KE

1

FG1 FG2 FGm

KB

FEATURE GENERATORSK

NO

WLE

DG

E E

XTR

AC

TOR

S

(Pennacchiotti and Pantel; EMNLP 2009)

Page 5: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 5 -Entity Extraction July 2010

Ensemble Semantics

S1

SK

S2

KE

nK

E2

KE

1

FG1 FG2 FGm

KB

FEATURE GENERATORSK

NO

WLE

DG

E E

XTR

AC

TOR

S

AG

GR

EG

ATO

R

(Pennacchiotti and Pantel; EMNLP 2009)

Page 6: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 6 -Entity Extraction July 2010

Ensemble Semantics

S1

SK

S2

KE

nK

E2

KE

1

FG1 FG2 FGm

KB

FEATURE GENERATORS

RANKER

KN

OW

LED

GE

EX

TRA

CTO

RS

AG

GR

EG

ATO

R

MODELER

DECODER

(Pennacchiotti and Pantel; EMNLP 2009)

Page 7: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 7 -Entity Extraction July 2010

S1

SK

S2

KE

nK

E2

KE

1

FG1 FG2 FGm

KB

FEATURE GENERATORS

RANKER

KN

OW

LED

GE

EX

TRA

CTO

RS

AG

GR

EG

ATO

R

MODELER

DECODER

Knowledge Extractors

Page 8: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 8 -Entity Extraction July 2010

1) DRIVING IDEA

3) IN ACTION

How are Jubil Sarch and Kirov Chob related?

• Jubil Sarch starring Kirov Chob• Jubil Sarch featuring the beautiful Kirov Chob• Steven Soderbergh-directed Jubil Sarch which

earned star Kirov Chob an Oscar

2) ALGORITHM• State-of-the-art pattern-based

algorithm for relation extraction [Pasca et al., 2006]

• Instantiate typical relations of the class at hand

• (e.g. act-in(Actor,Movie))

Seed tuples

Learn patterns

Append reliable tuples

Find new tuples

Web

act-in(Tom Hanks, The Terminal)act-in(Nicole Kidman, Eyes Wide Shut)

Johnny DeppDenis LawsonBrad PittMorgan Freeman…

Pattern-based Extractor

Page 9: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 9 -Entity Extraction July 2010

1) DRIVING IDEA

3) IN ACTION

What is tezgüno?– A bottle of tezgüno is on the table.– Everyone likes tezgüno.– Tezgüno makes you drunk.– We make tezgüno out of grapes.

2) ALGORITHM• State-of-the-art distributional entity

extractor [Pantel et al., EMNLP 2009]• Given a small set of seeds for a

given class, find distributionally similar candidate instances

Distributional Extractor

Nicole KidmanAl PacinoTom Hanks

Web

anna gunnolivier gueriteetomas von bromssenharry jonesjudy mathesonrobert keithmariah o'brienstarring dennis quaidnoah beery jrfederico castelluccioadienne shellybetty morangeorge takaijo anne worleyruth hampton

rex hagonalex fonggene burkemiguel hermoso arnaoeiko andocharles mccaughanyukijiro hotarualec christiedame wendy hillerjohn waynearthur lakesir herbert beerbohm treetonya wrightlori saunders

Page 10: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 10 -Entity Extraction July 2010

S1

SK

S2

KE

nK

E2

KE

1

FG1 FG2 FGm

KB

FEATURE GENERATORS

RANKER

KN

OW

LED

GE

EX

TRA

CTO

RS

AG

GR

EG

ATO

R

MODELER

DECODER

Feature Generators

Page 11: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 11 -Entity Extraction July 2010

Feature sets• 4 feature families• 5 feature types• 402 features

Web600M pages web crawl Query log1 year of queries (top 1M)Web TableFrom 600M pages web crawlWikipedia2M articles 2008 dump

Page 12: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 12 -Entity Extraction July 2010

Query Log FeaturesJodie Foster Humphrey BogartRobert DuvallJodie Foster Humphrey BogartRobert Duvall

starringfilmographymoviesOscar

Johnny DepppicturesstarringDonnie BrascofilmographyTim BurtonQuotesmovies

Barack Obamabiographybirth certificatespeechquotespicturesinaugurationwiki

starringCydney BernardfilmographyThe AccusedmoviesfilmOscar

westernsfilmographyTangofilmsmoviesstarringApocaplypse Now

filmographyquotesmoviesOscarLauren Bacall starringImdb

top-PMI attributes

Page 13: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 13 -Entity Extraction July 2010

Web-table Features

Harrison FordBurt LanacasterIan HartNicholas Jones

SeedsDenis LawsonCandidate

Page 14: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 14 -Entity Extraction July 2010

Ranker

S1

SK

S2

KE

nK

E2

KE

1

FG1 FG2 FGm

KB

FEATURE GENERATORS

RANKER

KN

OW

LED

GE

EX

TRA

CTO

RS

AG

GR

EG

ATO

R

MODELER

DECODER

Page 15: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 15 -Entity Extraction July 2010

EXPERIMENTS

S1

SK

S2

KE

nK

E2

KE

1

FG1 FG2 FGm

KB

FEATURE GENERATORS

RANKER

KN

OW

LED

GE

EX

TRA

CTO

RS

AG

GR

EG

ATO

R

MODELER

DECODER

Page 16: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 16 -Entity Extraction July 2010

Experimental Setup• Task : Entity extraction• Tested classes : Actors, Athletes, Musicians

• Gold Standard : - 500 manually annotated instances per class

- 10 annotators, Kappa=0.88

• Evaluation : - Metric: average precision- 10-fold cross validation

• Comparisons : (B1) pattern-based extractor(B2) distributional extractor(E1) combined extractor(E2) ML-combined extractor [Mirkin et al.,2006]

Page 17: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 17 -Entity Extraction July 2010

Experimental Results

FEATURESs = Source featuresw = Webcrawl (600M docs)q = Query Logs (1 year)t = Web tables (600M docs)k = Wikipedia

Page 18: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 18 -Entity Extraction July 2010

Standalone Extractors gain : Actors

0

0.2

0.4

0.6

0.8

1

Prec

isio

n

Rank

Distributional : Precision vs. Rank analysis

distributionalensemble

0

0.2

0.4

0.6

0.8

1

Prec

isio

n

Rank

Pattern-based: Precision vs. Rank analysis

pattern-basedensemble

Page 19: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 19 -Entity Extraction July 2010

Standalone Extractors gain : Athletes

0

0.2

0.4

0.6

0.8

1

Prec

isio

n

Rank

Distributional : Precision vs. Rank analysis

distributionalensemble

0

0.2

0.4

0.6

0.8

1

Prec

isio

n

Rank

Pattern-based: Precision vs. Rank analysis

pattern-basedensemble

Page 20: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 20 -Entity Extraction July 2010

Standalone Extractors gain : Musicians

0

0.2

0.4

0.6

0.8

1

Prec

isio

n

Rank

Distributional : Precision vs. Rank analysis

distributionalensemble

0

0.2

0.4

0.6

0.8

1

Prec

isio

n

Rank

Pattern-based: Precision vs. Rank analysis

pattern-basedensemble

Page 21: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 21 -Entity Extraction July 2010

Ensemble Effect: Athletes

1,15328,835

56,5180.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00Precision vs. Rank Analysis

upper_boundbaselinet-gbdt-t300

Rank

Prec

isio

n

Page 22: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 22 -Entity Extraction July 2010

free

googleyou tubegirlspeopledisneybabytexasairspearsnew york … … … …

100 Most Frequent

Entities in QL

<10% Precision

Takeaway: Harvesting knowledge requires different metrics than using the knowledge

• Editorial cleaning of entities covering 80% of queries• Automatic cleaning of torso/tail• Impact on query and document Interpretation?

Page 23: Entity Extraction for Query Interpretation Patrick  Pantel ǂ

- 23 -Entity Extraction July 2010

Conclusions• Ensemble Semantics:

– Draw from many knowledge sources– Apply different extraction biases– Leverage many signals of knowledge

• Significant gains in both coverage and precision on entity extraction

• BUT, is the knowledge useful??– 90% precision huge list of actors results in many many query

interpretation errors!• Frequent terms tend to be ambiguous• Some highly accurate databases contain very bad errors (e.g., Texas and

Baby are both actors in Y! Movies)– Further processing is necessary to make use of the knowledge…