beyond bag of words - northeastern university · tschernobyl könnte dann etwas später an die...

44
Beyond Bag of Words CS6200 Information Retrieval

Upload: lythuy

Post on 17-Sep-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Beyond Bag of Words

CS6200Information Retrieval

Page 2: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Bags of Words

• Most efficient (and still very effective) retrieval models treat words/terms as independent

• Generalized models based on (log-)linear combination of features of both query and document

SW(D;Q)=∑j wj·fj(D,Q)

Page 3: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Term Dependence Models

Full independence Sequential dependence

Full dependence General dependence

Term Dependence Models

Full independence Sequential dependence

Full dependence General dependence

Term Dependence Models

Full independence Sequential dependence

Full dependence General dependence

Term Dependence Models

Full independence Sequential dependence

Full dependence General dependence

Term Dependence Models

Full independence Sequential dependence

Full dependence General dependence

Page 4: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Term Dependence Models

Sequential dependence

Term Dependence Models

Sequential dependence

SW(D;Q)=∑j wj·fj(D,Q)

#weight(0.8 #combine(president abraham lincoln) 0.1 #combine(#od:1(president abraham)

#od:1(abraham lincoln)) 0.1 #combine(#uw:8(president abraham)

#uw:8(abraham lincoln)))

Page 5: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Term Dependence Models

Sequential dependence

Weights of different bigrams are tied

Term Dependence Models

Sequential dependence

SW(D;Q)=∑j wj·fj(D,Q)

#weight(0.8 #combine(president abraham lincoln) 0.1 #combine(#od:1(president abraham)

#od:1(abraham lincoln)) 0.1 #combine(#uw:8(president abraham)

#uw:8(abraham lincoln)))

Page 6: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Term Dependence Models

Sequential dependence

Weights of different bigrams are tied

Therefore, estimate weights for classes of features, not for each

individual bigram

Term Dependence Models

Sequential dependence

SW(D;Q)=∑j wj·fj(D,Q)

#weight(0.8 #combine(president abraham lincoln) 0.1 #combine(#od:1(president abraham)

#od:1(abraham lincoln)) 0.1 #combine(#uw:8(president abraham)

#uw:8(abraham lincoln)))

Page 7: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Term Dependence Models

Full dependence

SW(D;Q)=∑j wj·fj(D,Q)

Term Dependence Models

Full dependence

Features (with tied weights) for bigrams, trigrams, etc.

#weight(0.8 #combine(president abraham lincoln) 0.1 #combine(#od:1(president abraham)

#od:1(abraham lincoln) #od:1(president abraham lincoln))

0.1 #combine(#uw:8(president abraham) #uw:8(abraham lincoln) #uw:8(president lincoln) #uw:12(president abraham lincoln)))

Page 8: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Term Dependence Models

Full dependence

Features (with tied weights) for bigrams, trigrams, etc.

SW(D;Q)=∑j wj·fj(D,Q)

Term Dependence Models

Full dependence

Features (with tied weights) for bigrams, trigrams, etc.

#weight(0.8 #combine(president abraham lincoln) 0.1 #combine(#od:1(president abraham)

#od:1(abraham lincoln) #od:1(president abraham lincoln))

0.1 #combine(#uw:8(president abraham) #uw:8(abraham lincoln) #uw:8(president lincoln) #uw:12(president abraham lincoln)))

Page 9: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Term Dependence Models

Latent concept expansion

Unigram relevance model

Term Dependence Models

Latent concept expansion

Unigram relevance model

Term Dependence Models

Latent concept expansion

Unigram relevance model

Term Dependence Models

Latent concept expansion

Unigram relevance model

Page 10: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Syntactic Dependencies

I did not unfortunately receive an answer to this question

Maximum directed spanning tree

7

Page 11: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Syntactic Dependencies

I did not unfortunately receive an answer to this question

Maximum directed spanning tree

7

Page 12: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Syntactic Dependencies

I did not unfortunately receive an answer to this question

Maximum directed spanning tree

7

nn-1 = 109 = 1 billion possible trees!

Page 13: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Cross-Language SyntaxParser projection

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

8

Source language

Target language

Page 14: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Cross-Language Syntax

9

Tschernobyl könnte dann etwas später an die Reihe kommen

Then we could deal with Chernobyl some time later

Page 15: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Cross-Language Syntax

9

Tschernobyl könnte dann etwas später an die Reihe kommen

Then we could deal with Chernobyl some time later

Page 16: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Cross-Language Syntax

9

Tschernobyl könnte dann etwas später an die Reihe kommen

Then we could deal with Chernobyl some time later

Page 17: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Cross-Language Syntax

9

Tschernobyl könnte dann etwas später an die Reihe kommen

Then we could deal with Chernobyl some time later

Page 18: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Cross-Language Syntax

9

Tschernobyl könnte dann etwas später an die Reihe kommen

Then we could deal with Chernobyl some time later

Page 19: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Cross-Language Syntax

9

Tschernobyl könnte dann etwas später an die Reihe kommen

Then we could deal with Chernobyl some time later

Structures not isomorphic:use quasi-synchronous alignment features

Page 20: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Alignment Configurations

10

Page 21: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Alignment Configurations

10

sehe

ich

see

Imonotonic

(parent-child)

Page 22: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Alignment Configurations

10

sehe

ich

see

Imonotonic

(parent-child)

schwimmt

gern

likes

swimming

head swapping

Page 23: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Alignment Configurations

10

sehe

ich

see

Imonotonic

(parent-child)

schwimmt

gern

likes

swimming

head swapping

Völkerrecht law

international

two-to-one

Page 24: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Alignment Configurations

10

sehe

ich

see

Imonotonic

(parent-child)

schwimmt

gern

likes

swimming

head swapping

Völkerrecht law

international

two-to-one

habe

ich

bought

Igekauft

siblings

Page 25: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Alignment Configurations

10

sehe

ich

see

Imonotonic

(parent-child)

schwimmt

gern

likes

swimming

head swapping

Völkerrecht law

international

two-to-one

habe

ich

bought

Igekauft

siblingsWahlkampf

von

campaign

20032003

grandparent-grandchild

Page 26: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Alignment Configurations

10

sehe

ich

see

Imonotonic

(parent-child)

schwimmt

gern

likes

swimming

head swapping

Völkerrecht law

international

two-to-one

habe

ich

bought

Igekauft

siblingsWahlkampf

von

campaign

20032003

grandparent-grandchild

Also C-command, descendant, “none of the above”

Page 27: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Quasi-Synchronous Dependence

shih tzu health problemsQuery

... Find out all the serious health problems that face the Shih Tzu. ...

Documents

• Syntactic models of document mismatch✤ Developed (by me) for MT (SMT 2006; EMNLP 2009)

Page 28: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Quasi-Synchronous Dependence

shih tzu health problemsQuery

... Find out all the serious health problems that face the Shih Tzu. ...

Documents

• Syntactic models of document mismatch✤ Developed (by me) for MT (SMT 2006; EMNLP 2009)

P@10

0.39

0.41

0.43

0.45

0.47

1gram LMSDMLM+QGSDM+QG

Page 29: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Quasi-Synchronous Dependence

shih tzu health problemsQuery

... Find out all the serious health problems that face the Shih Tzu. ...

Documents

• Syntactic models of document mismatch✤ Developed (by me) for MT (SMT 2006; EMNLP 2009)

P@10

0.39

0.41

0.43

0.45

0.47

1gram LMSDMLM+QGSDM+QG

Page 30: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Quasi-Synchronous Dependence

shih tzu health problemsQuery

... Find out all the serious health problems that face the Shih Tzu. ...

Documents

• Syntactic models of document mismatch✤ Developed (by me) for MT (SMT 2006; EMNLP 2009)

P@10

0.39

0.41

0.43

0.45

0.47

1gram LMSDMLM+QGSDM+QG

Page 31: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Quasi-Synchronous Dependence

shih tzu health problemsQuery

... Find out all the serious health problems that face the Shih Tzu. ...

Documents

• Syntactic models of document mismatch✤ Developed (by me) for MT (SMT 2006; EMNLP 2009)

P@10

0.39

0.41

0.43

0.45

0.47

1gram LMSDMLM+QGSDM+QG

Page 32: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Quasi-Synchronous Dependence

shih tzu health problemsQuery

... Find out all the serious health problems that face the Shih Tzu. ...

Documents

• Syntactic models of document mismatch✤ Developed (by me) for MT (SMT 2006; EMNLP 2009)

P@10

0.39

0.41

0.43

0.45

0.47

1gram LMSDMLM+QGSDM+QG

Page 33: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Entity SearchEntity Search

Page 34: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Entity Search

Joint distribution over queries and entities

Factorized dist’n

Probability by proximity

Gaussian proximity kernel

Entity Search

Joint distribution over queries and entities

Factorized dist’n

Probability by proximity

Gaussian proximity kernel

Entity Search

Joint distribution over queries and entities

Factorized dist’n

Probability by proximity

Gaussian proximity kernel

Entity Search

Joint distribution over queries and entities

Factorized dist’n

Probability by proximity

Gaussian proximity kernel

Entity Search

Joint distribution over queries and entities

Factorized dist’n

Probability by proximity

Gaussian proximity kernel

Page 35: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Question Answering

• Human vs. machine: Compare search engine queries with, e.g., StackOverflow

• Current QA systems perform simple “factoid” retrieval

• Who invented the paper clip?

• Where is the Valley of the Kings?

• When was the last major eruption of Mt. St. Helens?

Page 36: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Question Answering

Page 37: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Question AnsweringExample: Indri system

used as first stage of IBM Watson

Page 38: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Question AnsweringExample: Indri system

used as first stage of IBM Watson

What kind of answer is the user expecting?

Page 39: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Answer CategorizationAnswer Categorization

Page 40: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Answer CategorizationAnswer Categorization

Page 41: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

OCR Transcripts

Optical character recognition

OCR Transcripts

Optical character recognition

Page 42: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

ASR Transcripts

Automatic speech recognition

ASR Transcripts

Automatic speech recognition

Page 43: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Image TaggingImage Tagging

Page 44: Beyond Bag of Words - Northeastern University · Tschernobyl könnte dann etwas später an die Reihe kommen Then we could deal with Chernobyl some time later Structures not isomorphic:

Image Annotation

• Offline annotation

• Online pseudo-relevance feedback

• Retrieve images by text annotation

• Estimate text/image relevance model

• Rerank more images

Image Annotation

• Offline annotation

• Online pseudo-relevance feedback

• Retrieve images by text annotation

• Estimate text/image relevance model

• Rerank more images