chen, yi-wen dept. of computer science & information engineering

10
2012 ICASSP Semantic Query Expansion and Context-based Discriminative Term Modeling for Spoken Document Retrieval Tsung-wei Tu, Hung-yi Lee, Yu-yu Chou, Lin-shan Lee Chen, Yi-wen Dept. of Computer Science & Information Engineering National Taiwan Normal University 2012/4/17

Upload: peregrine-palacios

Post on 02-Jan-2016

19 views

Category:

Documents


2 download

DESCRIPTION

2012 ICASSP. Semantic Query Expansion and Context-based Discriminative Term Modeling for Spoken Document Retrieval. Tsung-wei Tu , Hung- yi Lee, Yu- yu Chou, Lin- shan Lee. Chen, Yi-wen Dept. of Computer Science & Information Engineering National Taiwan Normal University. ✩ 2012/4/17. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chen, Yi-wen Dept. of Computer Science & Information Engineering

2012 ICASSP

Semantic Query Expansion and Context-based Discriminative Term Modeling for Spoken Document Retrieval

Tsung-wei Tu, Hung-yi Lee, Yu-yu Chou, Lin-shan Lee

Chen, Yi-wen

Dept. of Computer Science & Information Engineering

National Taiwan Normal University

✩2012/4/17

Page 2: Chen, Yi-wen Dept. of Computer Science & Information Engineering

2

Spoken Document Retrieval

𝑆 (𝑄 ,𝑑 )=∏𝑞∈𝑄

𝑃 (𝑞|𝜃𝑑 )𝑃 (𝑞|𝜃𝑄) ,(1)

TextInformationRetrieval

SpeechInformationRetrieval

High recognition errors!!

One way to handle this problem is to estimate the probability from lattices to include many recognition hypotheses.

Page 3: Chen, Yi-wen Dept. of Computer Science & Information Engineering

3

Document Model

SD

𝑃 (𝑤|𝑋 )= ∑𝑢∈𝑊 (𝑋 )

𝑁 (𝑤 ,𝑢)|𝑢|

𝑃 (𝑢|𝑋 ) ,(2)

A spoken document is first divided into spoken segments, and then each spoken segment is transcribed into a lattice.

is a word sequence in the lattice. is the set of all possible word sequences in the lattice for is the posterior prob. of the word sequence derived from the acoustic and language models.is the number of word arcs in . is the occurrence count of the term in

Acoustic and Language Model 選中 的機率。

平均 1 個 arc 出現的機率。

在這段中,所有可能 word sequence 其平均word arcs 出現的機率的總和。

Page 4: Chen, Yi-wen Dept. of Computer Science & Information Engineering

4

Document Model

𝐿𝑋= ∑𝑢∈𝑊 (𝑋 )

|𝑢|𝑃 (𝑢|𝑋 )(3)Expected length of the segment

中平均每條 含的 word arcs 數加總。

𝑃 (𝑤|𝜃𝑑)=∑𝑛=1

𝑁

𝐿𝑋𝑛𝑃 (𝑤|𝑋𝑛)

∑𝑛=1

𝑁

𝐿𝑋𝑛

(4)

Doc 在 lattice level 中,出現 w 的機率。

𝑆 (𝑄 ,𝑑 )=∏𝑞∈𝑄

𝑃 (𝑞|𝜃𝑑 )𝑃 (𝑞|𝜃𝑄) ,(1)

Here we borrow the query-regularized mixture model originally proposed for text information retrieval for query expansion.

𝑃 (𝑑|𝛼𝑑 , 𝜃𝑄′ )=∏

𝑤∈𝑑(𝛼𝑑𝑃 (𝑤|𝜃𝑄

′ )+(1 −𝛼𝑑 )𝑃 (𝑤|𝜃𝐵 ))𝑃 (𝑤|𝜃𝑑 ), (5)

Weight for top M Doc.

the expanded query model.

Query Model

Page 5: Chen, Yi-wen Dept. of Computer Science & Information Engineering

5

Document Model

Note the here positive and negative examples of each term are selected automatically in an unsupervised way, similar to the scenarios of pseudo-relevance feedback.

�̂� (𝑤|𝑋 )={𝑃 (𝑤|𝑋 )( 1

1+𝑒𝑥𝑝−𝑑𝑤 (𝑋 ) )𝛼

𝑑𝑤 ( 𝑋 )<0

𝑃 (𝑤|𝑋 ) h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒 ,

(11)

Page 6: Chen, Yi-wen Dept. of Computer Science & Information Engineering

6

Semantic Query Expansion

𝑃 (𝜃𝑄′ )=∏

𝑤

𝑃 (𝑤|𝜃𝑄′ )𝑃 (𝑤|𝜃𝑄 )

,(6): the expanded query model.

𝐹 (𝛼𝑑 ,𝜃𝑄′ )=𝑃 (𝜃𝑄

′ )∏𝑑∈𝐷

𝑃 (𝑑|𝛼𝑑 ,𝜃𝑄′ ) ,(7)

The parameters and are then estimated by maximizing the following objective function.

Instead of estimating a query dependent language model for word distribution, we now seek to estimate a query dependent language model for the distribution of latent topics .We assume the probabilities of observing all words given each latent topic are available, which are obtained from Probability Latent Semantic Analysis (PLSA)… in (5) and (6) is replaced by The parameters and are similarly estimated by maximizing the objective function in parallel with (7).

, (8)

求 !

Page 7: Chen, Yi-wen Dept. of Computer Science & Information Engineering

7

Semantic Query Expansion

The prob. to be used in (1) is then

𝑃 (𝑤|𝜃𝑄𝑇 )=∑

𝑘=1

𝐾

𝑃 (𝑤|𝑇 𝑘 )𝑃 (𝑇 𝑘|𝜃𝑄𝑇 ) ,(9)

𝑆 (𝑄 ,𝑑 )=∏𝑞∈𝑄

𝑃 (𝑞|𝜃𝑑 )𝑃 (𝑞|𝜃𝑄) ,(1)

This probability can be further interpolated with the probability obtained by maximizing (7).

Page 8: Chen, Yi-wen Dept. of Computer Science & Information Engineering

8

Experimental Results

Page 9: Chen, Yi-wen Dept. of Computer Science & Information Engineering

9

Experimental Results

Page 10: Chen, Yi-wen Dept. of Computer Science & Information Engineering

10

References

1) Tsung-wei Tu, Hung-yi Lee, Yu-yu Chou and Lin-shan Lee, “Semantic Query Expansion and Context-based Discriminative Term Modeling for Spoken Document Retrieval”, in ICASSP, 2012.

2) Tao Tao and ChengXiang Zhai, “Regularized estimation of mixture models for robust pseudo-relevance feedback”, in SIGIR, 2006.