unsupervised learning of soft patterns for generating definitions from online news

31
Hang Cui, Min-Yen Kan and Tat-Seng Chua Unsupervised Learning of Soft Pat terns for Generating Definitions from Online News 1/28 Unsupervised Learning of Soft Patterns for Generating Definitions from Online News Hang Cui Min-Yen Kan Tat-Seng Chua {cuihang, kanmy, chuats} @ comp.nus.edu.sg School of Computing, NUS, Singapore

Upload: rod

Post on 18-Jan-2016

18 views

Category:

Documents


0 download

DESCRIPTION

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News Hang Cui Min-Yen Kan Tat-Seng Chua {cuihang, kanmy, chuats} @ comp.nus.edu.sg School of Computing, NUS, Singapore. Problem. To answer “Who is Bob Woodward ” and “What is SARS ” questions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

1/28

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang CuiMin-Yen KanTat-Seng Chua

{cuihang, kanmy, chuats} @ comp.nus.edu.sg

School of Computing, NUS, Singapore

Page 2: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

2/28

Problem

• To answer “Who is Bob Woodward” and “What is SARS” questions.– A large portion of queries in search logs (Voorhees

2001).

• Where to get definitions– Dictionaries, encyclopedias, online glossaries ……– Online news – “new terms” (e.g. Sasser)

• In this paper, we– deal with recently popular terms and people.– identify definition sentences from online news.– distill search engine results to definitions.

Page 3: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

3/28

“In the News” from Google (Apr 23, 2004)

In the News

Bob WoodwardSARSVietnam WarYasser ArafatGeorge W. BushMarine CorpsGaza Strip Kofi AnnanMitsubishi MotorsAlan GreenspanFirst QuarterMaurice Clarett

Page 4: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

4/28

“In the News” from Google (Apr 23, 2004)

A list of relevant documents rather

than a direct answer

Page 5: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

5/28

Our Solution – DefSearch

Bob Woodward

Woodward, an Office of Naval Intelligence (ONI) asset, interviewed over 75 Bush Cabal insiders. (CNN)

Woodward, who had previously endeared himself to the Bush Administration with his pandering portrait of the President in "Bush at War", has launched a blistering assault on White House credibility with his new book, "Plan of Attack". (NY Times)

People close to Mr. Powell said Sunday that they had no doubt he would weather any criticism from within over his apparent cooperation with Mr. Woodward, an assistant managing editor at The Washington Post. (CNN)

The book, called Plan of Attack, is written by Bob Woodward, the respected journalist who helped break open the Watergate scandal.The book is based on interviews with 75 people, including Bush, and is due for release Tuesday. (REUTERS)

Bob Woodward, the famous Watergate reporter has interviewed President Bush and other Whitehouse "insiders". As a result of the interview, Woodward might have done more damage to the Presidents re-election cause than anyone since Richard Clarkes interview on the same program and the recent events in Spain might be an indication as to how the world is beginning to view President Bush. (ABC News)

Page 6: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

6/28

Behind DefSearch

User

Query

DefSearch

IR Engi nePatternMatchi ng

BestSentences

Defi ni ti onSentences

SentenceSel ecti on

Resul t

Defi ni ti on Patterns

Trai ni ngPattern

Instances

Page 7: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

7/28

Outline

• How Do Current Systems Identify Definitions?• What are Soft Patterns?• Matching Soft Patterns• Unsupervised Learning of Soft Patterns• Evaluations• Conclusion and Future Work

Page 8: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

8/28

How Do Current Systems Identify Definitions?

• Most of current systems use hand-crafted patterns– Appositive

• e.g. Gunter Blobel , a cellular and molecular biologist,…– Copulas

• e.g. Battery is a kind of electronic device … – Predicates (relations)

• e.g. TB is usually caused by …• Current work on definition sentence identification

– Domain-specific definition generation systems

• e.g. topic-specific definitions on the Web and biographies.

– Definitional QA Task at TREC 2003

Page 9: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

9/28

• Lack of Flexibility – Hard Matching– Pattern: <SCH_TERM> , also known as

TB , also known as Tuberculosis , … TB ( also known as Tuberculosis ) …

– Variations make hard matching fail– Introduce Soft Patterns with greater flexibility

• Manual labor– Introduce unsupervised learning by Group Pseudo-

Relevance Feedback (GPRF).

Weaknesses of Current Pattern Matching Methods

mismatch

Page 10: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

10/28

Outline

• How Do Current Systems Identify Definitions?• What are Soft Patterns?• Matching Soft Patterns• Unsupervised Learning of Soft Patterns• Evaluations• Conclusion and Future Work

Page 11: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

11/28

What are Soft Patterns?

• Soft patterns allow partial matching TB ( also known as Tuberculosis ) …

P( ( |Slot1) = 0.001, P(also|Slot2) = 0.21, P(known|Slot3) = 0.33, P(as|Slot4) = 0.13

P(Matching) = 0.23 : still better than non-definition sentences.

• How does it work?– Training – accumulating pattern instances in a vector.

• Derive pattern instances from labeled definition sentences.

– Matching with a probabilistic model, not regular expressions.• Using statistical information from all pattern instances,

not generalized rules.• Instance-based learning.

Page 12: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

12/28

Preparing Pattern Instances

The channel Iqra is owned by the Arab Radio and Television company and is the brainchild of the Saudi millionaire, Saleh Kamel.

The_DT channel_NN Iqra_NNP is_VBZ owned_VBN by_IN NNP company_NN and_CC is_VBZ the_DT brainchild_NN of_IN NNP.

Step 1POS tagging and noun

phrase chunking.

Step 2Selective substitution – replace those specific words with more general tags.Other tokens remain unchanged.

DT$ NN <SEARCH_TERM> BE$ owned by DT$ NNP and BE$ DT$ NN of NNP.

Page 13: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

13/28

Preparing Pattern Instances – Cont’d

DT$ NN <SCH_TERM> BE$ owned by

Step 3Crop a text window around the tag “<SCH_TERM>” (window size = 3 for each side)

Pattern Instance

Page 14: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

14/28

Illustration of Soft Pattern Generation

…… The channel Iqra is owned by the … …… severance packages, known as golden parachutes, included ……

A battery is a cell which can provide electricity.

DT$ NN <Search_Term> BE$ owned by known as <Search_Term> , VB

<Search_Term> BE$ DT$

…… <Slot-2> <Slot-1> <Search_Term> <Slot1> <Slot2> …… NN 0.12 NN 0.11 , 0.40 DT$ 0.2 known 0.09 as 0.20 BE$ 0.2 VB 0.1 DT$ 0.04 owned 0.09

<Slot-w, ……, Slot-2, Slot-1, SEARCH_TERM , Slot1, Slot2, …… Slotw : Pa>

Page 15: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

15/28

Outline

• How Do Current Systems Identify Definitions?• What are Soft Patterns?• Matching Soft Patterns – Addressing Flexibility • Unsupervised Learning of Soft Patterns• Evaluations• Conclusion and Future Work

Page 16: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

16/28

Matching Soft Patterns

• Test sentences are reduced to a vector S using the same strategy.<token-w, …, token-1, SEARCH_TERM, token1, …, tokenw : S>

• Matching Soft Patterns – similarity between the pattern vector Pa and the test vector S. – Independent slot content similarity.– Slot sequence fidelity.

Page 17: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

17/28

Probabilistic Matching Degree

• Individual slot similarity – independent assumption

• Sequence fidelity – bigram model

• Combined to get the matching degree

w

wiiiSlots SlottokenPaSweightPa )|Pr()|Pr(_

)|()|()(

)|,Pr()_Pr(

1121

21

ww

w

tokentokenPtokentokenPtokenP

Patokentokentokenseqright

)|_Pr(

)|_Pr()1(_

Paseqright

PaseqleftweightPa Seq

lengthfragment

weightPaweightPaweightPattern

SeqSlots

_

___

Page 18: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

18/28

Outline

• How Do Current Systems Identify Definitions?• What are Soft Patterns?• Matching Soft Patterns – Flexibility • Unsupervised Learning of Soft Patterns –

Addressing Manual Labor• Evaluations• Conclusion and Future Work

Page 19: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

19/28

Unsupervised Labeling of Definition Sentences using GPRF

• Pattern instances obtained from labeled definition sentences.– Manual labeling is too expensive.

• Pseudo-relevance Feedback in document retrieval– Take the top n ranked documents as relevant.

• We employ Group pseudo-relevance feedback (GPRF)– Statistical ranking – centroid based method.– Perform PRF over a group of questions (top 10 sentences for

each question).– Generate soft patterns from all auto-labeled sentences for all

questions.

Page 20: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

20/28

Analysis of GPRF

• Assumption 1 – some definition sentences can be ranked high using statistical method.– Word co-occurrence metrics can well model

descriptive sentences.• Over 33% of top ranked sentences are definitional.

– Noise introduced in each question’s top list can be mitigated by the group strategy.

• Assumption 2 – definition patterns are general and can be used across questions.

Page 21: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

21/28

Outline

• How Do Current Systems Identify Definitions?• What are Soft Patterns?• Matching Soft Patterns – Flexibility • Unsupervised Learning of Soft Patterns• Evaluations• Conclusion and Future Work

Page 22: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

22/28

Evaluation Setup

• Two experiments– To evaluate the effectiveness of our method on a community-

standard corpus.• TREC QA corpus - About 1M news articles.• 50 definitional questions with answer nuggets.

– To assess the adaptability of the system to actual online news and recent questions.

• 26 questions from Lycos.• Up to 200 news articles from each of eight news sites (e.g. CNN

and BBC) for each question.

• Comparison Systems– Baseline system – centroid based ranking (IR).– A top ranked definitional question answering system at

TREC2003 – HCR• Hand-crafted definition patterns (a man-month of time to construct).

Page 23: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

23/28

Evaluation Metrics

• Based on given answer nuggets.– The most essential information about the target.– Judged by human assessors.

• Nugget Precision (NP)– Penalty to longer answers.

• Nugget Recall (NR)– Proportion of returned nuggets to vital nuggets.

• F5-measure (weighting NR 5 times as NP)

Page 24: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

24/28

Evaluations on TREC Corpus

• Pattern matching has significant impact on definition sentence identification.

• Soft patterns are more effective for news text.

F5 measure

% improvement

(over baseline)

% improvement (over HCR)

Centroid (Baseline)

0.423

HCR 0.472 11.52%

SP+GPRF (w = 1)

0.507 19.65% 7.29%

SP+GPRF (w = 2)

0.539 27.20% 14.06%

SP+GPRF (w = 3)

0.531 25.37% 12.42%

SP+GPRF (w = 4)

0.495 16.97% 4.88%

SP+GPRF (w = 5)

0.484 14.35% 2.54%

Page 25: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

25/28

Evaluations on the Web Corpus• Using two sets of soft

patterns.– More pattern instances

lead to better performance (683 from TREC vs. 375 from Lycos).

• Soft patterns are general enough to be applied to other corpora.– Makes offline training

possible.

F5 Measure%

improvement (over baseline)

Centroid (baseline) 0.492

HCR 0.555 12.82%

SP+GPRF (Lycos patterns)

0.611 24.04%

SP+GPRF (TREC patterns)

0.642 30.33%

Page 26: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

26/28

Outline

• How Do Current Systems Identify Definitions?• What are Soft Patterns?• Matching Soft Patterns – Flexibility • Unsupervised Learning of Soft Patterns• Evaluations• Conclusion and Future Work

Page 27: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

27/28

Conclusions and Future Work

• Current definition pattern matching has weaknesses– Lack of flexibility– Manual labor

• We address them by– Soft patterns– Unsupervised learning by Group PRF

• Soft patterns prove to be effective in Web-based definition generation systems.

• Future work– Soft patterns in information extraction and factoid question

answering.

Page 28: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

28/28

Q & A

Thanks!

Try our online demo at http://www-appn.comp.nus.edu.sg/~cuihang/DefSearch/DefSearch.htm

!

Page 29: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

29/28

Statistical Ranking – Centroid Word Weighting

• Weighting the words by their co-occurrences with the search target.

• Words with the centrality weights beyond a predefined threshold form a centroid vector.

• Cosine similarity with the centroid vector used to rank the sentences.

• Top Ranked sentences by the centroid vector are deemed as definition sentence candidates.

)()1)_(log()1)(log(

)1)_,(log()(_ widf

termschsfwsf

termschwCowCentrality termsch

Page 30: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

30/28

Sentence Selection

• We adopt a variation of Maximal Marginal Relevance (MMR) to summarize the definition sentences.

• To ensure relevance and to avoid redundancy.• Examine only the top ranked sentences and stop

when the length of the definition is reached.– Different from MMR, which examines all sentences.– Due to the noisy input sentences.

Page 31: Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

Hang Cui, Min-Yen Kan and Tat-Seng Chua

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News

31/28

Compared to HMM

• Both address individual slot content and sequence fidelity.

• Soft patterns perform instance-based learning – can deal with– Small training set– Noisy data from group pseudo-relevance feedback– Online training

• HMM needs– More training data and time– Explicit transition paths between states