BBN Technologies
Statistical Models of Text:From Bags of Words
to Structure
Ralph Weischedel
17 April 2000
BBN Technologies
Extraction VisionMulti-dimensional
Meta-data Extraction
J F M A M J J A
EMPLOYEE / EMPLOYER Relationships:Jan Clesius works for Clesius EnterprisesBill Young works for InterMedia Inc.COMPANY / LOCATION Relationshis:Clesius Enterprises is in New York, NYInterMedia Inc. is in Boston, MA
Meta-Data
India Bombing NY Times Andhra Bhoomi Dinamani Dainik Jagran
Topic Discovery
Concept Indexing
Thread Creation
Term Translation
Document Translation
Story Segmentation
Entity Extraction
Fact Extraction
BBN Technologies
Outline
Statistical models that support feature extraction Bags of words
• Topic extraction
Sequences (HMMs)• Name extraction and classification
Lexicalized probabilistic context-free grammars• Parses• Facts/relationships
TBD• Propositions
BBN Technologies
Topic Extraction via Bag of Words
Topics• Clinton, Bill• Mexico • Money• Economic assistance,
American
“President Clinton dumped his embattled Mexican bailout today. Instead, he announced another plan that doesn’t need congressional approval.” Text
SpeechRecognition Classifier
Speech
Models
TrainingProgram
trainingsentences answers
Topics
BBN Technologies
Generative Model of Story and Topics First, choose a Set of topics, T0 ...TM
For each word in story:• Choose a topic according to P ( Tj | Set )
• Choose a word according to output distribution P ( Wn | Tj )
• Loop
.
.
P( Tj | Set )storystart
storyend
T1
T2
TM
T0General Language
Loop
P( Set )
n P ( Wn |Tj )
BBN Technologies
Topic Classification on Broadcast News
Trained on 1 year of stories from July ‘95 to Jun ‘96(42,502 stories)
Tested on 989 stories from July ‘96 Allowed 4,627 topics that occur at least twice OOT (out-of-topic) rate was 2.45% Results:
• 75.8% of the first choice topics are among the annotated labels
• 63.6% for a simple likelihood-based method
• 45% for the traditional tfidf measure used in IR
On cursory examination of errors, often the recognized topic was correct and the annotator failed to include it.
BBN Technologies
Name Extraction via HMMs
Text
SpeechRecognition Extractor
Speech Entities
NE
Models
Locations
Persons
Organizations
The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.
TrainingProgram
trainingsentences answers
The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.
Prior to 1997 - no learning approach competitive with hand-built rule systems
Since 1997 - Statistical approaches (BBN, NYU, MITRE) achieve state-of-the-art performance
BBN Technologies
START OF SENTENCE
END OF SENTENCE
(if MUC, thenfive other
entity types)
ORGANIZATION
NOT-A-NAME
PERSON
A Hidden Markov Model
Structure of Model One language model for
each category plus on for other (not-a-name)
The number of categories is learned from training
Bi-gram transition probabilities
W1
Pr(NC | NC-1, W-1) • Pr(W1 | NC, NC-1)
W1 W2 +end+Pr(W | W -1, NC )
Pr(W | W -1, NC )
BBN Technologies
Effect of Speech Recognition Error
70
72
74
76
78
80
82
84
86
88
90
92
0 5 10 15 20 25 30
Word error rate (Hub4 Eval 98)
F-m
ea
su
re
BBN and NIST found IdentiFinder performance degrades 0.7 points of F per 1% WER
BBN Technologies
Parsing via Lexicalized Probabilistic CFGs
Text
SpeechRecognition Parser
Speech Trees
NE
Models
TrainingProgram
trainingsentences answers
NP
NP
VP
SBAR
WH
NP
NP
S
NP
VP
S
NP
NP
NPVP
PP
PervezM
uscharraf,
PakistaniA
rmy
General
ousted
,
Naw
az
, who
ledPakistan
Sharif
October
12
was
by
Nawaz Sharif, who led Pakistan, was ousted October 12 by Pervez Musharraf, Pakistani Army General.
Prior to 1990 - accuracy for non-statistical parsers around 65% Since 1995 - Statistical parsers (IBM, UPenn, Brown and BBN)
achieve 85-90% accuracy
BBN Technologies
Example of Generating a Parse Tree
,
VP
SBAR
WH
NP
NP
S
NP
NP
Naw
az
, who
ledPakistan
Sharif
S wasN
P
NP
NPN
P
PP
PervezM
uscharraf,
PakistaniA
rmy
General
October
12 byVP ousted
was
VP
S
wasVP
S
ousted
VP
BBN Technologies
Extracting Facts via LPCFG
“Nance, who is also a paid consultant to ABC News, said ...”
PositionHolderPerson: NancePost: a paid consultantOrg: ABC News
Text
SpeechRecognition Extractor
Speech Relationships/
Events
Models
TrainingProgram
trainingsentences answers
1998 - First state-of-the-art trainable system (70% accuracy)
BBN Technologies
Type of Annotation Required
CoreferenceEmployee
relation
Nance , who is also a paid consultant to ABC News , said ...
person organization
person-descriptor
Training data consists ONLY of• Named entities (as in NE)• Descriptor phrases (for TE)• Descriptor references (for TE)• Relation/events to be extracted (for TR)
BBN Technologies
The Sentential Model
• Search Criterion: find M such that p(M | W) is maximized• Since p(W) is constant, search for:
• Model the probability as the product of the probabilities of generating each element in the tree
p(M |W )=p(M,W )
p(W ) p(M,T,W)
p(W)T max p(M,T,W )
p(W)
max
Mp(M,T,W)
treee
hepWTMp )|(),,(
BBN Technologies
Augmented Semantic Tree
Nance , who a consultantis to NewsABCpaidalso , said ...
per-r/np
per/np
per-desc-of/sbar-lnk
per-desc-ptr/sbar
per/nnp , wp vbz rb det vbn per-desc/nn to org-c/nnp org/nnp, vbd
per-desc-r/np
per-desc/np org-r/np
org-ptr/pp
whnp advp
per-desc-ptr/vp
vp
s
emp-of/pp-lnk
Syntax labelSemantic label
BBN Technologies
Propositions via TBD
Within the past two months, a bomb exploded in the offices of the El Espectador in Bogata, destroying a major part of its installations and equipment.
Text
SpeechRecognition Extractor
Speech Propositions
Models
TrainingProgram
trainingsentences answers
a major part of itsinstallations and equipment
L-OBJdestroying
a bombL-SUBJdestroying
a bombL-OBJexploded
ValueArgPredicate
BBN Technologies
Towards a Proposition Bank
PervezM
uscharraf,
PakistaniA
rmy
General
NP
NP
ousted
,
Naw
az
, who
ledPakistan
VP
SBAR
WH
NP
NP
S
Sharif
NP
VP
S
NP
NP
October
12
NP
was
VP
PP
by
Event: ousted-1
Logical Object:
Logical Subject
Time:
Location: --
Event: led-3
Logical Object:
Logical Subject
Time: --
Location: --
Add Predicate/Argument Markings
Add Co-referenceAdd Verb Sense Markings
-3
-1
BBN Technologies
Statistical Speech/Language Modeling
Trainer
Decoder
Model
LanguageInput
Answers
AnswersLanguageInput
Technology Input Answers• Speech recognition audio transcription• OCR image characters• Speech understanding audio response• Topic classification document topics• Topic detection text/speech clusters• Topic tracking text/speech relevant stories• Story segmentation speech stories• Information retrieval query text/speech• Named entity text/speech names & types
extraction
Advantages Mathematically rigorous approach State-of-the-art performance Highly robust in the face of degraded input Language independent, requiring only annotated
training data Affordable annotation
• Only domain knowledge is needed• Can be performed by students/interns