![Page 1: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/1.jpg)
IE with Dictionaries
Cohen & Sarawagi
![Page 2: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/2.jpg)
Announcements
• Current statistics:– days with unscheduled student talks: 2– students with unscheduled student talks: 0– Projects are due: 4/28 (last day of class)– Additional requirement: draft (for comments)
no later than 4/21
![Page 3: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/3.jpg)
Finding names you know about
• Problem: given dictionary of names, find them in email text– Important task beyond email (biology, link analysis,...)– Exact match is unlikely to work perfectly, due to
nicknames (Will Cohen), abbreviations (William C) , misspellings (Willaim Chen), polysemous words (June, Bill), etc
– In informal text it sometimes works very poorly– Problem is similar to record linkage (aka data
cleaning, de-duping, merge-purge, ...) problem of finding duplicate database records in heterogeneous databases.
![Page 4: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/4.jpg)
Finding names you know about
• Technical problem:– Hard to combine state of the art similarity
metrics (as used in record linkage) with state of the art NER system due to representational mismatch:
• Opening up the box, modern NER systems don’t really know anything about names....
![Page 5: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/5.jpg)
IE as Sequential Word Classification
Yesterday Pedro Domingos spoke this example sentence.
Person name: Pedro Domingos
A trained IE systemmodels the relative probability of labeled sequences of words.
To classify, find the most likely state sequence for the given words:
Any words said to be generated by the designated “person name”state extract as a person name:
person name
location name
background
![Page 6: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/6.jpg)
IE as Sequential Word Classification
Modern IE systems use a rich representation for words, and clever probabilistic models of how labels interact in a sequence, but do not explicitly represent the names extracted.
wt -1
wt
Ot
wt+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
![Page 7: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/7.jpg)
Semi-Markov models for IE
• Train on sequences of labeled segments, not labeled words.S=(start,end,label)
• Build probability model of segment sequences, not word sequences
• Define features f of segments
• (Approximately) optimize feature weights on training data
f(S) = words xt...xu, length, previous words, case information, ..., distance to known name
maximize:
m
iii
1
)|Pr(log xS
with Sunita Sarawagi, IIT Bombay
![Page 8: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/8.jpg)
Details: Semi-Markov model
![Page 9: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/9.jpg)
Segments vs tagging
1 2 3 4 5 6 7 8
Fred please stop by my office this afternoon
Person other other other Loc Loc other Time
t1=u1=1 t2=2, u2=4 t3=5,u3=6 t4=u4=7 t5=u5=8
Fred please stop by my office this afternoon
Person other Loc other Time
t
x
y
t,u
x
y
f(xt,yt)
f(xj,yj)
![Page 10: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/10.jpg)
Details: Semi-Markov model
![Page 11: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/11.jpg)
Conditional Semi-Markov models
CMM:
CSMM:
![Page 12: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/12.jpg)
A training algorithm for CSMM’s (1)
Review: Collins’ perceptron training algorithm
Correct tags
Viterbi tags
![Page 13: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/13.jpg)
A training algorithm for CSMM’s (2)
Variant of Collins’ perceptron training algorithm:
voted perceptron learner for TTRANS
like Viterbi
![Page 14: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/14.jpg)
A training algorithm for CSMM’s (3)
Variant of Collins’ perceptron training algorithm:
voted perceptron learner for TTRANS
like Viterbi
![Page 15: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/15.jpg)
A training algorithm for CSMM’s (3)
Variant of Collins’ perceptron training algorithm:
voted perceptron learner for TSEGTRANS
like Viterbi
![Page 16: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/16.jpg)
Viterbi for HMMs
![Page 17: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/17.jpg)
Viterbi for SMM
![Page 18: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/18.jpg)
Sample CSMM features
![Page 19: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/19.jpg)
Experimental results
• Baseline algorithms:– HMM-VP/1: tags are “in entity”, “other”– HMM-VP/4: tags are “begin entity”, “end entity”,
“continue entity”, “unique”, “other”– SMM-VP: all features f(w) have versions for “f(w) true for
some w in segment that is first (last, any) word of segment”– dictionaries: like Borthwick
• HMM-VP/1: fD(w)=“word w is in D”• HMM-VP/4: fD,begin(w)=“word w begins entity in D”,
etc, etc• Dictionary lookup
![Page 20: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/20.jpg)
Datasets used
Used small training sets (10% of available) in experiments.
![Page 21: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/21.jpg)
Results
![Page 22: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/22.jpg)
![Page 23: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/23.jpg)
Results: varying history
![Page 24: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/24.jpg)
Results: changing the dictionary
![Page 25: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/25.jpg)
Results: vs CRF
![Page 26: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/26.jpg)
Results: vs CRF
![Page 27: IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:](https://reader035.vdocuments.us/reader035/viewer/2022070411/56649f305503460f94c4a8d2/html5/thumbnails/27.jpg)