machine learning for information extraction: an overview kamal nigam google pittsburgh with input,...
TRANSCRIPT
![Page 1: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/1.jpg)
Machine Learning for Information Extraction: An Overview
Kamal NigamGoogle Pittsburgh
With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea
![Page 2: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/2.jpg)
Example: A Problem
Genomics job
Mt. Baker, the school district
Baker Hostetler, the company
Baker, a job opening
![Page 3: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/3.jpg)
Example: A Solution
![Page 4: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/4.jpg)
Job Openings:Category = Food ServicesKeyword = Baker Location = Continental U.S.
![Page 5: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/5.jpg)
Extracting Job Openings from the Web
Title: Ice Cream Guru
Description: If you dream of cold creamy…
Contact: [email protected]
Category: Travel/Hospitality
Function: Food Services
![Page 6: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/6.jpg)
Potential Enabler of Faceted Search
![Page 7: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/7.jpg)
Lots of Structured Information in Text
![Page 8: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/8.jpg)
IE from Research Papers
![Page 9: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/9.jpg)
What is Information Extraction?
• Recovering structured data from formatted text
![Page 10: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/10.jpg)
What is Information Extraction?
• Recovering structured data from formatted text– Identifying fields (e.g. named entity recognition)
![Page 11: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/11.jpg)
What is Information Extraction?
• Recovering structured data from formatted text– Identifying fields (e.g. named entity recognition)– Understanding relations between fields (e.g. record
association)
![Page 12: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/12.jpg)
What is Information Extraction?
• Recovering structured data from formatted text– Identifying fields (e.g. named entity recognition)– Understanding relations between fields (e.g. record
association)– Normalization and deduplication
![Page 13: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/13.jpg)
What is Information Extraction?
• Recovering structured data from formatted text– Identifying fields (e.g. named entity recognition)– Understanding relations between fields (e.g. record
association)– Normalization and deduplication
• Today, focus mostly on field identification &a little on record association
![Page 14: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/14.jpg)
IE Posed as a Machine Learning Task
• Training data: documents marked up with ground truth
• In contrast to text classification, local features crucial. Features of:– Contents– Text just before item– Text just after item– Begin/end boundaries
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
prefix contents suffix
… …
![Page 15: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/15.jpg)
Good Features for Information Extraction
Example word features:– identity of word– is in all caps– ends in “-ski”– is part of a noun phrase– is in a list of city names– is under node X in WordNet or
Cyc– is in bold font– is in hyperlink anchor– features of past & future– last person name was female– next two words are “and
Associates”
begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-number
contains-http
contains-non-space
contains-number
contains-pipe
contains-question-mark
contains-question-word
ends-with-question-mark
first-alpha-is-capitalized
indented
indented-1-to-4
indented-5-to-10
more-than-one-third-space
only-punctuation
prev-is-blank
prev-begins-with-ordinal
shorter-than-30
Creativity and Domain Knowledge Required!
![Page 16: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/16.jpg)
Is Capitalized
Is Mixed Caps
Is All Caps
Initial Cap
Contains Digit
All lowercase
Is Initial
Punctuation
Period
Comma
Apostrophe
Dash
Preceded by HTML tag
Character n-gram classifier says string is a person name (80% accurate)
In stopword list(the, of, their, etc)
In honorific list(Mr, Mrs, Dr, Sen, etc)
In person suffix list(Jr, Sr, PhD, etc)
In name particle list (de, la, van, der, etc)
In Census lastname list;segmented by P(name)
In Census firstname list;segmented by P(name)
In locations lists(states, cities, countries)
In company name list(“J. C. Penny”)
In list of company suffixes(Inc, & Associates, Foundation)
Word Features– lists of job titles, – Lists of prefixes– Lists of suffixes– 350 informative phrases
HTML/Formatting Features– {begin, end, in} x
{<b>, <i>, <a>, <hN>} x{lengths 1, 2, 3, 4, or longer}
– {begin, end} of line
Creativity and Domain Knowledge Required!Good Features for Information Extraction
![Page 17: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/17.jpg)
IE HistoryPre-Web• Mostly news articles
– De Jong’s FRUMP [1982]• Hand-built system to fill Schank-style “scripts” from news wire
– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]
• Most early work dominated by hand-built models– E.g. SRI’s FASTUS, hand-built FSMs.– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web• AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell’s WebKB, ‘96– Build KB’s from the Web.
• Wrapper Induction– Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…
![Page 18: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/18.jpg)
Landscape of ML Techniques for IE:
Any of these models can be used to capture words, formatting or both.
Classify Candidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Wrapper Induction
<b><i>Abraham Lincoln</i></b> was born in Kentucky.
Learn and apply pattern for a website
<b>
<i>
PersonName
![Page 19: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/19.jpg)
Sliding Windows & Boundary Detection
![Page 20: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/20.jpg)
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 21: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/21.jpg)
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 22: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/22.jpg)
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 23: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/23.jpg)
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 24: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/24.jpg)
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 25: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/25.jpg)
Information Extraction with Sliding Windows[Freitag 97, 98; Soderland 97; Califf 98]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
… …
• Standard supervised learning setting– Positive instances: Candidates with real label– Negative instances: All other candidates– Features based on candidate, prefix and suffix
• Special-purpose rule learning systems work wellcourseNumber(X) :-
tokenLength(X,=,2), every(X, inTitle, false), some(X, A, <previousToken>, inTitle, true),some(X, B, <>. tripleton, true)
![Page 26: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/26.jpg)
Rule-learning approaches to sliding-window classification: Summary
• Representations for classifiers allow restriction of the relationships between tokens, etc
• Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)
• Use of these “heavyweight” representations is complicated, but seems to pay off in results
![Page 27: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/27.jpg)
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 28: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/28.jpg)
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 29: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/29.jpg)
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 30: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/30.jpg)
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 31: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/31.jpg)
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 32: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/32.jpg)
BWI: Learning to detect boundaries
• Another formulation: learn three probabilistic classifiers:– START(i) = Prob( position i starts a field)– END(j) = Prob( position j ends a field)– LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) bySTART(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
[Freitag & Kushmerick, AAAI 2000]
![Page 33: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/33.jpg)
BWI: Learning to detect boundaries
• BWI uses boosting to find “detectors” for START and END
• Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i).
• Each “pattern” is a sequence of tokens and/or wildcards like: anyAlphabeticToken, anyToken, anyUpperCaseLetter, anyNumber, …
• Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns
![Page 34: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/34.jpg)
BWI: Learning to detect boundaries
Field F1 Person Name: 30%Location: 61%Start Time: 98%
![Page 35: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/35.jpg)
Problems with Sliding Windows and Boundary Finders
• Decisions in neighboring parts of the input are made independently from each other.
– Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”.
– It is possible for two overlapping windows to both be above threshold.
– In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.
![Page 36: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/36.jpg)
Finite State Machines
![Page 37: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/37.jpg)
Hidden Markov Models
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)
||
11 )|()|(),(
o
ttttt soPssPosP
HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
![Page 38: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/38.jpg)
IE with Hidden Markov Models
Yesterday Lawrence Saul spoke this example sentence.
Yesterday Lawrence Saul spoke this example sentence.
Person name: Lawrence Saul
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
![Page 39: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/39.jpg)
Generative Extraction with HMMs
• Parameters: {P(st|st-1), P(ot|st), for all states st, words ot}
• Parameters define generative model:
[McCallum, Nigam, Seymore & Rennie ‘00]
||
11 )|()|(),(
o
ttttt soPssPosP
![Page 40: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/40.jpg)
HMM Example: “Nymble”
Other examples of HMMs in IE: [Leek ’97; Freitag & McCallum ’99; Seymore et al. 99]
Task: Named Entity Extraction
Train on 450k words of news wire text.
Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%
[Bikel, et al 97]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Transitionprobabilities
Observationprobabilities
P(st | st-1, ot-1 ) P(ot | st , st-1 )
Back-off to: Back-off to:
P(st | st-1 )
P(st )
P(ot | st , ot-1 )
P(ot | st )
P(ot )
or
Results:
![Page 41: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/41.jpg)
Regrets from Atomic View of Tokens
Would like richer representation of text: multiple overlapping features, whole chunks of text.
line, sentence, or paragraph features:– length– is centered in page– percent of non-alphabetics– white-space aligns with next line– containing sentence has two verbs– grammatically contains a question– contains links to “authoritative” pages– emissions that are uncountable– features at multiple levels of granularity
Example word features:– identity of word– is in all caps– ends in “-ski”– is part of a noun phrase– is in a list of city names– is under node X in WordNet or Cyc– is in bold font– is in hyperlink anchor– features of past & future– last person name was female– next two words are “and Associates”
![Page 42: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/42.jpg)
Problems with Richer Representationand a Generative Model
• These arbitrary features are not independent:– Overlapping and long-distance dependences
– Multiple levels of granularity (words, characters)
– Multiple modalities (words, formatting, layout)
– Observations from past and future
• HMMs are generative models of the text:
• Generative models do not easily handle these non-independent features. Two choices:– Model the dependencies. Each state would have its own
Bayes Net. But we are already starved for training data!
– Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!
),( osP
![Page 43: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/43.jpg)
Conditional Sequence Models
• We would prefer a conditional model:P(s|o) instead of P(s,o):– Can examine features, but not responsible for generating
them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.
• If successful, this answers the challenge of integrating the ability to handle many arbitrary features with the full power of finite state automata.
![Page 44: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/44.jpg)
Conditional Markov Models
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Generative (traditional HMM)
||
11 )|()|(),(
o
ttttt soPssPosP
...transitions
observations
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Conditional
...transitions
observations
||
11 ),|()|(
o
tttt ossPosP
Standard belief propagation: forward-backward procedure.Viterbi and Baum-Welch follow naturally.
Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000]MaxEnt POS Tagger [Ratnaparkhi, 1996]
SNoW-based Markov Model [Punyakanok & Roth, 2000]
![Page 45: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/45.jpg)
Exponential Form for “Next State” Function
kttkk
ttttsttt sof
soZosPossP
t),(exp
),(
1)|(),|(
11 1
Capture dependency on sCapture dependency on st-1 t-1 with |S| with |S|
independent functions, Pindependent functions, Psst-1t-1(s(stt|o|ott).).
Each state contains a “next-state classifier”Each state contains a “next-state classifier”that, given the next observation, produces a that, given the next observation, produces a probability of the next state, Pprobability of the next state, Psst-1t-1(s(stt|o|ott).).
st-1
st
Recipe:- Labeled data is assigned to transitions.- Train each state’s exponential model by maximum entropy
weight feature
![Page 46: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/46.jpg)
• Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1
Pr(0123|rib)=1
Pr(0453|rob)=1
Label Bias Problem
![Page 47: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/47.jpg)
nn oooossss ,...,,..., 2121
HMM
MEMM
CRF
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
...
...
...
...
||
11 )|()|(),(
o
ttttt soPssPosP
||
1
1
,
||
11
),(
),(
exp1
),|()|(
1
o
t
kttkk
jttjj
os
o
tttt
osg
ssf
Z
ossPosP
tt
||
1
1
),(
),(
exp1
)|(o
t
kttkk
jttjj
o osg
ssf
ZosP
(A special case of MEMMs and CRFs.)
Conditional Random Fields (CRFs)[Lafferty, McCallum, Pereira ‘2001]
From HMMs to MEMMs to CRFs
![Page 48: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/48.jpg)
Conditional Random Fields (CRFs)
St St+1 St+2
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
St+3 St+4
Markov on s, conditional dependency on o.
||
11 ),,,(exp
1)|(
o
t kttkk
o
tossfZ
osP
Hammersley-Clifford-Besag theorem stipulates that the CRFhas this form—an exponential function of the cliques in the graph.
Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs.
[Lafferty, McCallum, Pereira ‘2001]
![Page 49: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/49.jpg)
Training CRFs
),,,(),(
),( )|(),(
:gradient likelihood-Log
}),{|}({
:data ninggiven trai parameters of likelihood-log Maximize
1
2)()(}{
)()(
penalty smoothing parameterscurrent by assigned labels usingcount feature labelscorrect usingcount feature
)(
ttt
kk
ki s
ik
i
i
iik
k
i
k
sstofosC
osCosPosCL
--
soL
k
Methods:• iterative scaling (quite slow)• conjugate gradient (much faster)• conjugate gradient with preconditioning (super fast)• limited-memory quasi-Newton methods (also super fast)
Complexity comparable to standard Baum-Welch
[Sha & Pereira 2002]& [Malouf 2002]
![Page 50: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/50.jpg)
Sample IE Applications of CRFs
• Noun phrase segmentation [Sha & Pereira, 03]• Named entity recognition [McCallum & Li 03]• Protein names in bio abstracts [Settles 05]• Addresses in web pages [Culotta et al. 05]• Semantic roles in text [Roth & Yih 05]• RNA structural alignment [Sato & Satakibara 05]
![Page 51: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/51.jpg)
Examples of Recent CRF Research
• Semi-Markov CRFs [Sarawagi & Cohen 05]– Awkwardness of token level decisions for segments– Segment sequence model alleviates this– Two-level model with sequences of segments,
which are sequences of tokens
• Stochastic Meta-Descent [Vishwanathan 06]– Stochastic gradient optimization for training– Take gradient step with small batches of examples– Order of magnitude faster than L-BFGS– Same resulting accuracies for extraction
![Page 52: Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum](https://reader030.vdocuments.us/reader030/viewer/2022013115/56649f115503460f94c23b2f/html5/thumbnails/52.jpg)
Further Reading about CRFs
Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006.
http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf