information extraction and integration: an overview william w. cohen carnegie mellon university...

68
Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Upload: audra-davis

Post on 25-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Information Extractionand Integration: an Overview

William W. CohenCarnegie Mellon University

April 26, 2004

Page 2: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Example: The Problem

Martin Baker, a person

Genomics job

Employers job posting form

Page 3: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Example: A Solution

Page 4: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Page 5: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Job

Op

enin

gs:

Cat

ego

ry =

Fo

od

Ser

vice

sK

eyw

ord

= B

aker

L

oca

tio

n =

Co

nti

nen

tal

U.S

.

Page 6: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

IE from Research Papers

Page 7: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Page 8: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Page 9: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is “Information Extraction”

Information Extraction = segmentation + classification + clustering + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka “named entity extraction”

Page 10: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 11: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 12: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

Page 13: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Tutorial Outline

• IE History• Landscape of problems and solutions• Models for named entity recognition:

– Sliding window– Boundary finding– Finite state machines

• Overview of related problems and solutions– Association, Clustering– Integration with Data Mining

Page 14: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

IE HistoryPre-Web• Mostly news articles

– De Jong’s FRUMP [1982]• Hand-built system to fill Schank-style “scripts” from news wire

– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]

• Early work dominated by hand-built models– E.g. SRI’s FASTUS, hand-built FSMs.– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and

then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]Web• AAAI ’94 Spring Symposium on “Software Agents”

– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.• Tom Mitchell’s WebKB, ‘96

– Build KB’s from the Web.• Wrapper Induction

– Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…– Citeseer; Cora; FlipDog; contEd courses, corpInfo, …

Page 15: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

IE History

Biology• Gene/protein entity extraction• Protein/protein fact interaction• Automated curation/integration of databases

– At CMU: SLIF (Murphy et al, subcellular information from images + text in journal articles)

Email• EPCA, PAL, RADAR, CALO: intelligent office

assistant that “understands” some part of email– At CMU: web site update requests, office-space

requests; calendar scheduling requests; social network analysis of email.

Page 16: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

www.apple.com/retail

IE is different in different domains!Example: on web there is less grammar, but more formatting & linking

The directory structure, link structure, formatting & layout of the Web is its own new grammar.

Apple to Open Its First Retail Storein New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

www.apple.com/retail/soho

www.apple.com/retail/soho/theatre.html

Newswire Web

Page 17: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Landscape of IE Tasks (1/4):Degree of Formatting

Text paragraphswithout formatting

Grammatical sentencesand some formatting & links

Non-grammatical snippets,rich formatting & links Tables

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Page 18: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Landscape of IE Tasks (2/4):Intended Breadth of Coverage

Web site specific Genre specific Wide, non-specific

Amazon.com Book Pages Resumes University Names

Formatting Layout Language

Page 19: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Landscape of IE Tasks (3/4):Complexity

Closed set

He was born in Alabama…

Regular set

Phone: (413) 545-1323

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802

…was among the six houses sold by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

The CALD main office can be reached at 412-268-1299

The big Wyoming sky…

U.S. states U.S. phone numbers

U.S. postal addresses

Person names

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

E.g. word patterns:

Page 20: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Landscape of IE Tasks (4/4):Single Field/Record

Single entity

Person: Jack Welch

Binary relationship

Relation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: Company-LocationCompany: General ElectricLocation: Connecticut

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

Page 21: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Landscape of IE Techniques (1/1):Models

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Page 22: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Sliding Windows

Page 23: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Page 24: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Page 25: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Page 26: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Page 27: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

A “Naïve Bayes” Sliding Window Model[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

… …

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

Page 28: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

“Naïve Bayes” Sliding Window Results

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Domain: CMU UseNet Seminar Announcements

Field F1 Person Name: 30%Location: 61%Start Time: 98%

Page 29: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

SRV: a realistic sliding-window-classifier IE system

• What windows to consider?– all windows containing as many tokens as the shortest

example, but no more tokens than the longest example

• How to represent a classifier? It might:– Restrict the length of window;– Restrict the vocabulary or formatting used

before/after/inside window;– Restrict the relative order of tokens;– Use inductive logic programming techniques to express all

these…<title>Course Information for CS213</title>

<h1>CS 213 C++ Programming</h1>

[Frietag AAAI ‘98]

Page 30: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

SRV: a rule-learner for sliding-window classification

• Primitive predicates used by SRV:– token(X,W), allLowerCase(W), numerical(W), …– nextToken(W,U), previousToken(W,V)

• HTML-specific predicates:– inTitleTag(W), inH1Tag(W), inEmTag(W),…– emphasized(W) = “inEmTag(W) or inBTag(W) or …”– tableNextCol(W,U) = “U is some token in the column

after the column W is in”– tablePreviousCol(W,V), tableRowHeader(W,T),…

Page 31: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

SRV: a rule-learner for sliding-window classification

• Non-primitive “conditions” used by SRV:– every(+X, f, c) = for all W in X : f(W)=c

– some(+X, W, <f1,…,fk>, g, c)= exists W: g(fk(…(f1(W)…))=c

– tokenLength(+X, relop, c):– position(+W,direction,relop, c):

• e.g., tokenLength(X,>,4), position(W,fromEnd,<,2)

courseNumber(X) :- tokenLength(X,=,2), every(X, inTitle, false), some(X, A, <previousToken>, inTitle, true),some(X, B, <>. tripleton, true)

Non-primitive conditions make greedy search easier

Page 32: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Rapier – results vs. SRV

Page 33: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Rule-learning approaches to sliding-window classification: Summary

• SRV, Rapier, and WHISK [Soderland KDD ‘97]

– Representations for classifiers allow restriction of the relationships between tokens, etc

– Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)

– Use of these “heavyweight” representations is complicated, but seems to pay off in results

• Some questions to consider:– Can simpler, propositional representations for classifiers

work (see Roth and Yih)– What learning methods to consider (NB, ILP, boosting,

semi-supervised – see Collins & Singer)– When do we want to use this method vs fancier ones?

Page 34: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

BWI: Learning to detect boundaries

• Another formulation: learn three probabilistic classifiers:– START(i) = Prob( position i starts a field)– END(j) = Prob( position j ends a field)– LEN(k) = Prob( an extracted field has length k)

• Then score a possible extraction (i,j) bySTART(i) * END(j) * LEN(j-i)

• LEN(k) is estimated from a histogram

[Freitag & Kushmerick, AAAI 2000]

Page 35: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

BWI: Learning to detect boundaries

Field F1 Person Name: 30%Location: 61%Start Time: 98%

Page 36: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Problems with Sliding Windows and Boundary Finders

• Decisions in neighboring parts of the input are made independently from each other.

– Expensive for long entity names– Sliding Window may predict a “seminar end time” before the

“seminar start time”.– It is possible for two overlapping windows to both be above

threshold.

– In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.

Page 37: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Finite State Machines

Page 38: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

IE with Hidden Markov Models

Yesterday Pedro Domingos spoke this example sentence.

Yesterday Pedro Domingos spoke this example sentence.

Person name: Pedro Domingos

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

person name

location name

background

Page 39: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

HMM for Segmentation

• Simplest Model: One state per entity type

Page 40: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is a “symbol” ???

Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?

4601 => “4601”, “9999”, “9+”, “number”, … ?

000.. . . .999

3 -d ig i ts

00000 .. . .99999

5 -d ig i ts

0 ..99 0000 ..9999 000000 ..

O th e rs

N u m b e rs

A .. ..z

C h a rs

a a ..

M u lt i -le tte r

W o rds

. , / - + ? #

D e lim ite rs

A ll

Datamold: choose best abstraction level using holdout set

Page 41: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

HMM Example: “Nymble”

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]

Task: Named Entity Extraction

Train on ~500k words of news wire text.

Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%

[Bikel, et al 1998], [BBN “IdentiFinder”]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Transitionprobabilities

Observationprobabilities

P(st | st-1, ot-1 ) P(ot | st , st-1 )

Back-off to: Back-off to:

P(st | st-1 )

P(st )

P(ot | st , ot-1 )

P(ot | st )

P(ot )

or

Results:

Page 42: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is a symbol?

Bikel et al mix symbols from two abstraction levels

Page 43: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is a symbol?

Ideally we would like to use many, arbitrary, overlapping features of words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

Page 44: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is a symbol?

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations

...)|Pr( tt xs

Page 45: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is a symbol?

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

...),|Pr( ,1 ttt sxs

Page 46: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What is a symbol?

St -1 S

t

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

......),|Pr( ,2,1 tttt ssxs

Page 47: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Ratnaparkhi’s MXPOST

• Sequential learning problem: predict POS tags of words.

• Uses MaxEnt model described above.

• Rich feature set.• To smooth, discard features

occurring < 10 times.

Page 48: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS

St-1 St

Ot

St+1

Ot+1Ot-1

...

i

iiii sossos )|Pr()|Pr(),Pr( 11

St-1 St

Ot

St+1

Ot+1Ot-1

...

i

iii ossos ),|Pr()|Pr( 11

Page 49: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Label Bias Problem

• Consider this MEMM, and enough training data to perfectly model it:

Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1

Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1

Pr(0123|rib)=1

Pr(0453|rob)=1

Page 50: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Another view of label bias [Sha & Pereira]

So what’s the alternative?

Page 51: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

CMMs to CRFs

j j

ijjjii

jjjjnn xZ

yyxfxyyxxyy

)(

)),,(exp()|Pr()...|...Pr(

1

,111

)(

)),(exp(

xZ

yxFi

ii

New model

j

jjjii

jj

iii

yyxfyxFxZ

yxF),,(),( where,

)(

)),(exp(

1

Page 52: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

What’s the new model look like?

)(

),,(exp(

)(

)),(exp( 1

xZ

yyxf

xZ

yxFi j

jjjii

iii

x1 x2 x3

y1 y2 y3

What’s independent?

Page 53: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Graphical comparison among HMMs, MEMMs and CRFs

HMM MEMM CRF

Page 54: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Sha & Pereira results

CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

Page 55: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Sha & Pereira results

in minutes, 375k examples

Page 56: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Broader Issues in IE

Page 57: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Broader View

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Up to now we have been focused on segmentation and classification

Page 58: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Broader View

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Now touch on some other issues

12

3

4

5

Page 59: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

(1) Association as Binary Classification

[Zelenko et al, 2002]

Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair.

Person-Role (Christos Faloutsos, KDD 2003 General Chair) NO

Person-Role ( Ted Senator, KDD 2003 General Chair) YES

Person Person Role

Do this with SVMs and tree kernels over parse trees.

Page 60: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

(1) Association with Finite State Machines[Ray & Craven, 2001]

… This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. …

DET thisN enzymeN ubc6V localizesPREP toART theADJ endoplasmicN reticulumPREP withART theADJ catalyticN domainV facingART theN cytosol Subcellular-localization (UBC6, endoplasmic reticulum)

Page 61: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

(1) Association with Graphical Models[Roth & Yih 2002]Capture arbitrary-distance

dependencies among predictions.

Local languagemodels contributeevidence to entityclassification.

Local languagemodels contributeevidence to relationclassification.

Random variableover the class ofentity #2, e.g. over{person, location,…}

Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.

Page 62: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

(1) Association with Graphical Models[Roth & Yih 2002]Also capture long-distance

dependencies among predictions.

Local languagemodels contributeevidence to entityclassification.

Random variableover the class ofentity #1, e.g. over{person, location,…}

Local languagemodels contributeevidence to relationclassification.

Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.

location

personlives-in

Page 63: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

Broader View

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Now touch on some other issues

12

3

4

5

When do two extracted stringsrefer to the same object?

Page 64: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

(2) Information Integration

Goal might be to merge results of two IE systems:

Name: Introduction to Computer Science

Number: CS 101

Teacher: M. A. Kludge

Time: 9-11am

Name: Data Structures in Java

Room: 5032 Wean Hall

Title: Intro. to Comp. Sci.

Num: 101

Dept: Computer Science

Teacher: Dr. Klüdge

TA: John Smith

Topic: Java Programming

Start time: 9:10 AM

[Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001],[Richardson & Domingos 2003]

Page 65: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

(2) Other Information Integration Issues

• Distance metrics for text – which work well?– [Cohen, Ravikumar, Fienberg, 2003]

• Finessing integration by soft database operations based on similarity – [Cohen, 2000]

• Integration of complex structured databases: (capture dependencies among multiple merges)– [Cohen, MacAllister, Kautz KDD 2000; Pasula,

Marthi, Milch, Russell, Shpitser, NIPS 2002; McCallum and Wellner, KDD WS 2003]

Page 66: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

IE Resources

• Data– RISE, http://www.isi.edu/~muslea/RISE/index.html– Linguistic Data Consortium (LDC)

• Penn Treebank, Named Entities, Relations, etc.– http://www.biostat.wisc.edu/~craven/ie

• Papers, tutorials, lectures, code– http://www.cs.cmu.edu/~wcohen/10-707

Page 67: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

References• [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In

Proceedings of ANLP’97, p194-201.• [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in

Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).• [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML

documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)• [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the

Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).• [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual

Similarity, in Proceedings of ACM SIGMOD-98.• [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language,

ACM Transactions on Information Systems, 18(3).• [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning:

Proceedings of the Seventeeth International Conference (ML-2000).• [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the

Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.• [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural

Language Processing. Larence Erlbaum, 1982, 149-176.• [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the

Fifteenth National Conference on Artificial Intelligence (AAAI-98).• [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon

University.• [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101

(2000).• Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National

Conference on Artificial Intelligence (AAAI-99)• [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings

AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.• [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).• [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models

for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.• [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.• [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information

extraction and segmentation, In Proceedings of ICML-2000• [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from

Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.

Page 68: Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University April 26, 2004

References

• [Muslea et al, 1999] Muslea, I.; Minton, S.; Knoblock, C. A.: A Hierarchical Approach to Wrapper Induction. Proceedings of Autonomous Agents-99.

• [Muslea et al, 2000] Musclea, I.; Minton, S.; and Knoblock, C. Hierarhical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems.

• [Nahm & Mooney, 2000] Nahm, Y.; and Mooney, R. A mutually beneficial integration of data mining and information extraction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 627--632, Austin, TX.

• [Punyakanok & Roth 2001] Punyakanok, V.; and Roth, D. The use of classifiers in sequential inference. Advances in Neural Information Processing Systems 13.

• [Ratnaparkhi 1996] Ratnaparkhi, A., A maximum entropy part-of-speech tagger, in Proc. Empirical Methods in Natural Language Processing Conference, p133-141.

• [Ray & Craven 2001] Ray, S.; and Craven, Ml. Representing Sentence Structure in Hidden Markov Models for Information Extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA. Morgan Kaufmann.

• [Soderland 1997]: Soderland, S.: Learning to Extract Text-Based Information from the World Wide Web. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97).

• [Soderland 1999] Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1/3):233-277.