information extraction october 11, 2006. plan for ie introduction to the ie problem wrappers ...

85
Information Extraction October 11, 2006

Upload: kevin-mccarthy

Post on 24-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Information Extraction

October 11, 2006

Page 2: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Plan for IE

Introduction to the IE problem Wrappers Wrapper Induction Traditional NLP-based IE Probabilistic/machine learning methods for information

extraction

Page 3: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

What is Information Extraction?

First note this semantic slippage: Information Retrieval doesn’t retrieve information You have an information need, but what you get back isn’t

information but documents, which you hope have the information

Information extraction is one approach to going further for a special case: There’s some relation you’re interested in Your query is for elements of that relation A limited form of natural language understanding

But this is a common scenario…

Page 4: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Page 5: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Page 6: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Data automatically

extracted from

marketsoft.com

Data automatically

extracted from

marketsoft.com

Source web page.

Color highlights

indicate type of

information.

(e.g., red = name)

Source web page.

Color highlights

indicate type of

information.

(e.g., red = name)

Extracting Corporate Information

E.g., information need: Who is theCEO of MarketSoft?

Source: Whizbang! Labs/Andrew McCallum

Page 7: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

IE from Research Papers

Page 8: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine
Page 9: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Commercial information…

Need thisprice

Title

A book,Not a toy

Page 10: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Canonicalization:Product information

Page 11: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Product information

Page 12: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Product information

Page 13: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Product info

CNET markets this information

How do they get most of it? Sometimes data

feeds

Phone calls Typing

Page 14: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

It’s difficult because of textual inconsistency: digital cameras Image Capture Device: 1.68 million pixel 1/2-inch CCD sensor Image Capture Device Total Pixels Approx. 3.34 million Effective Pixels

Approx. 3.24 million Image sensor Total Pixels: Approx. 2.11 million-pixel Imaging sensor Total Pixels: Approx. 2.11 million 1,688 (H) x 1,248 (V) CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560 [V] )

Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] ) Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] )

These all came off the same manufacturer’s website!! And this is a very technical domain. Try sofa beds.

Page 15: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Classified Advertisements (Real Estate)

Background: Advertisements are

plain text Lowest common

denominator: only thing that 70+ newspapers with 20+ publishing systems can all handle

<ADNUM>2067206v1</ADNUM><DATE>March 02, 1998</DATE><ADTITLE>MADDINGTON

$89,000</ADTITLE><ADTEXT>OPEN 1.00 - 1.45<BR>U 11 / 10 BERTRAM ST<BR> NEW TO MARKET Beautiful<BR>3 brm freestanding<BR>villa, close to shops & bus<BR>Owner moved to Melbourne<BR> ideally suit 1st home buyer,<BR> investor & 55 and over.<BR>Brian Hazelden 0418 958 996<BR> R WHITE LEEMING 9332 3477</ADTEXT>

Page 16: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine
Page 17: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

www.apple.com/retail

IE is different in different domains!Example: on web there is less grammar, but more formatting & linking

The directory structure, link structure, formatting & layout of the Web is its own new grammar.

Apple to Open Its First Retail Storein New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

www.apple.com/retail/soho

www.apple.com/retail/soho/theatre.html

Newswire Web

Page 18: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Landscape of IE Tasks (2/4):Intended Breadth of Coverage

Web site specific Genre specific Wide, non-specific

Amazon.com Book Pages Resumes University Names

Formatting Layout Language

Page 19: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Landscape of IE Tasks (3/4):Complexity

Closed set

He was born in Alabama…

Regular set

Phone: (413) 545-1323

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802

…was among the six houses sold by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

The CALD main office can be reached at 412-268-1299

The big Wyoming sky…

U.S. states U.S. phone numbers

U.S. postal addresses

Person names

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

E.g. word patterns:

Page 20: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Landscape of IE Tasks (4/4):Single Field/Record

Single entity

Person: Jack Welch

Binary relationship

Relation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: Company-LocationCompany: General ElectricLocation: Connecticut

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

Page 21: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Evaluation of Single Entity ExtractionMichael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.

TRUTH:

PRED:

Precision = = # correctly predicted segments 2

# predicted segments 6

Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.

Recall = = # correctly predicted segments 2

# true segments 4

F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 2

1

Page 22: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Why doesn’t text search (IR) work?

What you search for in real estate advertisements: Towns. You might think easy, but:

Real estate agents: Coldwell Banker, Mosman Phrases: Only 45 minutes from Parramatta Multiple property ads have different towns

Money: want a range not a textual match Multiple amounts: was $155K, now $145K Variations: offers in the high 700s [but not rents for $270]

Bedrooms: similar issues (br, bdr, beds, B/R)

Page 23: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Task: Information Extraction

Information extraction systems Find and understand the limited relevant parts of texts

Clear, factual information (who did what to whom when?) Produce a structured representation of the relevant information:

relations (in the DB sense) Combine knowledge about language and a domain Automatically extract the desired information

E.g. Gathering earnings, profits, board members, etc. from company

reports Learn drug-gene product interactions from medical research literature “Smart Tags” (Microsoft) inside documents

Page 24: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Landscape of IE Techniques (1/1):Models

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Page 25: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Aside: What about XML?

Don’t XML, RDF, OIL, SHOE, DAML, XSchema, … obviate the need for information extraction?!??!

Yes: IE is sometimes used to “reverse engineer” HTML database interfaces;

extraction would be much simpler if XML were exported instead of HTML.

Ontology-aware editors will make it easer to enrich content with metadata.

No: Terabytes of legacy HTML. Data consumers forced to accept ontological decisions of data

providers (eg, <NAME>John Smith</NAME> vs.<NAME first="John" last="Smith"/> ).

A lot of these pages are PR aimed at humans Will you annotate every email you send? Every memo you write?

Every photograph you scan?

Page 26: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

“Wrappers”

If we think of things from the database point of view We want to be able to database-style queries But we have data in some horrid textual form/content management system

that doesn’t allow such querying We need to “wrap” the data in a component that understands database-style

querying Hence the term “wrappers”

Many people have “wrapped” many web sites Commonly something like a Perl script Often easy to do as a one-off

But handcoding wrappers in Perl isn’t very viable Sites are numerous, and their surface structure mutates rapidly (around

10% failures each month)

Page 27: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Amazon Book Description….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>

….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>…

Page 28: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Extracted Book TemplateTitle: The Age of Spiritual Machines : When Computers Exceed Human IntelligenceAuthor: Ray KurzweilList-Price: $14.95Price: $11.96::

Page 29: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Template Types

Slots in template typically filled by a substring from the document. Some slots may have a fixed set of pre-specified possible fillers that

may not occur in the text itself. Terrorist act: threatened, attempted, accomplished. Job type: clerical, service, custodial, etc. Company type: SEC code

Some slots may allow multiple fillers. Programming language

Some domains may allow multiple extracted templates per document. Multiple apartment listings in one ad

Page 30: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Wrapper tool-kits

Wrapper toolkits: Specialized programming environments for writing & debugging wrappers by hand

Ugh! The links to examples I used in 2003 are all dead now … here’s one I found: http://www.cc.gatech.edu/projects/disl/XWRAPElite/elite-

home.html Aging Examples

World Wide Web Wrapper Factory (W4F) Java Extraction & Dissemination of Information (JEDI) Junglee Corporation Survey:

http://www.netobjectdays.org/pdf/02/papers/node/0188.pdf

Page 31: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Task: Wrapper Induction

Learning wrappers is wrapper induction Sometimes, the relations are structural.

Web pages generated by a database. Tables, lists, etc.

Can’t computers automatically learn the patterns a human wrapper-writer would use?

Wrapper induction is usually regular relations which can be expressed by the structure of the document:

the item in bold in the 3rd column of the table is the price Wrapper induction techniques can also learn:

If there is a page about a research project X and there is a link near the word ‘people’ to a page that is about a person Y then Y is a member of the project X.

[e.g, Tom Mitchell’s Web->KB project]

Page 32: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Wrappers:Simple Extraction Patterns

Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b”

May require preceding (pre-filler) pattern to identify proper context. Amazon list price:

Pre-filler pattern: “<b>List Price:</b> <span class=listprice>” Filler pattern: “\$\d+(\.\d{2})?\b”

May require succeeding (post-filler) pattern to identify the end of the filler. Amazon list price:

Pre-filler pattern: “<b>List Price:</b> <span class=listprice>” Filler pattern: “.+” Post-filler pattern: “</span>”

Page 33: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Simple Template Extraction

Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. Title Author List price …

Make patterns specific enough to identify each filler always starting from the beginning of the document.

Page 34: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Pre-Specified Filler Extraction

If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. Job category Company type

Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

Page 35: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Wrapper induction

Highly regularsource documents

Relatively simple

extraction patterns

Efficient

learning algorithm

Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.

Alternative is to use machine learning: Build a training set of

documents paired with human-produced filled extraction templates.

Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Page 36: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Use <B>, </B>, <I>, </I> for extraction

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

Wrapper induction: Delimiter-based extraction

Page 37: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

l1, r1, …, lK, rK

Example: Find 4 strings<B>, </B>, <I>, </I> l1 , r1 , l2 , r2

labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

Learning LR wrappers

Page 38: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

LR: Finding r1

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

r1 can be any prefix

eg </B>

Page 39: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

LR: Finding l1, l2 and r2

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML> r2 can be any prefix

eg </I>

l2 can be any suffix

eg <I>

l1 can be any suffix

eg <B>

Page 40: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Distracting text in head and tail <HTML><TITLE>Some Country Codes</TITLE>

<BODY><B>Some Country Codes</B><P> <B>Congo</B> <I>242</I><BR> <B>Egypt</B> <I>20</I><BR> <B>Belize</B> <I>501</I><BR> <B>Spain</B> <I>34</I><BR> <HR><B>End</B></BODY></HTML>

A problem with LR wrappers

Page 41: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Ignore page’s head and tail

<HTML><TITLE>Some Country Codes</TITLE><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

head

body

tail

}

}}

start of tail

end of head

Head-Left-Right-Tail wrappers

One (of many) solutions: HLRT

Page 42: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

More sophisticated wrappers

LR and HLRT wrappers are extremely simple Though applicable to many tabular patterns

Recent wrapper induction research has explored more expressive wrapper classes [Muslea et al, Agents-98; Hsu et al, JIS-98; Kushmerick, AAAI-1999; Cohen, AAAI-1999; Minton et al, AAAI-2000] Disjunctive delimiters Multiple attribute orderings Missing attributes Multiple-valued attributes Hierarchically nested data Wrapper verification and maintenance

Page 43: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Boosted wrapper induction

Wrapper induction is only ideal for rigidly-structured machine-generated HTML…

… or is it?! Can we use simple patterns to extract from natural

language documents?

… Name: Dr. Jeffrey D. Hermes … … Who: Professor Manfred Paul …... will be given by Dr. R. J. Pangborn …

… Ms. Scott will be speaking …… Karen Shriver, Dept. of ... … Maria Klawe, University of ...

Page 44: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

BWI: The basic idea

Learn “wrapper-like” patterns for texts pattern = exact token sequence

Learn many such “weak” patterns Combine with boosting to build “strong” ensemble pattern

Boosting is a popular recent machine learning method where many weak learners are combined

Demo: http://www.smi.ucd.ie/bwi Not all natural text is sufficiently regular for exact string

matching to work well!!

Page 45: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Natural Language Processing-based Information Extraction

If extracting from automatically generated web pages, simple regex patterns usually work.

If extracting from more natural, unstructured, human-written text, some NLP may help. Part-of-speech (POS) tagging

Mark each word as a noun, verb, preposition, etc. Syntactic parsing

Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet)

KILL: kill, murder, assassinate, strangle, suffocate Extraction patterns can use POS or phrase tags.

Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]

Page 46: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

MUC: the NLP genesis of IE

DARPA funded significant efforts in IE in the early to mid 1990’s.

Message Understanding Conference (MUC) was an annual event/competition where results were presented.

Focused on extracting information from news articles: Terrorist events Industrial joint ventures Company management changes

Information extraction is of particular interest to the intelligence community

Page 47: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE from FASTUS (1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 48: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE: FASTUS(1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 49: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite state automata (FSA) transductions

set upnew Taiwan dollars

a Japanese trading househad set up

production of 20, 000 iron and metal wood clubs

[company][set up][Joint-Venture]with[company]

Page 50: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

Finite Automaton forNoun groups:John’s interestingbook with a nice cover

Grep++ = Cascaded grepping

Page 51: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Rule-based Extraction Examples

Determining which person holds what office in what organization [person] , [office] of [org]

Vuk Draskovic, leader of the Serbian Renewal Movement [org] (named, appointed, etc.) [person] P [office]

NATO appointed Wesley Clark as Commander in Chief

Determining where an organization is located [org] in [loc]

NATO headquarters in Brussels [org] [loc] (division, branch, headquarters, etc.)

KFOR Kosovo headquarters

Page 52: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Finite State Machines

Page 53: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Hidden Markov Models

St -1

St

Ot

St+1

Ot +1

Ot -1

...

...

Finite state model Graphical model

Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)

||

11 )|()|(),(

o

ttttt soPssPosP

HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequenceObservation sequence

Usually a multinomial over atomic, fixed alphabet

Page 54: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

IE with Hidden Markov Models

Yesterday Pedro Domingos spoke this example sentence.

Yesterday Pedro Domingos spoke this example sentence.

Person name: Pedro Domingos

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

person name

location name

background

Page 55: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

HMM Example: “Nymble”

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]

Task: Named Entity Extraction

Train on ~500k words of news wire text.

Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%

[Bikel, et al 1998], [BBN “IdentiFinder”]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Transitionprobabilities

Observationprobabilities

P(st | st-1, ot-1 ) P(ot | st , st-1 )

Back-off to: Back-off to:

P(st | st-1 )

P(st )

P(ot | st , ot-1 )

P(ot | st )

P(ot )

or

Results:

Page 56: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

We want More than an Atomic View of WordsWould like richer representation of text: many arbitrary, overlapping features of the words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Page 57: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Problems with Richer Representationand a Generative Model

These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future

Two choices:

Model the dependencies.Each state would have its own Bayes Net. But we are already starved for training data!

Ignore the dependencies.This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

St -1

St

Ot

St+1

Ot +1

Ot -1

St -1

St

Ot

St+1

Ot +1

Ot -1

Page 58: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Conditional Sequence Models

We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o):

Can examine features, but not responsible for generating them.

Don’t have to explicitly model their dependencies.

Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

Page 59: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Conditional Markov Models (CMMs) vs HMMS

St-1 St

Ot

St+1

Ot+1Ot-1

...

i

iiii sossos )|Pr()|Pr(),Pr( 11

St-1 St

Ot

St+1

Ot+1Ot-1

...

i

iii ossos ),|Pr()|Pr( 11

Lots of ML ways to estimate Pr(y | x)

Page 60: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

nn oooossss ,...,,..., 2121

Joint

Conditional

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

||

11 )|()|(),(

o

ttttt soPssPosP

kttkko osft ),(exp)(

(A super-special case of Conditional Random Fields.)

Conditional Finite State Sequence Models

From HMMs to CRFs[Lafferty, McCallum, Pereira 2001]

[McCallum, Freitag & Pereira, 2000]

||

11 )|()|(

)(

1)|(

o

ttttt soPssP

oPosP

||

11 ),(),(

)(

1 o

tttotts soss

oZ

where

Arbitrary features of s,o, and t

Page 61: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Feature Functions:),,,( Example 1 tossf ttk

otherwise 0

s s )d(Capitalize if 1),,,( j1i

1,d,Capitalizettt

ttss

ssotossf

ji

Yesterday Pedro Domingos spoke this example sentence.

s3

s1 s2

s4

1 )2,,,( 21,, 31 ossf ssdCapitalize

o = o1 o2 o3 o4 o5 o6 o7

Page 62: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Learning Parameters of CRFs

),,,(),(# where

),'(# )|'(),(#

1

2'

)()(

,

tossfos

ososPosL

ttt

kk

k

i s

ik

i

Dosk

k

Methods:• iterative scaling (quite slow – 2000 iterations from good start)• gradient, conjugate gradient (faster)• limited-memory quasi-Newton methods (“super fast”)

[Sha & Pereira 2002] & [Malouf 2002]

Maximize log-likelihood of parameters k given training data D

Log-likelihood gradient:

k

k

Dos

o

t kttkk tossf

oZL

2

2

,

||

11 2

),,,(exp)(

1log

Page 63: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Voted Perceptron Sequence Models

before as ),,,(),( where

),(),( :k

),,,(expmaxarg

i instances, trainingallfor

:econvergenc toIterate

0k :zero toparameters Initialize

},{ :data ningGiven trai

1

)()()(

1

k

)(

tossfosC

osCosC

tossfs

so

ttt

kk

iViterbik

iikk

t kttkksViterbi

i

[Collins 2001; also Hofmann 2003, Taskar et al 2003]

Avoids the tricky math; very fast; uses “pseudo-negative” examples of sequences; approximates a margin classifier for “good” vs “bad” sequences

Analogous tothe gradientfor this onetraining instance

Page 64: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Broader Issues in IE

Page 65: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Broader ViewCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Up to now we have been focused on segmentation and classification

Page 66: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Broader ViewCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Now touch on some other issues

12

3

4

5

Page 67: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(1) Association as Binary Classification

[Zelenko et al, 2002]

Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair.

Person-Role (Christos Faloutsos, KDD 2003 General Chair) NO

Person-Role ( Ted Senator, KDD 2003 General Chair) YES

Person Person Role

Do this with SVMs and tree kernels over parse trees.

Page 68: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(1) Association with Finite State Machines

[Ray & Craven, 2001]

… This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. …

DET thisN enzymeN ubc6V localizesPREP toART theADJ endoplasmicN reticulumPREP withART theADJ catalyticN domainV facingART theN cytosol Subcellular-localization (UBC6, endoplasmic reticulum)

Page 69: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(1) Association with Graphical Models[Roth & Yih 2002]Capture arbitrary-distance

dependencies among predictions.

Local languagemodels contributeevidence to entityclassification.

Local languagemodels contributeevidence to relationclassification.

Random variableover the class ofentity #2, e.g. over{person, location,…}

Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.

Page 70: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(1) Association with Graphical Models[Roth & Yih 2002]Also capture long-distance

dependencies among predictions.

Local languagemodels contributeevidence to entityclassification.

Random variableover the class ofentity #1, e.g. over{person, location,…}

Local languagemodels contributeevidence to relationclassification.

Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.

person?

personlives-in

Page 71: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(1) Association with Graphical Models[Roth & Yih 2002]Also capture long-distance

dependencies among predictions.

Local languagemodels contributeevidence to entityclassification.

Random variableover the class ofentity #1, e.g. over{person, location,…}

Local languagemodels contributeevidence to relationclassification.

Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.

location

personlives-in

Page 72: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Broader ViewCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Now touch on some other issues

12

3

4

5

When do two extracted stringsrefer to the same object?

Page 73: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(2) Learning a Distance Metric Between Records[Borthwick, 2000; Cohen & Richman, 2001; Bilenko & Mooney, 2002, 2003]

Learn Pr ({duplicate, not-duplicate} | record1, record2)with a Maximum Entropy classifier.

Do greedy agglomerative clustering using this Probability as a distance metric.

Page 74: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(2) String Edit Distance distance(“William Cohen”, “Willliam Cohon”)

W I L L I A M _ C O H E N

W I L L L I A M _ C O H O N

C C C C I C C C C C C C S C

0 0 0 0 1 1 1 1 1 1 1 1 2 2

s

t

op

cost

alignment

Page 75: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(2) Computing String Edit DistanceD(i,j) = min

D(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete

C O H E N

M 1 2 3 4 5

C 1 2 3 4 5

C 2 3 3 4 5

O 3 2 3 4 5

H 4 3 2 3 4

N 5 4 3 3 3

A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

learntheseparameters

Page 76: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(2) String Edit Distance Learning

Precision/recall for MAILING dataset duplicate detection

[Bilenko & Mooney, 2002, 2003]

Page 77: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(2) Information Integration

Goal might be to merge results of two IE systems:

Name: Introduction to Computer Science

Number: CS 101

Teacher: M. A. Kludge

Time: 9-11am

Name: Data Structures in Java

Room: 5032 Wean Hall

Title: Intro. to Comp. Sci.

Num: 101

Dept: Computer Science

Teacher: Dr. Klüdge

TA: John Smith

Topic: Java Programming

Start time: 9:10 AM

[Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001],[Richardson & Domingos 2003]

Page 78: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(2) Other Information Integration Issues

Distance metrics for text – which work well? [Cohen, Ravikumar, Fienberg, 2003]

Finessing integration by soft database operations based on similarity [Cohen, 2000]

Integration of complex structured databases: (capture dependencies among multiple merges) [Cohen, MacAllister, Kautz KDD 2000; Pasula, Marthi,

Milch, Russell, Shpitser, NIPS 2002; McCallum and Wellner, KDD WS 2003]

Page 79: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Broader ViewCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Now touch on some other issues

1

1

2

3

4

5

Page 80: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

(5) Working with IE Data Some special properties of IE data:

It is based on extracted text It is “dirty”, (missing extraneous facts, improperly normalized

entity names, etc.) May need cleaning before use

What operations can be done on dirty, unnormalized databases? Datamine it directly. Query it directly with a language that has “soft joins” across

similar, but not identical keys. [Cohen 1998] Use it to construct features for learners [Cohen 2000] Infer a “best” underlying clean database

[Cohen, Kautz, MacAllester, KDD2000]

Page 81: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Evaluating IE Accuracy

Always evaluate performance on independent, manually-annotated test data not used during system development.

Template Measure for each test document: Total number of correct extractions in the solution template: N Total number of slot/value pairs extracted by the system: E Number of extracted slot/value pairs that are correct (i.e. in the

solution template): C Compute average value of metrics adapted from IR:

Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

Note subtle difference

Page 82: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

MUC Information Extraction:State of the Art c. 1997

NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production

Page 83: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Summary and prelude

We’ve looked at the “fragment extraction” task. Future? Top-down semantic constraints (as well as syntax)? Unified framework for extraction from regular & natural text?

(BWI is one tiny step; Webfoot [Soderland 1999] is another.) Beyond fragment extraction:

Anaphora resolution, discourse processing, ... Fragment extraction is good enough for many Web information

services! Next time:

Learning methods for information extraction

Page 84: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Three generations of IE systems

Hand-Built Systems – Knowledge Engineering [1980s– ] Rules written by hand Require experts who understand both the systems and the

domain Iterative guess-test-tweak-repeat cycle

Automatic, Trainable Rule-Extraction Systems [1990s– ] Rules discovered automatically using predefined templates,

using methods like ILP Require huge, labeled corpora (effort is just moved!)

Machine Learning (Sequence) Models [1997 – ] One decodes a statistical model that classifies the words of the

text, using HMMs, random fields or statistical parsers Learning usually supervised; may be partially unsupervised

Page 85: Information Extraction October 11, 2006. Plan for IE  Introduction to the IE problem  Wrappers  Wrapper Induction  Traditional NLP-based IE  Probabilistic/machine

Basic IE References

Douglas E. Appelt and David Israel. 1999. Introduction to Information Extraction Technology. IJCAI 1999 Tutorial. http://www.ai.sri.com/~appelt/ie-tutorial/

Kushmerick, Weld, Doorenbos: Wrapper Induction for Information Extraction,IJCAI 1997. http://www.cs.ucd.ie/staff/nick/

Stephen Soderland: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34(1-3): 233-272 (1999)