1 information extraction (several slides based on those by ray mooney, cohen/mccallum (via dan...

90
1 Information Extraction (Several slides based on those by Ray Mooney, Cohen/McCallum (via Dan Weld’s class) ake-up Class: Tomorrow (Wed) 10:30—11:45AM BY 210 (next to the advising office)

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

1

Information Extraction

(Several slides based on those by Ray Mooney, Cohen/McCallum (via Dan

Weld’s class)

Make-up Class: Tomorrow (Wed) 10:30—11:45AM BY 210 (next to the advising office)

2

Intended Use of Semantic Web?• Pages should be annotated with RDF triples, with links to

RDF-S (our OWL) background ontology. • E.g. See Jim Hendler’s page…

3

Database vs. Semantic Web Inference(and the Magellan Story)

• Also templated extraction as undoing XMLHTML conversion. Templated extraction is by DOM-patterns; unstructured extraction is (sort of) by grammar parse tree patterns. Grammar learning is mostly from +ve examples.

To be addedRinku Patel

4

Who will annotate the data?• Semantic web works if the users annotate their pages using some existing

ontology (or their own ontology, but with mapping to other ontologies)– But users typically do not conform to standards..

• and are not patient enough for delayed gratification… • Two Solutions

– 1. Intercede in the way pages are created (act as if you are helping them write web-pages)

• What if we change the MS Frontpage/Claris Homepage so that they (slyly) add annotations?

• E.g. The Mangrove project at U. Wash. – Help user in tagging their data (allow graphical editing)– Provide instant gratification by running services that use the tags.

– 2. Collaborative tagging!• “Folksonomies” (look at Wikipedia article)

– FLICKR, Technorati, deli.cio.us etc • CBIOC, ESP game etc.

– Need to incentivize users to do the annotations..

– 3. Automated information extraction (next topic)

5

Folksonomies—The good• Bottom-up approach to taxonomies/ontologies

– [In systems like] Furl, Flickr and Del.icio.us... people classify their pictures/bookmarks/web pages with tags (e.g. wedding), and then the most popular tags float to the top (e.g. Flickr's tags or Del.icio.us on the right)....

– [F]olksonomies can work well for certain kinds of information because they offer a small reward for using one of the popular categories (such as your photo appearing on a popular page). People who enjoy the social aspects of the system will gravitate to popular categories while still having the freedom to keep their own lists of tags.

6

Works best whenMany peopleTag the sameInfo…

7

Folksonomies… the bad• On the other hand, not hard to see a few reasons why a

folksonomy would be less than ideal in a lot of cases: – None of the current implementations have synonym control

(e.g. "selfportrait" and "me" are distinct Flickr tags, as are "mac" and "macintosh" on Del.icio.us).

– Also, there's a certain lack of precision involved in using simple one-word tags--like which Lance are we talking about?

– And, of course, there's no heirarchy and the content types (bookmarks, photos) are fairly simple.

• For indexing and library people, folksonomies are about as appealing as Wikipedia is to encyclopedia editors. – But.. there's some interesting stuff happening around them.

8

Mass Collaboration (& Mice running the Earth)

• The quality of the tags generated through folksonomies is notoriously hard to control– So, design mechanisms that ensure correctness of tags..

• ESP game makes it fun to

• CBIOC and Google Co-op restrict annotation previleges to trusted users..

• It is hard to get people to tag things in which they don’t have personal interest..– Find incentive structures..

• ESP makes it a “game” with points

• CBIOC and Google Co-op try to promise delayed gratification in terms of improved search later..

9

Who will annotate the data?• Semantic web works if the users annotate their pages using some existing

ontology (or their own ontology, but with mapping to other ontologies)– But users typically do not conform to standards..

• and are not patient enough for delayed gratification… • Two Solutions

– 1. Intercede in the way pages are created (act as if you are helping them write web-pages)

• What if we change the MS Frontpage/Claris Homepage so that they (slyly) add annotations?

• E.g. The Mangrove project at U. Wash. – Help user in tagging their data (allow graphical editing)– Provide instant gratification by running services that use the tags.

– 2. Collaborative tagging!• “Folksonomies” (look at Wikipedia article)

– FLICKR, Technorati, deli.cio.us etc • CBIOC, ESP game etc.

– Need to incentivize users to do the annotations..

– 3. Automated information extraction Next Topic

10

Information Extraction (IE)

• Identify specific pieces of information (data) in a unstructured or semi-structured textual document.

• Transform unstructured information in a corpus of documents or web pages into a structured database.

• Applied to different types of text:– Newspaper articles– Web pages– Scientific articles– Newsgroup messages– Classified ads– Medical notes– Wikipedia (info boxes)..

11

Information Extraction vs. NLP?

• Information extraction is attempting to find some of the structure and meaning in the hopefully template driven web pages.

• As IE becomes more ambitious and text becomes more free form, then ultimately we have IE becoming equal to NLP.

• Web does give one particular boost to NLP– Massive corpora..

12

MUC

• DARPA funded significant efforts in IE in the early to mid 1990’s.

• Message Understanding Conference (MUC) was an annual event/competition where results were presented.

• Focused on extracting information from news articles:– Terrorist events– Industrial joint ventures– Company management changes

• Information extraction of particular interest to the intelligence community (CIA, NSA).

13

Other Applications

• Job postings:– Newsgroups: Rapier from austin.jobs– Web pages: Flipdog

• Job resumes: – BurningGlass– Mohomine

• Seminar announcements• Company information from the web• Continuing education course info from the web• University information from the web• Apartment rental ads• Molecular biology information from MEDLINE

14

Wikipedia Infoboxes..

• Wikipedia has both unstructured text and structured info boxes..

Infobox

15

Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: <[email protected]>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:Kim AndersonAdNET(901) 458-2888 [email protected]

Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: <[email protected]>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:Kim AndersonAdNET(901) 458-2888 [email protected]

Sample Job Posting

16

Extracted Job Template

computer_science_jobid: [email protected]: SOFTWARE PROGRAMMERsalary:company:recruiter:state: TNcity:country: USlanguage: Cplatform: PC \ DOS \ OS-2 \ UNIXapplication:area: Voice Mailreq_years_experience: 2desired_years_experience: 5req_degree:desired_degree:post_date: 17 Nov 1996

17

Amazon Book Description

….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>

….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>…

18

Extracted Book Template

Title: The Age of Spiritual Machines : When Computers Exceed Human IntelligenceAuthor: Ray KurzweilList-Price: $14.95Price: $11.96::

19

Extraction from Templated Text

• Many web pages are generated automatically from an underlying database.

• Therefore, the HTML structure of pages is fairly specific and regular (semi-structured).

• However, output is intended for human consumption, not machine interpretation.

• An IE system for such generated pages allows the web site to be viewed as a structured database.

• An extractor for a semi-structured web site is sometimes referred to as a wrapper.

• Process of extracting from such pages is sometimes referred to as screen scraping.

20

Templated Extraction using DOM Trees

• Web extraction may be aided by first parsing web pages into DOM trees.

• Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract.

• May still need regex patterns to identify proper portion of the final CharacterData node.

21

Sample DOM Tree Extraction

HTML

BODY

FONTB

Age of Spiritual Machines

Ray Kurzweil

Element

Character-DataHEADER

by A

Title: HTMLBODYBCharacterDataAuthor: HTML BODYFONTA CharacterData

22

Template Types

• Slots in template typically filled by a substring from the document.

• Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text itself.– Terrorist act: threatened, attempted, accomplished.

– Job type: clerical, service, custodial, etc.

– Company type: SEC code

• Some slots may allow multiple fillers.– Programming language

• Some domains may allow multiple extracted templates per document.– Multiple apartment listings in one ad

23

Simple Extraction Patterns

• Specify an item to extract for a slot using a regular expression pattern.– Price pattern: “\b\$\d+(\.\d{2})?\b”

• May require preceding (pre-filler) pattern to identify proper context.– Amazon list price:

• Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”• Filler pattern: “\$\d+(\.\d{2})?\b”

• May require succeeding (post-filler) pattern to identify the end of the filler.– Amazon list price:

• Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”• Filler pattern: “.+”• Post-filler pattern: “</span>”

24

Simple Template Extraction

• Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. – Title

– Author

– List price

– …

• Make patterns specific enough to identify each filler always starting from the beginning of the document.

25

Pre-Specified Filler Extraction

• If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot.– Job category– Company type

• Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

26

Learning for IE

• Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.

• Alternative is to use machine learning:– Build a training set of documents paired with human-produced

filled extraction templates.– Learn extraction patterns for each slot using an appropriate

machine learning algorithm.

37

Information Extraction from unstructured text

39

Information Extraction from Unstructured Text:

• Semantic web needs:– Tagged data– Background knowledge

• (blue sky approaches to) automate both– Knowledge Extraction

• Extract base level knowledge (“facts”) directly from the web

– Automated tagging• Start with a background ontology and tag other web

pages– Semtag/Seeker

40

Fielded IE Systems: Citeseer, Google Scholar; LibraHow do they do it? Why do they fail?

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Slides from Cohen & McCallum

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Slides from Cohen & McCallum

What is “Information Extraction”

Information Extraction = segmentation + classification + clustering + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Slides from Cohen & McCallum

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Slides from Cohen & McCallum

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Slides from Cohen & McCallum

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

Slides from Cohen & McCallum

IE in Context

Create ontology

Segment Classify Associate Cluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Slides from Cohen & McCallum

IE HistoryPre-Web• Mostly news articles

– De Jong’s FRUMP [1982]• Hand-built system to fill Schank-style “scripts” from news wire

– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]

• Most early work dominated by hand-built models– E.g. SRI’s FASTUS, hand-built FSMs.– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then

HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]

Web• AAAI ’94 Spring Symposium on “Software Agents”

– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.

• Tom Mitchell’s WebKB, ‘96– Build KB’s from the Web.

• Wrapper Induction– First by hand, then ML: [Doorenbos ‘96], [Soderland ’96], [Kushmerick ’97],…

Slides from Cohen & McCallum

www.apple.com/retail

What makes IE from the Web Different?Less grammar, but more formatting & linking

The directory structure, link structure, formatting & layout of the Web is its own new grammar.

Apple to Open Its First Retail Storein New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."

www.apple.com/retail/soho

www.apple.com/retail/soho/theatre.html

Newswire Web

Slides from Cohen & McCallum

Landscape of IE Tasks (1/4):Pattern Feature Domain

Text paragraphswithout formatting

Grammatical sentencesand some formatting & links

Non-grammatical snippets,rich formatting & links Tables

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Slides from Cohen & McCallum

Landscape of IE Tasks (2/4):Pattern Scope

Web site specific Genre specific Wide, non-specific

Amazon.com Book Pages Resumes University Names

Formatting Layout Language

Slides from Cohen & McCallum

Landscape of IE Tasks (3/4):Pattern Complexity

Closed set

He was born in Alabama…

Regular set

Phone: (413) 545-1323

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802 …was among the six houses sold

by Hope Feldman that year.

Ambiguous patterns,needing context + manysources of evidence

The CALD main office can be reached at 412-268-1299

The big Wyoming sky…

U.S. states U.S. phone numbers

U.S. postal addresses

Person names

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

E.g. word patterns:

Slides from Cohen & McCallum

Landscape of IE Tasks (4/4):Pattern Combinations

Single entity

Person: Jack Welch

Binary relationship

Relation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: Company-LocationCompany: General ElectricLocation: Connecticut

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

Slides from Cohen & McCallum

Evaluation of Single Entity Extraction

Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.TRUTH:

PRED:

Precision = = # correctly predicted segments 2

# predicted segments 6

Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.

Recall = = # correctly predicted segments 2

# true segments 4

F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 2

1

Slides from Cohen & McCallum

State of the Art Performance

• Named entity recognition– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s

• Wrapper induction– Extremely accurate performance obtainable– Human effort (~30min) required on each site

Slides from Cohen & McCallum

Landscape of IE Techniques (1/1):Models

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

…and beyond

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Slides from Cohen & McCallum

Sliding Windows

Slides from Cohen & McCallum

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Slides from Cohen & McCallum

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Slides from Cohen & McCallum

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Slides from Cohen & McCallum

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Slides from Cohen & McCallum

A “Naïve Bayes” Sliding Window Model[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

Other examples of sliding window: [Baluja et al 2000](decision tree over individual words & their context)

If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

… …

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

Slides from Cohen & McCallum

“Naïve Bayes” Sliding Window Results

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Domain: CMU UseNet Seminar Announcements

Field F1 Person Name: 30%Location: 61%Start Time: 98%

Slides from Cohen & McCallum

Realistic sliding-window-classifier IE

• What windows to consider?– all windows containing as many tokens as the shortest example, but

no more tokens than the longest example

• How to represent a classifier? It might:– Restrict the length of window;– Restrict the vocabulary or formatting used before/after/inside

window;– Restrict the relative order of tokens, etc.

• Learning Method– SRV: Top-Down Rule Learning [Frietag AAAI ‘98]– Rapier: Bottom-Up [Califf & Mooney, AAAI ‘99]

Slides from Cohen & McCallum

Rapier: results – precision/recall

Slides from Cohen & McCallum

Rule-learning approaches to sliding-window classification: Summary

• SRV, Rapier, and WHISK [Soderland KDD ‘97]

– Representations for classifiers allow restriction of the relationships between tokens, etc

– Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)

– Use of these “heavyweight” representations is complicated, but seems to pay off in results

• Can simpler representations for classifiers work?

Slides from Cohen & McCallum

BWI: Learning to detect boundaries

• Another formulation: learn three probabilistic classifiers:– START(i) = Prob( position i starts a field)– END(j) = Prob( position j ends a field)– LEN(k) = Prob( an extracted field has length k)

• Then score a possible extraction (i,j) bySTART(i) * END(j) * LEN(j-i)

• LEN(k) is estimated from a histogram

[Freitag & Kushmerick, AAAI 2000]

Slides from Cohen & McCallum

BWI: Learning to detect boundaries

• BWI uses boosting to find “detectors” for START and END

• Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i).

• Each “pattern” is a sequence of – tokens and/or – wildcards like: anyAlphabeticToken, anyNumber, …

• Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns

Slides from Cohen & McCallum

BWI: Learning to detect boundaries

Field F1 Person Name: 30%Location: 61%Start Time: 98%

Slides from Cohen & McCallum

Problems with Sliding Windows and Boundary Finders

• Decisions in neighboring parts of the input are made independently from each other.

– Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”.

– It is possible for two overlapping windows to both be above threshold.

– In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.

Slides from Cohen & McCallum

Solution? Joint inference…

82

More Ambitious (Blue Sky) Approaches

• Semantic web needs:– Tagged data– Background knowledge

• (blue sky approaches to) automate both– Knowledge Extraction

• Extract base level knowledge (“facts”) directly from the web

– Automated tagging• Start with a background

ontology and tag other web pages

– Semtag/Seeker

• The information extraction tasks in fielded applications like Citeseer/Libra are narrowly focused– We assume that we are

learning specific relations (e.g. author/title etc)

– We assume that the extracted relations will be put in a database for db-style look-up

Let’s look at state of the feasible art before going to blue-sky..

83

Extraction from Free Text involvesNatural Language Processing

• If extracting from automatically generated web pages, simple regex patterns usually work.

• If extracting from more natural, unstructured, human-written text, some NLP may help.– Part-of-speech (POS) tagging

• Mark each word as a noun, verb, preposition, etc.

– Syntactic parsing• Identify phrases: NP, VP, PP

– Semantic word categories (e.g. from WordNet)• KILL: kill, murder, assassinate, strangle, suffocate

• Off-the-shelf software available to do this!– The “Brill” tagger

• Extraction patterns can use POS or phrase tags.

Analogy to regex patterns on D

OM

trees for structured tex

84

I. Generate-n-Test Architecture

Generic extraction patterns (Hearst ’92):• “…Cities such as Boston, Los Angeles, and Seattle…”

(“C such as NP1, NP2, and NP3”) => IS-A(each(head(NP)), C), …

•Detailed information for several countries such as maps, …” ProperNoun(head(NP))

• “I listen to pretty much all music but prefer country such as Garth Brooks”

TemplateDriven

Extraction(where templateIn in terms of Syntax Tree)

85

Test

|)(|

|)(|),(

SeattleHits

CitySeattleHitsCitySeattlePMI

Assess candidate extractions using Mutual Information (PMI-IR) (Turney ’01).

Many variations are possible…

86

..but many things indicate “city”ness

|)(|

|)(|),(

IHits

DIHitsDIPMI

•PMI = frequency of I & D co-occurrence•5-50 discriminators Di

•Each PMI for Di is a feature fi

•Naïve Bayes evidence combination:

i ii i

i i

n fPPfPP

fPPfffP

)|()()|()(

)|()(),...,|( 21

PMI is used for feature selection. NBC is used for learning. Hits used for assessing PMI as well as conditional probabilities

Discriminator phrases fi : “x is a city” “x has a population of” “x is the capital of y” “x’s baseball team…”

Keep the probablities with the extracted facts

87

Assessment In Action

1. I = “Yakima” (1,340,000)

2. D = <class name>

3. I+D = “Yakima city” (2760)

4. PMI = (2760 / 1.34M)= 0.02

•I = “Avocado” (1,000,000)•I+D =“Avocado city” (10)

PMI = 0.00001 << 0.02

88

Some Sources of ambiguity

• Time: “Clinton is the president” (in 1996).

• Context: “common misconceptions..”

• Opinion: Elvis…

• Multiple word senses: Amazon, Chicago, Chevy Chase, etc.– Dominant senses can mask recessive ones!

– Approach: unmasking. ‘Chicago –City’

89

Chicago

City Movie

|)|(|

|)|(|),,(

CIHits

CDIHitsCDIPMI

90

Chicago Unmasked

City sense Movie sense

|)(|

|)(|

CityChicagoHits

CityMovieChicagoHits

91

Impact of Unmasking on PMI

Name Recessive Original Unmask BoostWashington city 0.50 0.99 96%Casablanca city 0.41 0.93 127%Chevy Chase actor 0.09 0.58 512%Chicago movie 0.02 0.21 972%

92

CBioC: Collaborative Bio-Curation

Motivation To help get information nuggets of articles and

abstracts and store in a database. The challenge is that the number of articles are

huge and they keep growing, and need to process natural language.

The two existing approaches human curation and use of automatic information

extraction systems They are not able to meet the challenge, as the first is

expensive, while the second is error-prone.

93

CBioC (cont’d)

Approach: We propose a solution that is inexpensive, and that scales up. Our approach takes advantage of automatic information

extraction methods as a starting point, Based on the premise that if there are a lot of articles, then

there must be a lot of readers and authors of these articles. We provide a mechanism by which the readers of the

articles can participate and collaborate in the curation of information.

We refer to our approach as “Collaborative Curation''.

94

Using the C-BioCurator System (cont’d)

Extractor Systems

DownloadAgent

TextDBExistingDB

Data Format Exchange System

BioPax

CBioCDatabase

Collaborative Bio-Curation System

CBioC Interface

Browse Facts

Vote FactsAdd/Modify New

FactsAdd New SchemaInvoke

IntExtractorUser

Management

DownloadAgent

DIP

Reactome

… …Nature

SciencePubmed

... ...

(a)

(b) (d)

(c)

What is the main difference between Knowitall and CBIOC?

Assessment– Knowitall does it by HITS. CBioC by voting

96

Annotation

“The Chicago Bulls announced yesterday that Michael Jordan will. . . ”

The <resource ref="http://tap.stanford.edu/

BasketballTeam_Bulls">Chicago Bulls</resource>

announced yesterday that <resource ref=

"http://tap.stanford.edu/AthleteJordan,_Michael">

Michael Jordan</resource> will...’’

97

Semantic Annotation

Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt

This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies.

Name Entity Identification

98

Semantics

• Semantic Annotation - The content of annotation consists of some rich semantic information - Targeted not only at human reader of resources but also software agents - formal : metadata following structural standards informal : personal notes written in the margin while reading an article - explicit : carry sufficient information for interpretation tacit : many personal annotations (telegraphic and

incomplete)

http://www-scf.usc.edu/~csci586/slides/6

99

Uses of Annotation

http://www-scf.usc.edu/~csci586/slides/8

100

Objectives of Annotation

• Generate Metadata for existing information– e.g., author-tag in HTML– RDF descriptions to HTML– Content description to Multimedia files

• Employ metadata for– Improved search– Navigation– Presentation– Summarization of contents

http://www.aifb.uni-karlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdf

101

Annotation

Current practice of annotation for knowledge identification and extraction

is time consuming

needs annotation by experts

is complex

Reduce burden of text annotation for Knowledge

Managementwww.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.ppt

SemTag & Seeker WWW-03 Best Paper Prize Seeded with TAP ontology (72k concepts)

And ~700 human judgments Crawled 264 million web pages Extracted 434 million semantic tags

Automatically disambiguated

104

SemTag

• Uses broad, shallow knowledge base

• TAP – lexical and taxonomic information about popular objects– Music– Movies– Sports– Etc.

105

SemTag

• Problem:– No write access to original document, so how

do you annotate?

• Solution:– Store annotations in a web-available

database

106

SemTag

• Semantic Label Bureau– Separate store of semantic annotation

information– HTTP server that can be queried for

annotation information– Example

• Find all semantic tags for a given document• Find all semantic tags for a particular object

107

SemTag

• Methodology

108

SemTag

• Three phases1. Spotting Pass:

– Tokenize the document– All instances plus 20 word window

2. Learning Pass:– Find corpus-wide distribution of terms at each internal

node of taxonomy– Based on a representative sample

3. Tagging Pass:– Scan windows to disambiguate each reference– Finally determined to be a TAP object

110

SemTag

• Solution:– Taxonomy Based Disambiguation (TBD)

• TBD expectation:– Human tuned parameters used in small,

critical sections– Automated approaches deal with bulk of

information

111

SemTag

• TBD methodology:– Each node in the taxonomy is associated with

a set of labels• Cats, Football, Cars all contain “jaguar”

– Each label in the text is stored with a window of 20 words – the context

– Each node has an associated similarity function mapping a context to a similarity

• Higher similarity more likely to contain a reference

112

SemTag

• Similarity:– Built a 200,000 word lexicon (200,100 most

common – 100 most common)– 200,000 dimensional vector space– Training: spots (label, context) and correct

node– Estimated the distribution of terms for nodes – Standard cosine similarity– TFIDF vectors (context vs. node)

114

SemTag

• Some internal nodes very popular:– Associate a measurement of how accurate

Sim is likely to be at a node– Also, how ambiguous the node is overall

(consistency of human judgment)

• TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v

• 82% accuracy on 434 million spots

116

Summary

• Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web

• Extraction complexity depends on whether the text you have is “templated” or “free-form”– Extraction from templated text can be done by regular

expressions– Extraction from free form text requires NLP

• Can be done in terms of parts-of-speech-tagging

• “Annotation” involves connecting terms in a free form text to items in the background knowledge– It too can be automated