text tango: a new text data mining project text tango: a new text data mining project marti a....

30
Text Tango: Text Tango: A New Text Data A New Text Data Mining Project Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Post on 19-Dec-2015

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Text Tango:Text Tango:A New Text Data A New Text Data Mining ProjectMining Project

Marti A. Hearst

GUIR Meeting, Sept 17, 1998

Page 2: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Talk OutlineTalk Outline

What is Data Mining?What is Data Mining? What What isn’t isn’t Text Data Mining?Text Data Mining? What is Text Data MiningWhat is Text Data Mining

ExamplesExamples

A proposal for a system for Text A proposal for a system for Text Data MiningData Mining

Page 3: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

What is Data Mining? What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97)(Fayyad & Uthurusamy 96, Fayyad 97)

Fitting models to or determining Fitting models to or determining patterns from very large datasets.patterns from very large datasets.

A “regime” which enables people to A “regime” which enables people to interact effectively with massive data interact effectively with massive data stores.stores.

Deriving new information from data.Deriving new information from data. finding patternsfinding patterns across large datasets across large datasets discoveringdiscovering heretofore unknown information heretofore unknown information

Page 4: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

What is Data Mining?What is Data Mining? Potential point of confusion:Potential point of confusion:

The The extracting ore from rockextracting ore from rock metaphor does not metaphor does not really apply to the practice of data miningreally apply to the practice of data mining

If it did, then standard If it did, then standard database queriesdatabase queries would fit would fit under the rubric of data miningunder the rubric of data mining

Find all employee records in which employee earns Find all employee records in which employee earns $300/month less than their managers$300/month less than their managers

In practice, DM refers to:In practice, DM refers to: finding patternsfinding patterns across large datasets across large datasets discoveringdiscovering heretofore unknown information heretofore unknown information

Page 5: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

DM Touchstone ApplicationsDM Touchstone Applications(CACM 39 (11) Special Issue)(CACM 39 (11) Special Issue)

Finding patterns across data sets:Finding patterns across data sets: Reports on changes in retail salesReports on changes in retail sales

to improve salesto improve sales

Patterns of sizes of TV audiencesPatterns of sizes of TV audiences for marketingfor marketing

Patterns in NBA playPatterns in NBA play to alter, and so improve,to alter, and so improve, performance performance

Deviations in standard phone calling behavior Deviations in standard phone calling behavior to detect fraudto detect fraud for marketingfor marketing

Page 6: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

What is Text Data Mining?What is Text Data Mining?

Peoples’ first thought: Peoples’ first thought: Make it easier to find things on the Web.Make it easier to find things on the Web. This is information retrieval!This is information retrieval!

The metaphor of extracting ore from The metaphor of extracting ore from rock rock doesdoes make sense for extracting make sense for extracting documents of interest from a huge pile.documents of interest from a huge pile.

But does But does notnot reflect notions of DM in reflect notions of DM in practice:practice: finding patternsfinding patterns across large collections across large collections discoveringdiscovering heretofore unknown information heretofore unknown information

Page 7: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Text DM != IRText DM != IR

Data Mining:Data Mining: Patterns, Nuggets, Exploratory Analysis Patterns, Nuggets, Exploratory Analysis

Information Retrieval:Information Retrieval: Finding and ranking documents that match Finding and ranking documents that match

users’ information needusers’ information need ad hoc queryad hoc query filtering/standing query filtering/standing query

Page 8: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

RealReal Text DM Text DM

What would finding a pattern What would finding a pattern across a large text collection across a large text collection reallyreally look like?look like?

Page 9: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)

Bill Gates + MS-DOS in the Bill Gates + MS-DOS in the Bible!Bible!

Page 10: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

From: “The Internet Diary of the man who cracked the Bible Code”Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil

Page 11: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

RealReal Text DM Text DM

The point:The point: Discovering heretofore unknown information Discovering heretofore unknown information

is not what we usually do with text.is not what we usually do with text. (If it weren’t known, it could not have been (If it weren’t known, it could not have been

written by someone.)written by someone.) However:However:

There are some interesting problems of this There are some interesting problems of this type!type!

Page 12: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Combining Data TypesCombining Data Typesfor Novel Tasksfor Novel Tasks

Text + Links to find “authority Text + Links to find “authority pages” pages” (Kleinberg at Cornell, Page at Stanford)(Kleinberg at Cornell, Page at Stanford)

Usage + Time + Links to study Usage + Time + Links to study evolution of web and information evolution of web and information use use (Pitkow et al. at PARC)(Pitkow et al. at PARC)

Page 13: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Ore-Filled Text CollectionsOre-Filled Text Collections

Congressional Voting RecordsCongressional Voting Records Answer questions like:Answer questions like:

Who are the most hypocritical congresspeople?Who are the most hypocritical congresspeople?

Medical ArticlesMedical Articles Create hypotheses about causes of rare diseasesCreate hypotheses about causes of rare diseases Create hypotheses about gene functionCreate hypotheses about gene function

Patent LawPatent Law Answer questions like:Answer questions like:

Is government funding of research worthwhile?Is government funding of research worthwhile?

Page 14: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Page 15: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Page 16: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

How to find Hypocritical How to find Hypocritical Congresspersons?Congresspersons?

This must have taken a lot of workThis must have taken a lot of work Hand cutting and pastingHand cutting and pasting Lots of picky detailsLots of picky details

Some people voted on one but not the other billSome people voted on one but not the other bill Some people share the same nameSome people share the same name

Check for different county/stateCheck for different county/state Still messed up on “Bono”Still messed up on “Bono”

Taking stats at the end on various attributesTaking stats at the end on various attributes Which stateWhich state Which partyWhich party

Page 17: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

How to find functions of How to find functions of genes?genes?

Important problem in molecular Important problem in molecular biologybiology Have the genetic sequenceHave the genetic sequence Don’t know what it doesDon’t know what it does But …But …

Know which genes itKnow which genes it coexpresses coexpresses withwith Some of these have known functionSome of these have known function

So … Infer function based on function of co-expressed So … Infer function based on function of co-expressed genesgenes

This is new work by Michael Walker and others at Incyte This is new work by Michael Walker and others at Incyte PharmaceuticalsPharmaceuticals

Page 18: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Gene Co-expression:Gene Co-expression:Role in the genetic pathwayRole in the genetic pathway

g?

PSA

Kall.

PAP

h?

PSA

Kall.

PAP

g?

Other possibilities as well

Page 19: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Make use of the literatureMake use of the literature Look up what is known about the Look up what is known about the

other genes.other genes. Different articles in different Different articles in different

collectionscollections Look for commonalities Look for commonalities

Similar topics indicated by Subject DescriptorsSimilar topics indicated by Subject Descriptors Similar words in titles and abstractsSimilar words in titles and abstracts

adenocarcinoma, neoplasm, prostate, prostatic adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...neoplasms, tumor markers, antibodies ...

Page 20: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Developing StrategiesDeveloping Strategies Different strategies seem needed Different strategies seem needed

for different situationsfor different situations First: see what is known about Kallikrein.First: see what is known about Kallikrein. 7341 documents. Too many7341 documents. Too many AND the result with “disease” categoryAND the result with “disease” category

If result is non-empty, this might be an interesting geneIf result is non-empty, this might be an interesting gene

Now get 803 documentsNow get 803 documents AND the result with PSAAND the result with PSA

Get 11 documents. Better! Get 11 documents. Better!

Page 21: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Developing StrategiesDeveloping Strategies

Look for commalities among Look for commalities among these documentsthese documents Manual scan through ~100 category labelsManual scan through ~100 category labels Would have been better ifWould have been better if

Automatically organizedAutomatically organized Intersections of “important” categories scanned Intersections of “important” categories scanned

for firstfor first

Page 22: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Try a new tackTry a new tack

Researcher uses knowledge of field Researcher uses knowledge of field to realize these are related to to realize these are related to prostate cancer and diagnostic testsprostate cancer and diagnostic tests

New tack: intersect search on all New tack: intersect search on all three known genesthree known genes Hope they all talk about diagnostics and Hope they all talk about diagnostics and

prostate cancerprostate cancer Fortunately, 7 documents returnedFortunately, 7 documents returned Bingo! A relation to regulation of this cancerBingo! A relation to regulation of this cancer

Page 23: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Formulate a HypothesisFormulate a Hypothesis

Hypothesis: mystery gene has to do Hypothesis: mystery gene has to do with regulation of expression of with regulation of expression of genes leading to prostate cancergenes leading to prostate cancer

New tack: do some lab testsNew tack: do some lab tests See if mystery gene is similar in molecular See if mystery gene is similar in molecular

structure to the othersstructure to the others If so, it might do some of the same things If so, it might do some of the same things

they dothey do

Page 24: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Strategies againStrategies again

In hindsight, combining all In hindsight, combining all three genes was a good three genes was a good strategy.strategy. Store this for laterStore this for later

Might not have workedMight not have worked Need a suite of strategiesNeed a suite of strategies Build them up via experience and a good Build them up via experience and a good

UIUI

Page 25: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

The SystemThe System Doing the same query with slightly different Doing the same query with slightly different

values each time is time-consuming and values each time is time-consuming and tedioustedious

Same goes for cutting and pasting resultsSame goes for cutting and pasting results IR systems don’t support varying queries like IR systems don’t support varying queries like

this very well.this very well. Each situation is a bit differentEach situation is a bit different

Some automatic processing is needed in the Some automatic processing is needed in the background to eliminate/suggest hypothesesbackground to eliminate/suggest hypotheses

Page 26: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

The SystemThe System

Three main partsThree main parts UI for building/using strategiesUI for building/using strategies Backend for interfacing with various Backend for interfacing with various

databases and translating different formatsdatabases and translating different formats Content analysis/machine learning for Content analysis/machine learning for

figuring out good hypotheses/throwing out figuring out good hypotheses/throwing out bad onesbad ones

Page 27: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

The UI partThe UI part Need support for building strategiesNeed support for building strategies Lots of info lying around, so a nice option Lots of info lying around, so a nice option

is ...is ... Two-handed interfaceTwo-handed interface Big table displayBig table display

Mixed-initiative systemMixed-initiative system Trade off between user-initiated hypotheses Trade off between user-initiated hypotheses

exploration and system-initiated suggestionsexploration and system-initiated suggestions

Information visualizationInformation visualization Another way to show lots of choicesAnother way to show lots of choices

Page 28: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Candidate Associations

Current Retrieval Results

Suggested Strategies

Page 29: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Other applicationsOther applications

Patent examplePatent example Political examplePolitical example The truth’s out there!The truth’s out there!

Page 30: Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Marti A. HearstUC Berkeley SIMS 1998

Text TangoText Tango

Just starting up now.Just starting up now. Let me know if you’d like to Let me know if you’d like to

work on it!work on it!