text tango: a new text data mining project text tango: a new text data mining project marti a....
Post on 19-Dec-2015
237 views
TRANSCRIPT
Text Tango:Text Tango:A New Text Data A New Text Data Mining ProjectMining Project
Marti A. Hearst
GUIR Meeting, Sept 17, 1998
Marti A. HearstUC Berkeley SIMS 1998
Talk OutlineTalk Outline
What is Data Mining?What is Data Mining? What What isn’t isn’t Text Data Mining?Text Data Mining? What is Text Data MiningWhat is Text Data Mining
ExamplesExamples
A proposal for a system for Text A proposal for a system for Text Data MiningData Mining
Marti A. HearstUC Berkeley SIMS 1998
What is Data Mining? What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97)(Fayyad & Uthurusamy 96, Fayyad 97)
Fitting models to or determining Fitting models to or determining patterns from very large datasets.patterns from very large datasets.
A “regime” which enables people to A “regime” which enables people to interact effectively with massive data interact effectively with massive data stores.stores.
Deriving new information from data.Deriving new information from data. finding patternsfinding patterns across large datasets across large datasets discoveringdiscovering heretofore unknown information heretofore unknown information
Marti A. HearstUC Berkeley SIMS 1998
What is Data Mining?What is Data Mining? Potential point of confusion:Potential point of confusion:
The The extracting ore from rockextracting ore from rock metaphor does not metaphor does not really apply to the practice of data miningreally apply to the practice of data mining
If it did, then standard If it did, then standard database queriesdatabase queries would fit would fit under the rubric of data miningunder the rubric of data mining
Find all employee records in which employee earns Find all employee records in which employee earns $300/month less than their managers$300/month less than their managers
In practice, DM refers to:In practice, DM refers to: finding patternsfinding patterns across large datasets across large datasets discoveringdiscovering heretofore unknown information heretofore unknown information
Marti A. HearstUC Berkeley SIMS 1998
DM Touchstone ApplicationsDM Touchstone Applications(CACM 39 (11) Special Issue)(CACM 39 (11) Special Issue)
Finding patterns across data sets:Finding patterns across data sets: Reports on changes in retail salesReports on changes in retail sales
to improve salesto improve sales
Patterns of sizes of TV audiencesPatterns of sizes of TV audiences for marketingfor marketing
Patterns in NBA playPatterns in NBA play to alter, and so improve,to alter, and so improve, performance performance
Deviations in standard phone calling behavior Deviations in standard phone calling behavior to detect fraudto detect fraud for marketingfor marketing
Marti A. HearstUC Berkeley SIMS 1998
What is Text Data Mining?What is Text Data Mining?
Peoples’ first thought: Peoples’ first thought: Make it easier to find things on the Web.Make it easier to find things on the Web. This is information retrieval!This is information retrieval!
The metaphor of extracting ore from The metaphor of extracting ore from rock rock doesdoes make sense for extracting make sense for extracting documents of interest from a huge pile.documents of interest from a huge pile.
But does But does notnot reflect notions of DM in reflect notions of DM in practice:practice: finding patternsfinding patterns across large collections across large collections discoveringdiscovering heretofore unknown information heretofore unknown information
Marti A. HearstUC Berkeley SIMS 1998
Text DM != IRText DM != IR
Data Mining:Data Mining: Patterns, Nuggets, Exploratory Analysis Patterns, Nuggets, Exploratory Analysis
Information Retrieval:Information Retrieval: Finding and ranking documents that match Finding and ranking documents that match
users’ information needusers’ information need ad hoc queryad hoc query filtering/standing query filtering/standing query
Marti A. HearstUC Berkeley SIMS 1998
RealReal Text DM Text DM
What would finding a pattern What would finding a pattern across a large text collection across a large text collection reallyreally look like?look like?
Marti A. HearstUC Berkeley SIMS 1998
From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)
Bill Gates + MS-DOS in the Bill Gates + MS-DOS in the Bible!Bible!
Marti A. HearstUC Berkeley SIMS 1998
From: “The Internet Diary of the man who cracked the Bible Code”Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil
Marti A. HearstUC Berkeley SIMS 1998
RealReal Text DM Text DM
The point:The point: Discovering heretofore unknown information Discovering heretofore unknown information
is not what we usually do with text.is not what we usually do with text. (If it weren’t known, it could not have been (If it weren’t known, it could not have been
written by someone.)written by someone.) However:However:
There are some interesting problems of this There are some interesting problems of this type!type!
Marti A. HearstUC Berkeley SIMS 1998
Combining Data TypesCombining Data Typesfor Novel Tasksfor Novel Tasks
Text + Links to find “authority Text + Links to find “authority pages” pages” (Kleinberg at Cornell, Page at Stanford)(Kleinberg at Cornell, Page at Stanford)
Usage + Time + Links to study Usage + Time + Links to study evolution of web and information evolution of web and information use use (Pitkow et al. at PARC)(Pitkow et al. at PARC)
Marti A. HearstUC Berkeley SIMS 1998
Ore-Filled Text CollectionsOre-Filled Text Collections
Congressional Voting RecordsCongressional Voting Records Answer questions like:Answer questions like:
Who are the most hypocritical congresspeople?Who are the most hypocritical congresspeople?
Medical ArticlesMedical Articles Create hypotheses about causes of rare diseasesCreate hypotheses about causes of rare diseases Create hypotheses about gene functionCreate hypotheses about gene function
Patent LawPatent Law Answer questions like:Answer questions like:
Is government funding of research worthwhile?Is government funding of research worthwhile?
Marti A. HearstUC Berkeley SIMS 1998
Marti A. HearstUC Berkeley SIMS 1998
Marti A. HearstUC Berkeley SIMS 1998
How to find Hypocritical How to find Hypocritical Congresspersons?Congresspersons?
This must have taken a lot of workThis must have taken a lot of work Hand cutting and pastingHand cutting and pasting Lots of picky detailsLots of picky details
Some people voted on one but not the other billSome people voted on one but not the other bill Some people share the same nameSome people share the same name
Check for different county/stateCheck for different county/state Still messed up on “Bono”Still messed up on “Bono”
Taking stats at the end on various attributesTaking stats at the end on various attributes Which stateWhich state Which partyWhich party
Marti A. HearstUC Berkeley SIMS 1998
How to find functions of How to find functions of genes?genes?
Important problem in molecular Important problem in molecular biologybiology Have the genetic sequenceHave the genetic sequence Don’t know what it doesDon’t know what it does But …But …
Know which genes itKnow which genes it coexpresses coexpresses withwith Some of these have known functionSome of these have known function
So … Infer function based on function of co-expressed So … Infer function based on function of co-expressed genesgenes
This is new work by Michael Walker and others at Incyte This is new work by Michael Walker and others at Incyte PharmaceuticalsPharmaceuticals
Marti A. HearstUC Berkeley SIMS 1998
Gene Co-expression:Gene Co-expression:Role in the genetic pathwayRole in the genetic pathway
g?
PSA
Kall.
PAP
h?
PSA
Kall.
PAP
g?
Other possibilities as well
Marti A. HearstUC Berkeley SIMS 1998
Make use of the literatureMake use of the literature Look up what is known about the Look up what is known about the
other genes.other genes. Different articles in different Different articles in different
collectionscollections Look for commonalities Look for commonalities
Similar topics indicated by Subject DescriptorsSimilar topics indicated by Subject Descriptors Similar words in titles and abstractsSimilar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...neoplasms, tumor markers, antibodies ...
Marti A. HearstUC Berkeley SIMS 1998
Developing StrategiesDeveloping Strategies Different strategies seem needed Different strategies seem needed
for different situationsfor different situations First: see what is known about Kallikrein.First: see what is known about Kallikrein. 7341 documents. Too many7341 documents. Too many AND the result with “disease” categoryAND the result with “disease” category
If result is non-empty, this might be an interesting geneIf result is non-empty, this might be an interesting gene
Now get 803 documentsNow get 803 documents AND the result with PSAAND the result with PSA
Get 11 documents. Better! Get 11 documents. Better!
Marti A. HearstUC Berkeley SIMS 1998
Developing StrategiesDeveloping Strategies
Look for commalities among Look for commalities among these documentsthese documents Manual scan through ~100 category labelsManual scan through ~100 category labels Would have been better ifWould have been better if
Automatically organizedAutomatically organized Intersections of “important” categories scanned Intersections of “important” categories scanned
for firstfor first
Marti A. HearstUC Berkeley SIMS 1998
Try a new tackTry a new tack
Researcher uses knowledge of field Researcher uses knowledge of field to realize these are related to to realize these are related to prostate cancer and diagnostic testsprostate cancer and diagnostic tests
New tack: intersect search on all New tack: intersect search on all three known genesthree known genes Hope they all talk about diagnostics and Hope they all talk about diagnostics and
prostate cancerprostate cancer Fortunately, 7 documents returnedFortunately, 7 documents returned Bingo! A relation to regulation of this cancerBingo! A relation to regulation of this cancer
Marti A. HearstUC Berkeley SIMS 1998
Formulate a HypothesisFormulate a Hypothesis
Hypothesis: mystery gene has to do Hypothesis: mystery gene has to do with regulation of expression of with regulation of expression of genes leading to prostate cancergenes leading to prostate cancer
New tack: do some lab testsNew tack: do some lab tests See if mystery gene is similar in molecular See if mystery gene is similar in molecular
structure to the othersstructure to the others If so, it might do some of the same things If so, it might do some of the same things
they dothey do
Marti A. HearstUC Berkeley SIMS 1998
Strategies againStrategies again
In hindsight, combining all In hindsight, combining all three genes was a good three genes was a good strategy.strategy. Store this for laterStore this for later
Might not have workedMight not have worked Need a suite of strategiesNeed a suite of strategies Build them up via experience and a good Build them up via experience and a good
UIUI
Marti A. HearstUC Berkeley SIMS 1998
The SystemThe System Doing the same query with slightly different Doing the same query with slightly different
values each time is time-consuming and values each time is time-consuming and tedioustedious
Same goes for cutting and pasting resultsSame goes for cutting and pasting results IR systems don’t support varying queries like IR systems don’t support varying queries like
this very well.this very well. Each situation is a bit differentEach situation is a bit different
Some automatic processing is needed in the Some automatic processing is needed in the background to eliminate/suggest hypothesesbackground to eliminate/suggest hypotheses
Marti A. HearstUC Berkeley SIMS 1998
The SystemThe System
Three main partsThree main parts UI for building/using strategiesUI for building/using strategies Backend for interfacing with various Backend for interfacing with various
databases and translating different formatsdatabases and translating different formats Content analysis/machine learning for Content analysis/machine learning for
figuring out good hypotheses/throwing out figuring out good hypotheses/throwing out bad onesbad ones
Marti A. HearstUC Berkeley SIMS 1998
The UI partThe UI part Need support for building strategiesNeed support for building strategies Lots of info lying around, so a nice option Lots of info lying around, so a nice option
is ...is ... Two-handed interfaceTwo-handed interface Big table displayBig table display
Mixed-initiative systemMixed-initiative system Trade off between user-initiated hypotheses Trade off between user-initiated hypotheses
exploration and system-initiated suggestionsexploration and system-initiated suggestions
Information visualizationInformation visualization Another way to show lots of choicesAnother way to show lots of choices
Marti A. HearstUC Berkeley SIMS 1998
Candidate Associations
Current Retrieval Results
Suggested Strategies
Marti A. HearstUC Berkeley SIMS 1998
Other applicationsOther applications
Patent examplePatent example Political examplePolitical example The truth’s out there!The truth’s out there!
Marti A. HearstUC Berkeley SIMS 1998
Text TangoText Tango
Just starting up now.Just starting up now. Let me know if you’d like to Let me know if you’d like to
work on it!work on it!