introduction to text mining
DESCRIPTION
Introduction to Text Mining. ChengXiang (“Cheng”) Zhai Department of Computer Science Graduate School of Library & Information Science Statistics, and Institute for Genomic Biology University of Illinois, Urbana-Champaign. Outline. Overview of Text Mining IR-Style Text Mining Techniques - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/1.jpg)
1
Introduction to Text Mining
ChengXiang (“Cheng”) Zhai
Department of Computer Science
Graduate School of Library & Information Science
Statistics, and Institute for Genomic Biology
University of Illinois, Urbana-Champaign
![Page 2: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/2.jpg)
Outline
- Overview of Text Mining
- IR-Style Text Mining Techniques
- NLP-Style Text Mining Techniques
- ML-Style Text Mining Techniques
2
![Page 3: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/3.jpg)
3
Two Definitions of “Mining”• Goal-oriented (effectiveness driven, NLP, AI)
– Any process that generates useful results that are non-obvious is called “mining”.
– Keywords: “useful” + “non-obvious”– Data isn’t necessarily massive
• Method-oriented (efficiency driven, DB, IR)– Any process that involves extracting information
from massive data is called “mining” – Keywords: “massive” + “pattern”– Patterns aren’t necessarily useful
![Page 4: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/4.jpg)
4
What is Text Mining?
• Data Mining View: Explore patterns in textual data– Find latent topics – Find topical trends– Find outliers and other hidden patterns
• Natural Language Processing View: Make inferences based on partial understanding natural language text – Information extraction– Question answering
![Page 5: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/5.jpg)
5
Applications of Text Mining
• Direct applications– Discovery-driven (Bioinformatics, Business Intelligence, etc):
We have specific questions; how can we exploit data mining to answer the questions?
– Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it?
• Indirect applications– Assist information access (e.g., discover latent topics to better
summarize search results)– Assist information organization (e.g., discover hidden
structures)
![Page 6: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/6.jpg)
6
Text Mining Methods• Data Mining Style: View text as high dimensional data
– Frequent pattern finding– Association analysis– Outlier detection
• Information Retrieval Style: Fine granularity topical analysis– Topic extraction– Exploit term weighting and text similarity measures– Question answering
• Natural Language Processing Style: Information Extraction– Entity extraction– Relation extraction– Sentiment analysis
• Machine Learning Style: Unsupervised or semi-supervised learning– Generative models– Dimension reduction– Classification & prediction
![Page 8: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/8.jpg)
8
Some “Basic” IR Techniques
• Stemming
• Stop words
• Weighting of terms (e.g., TF-IDF)
• Vector/Unigram representation of text
• Text similarity (e.g., cosine, KL-div)
• Relevance/pseudo feedback (e.g., Rocchio)
![Page 9: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/9.jpg)
9
Generality of Basic Techniques
Raw text
Term similarity
Doc similarity
Vector centroid
CLUSTERING
d
CATEGORIZATIONMETA-DATA/ANNOTATION
d d d
d
d d
d
d d d
d d
d d
t t
t t
t t t
t t
t
t t
Stemming & Stop words
Tokenized text
Term Weighting
w11 w12… w1n
w21 w22… w2n
… …wm1 wm2… wmn
t1 t2 … tn
d1
d2 … dm
Sentenceselection
SUMMARIZATION
![Page 10: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/10.jpg)
10
Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
• Text Summarization
![Page 11: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/11.jpg)
11
Information Filtering• Stable & long term interest, dynamic info source
• System must make a delivery decision immediately as a document “arrives”
• Two Methods: Content-based vs. Collaborative
FilteringSystem
…
my interest:
![Page 12: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/12.jpg)
12
Examples of Information Filtering
• News filtering
• Email filtering
• Recommending Systems
• Literature alert
• And many others
![Page 13: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/13.jpg)
13
Sample Applications
• Information Filtering
Text Categorization
• Document/Term Clustering
• Text Summarization
![Page 14: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/14.jpg)
14
Text Categorization
• Pre-given categories and labeled document examples (Categories may form hierarchy)
• Classify new documents
• A standard supervised learning problem
CategorizationSystem
…
Sports
Business
Education
Science…Sports
Business
Education
![Page 15: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/15.jpg)
15
Examples of Text Categorization
• News article classification
• Meta-data annotation
• Automatic Email sorting
• Web page classification
![Page 16: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/16.jpg)
16
Sample Applications
• Information Filtering
• Text Categorization
Document/Term Clustering
• Text Summarization
![Page 17: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/17.jpg)
17
The Clustering Problem
• Discover “natural structure”
• Group similar objects together
• Object can be document, term, passages
• Example
![Page 19: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/19.jpg)
19
Examples of Doc/Term Clustering
• Clustering of retrieval results
• Clustering of documents in the whole collection
• Term clustering to define “concept” or “theme”
• Automatic construction of hyperlinks
• In general, very useful for text mining
![Page 20: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/20.jpg)
20
Sample Applications
• Information Filtering
• Text Categorization
• Document/Term Clustering
Text Summarization
![Page 21: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/21.jpg)
21
“Retrieval-based” Summarization
• Observation: term vector summary?
• Basic approach– Rank “sentences”, and select top N as a summary
• Methods for ranking sentences– Based on term weights– Based on position of sentences– Based on the similarity of sentence and document
vector
![Page 22: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/22.jpg)
22
Examples of Summarization
• News summary
• Summarize retrieval results– Single doc summary– Multi-doc summary
• Summarize a cluster of documents (automatic label creation for clusters)
![Page 23: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/23.jpg)
23
NLP-Style Text Mining Techniques
Most of the following slides are from William Cohen’s IE tutorial
![Page 24: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/24.jpg)
24
What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation NA
ME
TITL
E
ORGA
NIZA
TION
Bill
Gat
esCEO
Micr
osof
tBill
Ve g
h te
VPMicr
osof
tRich
ard
Stal
lman
foun
der
Free
Sof
t..
*
*
*
*
![Page 25: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/25.jpg)
25
Landscape of IE Tasks:Complexity
Closed set
He was born in Alabama…
Regular set
Phone: (413) 545-1323
Complex pattern
University of ArkansasP.O. Box 140Hope, AR 71802 …was among the six houses sold
by Hope Feldman that year.
Ambiguous patterns,needing context andmany sources of evidence
The CALD main office can be reached at 412-268-1299
The big Wyoming sky…
U.S. states U.S. phone numbers
U.S. postal addresses
Person names
Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210
Pawel Opalinski, SoftwareEngineer at WhizBang Labs.
E.g. word patterns:
![Page 26: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/26.jpg)
26
Landscape of IE Techniques
Any of these models can be used to capture words, formatting or both.
Lexicons
AlabamaAlaska…WisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Sliding WindowAbraham Lincoln was born in Kentucky.
Classifierwhich class?
Try alternatewindow sizes:
Boundary ModelsAbraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Context Free GrammarsAbraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VPVP
S
Mos
t lik
ely
pars
e?
Finite State MachinesAbraham Lincoln was born in Kentucky.
Most likely state sequence?
![Page 28: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/28.jpg)
Many Techniques are Available• Supervised learning
– Classification– Regression
• Unsupervised learning– Topic models – Dimension reduction
• Most relevant methods– Generative models– Matrix decomposition
28
![Page 29: Introduction to Text Mining](https://reader035.vdocuments.us/reader035/viewer/2022062411/5681683a550346895dde032c/html5/thumbnails/29.jpg)
Topics for Discussion
• Social Science research questions:– Mining bias: selection bias, framing bias
• Text Mining techniques– Sentiment analysis– Topic discovery and evolution graph– Joint text-image analysis
29