introduction to text mining - indian statistical …acmsc/tmw2014/m_mitra.pdfintroduction to text...

39
Introduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Upload: others

Post on 24-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Introduction to Text Mining

Mandar Mitra

Indian Statistical Institute

M. Mitra (ISI) Text Mining 1 / 29

Page 2: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Outline

1 Preliminaries

2 Preprocessing

3 Mining word associations

4 Opinion mining

M. Mitra (ISI) Text Mining 2 / 29

Page 3: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

What is Text Mining?

.Strict definition..

......The nontrivial extraction of implicit, previously unknown, and potentiallyuseful information from [textual] data

OR.Loose definition..

......The science of extracting useful information from large [textual] datasets

.Old wine in a new bottle?..

......Text mining = information retrieval + statistics + artificial intelligence(natural language processing, machine learning / pattern recognition)

M. Mitra (ISI) Text Mining 3 / 29

Page 4: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

What is Text Mining?

.Strict definition..

......The nontrivial extraction of implicit, previously unknown, and potentiallyuseful information from [textual] data

OR.Loose definition..

......The science of extracting useful information from large [textual] datasets

.Old wine in a new bottle?..

......Text mining = information retrieval + statistics + artificial intelligence(natural language processing, machine learning / pattern recognition)

M. Mitra (ISI) Text Mining 3 / 29

Page 5: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Why is it interesting?

Growth of Web / electronic information sources

Multidisciplinary nature

E-commerce potential

“Electronic commerce is emerging as the killer domain fordata-mining technology” — RONNY KOHAVI

M. Mitra (ISI) Text Mining 4 / 29

Page 6: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Data sources

World Wide Webunstructured and semi-structured text

“deep” web: pages that do not exist until they are createddynamically as the result of a specific search

social networks

Intranetinternal correspondence, memos, presentations

white papers, technical reports

customer email, customer forums, product reviews

news Wires. . .

.

...... No structure / general schema / tabular form that fits text

M. Mitra (ISI) Text Mining 5 / 29

Page 7: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Data sources

World Wide Webunstructured and semi-structured text

“deep” web: pages that do not exist until they are createddynamically as the result of a specific search

social networks

Intranetinternal correspondence, memos, presentations

white papers, technical reports

customer email, customer forums, product reviews

news Wires. . .

.

...... No structure / general schema / tabular form that fits text

M. Mitra (ISI) Text Mining 5 / 29

Page 8: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Data sources

World Wide Webunstructured and semi-structured text

“deep” web: pages that do not exist until they are createddynamically as the result of a specific search

social networks

Intranetinternal correspondence, memos, presentations

white papers, technical reports

customer email, customer forums, product reviews

news Wires. . .

.

...... No structure / general schema / tabular form that fits text

M. Mitra (ISI) Text Mining 5 / 29

Page 9: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Outline

1 Preliminaries

2 Preprocessing

3 Mining word associations

4 Opinion mining

M. Mitra (ISI) Text Mining 6 / 29

Page 10: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Indexing

Any text item (“document”) represented as list of terms andassociated weights

D = (⟨t1, w1⟩, . . . , ⟨tn, wn⟩)

Term = keywords or content-descriptors

Weight = measure of the importance of a term in representing theinformation contained in the document

M. Mitra (ISI) Text Mining 7 / 29

Page 11: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Indexing

Tokenization: identify individual words

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

SachinTendulkar

madea

tearfulbut. . .

M. Mitra (ISI) Text Mining 8 / 29

Page 12: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Indexing

Stopword removal: eliminate common words, e.g. and, of, the,etc..

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 9 / 29

Page 13: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Indexing

Stemming: reduce words to a common roote.g. resignation, resigned, resigns → resignanalysis, analyze, analyzing → analy

use standard algorithms (Porter)

M. Mitra (ISI) Text Mining 10 / 29

Page 14: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Indexing

Thesaurus: find synonyms for words in the document

Phrases: find multi-word terms e.g. computer science, datamining

use syntax/linguistic methods or “statistical” methods

Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 11 / 29

Page 15: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Indexing

Thesaurus: find synonyms for words in the document

Phrases: find multi-word terms e.g. computer science, datamining

use syntax/linguistic methods or “statistical” methods

Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 11 / 29

Page 16: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Indexing

Thesaurus: find synonyms for words in the document

Phrases: find multi-word terms e.g. computer science, datamining

use syntax/linguistic methods or “statistical” methods

Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 11 / 29

Page 17: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Indexing

Thesaurus: find synonyms for words in the document

Phrases: find multi-word terms e.g. computer science, datamining

use syntax/linguistic methods or “statistical” methods

Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 11 / 29

Page 18: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Indexing: Term Weights

Term frequency (tf): repeated words are strongly related to content

Inverse document frequency (idf): uncommon term is moreimportant

Normalization by document lengthlong docs. contain many distinct words

long docs. contain same word many times

term-weights for long documents should be reduced

use # bytes, # distinct words, Euclidean length, etc.

Weight = tf x idf / normalization

M. Mitra (ISI) Text Mining 12 / 29

Page 19: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Commonly used weighting schemes

Pivoted normalization [Singhal et al., SIGIR 96]

1+log(tf )1+log(average tf ) × log(Ndf )

(1.0− slope)× pivot + slope ×# unique terms

BM25 (probabilistic model) [Robertson and Zaragoza, FTIR 2009]

tf × log(N−df+0.5df+0.5 )

k1((1− b) + b dlavdl ) + tf

M. Mitra (ISI) Text Mining 13 / 29

Page 20: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Searching

Measure vocabulary overlap between user query and documents.

t1 . . . tnQ = q1 . . . qnD = d1 . . . dn

Sim(Q,D) = Q⃗.D⃗=

∑i qi × di

Use inverted list (index).

Termi → (Di1 , wi1), . . . , (Dik , wik)

M. Mitra (ISI) Text Mining 14 / 29

Page 21: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Searching

Measure vocabulary overlap between user query and documents.

t1 . . . tnQ = q1 . . . qnD = d1 . . . dn

Sim(Q,D) = Q⃗.D⃗=

∑i qi × di

Use inverted list (index).

Termi → (Di1 , wi1), . . . , (Dik , wik)

M. Mitra (ISI) Text Mining 14 / 29

Page 22: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Outline

1 Preliminaries

2 Preprocessing

3 Mining word associations

4 Opinion mining

M. Mitra (ISI) Text Mining 15 / 29

Page 23: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Stemming

YASS [Majumder et al., ACM TOIS 25(4), 2007]

Stemming ≡ grouping morphologically related words togethere.g. { analysis, analyze, analyzing }

Try clusteringdistance measure: edit distance, or

D(X,Y ) =n−m+ 1

n∑i=m

1

2i−mif m > 0, ∞ otherwise

clustering algorithm: hierarchical agglomerative(single link / complete link / average link)

M. Mitra (ISI) Text Mining 16 / 29

Page 24: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Stemming

0 1 2 3 4 5 6 7 8 9 10 11 12 13

a s t r o n o m i c a l l y

a s t r o n o m e r x x x x

Edit distance = 6D = 6

8 × ( 120

+ . . .+ 1213−8 ) = 1.4766

0 1 2 3 4 5 6 7 8 9

a s t o n i s h x x

a s t r o n o m e r

D = 73 × ( 1

20+ . . .+ 1

29−3 ) = 4.6302Edit distance = 5

M. Mitra (ISI) Text Mining 17 / 29

Page 25: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Stemming

Clustering:

[Courtesy: http://espin086.files.wordpress.com/2011/02/2-variable-clustering.png]

M. Mitra (ISI) Text Mining 18 / 29

Page 26: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Word Relations

Motivation:Manual thesauri are:

general purpose (Roget’s Thesaurus, WordNet) – difficult to use fordocument retrieval

retrieval-oriented (INSPEC, MeSH) – expensive to build andmaintain

Construct an automatic thesaurus (based on information aboutco-occurrence of words in a collection)

M. Mitra (ISI) Text Mining 19 / 29

Page 27: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Word Relations

Association: if two terms co-occur within the same paragraph,they constitute an association

⟨term1, term2,assoc. frequency⟩

Gather data about term-associations over a large amount of text

Refine associations:Discard associations with frequency 1

Discard terms that are associated with too many other terms(people, state, company, etc.)

M. Mitra (ISI) Text Mining 20 / 29

Page 28: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Word Relations

Each term is represented by a vector of associated terms

T = (⟨t1, w1⟩, . . . , ⟨tn, wn⟩)

⇒ term = pseudo document

Compare query to the term vectors (instead of document vectors)

Sim(Q,T ) = Σiwt(qi)× wt(ti)

Most “similar” terms are added to the query

Example: 1986 US Immigration Lawsimilar terms: illegal immigration, amnesty program,simpson-mazzoli

M. Mitra (ISI) Text Mining 21 / 29

Page 29: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Word Relations

Experimental results:Data: 500,000 documents (news, computer abstracts, govt.documents); 50 queries

Baseline average precision: 37%

Improves to 6 - 30% by using thesaurus

2 weeks to generate association data!

Processing time can be reduced without major loss inperformance by using a subset of the document collection

M. Mitra (ISI) Text Mining 22 / 29

Page 30: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Outline

1 Preliminaries

2 Preprocessing

3 Mining word associations

4 Opinion mining

M. Mitra (ISI) Text Mining 23 / 29

Page 31: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Challenges

Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions

blogs — harder

Sentiment classificationoverall (polarity) / specific

free form / grades or stars ..

quotations

Presentationhighlighting

aggregation

community identification

estimating reliability..

Query classification: is theuser looking for an opinion?

M. Mitra (ISI) Text Mining 24 / 29

Page 32: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Challenges

Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions

blogs — harder

Sentiment classificationoverall (polarity) / specific

free form / grades or stars ..

quotations

Presentationhighlighting

aggregation

community identification

estimating reliability..

Query classification: is theuser looking for an opinion?

M. Mitra (ISI) Text Mining 24 / 29

Page 33: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Challenges

Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions

blogs — harder

Sentiment classificationoverall (polarity) / specific

free form / grades or stars ..

quotations

Presentationhighlighting

aggregation

community identification

estimating reliability

..

Query classification: is theuser looking for an opinion?

M. Mitra (ISI) Text Mining 24 / 29

Page 34: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Challenges

Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions

blogs — harder

Sentiment classificationoverall (polarity) / specific

free form / grades or stars ..

quotations

Presentationhighlighting

aggregation

community identification

estimating reliability..

Query classification: is theuser looking for an opinion?

M. Mitra (ISI) Text Mining 24 / 29

Page 35: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Opinion Mining

Feature-based opinion summarizationIdentify the features of the product that customers have expressedopinions on (called opinion features)

For each feature, identify how many customer reviews are positive/ negative

Examples:

The pictures are very clear.

Overall a fantastic, very compact, camera.

While light, it will not easily fit in pockets. (HARD!)

M. Mitra (ISI) Text Mining 25 / 29

Page 36: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Opinion Mining

Feature identification1 POS tagging + chunking: identify nouns, verbs, adjectives, simple

noun groups, verb groups

2 Transaction creation for each sentence: item ≡ normalized nouns/ noun phrases

3 Association rule mining: all itemsets with > 1% support arecandidate frequent features

4 Feature pruning:keep features that have some compact occurrences

keep singleton itemsets only if they occur enough times in isolatione.g. manual vs. manual mode, manual setting

M. Mitra (ISI) Text Mining 26 / 29

Page 37: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Opinion Mining

Sentiment / orientation identification1 Examine each sentence in the review database

2 If it contains a frequent feature, extract all the adjective words asopinion words

3 For each feature in the sentence, the nearby adjective is recordedas its effective opinion

4 Look up adjective in a list of adjectives with known orientation, orconsult WordNet (discard unknowns)adjectives arranged in bipolar structures

M. Mitra (ISI) Text Mining 27 / 29

Page 38: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

Datasets

Blog06 (25GB) : University of Glasgowhttp://ir.dcs.gla.ac.uk/test_collections/access_to_data.htm

Congressional floor-debate transcriptshttp://www.cs.cornell.edu/home/llee/data/convote.html

Cornell movie-review datasetshttp://www.cs.cornell.edu/people/pabo/movie-review-data/

M. Mitra (ISI) Text Mining 28 / 29

Page 39: Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29

References

Untangling Text Data Mining. M. Hearst. Proceedings of ACL’99.www.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html

An Introduction to Information Retrieval. Manning, Raghavan,Schutze.www-csli.stanford.edu/~schuetze/information-retrieval-book.html

Tutorial on Web Content Mining. Bing Liu. WWW 2005.www.cs.uic.edu/~liub

Web Data Mining. Bing Liu. Springer, 2006.

Opinion Mining and Sentiment Analysis. B. Pang and L. Lee.Foundations and Trends in Information Retrieval, 2(1-2), 2008.

Sentiment Analysis and Opinion Mining. Bing Liu. MorganClaypool, 2012.www.morganclaypool.com/doi/abs/10.2200/S00416ED1V01Y201204HLT016?

journalCode=hltM. Mitra (ISI) Text Mining 29 / 29