uic - cs 594bing liu1 chapter 7: text mining. uic - cs 594bing liu2 text mining it refers to data...

49
UIC - CS 594 Bing Liu 1 Chapter 7: Text mining

Upload: randolph-lyons

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 1

Chapter 7: Text mining

Page 2: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 2

Text mining It refers to data mining using text

documents as data. There are many special techniques

for pre-processing text documents to make them suitable for mining.

Most of these techniques are from the field of “Information Retrieval”.

Page 3: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 3

Information Retrieval (IR) Conceptually, information retrieval (IR) is the

study of finding needed information. I.e., IR helps users find information that matches their information needs.

Historically, information retrieval is about document retrieval, emphasizing document as the basic unit.

Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.

IR has become a center of focus in the Web era.

Page 4: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 4

Information Retrieval

User InformationSearch/select

Info. Needs Queries Stored Information

Translating info.needs to queries

Matching queriesTo stored information

Query result evaluationDoes information found match user’s information

needs?

Page 5: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 5

Text Processing

Word (token) extraction Stop words Stemming Frequency counts

Page 6: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 6

Stop words Many of the most frequently used words in

English are worthless in IR and text mining – these words are called stop words.

the, of, and, to, …. Typically about 400 to 500 such words For an application, an additional domain specific stop

words list may be constructed

Why do we need to remove stop words? Reduce indexing (or data) file size

stopwords accounts 20-30% of total word counts. Improve efficiency

stop words are not useful for searching or text mining stop words always have a large number of hits

Page 7: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 7

Stemming Techniques used to find out the root/stem

of a word: E.g.,

user engineering users engineered used engineer using

stem: use engineer

Usefulness improving effectiveness of IR and text

mining matching similar words

reducing indexing size combing words with same roots may reduce

indexing size as much as 40-50%.

Page 8: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 8

Basic stemming methods remove ending

if a word ends with a consonant other than s, followed by an s, then delete s. if a word ends in es, drop the s. if a word ends in ing, delete the ing unless the

remaining word consists only of one letter or of th. If a word ends with ed, preceded by a consonant,

delete the ed unless this leaves only a single letter. …...

transform words if a word ends with “ies” but not “eies” or “aies” then

“ies --> y.”

Page 9: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 9

Frequency counts Counts the number of times a word

occurred in a document. Counts the number of documents in

a collection that contains a word. Using occurrence frequencies to

indicate relative importance of a word in a document. if a word appears often in a document,

the document likely “deals with” subjects related to the word.

Page 10: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 10

Vector Space Representation

A document is represented as a vector: (W1, W2, … … , Wn)

Binary: Wi= 1 if the corresponding term i (often a word) is

in the document Wi= 0 if the term i is not in the document

TF: (Term Frequency) Wi= tfi where tfi is the number of times the term

occurred in the document TF*IDF: (Inverse Document Frequency)

Wi =tfi*idfi=tfi*log(N/dfi)) where dfi is the number of documents contains term i, and N the total number of documents in the collection.

Page 11: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 11

Vector Space and Document Similarity

Each indexing term is a dimension. A indexing term is normally a word.

Each document is a vector Di = (ti1, ti2, ti3, ti4, ... tin) Dj = (tj1, tj2, tj3, tj4, ..., tjn)

Document similarity is defined as

n

1k

jk

n

1k

n

1k

jkik

ji

22ik tt

t*t)D ,(D Similarity

Page 12: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 12

Query formats

Query is a representation of the user’s information needs Normally a list of words.

Query as a simple question in natural language The system translates the question into

executable queries Query as a document

“Find similar documents like this one” The system defines what the similarity is

Page 13: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 13

An Example A document Space is defined by three

terms: hardware, software, users

A set of documents are defined as: A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)

If the Query is “hardware and software”

what documents should be retrieved?

Page 14: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 14

An Example (cont.) In Boolean query matching:

document A4, A7 will be retrieved (“AND”) retrieved:A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)

In similarity matching (cosine): q=(1, 1, 0) S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 Document retrieved set (with ranking)=

{A4, A7, A1, A2, A5, A6, A8, A9}

Page 15: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 15

Relevance judgment for IR

A measurement of the outcome of a search or retrieval

The judgment on what should or should not be retrieved.

There is no simple answer to what is relevant and what is not relevant: need human users. difficult to define subjective depending on knowledge, needs, time,, etc.

The central concept of information retrieval

Page 16: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 16

Precision and Recall Given a query:

Are all retrieved documents relevant? Have all the relevant documents been

retrieved ? Measures for system performance:

The first question is about the precision of the search

The second is about the completeness (recall) of the search.

Page 17: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 17

Precision and Recall (cont)

Relevant Not Relevant

Retrieved

Not retrieved

a b

c d

P = --------------a

a+bR = --------------

a

a+c

Page 18: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 18

Precision and Recall (cont)

Precision measures how precise a search is. the higher the precision, the less unwanted documents.

Recall measures how complete a search is. the higher the recall, the less missing documents.

Precision = --------------------------------------------Number of relevant documents retrieved

Total number of documents retrieved

Recall = -----------------------------------------------------Number of relevant documents retrieved

Number of all the relevant documents in the database

Page 19: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 19

Relationship of R and P Theoretically,

R and P not depend on each other. Practically,

High Recall is achieved at the expense of precision.

High Precision is achieved at the expense of recall. When will p = 0?

Only when none of the retrieved documents is relevant.

When will p=1? Only when every retrieved documents are

relevant. Depending on application, you may want a

higher precision or a higher recall.

Page 20: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 20

P-R diagram

0.1 0.5 1.0 R

P

0.1

0.5

1.0

System C

System A

System B

Page 21: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 21

Alternative measures Combining recall and precision, F score

2PRF = ------------------ R + P

Breakeven point: when p = r These two measures are commonly used in

text mining: classification and clustering. Accuracy is not normally used in text

domain because the set of relevant documents is almost always very small compared to the set of irrelevant documents.

Page 22: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 22

Web Search as a huge IR system

A Web crawler (robot) crawls the Web to collect all the pages.

Servers establish a huge inverted indexing database and other indexing databases

At query (search) time, search engines conduct different types of vector query matching

Page 23: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 23

Different search engines The real differences among different

search engines are their indexing weight schemes their query process methods their ranking algorithms None of these are published by any of the

search engines firms.

Page 24: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 24

Vector Space Based Document Classification

Page 25: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 25

Vector Space Representation

Each doc j is a vector, one component for each term (= word).

Have a vector space terms are attributes n docs live in this space even with stop word removal and

stemming, we may have 10000+ dimensions, or even 1,000,000+

Page 26: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 26

Classification in Vector space

Each training doc is a point (vector) labeled by its topic (= class)

Hypothesis: docs of the same topic form a contiguous region of space

Define surfaces to delineate topics in space

Government

Science

Arts

Page 27: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 27

Test doc = Government

Government

Science

Arts

Page 28: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 28

Rocchio Classification Method

Given training documents compute a prototype vector for each class.

Given test doc, assign to topic whose prototype (centroid) is nearest using cosine similarity.

Page 29: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 29

Rocchio Classification Constructing document vectors into a

prototype vector for each class cj.

and are parameters that adjust the relative impact of relevant and irrelevant training examples. Normally, = 16 and = 4.

jj CDdjCdjj

d

d

CDd

d

Cc

||||

1

||||

1

jc

Page 30: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 30

Naïve Bayesian Classifier Given a set of training documents D,

each document is considered an ordered list of words.

wdi,k denotes the word wt in position k of document di, where each word is from the vocabulary V = < w1, w2, … , w|v| >.

Let C = {c1, c2, … , c|C|} be the set of pre-defined classes.

There are two naïve Bayesian models, One based on multi-variate Bernoulli model (a word

occurs or does not occurs in a document). One based on the multinomial model (the

number of word occurrences is considered)

Page 31: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 31

Naïve Bayesian Classifier (multinomial model)

||

1

||

1 ,

||

1 ,

)|()(

)|()()|(

C

r

d

k rkdr

d

k jkdj

ij i

i

i

i

cwc

cwcdc

||

1

||

1

||

1

)|(),(||

)|(),(1)|( V

s

D

i ijis

D

i ijitjt

dcdwNV

dcdwNcw

||

)|()(

||

1

D

dcc

D

i ijj

(1)

(2)

(3)

N(wt, di) is the number of times the word wt occurs in document di. P(cj|di) is in {0, 1}

Page 32: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 32

k Nearest Neighbor Classification

To classify document d into class c Define k-neighborhood N as k nearest

neighbors of d Count number of documents n in N

that belong to c Estimate P(c|d) as n/k No training is needed (?). Classification

time is linear in training set size.

Page 33: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 33

Example

Government

Science

Arts

Page 34: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 34

Example: k=6 (6NN)

Government

Science

Arts

P(science| )?

Page 35: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 35

Linear classifiers: Binary Classification

Consider 2 class problems Assume linear separability for now:

in 2 dimensions, can separate by a line in higher dimensions, need hyperplanes

Can find separating hyperplane by linear programming (e.g. perceptron): separator can be expressed as ax + by = c

Page 36: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 36

Linear programming / Perceptron

Find a,b,c, such thatax + by c for red pointsax + by c for green points.

Page 37: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 37

Linear Classifiers (cont.)

Many common text classifiers are linear classifiers

Despite this similarity, large performance differences For separable problems, there is an

infinite number of separating hyperplanes. Which one do you choose?

What to do for non-separable problems?

Page 38: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 38

Which hyperplane?

In general, lots of possiblesolutions for a,b,c.Support Vector Machine (SVM) finds an optimal solution

Page 39: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 39

Support Vector Machine (SVM)

Support vectors

Maximizemargin

SVMs maximize the margin around the separating hyperplane.

The decision function is fully specified by a subset of training samples, the support vectors.

Quadratic programming problem.

SVM: very good for text classification

Page 40: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 40

Optimal hyperplane Let the training examples be (xi, ,yi) i = 1, 2,…, n,

where xi is n-dimensional vector. yi is its class, -1 or 1. The class represented by the subset with yi = -1 and

the class represented by the subset with yi = +1 are linearly separable if there exists (w, b) such that

The margin of separation m is the separation between the hyperplane wTx + b = 0 and the closest data points (support vectors).

The goal of a SVM is to find the optimal hyperplane with the maximum margin of separation.

• wTxi + b 0 for yi = +1

• wTxi + b < 0 for yi = -1

Page 41: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 41

A Geometrical Interpretation

The decision boundary should be as far away from the data of both classes as possible We maximize the margin, m

Class 1

Class 2

m

Page 42: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 42

SVM formulation: separable case

Thus, support vector machines (SVM) are linear functions of the form f(x) = wTx + b, where w is the weight vector and x is the input vector.

To find the linear function:Minimize:

Subject to: Quadratic programming.

wwT

2

1

niby ii ..., 2, 1, ,1)( T xw

Page 43: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 43

Non-separable caseSoft margin SVM

To deal with cases where there may be no separating hyperplane due to noisy labels of both positive and negative training examples, the soft margin SVM is proposed:

Minimize:

Subject to:

i 0, i = 1, …, nwhere C 0 is a parameter that controls the amount of training errors allowed.

n

iiC

1

T

2

1 ww

niby iii ..., 2, 1, ,1)( T xw

Page 44: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 44

Illustration:Non-separable case

Support Vectors:1 margin s.v. i = 0 Correct

2 non-margin s.v. i < 1 Correct (in margin)

3 non-margin s.v. I > 1 Error

1

11

1

332

3

Page 45: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 45

Extension to Non-linear Decision surface

In general, complex real world applications may not be expressed with linear functions.

Key idea: transform xi into a higher dimensional space to “make life easier” Input space: the space xi are in Feature space: the space of (xi) after transformation ( )

( )

( )( )( )

( )

( )( )

(.)( )

( )

( )

( )( )

( )

( )

( )( )

( )

Feature spaceInput space

Page 46: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 46

Kernel Trick The mapping function (.) is used to project

data into a higher dimensional feature space. x =(x1, .., xn) (x) = (1(x), …, N(x)) With a higher dimensional space, the data are

more likely to be linearly separable. In SVM, the projection can be done implicitly,

rather than explicitly because the optimization does not actually need the explicit projection.

It only needs a way to compute inner products between pairs of training examples (e.g., x, z) Kernel: K(x, z) = < (x) (z)> If you know how to compute K, you do not need to

know .

Page 47: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 47

Comments of SVM SVM are seen as best-performing method by

many. Statistical significance of most results not clear. Kernels are an elegant and efficient way to

map data into a better representation. SVM can be expensive to train (quadratic

programming). For text classification, linear kernel is common

and often sufficient.

Page 48: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 48

Document clustering

We can still use the normal clustering techniques, e.g., partition and hierarchical methods.

Documents can be represented using vector space model.

For distance function, cosine similarity measure is commonly used.

Page 49: UIC - CS 594Bing Liu1 Chapter 7: Text mining. UIC - CS 594Bing Liu2 Text mining It refers to data mining using text documents as data. There are many

UIC - CS 594 Bing Liu 49

Summary

Text mining applies and adapts data mining techniques to text domain.

A significant amount of pre-processing is needed before mining, using information retrieval techniques.