text mining with machine learning techniques

34
Ping-Tsun Chang Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining Text Mining with Machine Learning with Machine Learning Techniques Techniques

Upload: elewa

Post on 14-Jan-2016

60 views

Category:

Documents


0 download

DESCRIPTION

Text Mining with Machine Learning Techniques. Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University. Text Analysis. Summerization. Classification. Feature Selection. Language Identification. Clustering. Text Mining. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Ping-Tsun ChangIntelligent Systems Laboratory

Computer Science and Information Engineering

National Taiwan University

Text MiningText Miningwith Machine Learning Techniqueswith Machine Learning Techniques

Page 2: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

LanguageIdentification

Classification

Clustering

Summerization

Feature Selection

Text AnalysisText Analysis

Page 3: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Text MiningText Mining

• Text mining is about looking for patterns in natural language text– Natural Language Processing

• May be defined as the process of analyzing text to extract information from it for particular purposes.– Information Extraction– Information Retrieval

Page 4: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Text MiningText Miningand Knowledge Managementand Knowledge Management

• a recent study indicated that 80% of a company's information is contained in text documents– emails, memos, customer correspondence, and reports

• The ability to distil this untapped source of information provides substantial competitive advantages for a company to succeed in the era of a knowledge-based economy.

Page 5: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Text MiningText MiningApplicationsApplications

• Customer profile analysis– mining incoming emails for customers' complaint and

feedback.

• Patent analysis– analyzing patent databases for major technology players,

trends, and opportunities.

• Information dissemination– organizing and summarizing trade news and reports for

personalized information services.

• Company resource planning– mining a company's reports and correspondences for activities,

status, and problems reported.

Page 6: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Text CategorizationText CategorizationProblem DefinitionProblem Definition

• Text categorization is the problem of automatically assigned predefined categories to free text documents– Document classification– Web page classification– News classification

Page 7: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Information RetrievalInformation Retrieval

• Full text is hard to process, but is a complete representation to document

• Logical view of documents• Models

– Boolean Model– Vector Model– Probabilistic Model

• Think text as patterns?

Page 8: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

EvaluationEvaluation

Retrieved

Relevant

ab

c dd

Precisiona

a + bRecall

a

a + d

Page 9: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Pattern RecognizationPattern Recognization

Sensing

Segmentation

Classification

Post-Processing

Feature Extraction Decision

Page 10: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Pattern ClassificationPattern Classification

f1

f2

C1

C2

Page 11: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Machine LearningMachine Learning

• Using Computer help us to induction from complex and large amount of pattern data

• Bayesian Learning

• Instance-Based Learning– K-Nearest Neighbors

• Neural Networks

• Support Vector Machine

Page 12: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Feature Selection (I)Feature Selection (I)

• Information Gain

||

1

||

1

||

1

)0|(log)0|()0(

)1|(log)1|()1(

)(log)(

)|()(),(

c

kkki

c

kkki

c

kkk

ii

tCPtCPKP

tCPtCPKP

CPCP

KCECECKIG

Page 13: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Feature Selection (II)Feature Selection (II)

• Mutual Information

• CHI-Square

)()(

),(log()

)|(

1log()

)(

1log(),(

CPkP

CkP

kcPCPCKMI

t

t

tt

)()()()(

)(),(

22

DCBADBCA

CBADNCk t

Page 14: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Weighting SchemeWeighting SchemeTF IDF‧TF IDF‧

||||

)/log()),(log(1(),(

d

nNdktfdkw ti

i

dk

i

i

dkwd 2),(||||

Page 15: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Simility EvaluationSimility Evaluation

• Cosine-Like schema

T

llj

T

lli

T

lljli

ji

jiji

ww

ww

dd

ddddsim

1

2

1

2

1

||||),(

di

dj

Page 16: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Machine LearningMachine LearningApproaches: Baysian ClassifierApproaches: Baysian Classifier

))|()((maxarg1

n

ii

cCKPCP

TCN

CKNCKP i

i

)(

1)|()|(

Page 17: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Machine LearningMachine LearningApproaches: kNN ClassifierApproaches: kNN Classifier

kNNd

jiij

i

cdCddsimcdC ),(),(),(

d ?

Page 18: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Machine LearningMachine LearningApproaches: Support Vector MachineApproaches: Support Vector Machine

• Basic hypotheses : Consistent hypotheses of the Version Space

• Project the original training data in space X to a higher dimension feature space F via a Mercel operator K

n

iii xxKxf

1

),()(

Page 19: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Compare: SVM and traditional Compare: SVM and traditional LeanersLeaners

• Traditional Leaner

• SVM access the hypothesis space!

P(h)

hypothesis

P(h|D1)

hypothesis

P(h|D1^D2)

hypothesis

Page 20: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

SVM Learning in Feature SpacesSVM Learning in Feature Spaces

))(),...,(()(),...( 11 xxxxxx dn

),,)(()( 21},,{

221 mmppx

zyxi

ii

),,,,,,,( 21222111 mmppppppx zyxzyxExample:

X F

Page 21: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Support Vector Machine Support Vector Machine (cont’d)(cont’d)

• Nonlinear– Example: XOR Problem

• Natural Language is Nonlinear!

f1

f2

f1

f1 f2

2

2

Page 22: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Support Vector Machine Support Vector Machine (cont’d)(cont’d)

• Consistent hypothses

• Maximum margin

• Support Vector

Page 23: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Statistical Learning TheoryStatistical Learning Theory

P(X) P(y|x)

F(x)

y

y*

x

x

Generator Supervisor

Leaner

Page 24: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Support Vector MachineSupport Vector MachineLinear Discriminant FunctionsLinear Discriminant Functions

• Linear discriminant space

• Hyperplane

yayg t)(

nkygz kk ,...,1,1)( g(y)>1

y2

y1

g(y)<1

Page 25: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Learning of Support Vector Learning of Support Vector MachineMachine

• Maxmize Margin

• Minimize ||a||

n

kk

tkk yazaaL

1

2 ]1[||||2

1),(

Optimal hyperplane

nkba

ygz kk ,...1,||||

)(

Page 26: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Version SpaceVersion Space

• Hypothesis Space H

• Version Space V

},||||

)()(|{ Ww

w

xwxffH

}0)(},...1{|{ ii xfyniHfV

H

V

Page 27: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Support Vector Machine Support Vector Machine Active LearningActive Learning

• Why Support Vector Machine?– Text Categorization have large amount of data– Traditional Learning cause Over-Fitting– Language is complex and nonlinear

• Why Active Learning? – Labeling instance is time-consuming and costly– Reduce the need for labeled training instances

Page 28: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Active Learning: HistoryActive Learning: HistoryText Classification [Rochio, 71] [Dumais, 98]

Support Vector Machine [Vapnik,82]

Text ClassificationSupport Vector Machine [Joachims,98] [Dumais,98]

Pool-Based Active Learning [Lewis, Gale ‘94] [McCallum, Nigrm ‘98]

The Nature of Statistical Learning Theory [Vapnik, 95]

Automated Text Categorization UsingSupport Vector Machine [Kwok, 98]

Page 29: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Active LearningActive Learning

• Pool-Based active learning have a pool UU of unlabeled instances

• Active Lerner l have three components (f,q,X)– f: classifier x->{-1, 1}

– q: querying function q(X), given a training instance labeled set X, decide which instance in U to query next.

– X: training data, labeled.

Page 30: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Active Learning (cont’d)Active Learning (cont’d)

• Main difference: querying component q.

• How to choose the next unlabeled instance to query?

• Resulting Version Space

}0))((|{ 1

iii xwWwVV

}0))((|{ 1

iii xwWwVV

Page 31: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Active LearnerActive Learner

• Active learner l* always queries instances whose corresponding hyperplanes in parameter space W halves the area of the current version space

)]([sup)]([sup *iP

PiP

PVAreaEVAreaENi

Page 32: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

ExperienmentsExperienmentsBayesian ClassifierBayesian Classifier

Page 33: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Comparsion of Learning Comparsion of Learning MethodsMethods

0 10 20 30 40 50 60

0.6

0.8

1

0.4

0.2

Precision

Training Data Size

SVM

kNN

NB

NNet

Page 34: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

ConclusionsConclusions

• Text-Mining extraction knowledge from text.

• Support Vector Machine is almost the best statistic-based machine learning method

• Natural Language Understanding is still a open problem

Knowledge