anlyse_mine_unstructereddata@socialmedia

7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

1/80

All Rights Reserved, Indian


2/80


Decision Trees

Introduction to Decision Trees.

Decision Tree approach for classification.

Chi-Square Automatic Interaction Detection (CHAID

Classification and Regression Tree (CART)


3/80


CHAID (Chi-square Automatic Interaction Det


4/80


Introduction to CHAID

CHAID is a decision tree algorithm used in classification pro

Initial models of CHAID used chi-square test of indep

splitting.

CHAID was first presented in an article titled, Anexplorato

for Investigating large quantities of categoricalvariables,b

in Applied Statistics (1980)


5/80 All Rights Reserved, Indian

CHAID

CHAID partitions the data into mutually exclusive, exhaustive,

best describe the dependent categorical variable.

CHAID is an iterative procedure that examines the pr

classification variables) and uses them in the order of th

significance.



CHAID Splitting Rule

The splitting in CHAID is based on the type of the dependent vavariable).

For continuous dependent variable F test is used.

For categorical dependent variable chi-square test of independ



CHAID Example

German Credit Rating Data Set.

Predictor Variable Checking Account Balanc



Chi-Square test of independen

Chi-square test of independence starts with an assumption th

relationship between two variables.

For example, we assume that there is no relationship betwee

account balance and default.



Chi-Square test of Independence

German Credit Case

H0: There is no relationship between checking account balanc

HA: There is a relationship between checking account balance



Contingency Table

Checking account

balance

Default

1 0 Total

0 DM 135 139 274Other than

0 DM 165 561 726

Total 300 700 1000


11/80


Chi-Square test of Independenc

in German Credit Data

H0: Checking account balance and default are independ

HA: Checking account balance and default are depende


12/80


Chi-Square test of IndependenceT

Statistic

sumtotalscolumnsumrowEfrequencyExpected

E

EOstatistic

22


13/80


Chi-Square test of Independence

Observed

frequency

Expected

Frequency (O-E)^0DM-1 135 82.2 33.9

0DM-0 139 191.8 14.5

N0DM-1 165 217.8 12.

N0DM-0 561 508.2 5.4

1000 1000 66.7

P-value 3.10

p-value is less than 0.05, we reject the null

hypothesis.


14/80



15/80


CHAID Procedure

Step 1: Examine each predictor variable for its statistical significance with the

using F test (for continuous dependent) or chi-square test for categorical depe

Step 2:Determine the most significant among the predictors (predictor wi

after Bonferonni correction).

Step 3: Divide the data by levels of the most significant predictor Each of t

examined individually further.

Step 4: For each sub-group, determine the most significant variable from the

and divide the data again.

Step 5: Repeat step 4 till stopping criteria is reached.


16/80


Merging - CHAID

CHAID uses both splitting and merging steps.

In merging, least significantly different groups are merged to


17/80


CHAID Stopping Criteria

Maximum tree depth is reached (which is pre-defined).

Minimum number of a cases to be a parent node is reached (a

Minimum number of to be a child node is reached.


18/80


CHAID Input

Significance level for partitioning a variable.

Significance level for merging.

Minimum number of records for the cells.


19/80



20/80


CHAID Example: German Credit Data


21/80


First Split0DM

OBSERVED

Predictor

Response variable Total

Y = 1 Y = 0

0DM = 1 99 110 209

0DM = 0 140 451 591

Total 239 561 800

Expected

Predictor Response Variable TotY = 1 Y = 0

0DM = 1 62.43875 146.56 209

0DM = 0 176.5613 414.44 59

Total 239 561 800CHAID Tree

P-value = 1.28 x 10-10


22/80


First Split0DM

CHAID Tree

Business Rules

Business rules are always derived for lea

without any branch)

Node 1: If 0 DM = 0 then classify the

(good credit) which will be accurate 76


23/80


Node 7, 90% accuracy

Node 5, 87.8 % a


24/80


Business Rule

Node 7: If theChecking account balance is not less than 2

there are no other installments, then classify the observation as


25/80


Tree Pruning

Tree pruning is a process of reducing the size of the

different strategies.

Pruning is important since large tree can result in over fittin

Pruning can be either based on the size of the or purity.


26/80



27/80


Classification and Regression Tree (CART)


28/80


Classification and Regression Trees (C

Splits are chosen based either on Gini Index or twoing crite

CART is a binary tree, whereas CHAID can split the in

more than 2 branches.


29/80


Gini Index (Classification Impur Gini Index is used to measure the impurity at a

classification problem) and is given by:

noinjcategoryofproportiontheisk)P(jwhere

))(1)((Gini(k)1

J

j

kjPkjP

Smaller Gini Index implies less impurity


30/80


Entropy (Impurity Measure)

Entropy is another impurity measure that is frequently Entropy at node k is given by:

c

1j

k))|(P(jk)|P(jEntropy(k) log2


31/80


Gini Index Calculation

Number of classes 2 (say 0 and 1)

Consider node label k with 10 1s and 90 0s.

1)((2))(1)((Gini(k)2

1

PkjPkjPkjPj

18.09.01.02Gini(k)

Smaller number

implies less impurity!


32/80


Entropy Calculation

Number of classes 2 (say 0 and 1)

Consider node label k with 10 1s and 90 0s.

2

1

))(log()(Entropy(k)j

kjPkjP

4689.0)9.0log(9.0)1.0log(1.0Entropy(k)

Higher than Gini

coefficient


33/80


Classification Tree Logic

Node t

Node tL Node tR

noderightin thensobservatioofProportionP

nodeleftin thensobservatioofProportionP

(.)nodeatImpurityi(.)

)i(tN)i(tNi(t)Max

R

L

LRLL

Reduction

impurit


34/80



35/80


CART Example: German Credit D


36/80


First Split0DM

CART Tree

Impurity at Node 0 = 2 x 0.299 x 0.701 = 0.4191

Impurity at Node 1 = 2 x 0.237 x 0.763 x 0.73875 = 0


Change in impurity = 0.4191 0.2671 0.130271 =


37/80


First SplitCredit Amount

CART Tree

Impurity at Node 0 = 2 x 0.299 x 0.701 = 0.4191


Impurity at Node 2 = 2 x 0.5 x 0.5 x 0.1225 = 0.0612

Change in impurity = 0.4191 0.3467 0.06125 = 0


38/80


Node 5: 0DM = 1 and d

classify the observation

accuracy)


39/80


Ensemble Methods

Ensemble methods is a majority voting method in whic

developed using various techniques such as logistic regressionanalysis, CHAID, CART etc.

For each observation, the classification is done based on major

Ensemble methods in general improve the accuracy of predictio


40/80


Random Forest

Random forest is an ensemble method in which several tree(thus the name forest) by sampling the predictor variable.

The sampling can also be on the training data set.


41/80



42/80


Analysis of unstructured data: Text Mining/Sentim

Analysis/Social Media Analytics


43/80


Why Text Mining and Sentimental Analysi

Science of Social Influence:

70% of the sales decisions were made before engaging a sale

(HP).

Sales force is becoming irrelevant since customers are engaged

media.


44/80


Importance of Social Listening

A study by McKinsey revealed that electronic word-of-motwice the sales as compared to paid advertising.


45/80


Data Sources for unstructured data and Social M

Analytics

Facebook

Blogs and Microsites

YouTube, Twitter, Instagram


46/80


Text Classification and Sentiment Analys

Text Classification Approaches

Rule based classification

Nave Bayes Algorithm


47/80


Classification Methods: Rules

Rules based on combinations of words present in the d

Spam e-mail classification

(Hello Dear) OR (You Won a Lottery)

Accuracy can be low


48/80



49/80





50/80



Nave Bayes algorithm (or nave Bayes classifier) istechniques that uses Bayes theorem.

One of the popular tools for sentiment analysis.


51/80


Comments on the Bollywood Movie Chennai Ex

Some films are hard to make sense. Others arenonsense. Chennai Express, directed by Rohit Shticks both boxes.

The film is filled with gigantic men whose size functas a punch line. Yes, some of it is funny. The location

beautiful.


52/80


Classification Methods: Nave Bayes Algo

Input: a document set D = {d1, d2, .}

a fixed set of classes C = {c1, c2,, cn}

A training set of kdocuments (d1,c1),....,(dk, ck)

Output:

a learned classifier Y: dnew c

C1 = PoC2 = Ne

C3 = Ne

Nave Bayes Model


53/80


y

Documents are treated as bag of words.

Consider a document D, whose class is given by C. We classifywhich has the highest posterior probability P(C|D).

P(C|D) can be expressed usingBayes theorem:

)()|()(

)()|()|( CPCDP

DP

CPCDPDCP

Nave Bayes Classification for Text Mining


54/80


Nave Bayes Classification for Text Mining

Document is represented as bag of words.

We create a vocabulary V containing N word types (featu

Documents are represented using feature vectors.

V b l


55/80


Vocabulary

V = {Awesome, humor, action, nonsense, romance, stereotype, Mindl

Overacting, trash }

N = 10 (Number of words in the Vocabulary)

Document Model


56/80


Document Model

Bernoulli document model: A document is represented by a

with binary elements taking value 1 if the corresponding word0 otherwise.

Multinomial document model: A document is representevector with integer whose value is the frequency of thadocument.

Bernoulli Document Model


57/80



V = {awesome, humor, action, nonsense, romance, stereotype, MiOveracting, trash}

Some films are hard to make sense of. Others are just nonsense. Chdirected by Rohit Shetty, ticks both boxes. More than a quarter ofTamil, and hence incomprehensible if you're unfamiliar with therest is a stew of puerile humor, lazy stereotypes, and overacting frappears to be trying too hard.

D = {0, 1, 0, 1, 0, 1, 0, 0, 1, 0}



58/80


V = {awesome, humor, action, nonsense, romance, stereotype, MOveracting, trash}

Chennai Express plays neither to Rohits strengths nor to Shahstrangely sloppy mishmash of cheesy humour, half-hearted romemotion and head-banging action. The film is filled with giganticfunctions as a punch line. Yes, some of it is funny. The locations are be

D = {0, 1, 0, 0, 1, 0, 0, 0, 0, 0}


59/80



60/80



Training Data Set (Chennai Express from Bollywood Hu


61/80


g ( p y

what a awseome movie....start fr the DDLJ train scene..till thefunny...conversation with singing a song..haha mind blowing... Mina wash

mina...mina..ting tong...halirous man...SRK you proof once again that you aracting and bollwood...

i like the movie v.much. its train secquence srk acting in train, converstion bdepica by singing hindi songs it is amaizing.its drama,its music and its climax is

proud of such a lover rahol..Ce has good entertainment story love to watch it ag

wow mind blowing...superb.. truly enjoy to watch this. CE rocksssssssssssssssss

Training Data Set (Chennai Express from Bollywoo


62/80


Training Data Set (Chennai Express from Bollywoo

trash in one word.

Shame on the film industry. An ugly blot on the blockbuster list. SRK how do u manage twith films such as this. U used to be an actor of quality with movies such as Daar, swadesto back you. The makers of the film have made a mockery out of Indians. And what is wodemean ourselves by going to watch such movies.

Quiet shocking to see actor of Shahrukh s caliber and seniority doing such an ordinary ro

much more expected from Shahrukh .In fact his sarcasm was felt very repetitive and far frgood thing about movie is Deepika s Perfomance .She has done extraordinary job and shocredit for the success of movie.

Ref: Hiroshi Shimodaira, Text Classification using Nave Bayes, Learning and Data Note, University of Edinburgh

Nave Bayes Model


63/80


Nave Bayes Model


)()|()(

)()|()|( CPCDP

DP

CPCDPDCP


64/80


Bernoulli Document ModelProbability Estim

If there are N documents in total in the training set, then the prior pr

class C = k may be estimated as the relative frequency of documents of c

N

kNkCP

)(

)()|()(

)()|()|( CPCDP

DP

CPCDPDCP

B lli D t M d l P b bilit E ti


66/80


y


)()|()(

)()|()|( CPCDP

DP

CPCDPDCP

Documents and converted into feature vectors

bi= {0, 1, 0, 0, 1, 0, 0, 0, 0, 0}



67/80


Let bi be the feature vector for the ith

document D

otherwise0

documentin thepresentiswwordif1b

tit



68/80


P(Di| C) in terms of the individual word likelihoods P(wt |C) is given

N

itittitii CwPbCwPbCbPCDP

1

))|(1)(1()|()|()|(

Probability of findingwt given class C

Probability of not finding

wt given class C

Bernoulli Document Model Probability Est


69/80


Bernoulli Document ModelProbability Est

Let nk(wt) be the number of documents of class C = k

observed; and let Nk be the total number of documents

Then estimate P(Wt|C = k) is given by:

kN

twkn

kCtwP

)()|(

Bernoulli Document Model Probability Estim


70/80



Class k = Positive

NK = 200 (total number of positive documents in the training da

Nk(wt) = 50 (that is word wt appears in 50 training documents)

25.020050)()|(

k

tktNwnkCwP


71/80




72/80


V = {awesome, humor, action, nonsense, romance, stereotype, Mindless, Boring

11

11

11

10

01

01

11001101

11010001

10101001

01101010

00110100

11110001

PT

TP = Documents that are classifiepositive sentiment in the training data

1000000001

00

00

01

01

10

01010100

00000001

00100110

11001011

01000110

NT TN = Documents that are classifienegative sentiment in the training data

Nave Bayes Model


73/80



)()|()(

)()|()|( CPCDP

DP

CPCDPDCP



74/80



P(P) = 6/12 = 0.5 P(N) = 6/12 = 0.5

)()|()(

)()|()|( CPCDPDP

CPCDPDCP

N

i

tittitii CwPbCwPbCbPCDP1

))|(1)(1()|()|()|(


75/80


Np(w1) = 4 Np(w2) = 1 Np(w3) = 2 Np(w4) = 3 Np(w5) = 3

Np(w6) = 4 Np(w7) = 4 Np(w8) = 4 Np(w9) = 5 Np(w10) = 4

Np(wt) = Number of times word wt has occurred in positive documents in

P(w1|P) = 4/6 P(w2|P) = 1/6 P(w3|P) = 2/6 P(w4|P) =

P(w5|P) = 3/6 P(w6|P) = 4/6 P(w7|P) = 4/6 P(w8|P) =

P(w9|P) = 5/6 P(w10|P) = 4/6

P(wt|P) = Conditional probability of observing word wt given it is a posit

sentiment.



76/80


NN(w1) = 2 NN(w2) = 3 NN(w3) = 3 NN(w4) = 1 NN(w5) = 1

NN(w6) =1 NN(w7) = 3 NN(w8) = 1 NN(w9) =2 NN(w10) = 1

NN(wt) = Number of times word wt has occurred in negative documents in

P(w1|N) = 2/6 P(w2|N) = 3/6 P(w3|N) = 3/6 P(w4|N

P(w5|N) = 1/6 P(w6|N) =1/6 P (w7|N) = 3/6 P(w8|N

P (w7|N) = 2/6 P(w8|N) = 1/6

P(wt|N) = Conditional probability of observing word wt given it is a negativ



77/80


Classify the following documents into positive and negativ

D1 = (1, 1, 1, 1, 1, 1, 1, 1,1,1)

D1 = (1,

1, 1, 1, 1, 1, 1, 1)

||V


78/80


6

1

6

2

6

1

6

3

6

1

2

1

6

1

6

3

6

3

6

2*(6/12)

))|(1)(1()|()()|(

64

65

64

64

64

23

63

62

61

64*(6/12)

))|(1)(1()|()()|(

||

1

1

||

1

1

V

i

tittit

V

i

tittit

CwPbCwPbNPDNP

CwPbCwPbPPDPP

The document will be classified as P since it has the higher probability

Challenges


79/80


Difficult to classify sarcastic comments

All my life I thought Air was free until I bought a b

You have been so incredibly helpful and thanks (for no

Text Pre-Processing

Feature Extraction


80/80


Feature Extraction

Feature extractor is used to convert each comment to a feature set.

Can be used for creating vocabulary

Stemming

Different forms of the same word. Stemming is a process of transinto its stem.

N Gram Analysis

N-Gram is a sequence of n consecutive words(e.g. machine learning