anlyse_mine_unstructereddata@socialmedia

Upload: venkata-nelluri-pmp

Post on 26-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    1/80

    All Rights Reserved, Indian

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    2/80

    All Rights Reserved, Indian

    Decision Trees

    Introduction to Decision Trees.

    Decision Tree approach for classification.

    Chi-Square Automatic Interaction Detection (CHAID

    Classification and Regression Tree (CART)

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    3/80

    All Rights Reserved, Indian

    CHAID (Chi-square Automatic Interaction Det

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    4/80

    All Rights Reserved, Indian

    Introduction to CHAID

    CHAID is a decision tree algorithm used in classification pro

    Initial models of CHAID used chi-square test of indep

    splitting.

    CHAID was first presented in an article titled, Anexplorato

    for Investigating large quantities of categoricalvariables,b

    in Applied Statistics (1980)

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    5/80 All Rights Reserved, Indian

    CHAID

    CHAID partitions the data into mutually exclusive, exhaustive,

    best describe the dependent categorical variable.

    CHAID is an iterative procedure that examines the pr

    classification variables) and uses them in the order of th

    significance.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    6/80 All Rights Reserved, Indian

    CHAID Splitting Rule

    The splitting in CHAID is based on the type of the dependent vavariable).

    For continuous dependent variable F test is used.

    For categorical dependent variable chi-square test of independ

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    7/80 All Rights Reserved, Indian

    CHAID Example

    German Credit Rating Data Set.

    Predictor Variable Checking Account Balanc

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    8/80 All Rights Reserved, Indian

    Chi-Square test of independen

    Chi-square test of independence starts with an assumption th

    relationship between two variables.

    For example, we assume that there is no relationship betwee

    account balance and default.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    9/80 All Rights Reserved, Indian

    Chi-Square test of Independence

    German Credit Case

    H0: There is no relationship between checking account balanc

    HA: There is a relationship between checking account balance

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    10/80 All Rights Reserved, Indian

    Contingency Table

    Checking account

    balance

    Default

    1 0 Total

    0 DM 135 139 274Other than

    0 DM 165 561 726

    Total 300 700 1000

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    11/80

    All Rights Reserved, Indian

    Chi-Square test of Independenc

    in German Credit Data

    H0: Checking account balance and default are independ

    HA: Checking account balance and default are depende

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    12/80

    All Rights Reserved, Indian

    Chi-Square test of IndependenceT

    Statistic

    sumtotalscolumnsumrowEfrequencyExpected

    E

    EOstatistic

    22

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    13/80

    All Rights Reserved, Indian

    Chi-Square test of Independence

    Observed

    frequency

    Expected

    Frequency (O-E)^0DM-1 135 82.2 33.9

    0DM-0 139 191.8 14.5

    N0DM-1 165 217.8 12.

    N0DM-0 561 508.2 5.4

    1000 1000 66.7

    P-value 3.10

    p-value is less than 0.05, we reject the null

    hypothesis.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    14/80

    All Rights Reserved, Indian

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    15/80

    All Rights Reserved, Indian

    CHAID Procedure

    Step 1: Examine each predictor variable for its statistical significance with the

    using F test (for continuous dependent) or chi-square test for categorical depe

    Step 2:Determine the most significant among the predictors (predictor wi

    after Bonferonni correction).

    Step 3: Divide the data by levels of the most significant predictor Each of t

    examined individually further.

    Step 4: For each sub-group, determine the most significant variable from the

    and divide the data again.

    Step 5: Repeat step 4 till stopping criteria is reached.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    16/80

    All Rights Reserved, Indian

    Merging - CHAID

    CHAID uses both splitting and merging steps.

    In merging, least significantly different groups are merged to

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    17/80

    All Rights Reserved, Indian

    CHAID Stopping Criteria

    Maximum tree depth is reached (which is pre-defined).

    Minimum number of a cases to be a parent node is reached (a

    Minimum number of to be a child node is reached.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    18/80

    All Rights Reserved, Indian

    CHAID Input

    Significance level for partitioning a variable.

    Significance level for merging.

    Minimum number of records for the cells.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    19/80

    All Rights Reserved, Indian

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    20/80

    All Rights Reserved, Indian

    CHAID Example: German Credit Data

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    21/80

    All Rights Reserved, Indian

    First Split0DM

    OBSERVED

    Predictor

    Response variable Total

    Y = 1 Y = 0

    0DM = 1 99 110 209

    0DM = 0 140 451 591

    Total 239 561 800

    Expected

    Predictor Response Variable TotY = 1 Y = 0

    0DM = 1 62.43875 146.56 209

    0DM = 0 176.5613 414.44 59

    Total 239 561 800CHAID Tree

    P-value = 1.28 x 10-10

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    22/80

    All Rights Reserved, Indian

    First Split0DM

    CHAID Tree

    Business Rules

    Business rules are always derived for lea

    without any branch)

    Node 1: If 0 DM = 0 then classify the

    (good credit) which will be accurate 76

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    23/80

    All Rights Reserved, Indian

    Node 7, 90% accuracy

    Node 5, 87.8 % a

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    24/80

    All Rights Reserved, Indian

    Business Rule

    Node 7: If theChecking account balance is not less than 2

    there are no other installments, then classify the observation as

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    25/80

    All Rights Reserved, Indian

    Tree Pruning

    Tree pruning is a process of reducing the size of the

    different strategies.

    Pruning is important since large tree can result in over fittin

    Pruning can be either based on the size of the or purity.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    26/80

    All Rights Reserved, Indian

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    27/80

    All Rights Reserved, Indian

    Classification and Regression Tree (CART)

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    28/80

    All Rights Reserved, Indian

    Classification and Regression Trees (C

    Splits are chosen based either on Gini Index or twoing crite

    CART is a binary tree, whereas CHAID can split the in

    more than 2 branches.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    29/80

    All Rights Reserved, Indian

    Gini Index (Classification Impur Gini Index is used to measure the impurity at a

    classification problem) and is given by:

    noinjcategoryofproportiontheisk)P(jwhere

    ))(1)((Gini(k)1

    J

    j

    kjPkjP

    Smaller Gini Index implies less impurity

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    30/80

    All Rights Reserved, Indian

    Entropy (Impurity Measure)

    Entropy is another impurity measure that is frequently Entropy at node k is given by:

    c

    1j

    k))|(P(jk)|P(jEntropy(k) log2

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    31/80

    All Rights Reserved, Indian

    Gini Index Calculation

    Number of classes 2 (say 0 and 1)

    Consider node label k with 10 1s and 90 0s.

    1)((2))(1)((Gini(k)2

    1

    PkjPkjPkjPj

    18.09.01.02Gini(k)

    Smaller number

    implies less impurity!

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    32/80

    All Rights Reserved, Indian

    Entropy Calculation

    Number of classes 2 (say 0 and 1)

    Consider node label k with 10 1s and 90 0s.

    2

    1

    ))(log()(Entropy(k)j

    kjPkjP

    4689.0)9.0log(9.0)1.0log(1.0Entropy(k)

    Higher than Gini

    coefficient

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    33/80

    All Rights Reserved, Indian

    Classification Tree Logic

    Node t

    Node tL Node tR

    noderightin thensobservatioofProportionP

    nodeleftin thensobservatioofProportionP

    (.)nodeatImpurityi(.)

    )i(tN)i(tNi(t)Max

    R

    L

    LRLL

    Reduction

    impurit

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    34/80

    All Rights Reserved, Indian

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    35/80

    All Rights Reserved, Indian

    CART Example: German Credit D

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    36/80

    All Rights Reserved, Indian

    First Split0DM

    CART Tree

    Impurity at Node 0 = 2 x 0.299 x 0.701 = 0.4191

    Impurity at Node 1 = 2 x 0.237 x 0.763 x 0.73875 = 0

    Impurity at Node 2 = 2 x 0.474 x 0.526 x 0.26125 = 0

    Change in impurity = 0.4191 0.2671 0.130271 =

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    37/80

    All Rights Reserved, Indian

    First SplitCredit Amount

    CART Tree

    Impurity at Node 0 = 2 x 0.299 x 0.701 = 0.4191

    Impurity at Node 1 = 2 x 0.271 x 0.729 x 0.8775 = 0

    Impurity at Node 2 = 2 x 0.5 x 0.5 x 0.1225 = 0.0612

    Change in impurity = 0.4191 0.3467 0.06125 = 0

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    38/80

    All Rights Reserved, Indian

    Node 5: 0DM = 1 and d

    classify the observation

    accuracy)

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    39/80

    All Rights Reserved, Indian

    Ensemble Methods

    Ensemble methods is a majority voting method in whic

    developed using various techniques such as logistic regressionanalysis, CHAID, CART etc.

    For each observation, the classification is done based on major

    Ensemble methods in general improve the accuracy of predictio

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    40/80

    All Rights Reserved, Indian

    Random Forest

    Random forest is an ensemble method in which several tree(thus the name forest) by sampling the predictor variable.

    The sampling can also be on the training data set.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    41/80

    All Rights Reserved, Indian

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    42/80

    All Rights Reserved, Indian

    Analysis of unstructured data: Text Mining/Sentim

    Analysis/Social Media Analytics

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    43/80

    All Rights Reserved, Indian

    Why Text Mining and Sentimental Analysi

    Science of Social Influence:

    70% of the sales decisions were made before engaging a sale

    (HP).

    Sales force is becoming irrelevant since customers are engaged

    media.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    44/80

    All Rights Reserved, Indian

    Importance of Social Listening

    A study by McKinsey revealed that electronic word-of-motwice the sales as compared to paid advertising.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    45/80

    All Rights Reserved, Indian

    Data Sources for unstructured data and Social M

    Analytics

    Facebook

    Blogs and Microsites

    YouTube, Twitter, Instagram

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    46/80

    All Rights Reserved, Indian

    Text Classification and Sentiment Analys

    Text Classification Approaches

    Rule based classification

    Nave Bayes Algorithm

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    47/80

    All Rights Reserved, Indian

    Classification Methods: Rules

    Rules based on combinations of words present in the d

    Spam e-mail classification

    (Hello Dear) OR (You Won a Lottery)

    Accuracy can be low

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    48/80

    All Rights Reserved, Indian

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    49/80

    All Rights Reserved, Indian

    Nave Bayes Algorithm

    Nave Bayes Algorithm

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    50/80

    All Rights Reserved, Indian

    Nave Bayes Algorithm

    Nave Bayes algorithm (or nave Bayes classifier) istechniques that uses Bayes theorem.

    One of the popular tools for sentiment analysis.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    51/80

    All Rights Reserved, Indian

    Comments on the Bollywood Movie Chennai Ex

    Some films are hard to make sense. Others arenonsense. Chennai Express, directed by Rohit Shticks both boxes.

    The film is filled with gigantic men whose size functas a punch line. Yes, some of it is funny. The location

    beautiful.

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    52/80

    All Rights Reserved, Indian

    Classification Methods: Nave Bayes Algo

    Input: a document set D = {d1, d2, .}

    a fixed set of classes C = {c1, c2,, cn}

    A training set of kdocuments (d1,c1),....,(dk, ck)

    Output:

    a learned classifier Y: dnew c

    C1 = PoC2 = Ne

    C3 = Ne

    Nave Bayes Model

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    53/80

    All Rights Reserved, Indian

    y

    Documents are treated as bag of words.

    Consider a document D, whose class is given by C. We classifywhich has the highest posterior probability P(C|D).

    P(C|D) can be expressed usingBayes theorem:

    )()|()(

    )()|()|( CPCDP

    DP

    CPCDPDCP

    Nave Bayes Classification for Text Mining

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    54/80

    All Rights Reserved, Indian

    Nave Bayes Classification for Text Mining

    Document is represented as bag of words.

    We create a vocabulary V containing N word types (featu

    Documents are represented using feature vectors.

    V b l

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    55/80

    All Rights Reserved, Indian

    Vocabulary

    V = {Awesome, humor, action, nonsense, romance, stereotype, Mindl

    Overacting, trash }

    N = 10 (Number of words in the Vocabulary)

    Document Model

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    56/80

    All Rights Reserved, Indian

    Document Model

    Bernoulli document model: A document is represented by a

    with binary elements taking value 1 if the corresponding word0 otherwise.

    Multinomial document model: A document is representevector with integer whose value is the frequency of thadocument.

    Bernoulli Document Model

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    57/80

    All Rights Reserved, Indian

    Bernoulli Document Model

    V = {awesome, humor, action, nonsense, romance, stereotype, MiOveracting, trash}

    Some films are hard to make sense of. Others are just nonsense. Chdirected by Rohit Shetty, ticks both boxes. More than a quarter ofTamil, and hence incomprehensible if you're unfamiliar with therest is a stew of puerile humor, lazy stereotypes, and overacting frappears to be trying too hard.

    D = {0, 1, 0, 1, 0, 1, 0, 0, 1, 0}

    Bernoulli Document Model

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    58/80

    All Rights Reserved, Indian

    V = {awesome, humor, action, nonsense, romance, stereotype, MOveracting, trash}

    Chennai Express plays neither to Rohits strengths nor to Shahstrangely sloppy mishmash of cheesy humour, half-hearted romemotion and head-banging action. The film is filled with giganticfunctions as a punch line. Yes, some of it is funny. The locations are be

    D = {0, 1, 0, 0, 1, 0, 0, 0, 0, 0}

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    59/80

    All Rights Reserved, Indian

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    60/80

    All Rights Reserved, Indian

    Nave Bayes Algorithm

    Training Data Set (Chennai Express from Bollywood Hu

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    61/80

    All Rights Reserved, Indian

    g ( p y

    what a awseome movie....start fr the DDLJ train scene..till thefunny...conversation with singing a song..haha mind blowing... Mina wash

    mina...mina..ting tong...halirous man...SRK you proof once again that you aracting and bollwood...

    i like the movie v.much. its train secquence srk acting in train, converstion bdepica by singing hindi songs it is amaizing.its drama,its music and its climax is

    proud of such a lover rahol..Ce has good entertainment story love to watch it ag

    wow mind blowing...superb.. truly enjoy to watch this. CE rocksssssssssssssssss

    Training Data Set (Chennai Express from Bollywoo

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    62/80

    All Rights Reserved, Indian

    Training Data Set (Chennai Express from Bollywoo

    trash in one word.

    Shame on the film industry. An ugly blot on the blockbuster list. SRK how do u manage twith films such as this. U used to be an actor of quality with movies such as Daar, swadesto back you. The makers of the film have made a mockery out of Indians. And what is wodemean ourselves by going to watch such movies.

    Quiet shocking to see actor of Shahrukh s caliber and seniority doing such an ordinary ro

    much more expected from Shahrukh .In fact his sarcasm was felt very repetitive and far frgood thing about movie is Deepika s Perfomance .She has done extraordinary job and shocredit for the success of movie.

    Ref: Hiroshi Shimodaira, Text Classification using Nave Bayes, Learning and Data Note, University of Edinburgh

    Nave Bayes Model

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    63/80

    All Rights Reserved, Indian

    Nave Bayes Model

    P(C|D) can be expressed usingBayes theorem:

    )()|()(

    )()|()|( CPCDP

    DP

    CPCDPDCP

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    64/80

    All Rights Reserved, Indian

    Bernoulli Document ModelProbability Estim

    If there are N documents in total in the training set, then the prior pr

    class C = k may be estimated as the relative frequency of documents of c

    N

    kNkCP

    )(

    )()|()(

    )()|()|( CPCDP

    DP

    CPCDPDCP

    B lli D t M d l P b bilit E ti

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    65/80

    All Rights Reserved Indian

    Bernoulli Document ModelProbability Estim

    Number of documents = 500

    Number of positive documents = 200

    4.0

    500

    200)(

    N

    NkCP k

    Nave Bayes Model

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    66/80

    All Rights Reserved Indian

    y

    P(C|D) can be expressed usingBayes theorem:

    )()|()(

    )()|()|( CPCDP

    DP

    CPCDPDCP

    Documents and converted into feature vectors

    bi= {0, 1, 0, 0, 1, 0, 0, 0, 0, 0}

    Bernoulli Document Model

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    67/80

    All Rights Reserved Indian

    Let bi be the feature vector for the ith

    document D

    otherwise0

    documentin thepresentiswwordif1b

    tit

    Bernoulli Document Model

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    68/80

    All Rights Reserved Indian

    P(Di| C) in terms of the individual word likelihoods P(wt |C) is given

    N

    itittitii CwPbCwPbCbPCDP

    1

    ))|(1)(1()|()|()|(

    Probability of findingwt given class C

    Probability of not finding

    wt given class C

    Bernoulli Document Model Probability Est

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    69/80

    All Rights Reserved, Indian

    Bernoulli Document ModelProbability Est

    Let nk(wt) be the number of documents of class C = k

    observed; and let Nk be the total number of documents

    Then estimate P(Wt|C = k) is given by:

    kN

    twkn

    kCtwP

    )()|(

    Bernoulli Document Model Probability Estim

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    70/80

    All Rights Reserved, Indian

    Bernoulli Document ModelProbability Estim

    Class k = Positive

    NK = 200 (total number of positive documents in the training da

    Nk(wt) = 50 (that is word wt appears in 50 training documents)

    25.020050)()|(

    k

    tktNwnkCwP

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    71/80

    All Rights Reserved, Indian

    Nave Bayes Algorithm

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    72/80

    All Rights Reserved, Indian

    V = {awesome, humor, action, nonsense, romance, stereotype, Mindless, Boring

    11

    11

    11

    10

    01

    01

    11001101

    11010001

    10101001

    01101010

    00110100

    11110001

    PT

    TP = Documents that are classifiepositive sentiment in the training data

    1000000001

    00

    00

    01

    01

    10

    01010100

    00000001

    00100110

    11001011

    01000110

    NT TN = Documents that are classifienegative sentiment in the training data

    Nave Bayes Model

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    73/80

    All Rights Reserved, Indian

    P(C|D) can be expressed usingBayes theorem:

    )()|()(

    )()|()|( CPCDP

    DP

    CPCDPDCP

    Nave Bayes Algorithm

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    74/80

    All Rights Reserved, Indian

    Nave Bayes Algorithm

    P(P) = 6/12 = 0.5 P(N) = 6/12 = 0.5

    )()|()(

    )()|()|( CPCDPDP

    CPCDPDCP

    N

    i

    tittitii CwPbCwPbCbPCDP1

    ))|(1)(1()|()|()|(

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    75/80

    All Rights Reserved, Indian

    Np(w1) = 4 Np(w2) = 1 Np(w3) = 2 Np(w4) = 3 Np(w5) = 3

    Np(w6) = 4 Np(w7) = 4 Np(w8) = 4 Np(w9) = 5 Np(w10) = 4

    Np(wt) = Number of times word wt has occurred in positive documents in

    P(w1|P) = 4/6 P(w2|P) = 1/6 P(w3|P) = 2/6 P(w4|P) =

    P(w5|P) = 3/6 P(w6|P) = 4/6 P(w7|P) = 4/6 P(w8|P) =

    P(w9|P) = 5/6 P(w10|P) = 4/6

    P(wt|P) = Conditional probability of observing word wt given it is a posit

    sentiment.

    Nave Bayes Algorithm

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    76/80

    All Rights Reserved, Indian

    NN(w1) = 2 NN(w2) = 3 NN(w3) = 3 NN(w4) = 1 NN(w5) = 1

    NN(w6) =1 NN(w7) = 3 NN(w8) = 1 NN(w9) =2 NN(w10) = 1

    NN(wt) = Number of times word wt has occurred in negative documents in

    P(w1|N) = 2/6 P(w2|N) = 3/6 P(w3|N) = 3/6 P(w4|N

    P(w5|N) = 1/6 P(w6|N) =1/6 P (w7|N) = 3/6 P(w8|N

    P (w7|N) = 2/6 P(w8|N) = 1/6

    P(wt|N) = Conditional probability of observing word wt given it is a negativ

    Nave Bayes Algorithm

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    77/80

    All Rights Reserved, Indian

    Classify the following documents into positive and negativ

    D1 = (1, 1, 1, 1, 1, 1, 1, 1,1,1)

    D1 = (1,

    1, 1, 1, 1, 1, 1, 1)

    ||V

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    78/80

    All Rights Reserved, Indian

    6

    1

    6

    2

    6

    1

    6

    3

    6

    1

    2

    1

    6

    1

    6

    3

    6

    3

    6

    2*(6/12)

    ))|(1)(1()|()()|(

    64

    65

    64

    64

    64

    23

    63

    62

    61

    64*(6/12)

    ))|(1)(1()|()()|(

    ||

    1

    1

    ||

    1

    1

    V

    i

    tittit

    V

    i

    tittit

    CwPbCwPbNPDNP

    CwPbCwPbPPDPP

    The document will be classified as P since it has the higher probability

    Challenges

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    79/80

    All Rights Reserved, Indian

    Difficult to classify sarcastic comments

    All my life I thought Air was free until I bought a b

    You have been so incredibly helpful and thanks (for no

    Text Pre-Processing

    Feature Extraction

  • 7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia

    80/80

    All Rights Reserved, Indian

    Feature Extraction

    Feature extractor is used to convert each comment to a feature set.

    Can be used for creating vocabulary

    Stemming

    Different forms of the same word. Stemming is a process of transinto its stem.

    N Gram Analysis

    N-Gram is a sequence of n consecutive words(e.g. machine learning