anlyse_mine_unstructereddata@socialmedia
TRANSCRIPT
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
1/80
All Rights Reserved, Indian
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
2/80
All Rights Reserved, Indian
Decision Trees
Introduction to Decision Trees.
Decision Tree approach for classification.
Chi-Square Automatic Interaction Detection (CHAID
Classification and Regression Tree (CART)
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
3/80
All Rights Reserved, Indian
CHAID (Chi-square Automatic Interaction Det
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
4/80
All Rights Reserved, Indian
Introduction to CHAID
CHAID is a decision tree algorithm used in classification pro
Initial models of CHAID used chi-square test of indep
splitting.
CHAID was first presented in an article titled, Anexplorato
for Investigating large quantities of categoricalvariables,b
in Applied Statistics (1980)
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
5/80 All Rights Reserved, Indian
CHAID
CHAID partitions the data into mutually exclusive, exhaustive,
best describe the dependent categorical variable.
CHAID is an iterative procedure that examines the pr
classification variables) and uses them in the order of th
significance.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
6/80 All Rights Reserved, Indian
CHAID Splitting Rule
The splitting in CHAID is based on the type of the dependent vavariable).
For continuous dependent variable F test is used.
For categorical dependent variable chi-square test of independ
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
7/80 All Rights Reserved, Indian
CHAID Example
German Credit Rating Data Set.
Predictor Variable Checking Account Balanc
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
8/80 All Rights Reserved, Indian
Chi-Square test of independen
Chi-square test of independence starts with an assumption th
relationship between two variables.
For example, we assume that there is no relationship betwee
account balance and default.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
9/80 All Rights Reserved, Indian
Chi-Square test of Independence
German Credit Case
H0: There is no relationship between checking account balanc
HA: There is a relationship between checking account balance
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
10/80 All Rights Reserved, Indian
Contingency Table
Checking account
balance
Default
1 0 Total
0 DM 135 139 274Other than
0 DM 165 561 726
Total 300 700 1000
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
11/80
All Rights Reserved, Indian
Chi-Square test of Independenc
in German Credit Data
H0: Checking account balance and default are independ
HA: Checking account balance and default are depende
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
12/80
All Rights Reserved, Indian
Chi-Square test of IndependenceT
Statistic
sumtotalscolumnsumrowEfrequencyExpected
E
EOstatistic
22
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
13/80
All Rights Reserved, Indian
Chi-Square test of Independence
Observed
frequency
Expected
Frequency (O-E)^0DM-1 135 82.2 33.9
0DM-0 139 191.8 14.5
N0DM-1 165 217.8 12.
N0DM-0 561 508.2 5.4
1000 1000 66.7
P-value 3.10
p-value is less than 0.05, we reject the null
hypothesis.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
14/80
All Rights Reserved, Indian
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
15/80
All Rights Reserved, Indian
CHAID Procedure
Step 1: Examine each predictor variable for its statistical significance with the
using F test (for continuous dependent) or chi-square test for categorical depe
Step 2:Determine the most significant among the predictors (predictor wi
after Bonferonni correction).
Step 3: Divide the data by levels of the most significant predictor Each of t
examined individually further.
Step 4: For each sub-group, determine the most significant variable from the
and divide the data again.
Step 5: Repeat step 4 till stopping criteria is reached.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
16/80
All Rights Reserved, Indian
Merging - CHAID
CHAID uses both splitting and merging steps.
In merging, least significantly different groups are merged to
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
17/80
All Rights Reserved, Indian
CHAID Stopping Criteria
Maximum tree depth is reached (which is pre-defined).
Minimum number of a cases to be a parent node is reached (a
Minimum number of to be a child node is reached.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
18/80
All Rights Reserved, Indian
CHAID Input
Significance level for partitioning a variable.
Significance level for merging.
Minimum number of records for the cells.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
19/80
All Rights Reserved, Indian
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
20/80
All Rights Reserved, Indian
CHAID Example: German Credit Data
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
21/80
All Rights Reserved, Indian
First Split0DM
OBSERVED
Predictor
Response variable Total
Y = 1 Y = 0
0DM = 1 99 110 209
0DM = 0 140 451 591
Total 239 561 800
Expected
Predictor Response Variable TotY = 1 Y = 0
0DM = 1 62.43875 146.56 209
0DM = 0 176.5613 414.44 59
Total 239 561 800CHAID Tree
P-value = 1.28 x 10-10
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
22/80
All Rights Reserved, Indian
First Split0DM
CHAID Tree
Business Rules
Business rules are always derived for lea
without any branch)
Node 1: If 0 DM = 0 then classify the
(good credit) which will be accurate 76
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
23/80
All Rights Reserved, Indian
Node 7, 90% accuracy
Node 5, 87.8 % a
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
24/80
All Rights Reserved, Indian
Business Rule
Node 7: If theChecking account balance is not less than 2
there are no other installments, then classify the observation as
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
25/80
All Rights Reserved, Indian
Tree Pruning
Tree pruning is a process of reducing the size of the
different strategies.
Pruning is important since large tree can result in over fittin
Pruning can be either based on the size of the or purity.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
26/80
All Rights Reserved, Indian
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
27/80
All Rights Reserved, Indian
Classification and Regression Tree (CART)
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
28/80
All Rights Reserved, Indian
Classification and Regression Trees (C
Splits are chosen based either on Gini Index or twoing crite
CART is a binary tree, whereas CHAID can split the in
more than 2 branches.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
29/80
All Rights Reserved, Indian
Gini Index (Classification Impur Gini Index is used to measure the impurity at a
classification problem) and is given by:
noinjcategoryofproportiontheisk)P(jwhere
))(1)((Gini(k)1
J
j
kjPkjP
Smaller Gini Index implies less impurity
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
30/80
All Rights Reserved, Indian
Entropy (Impurity Measure)
Entropy is another impurity measure that is frequently Entropy at node k is given by:
c
1j
k))|(P(jk)|P(jEntropy(k) log2
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
31/80
All Rights Reserved, Indian
Gini Index Calculation
Number of classes 2 (say 0 and 1)
Consider node label k with 10 1s and 90 0s.
1)((2))(1)((Gini(k)2
1
PkjPkjPkjPj
18.09.01.02Gini(k)
Smaller number
implies less impurity!
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
32/80
All Rights Reserved, Indian
Entropy Calculation
Number of classes 2 (say 0 and 1)
Consider node label k with 10 1s and 90 0s.
2
1
))(log()(Entropy(k)j
kjPkjP
4689.0)9.0log(9.0)1.0log(1.0Entropy(k)
Higher than Gini
coefficient
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
33/80
All Rights Reserved, Indian
Classification Tree Logic
Node t
Node tL Node tR
noderightin thensobservatioofProportionP
nodeleftin thensobservatioofProportionP
(.)nodeatImpurityi(.)
)i(tN)i(tNi(t)Max
R
L
LRLL
Reduction
impurit
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
34/80
All Rights Reserved, Indian
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
35/80
All Rights Reserved, Indian
CART Example: German Credit D
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
36/80
All Rights Reserved, Indian
First Split0DM
CART Tree
Impurity at Node 0 = 2 x 0.299 x 0.701 = 0.4191
Impurity at Node 1 = 2 x 0.237 x 0.763 x 0.73875 = 0
Impurity at Node 2 = 2 x 0.474 x 0.526 x 0.26125 = 0
Change in impurity = 0.4191 0.2671 0.130271 =
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
37/80
All Rights Reserved, Indian
First SplitCredit Amount
CART Tree
Impurity at Node 0 = 2 x 0.299 x 0.701 = 0.4191
Impurity at Node 1 = 2 x 0.271 x 0.729 x 0.8775 = 0
Impurity at Node 2 = 2 x 0.5 x 0.5 x 0.1225 = 0.0612
Change in impurity = 0.4191 0.3467 0.06125 = 0
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
38/80
All Rights Reserved, Indian
Node 5: 0DM = 1 and d
classify the observation
accuracy)
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
39/80
All Rights Reserved, Indian
Ensemble Methods
Ensemble methods is a majority voting method in whic
developed using various techniques such as logistic regressionanalysis, CHAID, CART etc.
For each observation, the classification is done based on major
Ensemble methods in general improve the accuracy of predictio
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
40/80
All Rights Reserved, Indian
Random Forest
Random forest is an ensemble method in which several tree(thus the name forest) by sampling the predictor variable.
The sampling can also be on the training data set.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
41/80
All Rights Reserved, Indian
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
42/80
All Rights Reserved, Indian
Analysis of unstructured data: Text Mining/Sentim
Analysis/Social Media Analytics
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
43/80
All Rights Reserved, Indian
Why Text Mining and Sentimental Analysi
Science of Social Influence:
70% of the sales decisions were made before engaging a sale
(HP).
Sales force is becoming irrelevant since customers are engaged
media.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
44/80
All Rights Reserved, Indian
Importance of Social Listening
A study by McKinsey revealed that electronic word-of-motwice the sales as compared to paid advertising.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
45/80
All Rights Reserved, Indian
Data Sources for unstructured data and Social M
Analytics
Facebook
Blogs and Microsites
YouTube, Twitter, Instagram
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
46/80
All Rights Reserved, Indian
Text Classification and Sentiment Analys
Text Classification Approaches
Rule based classification
Nave Bayes Algorithm
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
47/80
All Rights Reserved, Indian
Classification Methods: Rules
Rules based on combinations of words present in the d
Spam e-mail classification
(Hello Dear) OR (You Won a Lottery)
Accuracy can be low
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
48/80
All Rights Reserved, Indian
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
49/80
All Rights Reserved, Indian
Nave Bayes Algorithm
Nave Bayes Algorithm
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
50/80
All Rights Reserved, Indian
Nave Bayes Algorithm
Nave Bayes algorithm (or nave Bayes classifier) istechniques that uses Bayes theorem.
One of the popular tools for sentiment analysis.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
51/80
All Rights Reserved, Indian
Comments on the Bollywood Movie Chennai Ex
Some films are hard to make sense. Others arenonsense. Chennai Express, directed by Rohit Shticks both boxes.
The film is filled with gigantic men whose size functas a punch line. Yes, some of it is funny. The location
beautiful.
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
52/80
All Rights Reserved, Indian
Classification Methods: Nave Bayes Algo
Input: a document set D = {d1, d2, .}
a fixed set of classes C = {c1, c2,, cn}
A training set of kdocuments (d1,c1),....,(dk, ck)
Output:
a learned classifier Y: dnew c
C1 = PoC2 = Ne
C3 = Ne
Nave Bayes Model
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
53/80
All Rights Reserved, Indian
y
Documents are treated as bag of words.
Consider a document D, whose class is given by C. We classifywhich has the highest posterior probability P(C|D).
P(C|D) can be expressed usingBayes theorem:
)()|()(
)()|()|( CPCDP
DP
CPCDPDCP
Nave Bayes Classification for Text Mining
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
54/80
All Rights Reserved, Indian
Nave Bayes Classification for Text Mining
Document is represented as bag of words.
We create a vocabulary V containing N word types (featu
Documents are represented using feature vectors.
V b l
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
55/80
All Rights Reserved, Indian
Vocabulary
V = {Awesome, humor, action, nonsense, romance, stereotype, Mindl
Overacting, trash }
N = 10 (Number of words in the Vocabulary)
Document Model
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
56/80
All Rights Reserved, Indian
Document Model
Bernoulli document model: A document is represented by a
with binary elements taking value 1 if the corresponding word0 otherwise.
Multinomial document model: A document is representevector with integer whose value is the frequency of thadocument.
Bernoulli Document Model
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
57/80
All Rights Reserved, Indian
Bernoulli Document Model
V = {awesome, humor, action, nonsense, romance, stereotype, MiOveracting, trash}
Some films are hard to make sense of. Others are just nonsense. Chdirected by Rohit Shetty, ticks both boxes. More than a quarter ofTamil, and hence incomprehensible if you're unfamiliar with therest is a stew of puerile humor, lazy stereotypes, and overacting frappears to be trying too hard.
D = {0, 1, 0, 1, 0, 1, 0, 0, 1, 0}
Bernoulli Document Model
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
58/80
All Rights Reserved, Indian
V = {awesome, humor, action, nonsense, romance, stereotype, MOveracting, trash}
Chennai Express plays neither to Rohits strengths nor to Shahstrangely sloppy mishmash of cheesy humour, half-hearted romemotion and head-banging action. The film is filled with giganticfunctions as a punch line. Yes, some of it is funny. The locations are be
D = {0, 1, 0, 0, 1, 0, 0, 0, 0, 0}
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
59/80
All Rights Reserved, Indian
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
60/80
All Rights Reserved, Indian
Nave Bayes Algorithm
Training Data Set (Chennai Express from Bollywood Hu
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
61/80
All Rights Reserved, Indian
g ( p y
what a awseome movie....start fr the DDLJ train scene..till thefunny...conversation with singing a song..haha mind blowing... Mina wash
mina...mina..ting tong...halirous man...SRK you proof once again that you aracting and bollwood...
i like the movie v.much. its train secquence srk acting in train, converstion bdepica by singing hindi songs it is amaizing.its drama,its music and its climax is
proud of such a lover rahol..Ce has good entertainment story love to watch it ag
wow mind blowing...superb.. truly enjoy to watch this. CE rocksssssssssssssssss
Training Data Set (Chennai Express from Bollywoo
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
62/80
All Rights Reserved, Indian
Training Data Set (Chennai Express from Bollywoo
trash in one word.
Shame on the film industry. An ugly blot on the blockbuster list. SRK how do u manage twith films such as this. U used to be an actor of quality with movies such as Daar, swadesto back you. The makers of the film have made a mockery out of Indians. And what is wodemean ourselves by going to watch such movies.
Quiet shocking to see actor of Shahrukh s caliber and seniority doing such an ordinary ro
much more expected from Shahrukh .In fact his sarcasm was felt very repetitive and far frgood thing about movie is Deepika s Perfomance .She has done extraordinary job and shocredit for the success of movie.
Ref: Hiroshi Shimodaira, Text Classification using Nave Bayes, Learning and Data Note, University of Edinburgh
Nave Bayes Model
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
63/80
All Rights Reserved, Indian
Nave Bayes Model
P(C|D) can be expressed usingBayes theorem:
)()|()(
)()|()|( CPCDP
DP
CPCDPDCP
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
64/80
All Rights Reserved, Indian
Bernoulli Document ModelProbability Estim
If there are N documents in total in the training set, then the prior pr
class C = k may be estimated as the relative frequency of documents of c
N
kNkCP
)(
)()|()(
)()|()|( CPCDP
DP
CPCDPDCP
B lli D t M d l P b bilit E ti
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
65/80
All Rights Reserved Indian
Bernoulli Document ModelProbability Estim
Number of documents = 500
Number of positive documents = 200
4.0
500
200)(
N
NkCP k
Nave Bayes Model
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
66/80
All Rights Reserved Indian
y
P(C|D) can be expressed usingBayes theorem:
)()|()(
)()|()|( CPCDP
DP
CPCDPDCP
Documents and converted into feature vectors
bi= {0, 1, 0, 0, 1, 0, 0, 0, 0, 0}
Bernoulli Document Model
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
67/80
All Rights Reserved Indian
Let bi be the feature vector for the ith
document D
otherwise0
documentin thepresentiswwordif1b
tit
Bernoulli Document Model
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
68/80
All Rights Reserved Indian
P(Di| C) in terms of the individual word likelihoods P(wt |C) is given
N
itittitii CwPbCwPbCbPCDP
1
))|(1)(1()|()|()|(
Probability of findingwt given class C
Probability of not finding
wt given class C
Bernoulli Document Model Probability Est
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
69/80
All Rights Reserved, Indian
Bernoulli Document ModelProbability Est
Let nk(wt) be the number of documents of class C = k
observed; and let Nk be the total number of documents
Then estimate P(Wt|C = k) is given by:
kN
twkn
kCtwP
)()|(
Bernoulli Document Model Probability Estim
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
70/80
All Rights Reserved, Indian
Bernoulli Document ModelProbability Estim
Class k = Positive
NK = 200 (total number of positive documents in the training da
Nk(wt) = 50 (that is word wt appears in 50 training documents)
25.020050)()|(
k
tktNwnkCwP
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
71/80
All Rights Reserved, Indian
Nave Bayes Algorithm
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
72/80
All Rights Reserved, Indian
V = {awesome, humor, action, nonsense, romance, stereotype, Mindless, Boring
11
11
11
10
01
01
11001101
11010001
10101001
01101010
00110100
11110001
PT
TP = Documents that are classifiepositive sentiment in the training data
1000000001
00
00
01
01
10
01010100
00000001
00100110
11001011
01000110
NT TN = Documents that are classifienegative sentiment in the training data
Nave Bayes Model
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
73/80
All Rights Reserved, Indian
P(C|D) can be expressed usingBayes theorem:
)()|()(
)()|()|( CPCDP
DP
CPCDPDCP
Nave Bayes Algorithm
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
74/80
All Rights Reserved, Indian
Nave Bayes Algorithm
P(P) = 6/12 = 0.5 P(N) = 6/12 = 0.5
)()|()(
)()|()|( CPCDPDP
CPCDPDCP
N
i
tittitii CwPbCwPbCbPCDP1
))|(1)(1()|()|()|(
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
75/80
All Rights Reserved, Indian
Np(w1) = 4 Np(w2) = 1 Np(w3) = 2 Np(w4) = 3 Np(w5) = 3
Np(w6) = 4 Np(w7) = 4 Np(w8) = 4 Np(w9) = 5 Np(w10) = 4
Np(wt) = Number of times word wt has occurred in positive documents in
P(w1|P) = 4/6 P(w2|P) = 1/6 P(w3|P) = 2/6 P(w4|P) =
P(w5|P) = 3/6 P(w6|P) = 4/6 P(w7|P) = 4/6 P(w8|P) =
P(w9|P) = 5/6 P(w10|P) = 4/6
P(wt|P) = Conditional probability of observing word wt given it is a posit
sentiment.
Nave Bayes Algorithm
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
76/80
All Rights Reserved, Indian
NN(w1) = 2 NN(w2) = 3 NN(w3) = 3 NN(w4) = 1 NN(w5) = 1
NN(w6) =1 NN(w7) = 3 NN(w8) = 1 NN(w9) =2 NN(w10) = 1
NN(wt) = Number of times word wt has occurred in negative documents in
P(w1|N) = 2/6 P(w2|N) = 3/6 P(w3|N) = 3/6 P(w4|N
P(w5|N) = 1/6 P(w6|N) =1/6 P (w7|N) = 3/6 P(w8|N
P (w7|N) = 2/6 P(w8|N) = 1/6
P(wt|N) = Conditional probability of observing word wt given it is a negativ
Nave Bayes Algorithm
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
77/80
All Rights Reserved, Indian
Classify the following documents into positive and negativ
D1 = (1, 1, 1, 1, 1, 1, 1, 1,1,1)
D1 = (1,
1, 1, 1, 1, 1, 1, 1)
||V
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
78/80
All Rights Reserved, Indian
6
1
6
2
6
1
6
3
6
1
2
1
6
1
6
3
6
3
6
2*(6/12)
))|(1)(1()|()()|(
64
65
64
64
64
23
63
62
61
64*(6/12)
))|(1)(1()|()()|(
||
1
1
||
1
1
V
i
tittit
V
i
tittit
CwPbCwPbNPDNP
CwPbCwPbPPDPP
The document will be classified as P since it has the higher probability
Challenges
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
79/80
All Rights Reserved, Indian
Difficult to classify sarcastic comments
All my life I thought Air was free until I bought a b
You have been so incredibly helpful and thanks (for no
Text Pre-Processing
Feature Extraction
-
7/25/2019 Anlyse_Mine_UnstructeredData@SocialMedia
80/80
All Rights Reserved, Indian
Feature Extraction
Feature extractor is used to convert each comment to a feature set.
Can be used for creating vocabulary
Stemming
Different forms of the same word. Stemming is a process of transinto its stem.
N Gram Analysis
N-Gram is a sequence of n consecutive words(e.g. machine learning