slides sentiment 2013 10-3
TRANSCRIPT
Brief Introduc.on to Sen.ment Analysis
Joachim De Beule 4 May 2013
What is sen.ment?
Expression of:
-‐ an emo.on (I am happy) -‐ an evalua.on (Great idea!) -‐ a stance (I support the bill)
What is sen.ment?
Expression of:
-‐ an emo.on (I am happy) -‐ an evalua.on (Great idea!) -‐ a stance (I support the bill)
Involves a perspec.ve, a target (named en..es) and a sen.ment value.
Kermit was thrilled about the idea!
Sen.ment analysis is difficult!!
Sen$ment Precision Recall
Nega.ve 71% 90%
Neutral 96% 87%
Posi.ve 77% 92%
Sen$ment Precision Recall
Nega.ve 88% 66%
Neutral 86% 97%
Posi.ve 91% 65%
Student 1:
Sen$ment Precision Recall
Nega.ve 79% 91%
Neutral 96% 90%
Posi.ve 80% 92%
Student 2:
Student 3:
71% of the men.ons labeled “Nega.ve” by student 1 were also labeled “Nega.ve” by student 2 or 3 (or both)
29% of the men.ons labeled “Nega.ve” by student 1 were labeled neutral (or posi.ve) by both the other students.
Sen.ment analysis is difficult!!
Sen$ment Precision Recall
Nega.ve 71% 90%
Neutral 96% 87%
Posi.ve 77% 92%
Sen$ment Precision Recall
Nega.ve 88% 66%
Neutral 86% 97%
Posi.ve 91% 65%
Student 1:
Sen$ment Precision Recall
Nega.ve 79% 91%
Neutral 96% 90%
Posi.ve 80% 92%
Student 2:
Student 3:
66% of the men.ons labeled “Nega.ve” by student 1 or 2 (or both) were also labeled “Nega.ve” by student 3
34% of the men.ons labeled “Nega.ve” by student 1 and 2 were not labeled “Nega.ve” by student 3
Sen.ment analysis is difficult!!
Sen$ment Precision Recall
Nega.ve 71% 90%
Neutral 96% 87%
Posi.ve 77% 92%
Sen$ment Precision Recall
Nega.ve 88% 66%
Neutral 86% 97%
Posi.ve 91% 65%
Student 1:
Sen$ment Precision Recall
Nega.ve 79% 91%
Neutral 96% 90%
Posi.ve 80% 92%
Student 2:
Student 3:
Neutral is “easy” because 70% of all men.ons are neutral
Thus, always saying “Neutral” will be correct 70% of the .me and lets you recall 100% of the neutral messages
Sen.ment analysis is difficult!!
#tvvv neeeeee :( domien is out ;o ik blijf vanje houden
domien!
Eindelijk verlost van @belgacom! Surfen gaat een pak vlo?er met @telenet :-‐)
Sen.ment analysis is difficult!!
#tvvv neeeeee :( domien is out ;o ik blijf vanje houden domien!
ບ"ມ$ຕ&ນໄມ)ຖ+ກອອກoຂ)າພະເຈ&າຍ5ງຮ5ກທ9ານເປ5ນຕ&ນໄມ)!
Eindelijk verlost van @belgacom! Surfen gaat een pak vlo?er met @telenet :-‐)
ສ<ດທ)າຍຈາກຕ&ນໄມ)ເກມບ>ນແມ9ນ@າຍຂAນໄວທCມ$ປ9າໄມ)
Automa.c Sen.ment Analysis Basic strategy
Human annota.on
Features (unigrams)
Label/ Ac.on/
predic.on
Men.on
Tokeniza.on, POS taging, …
Learning
Classifier Model: Feature-‐weights
per class (“count table”)
(1) Training phase
Features (unigrams) Men.on
Tokeniza.on, POS taging, …
classifica.on
Classifier Model: Feature-‐weights
per class (“count table”)
(2) Opera.onal phase
Label/ Ac.on/
predic.on
Automa.c Sen.ment Analysis Basic strategy
Automa.c Sen.ment Analysis
Training Set: neeeeee :( domien is out = NegaDve ik blijf vanje houden domien! = PosiDve eindelijk verlost van @belgacom! = NegaDve surfen gaat een pak vlo?er met @telenet :-‐) = PosiDve … = …
“Bag of Words” “neeeeee :( domien is out” = PosiDve {“domien”, “is”, “neeeeee”, “out”, “:(“} = PosiDve
unigram #Nega$ve #Neutral #Posi$ve
… … … …
“Ik” 3132 6245 3700
… … … …
“:(“ 365 122 58
… … … …
“Domien” 22 13 14
“neeeeee” 4 1 0
… … … …
Train set àTable of unigram counts:
⇒ P[Nega.ve| “ik”] = 3132 / (3132+6245+3700) = 24%
⇒ P[Nega.ve| “ik ben blij”] = ?
Bayes rule of condi.onal probabili.es:
P[Nega.ve] x P[“ik ben blij” | Nega.ve]
P[ Nega.ve| “ik ben blij”] = P[“ik ben blij”]
P[“ik ben blij” | Neg.] = P[“ik” | nega.ve] (unigram)
x P[“ben” | Neg., “ik”] (bigram) x P[“blij” | Neg., “ik ben” ] (trigram)
Evidence (same for all senDments)
Prior (over all menDons) likelihood
Chain rule:
Naïve Bayes approxima.on
P[ Neg.| “ik ben blij”] = P[Neg.]
x P[“ik” | Neg.] x P[“ben” | Neg.] x P[“blij” | Neg.]
P[Pos. | “ik ben blij”] = P[Pos.]
x P[“ik” | Pos.] x P[“ben” | Pos.] x P[“blij” | Pos.]
“Posi.ve” if P[Pos. | “ik ben blij”] > P[Neg. | “ik ben blij” ]
From unigram counts table
Classifica.on Algorithm:
Improvements over Naïve Bayes -‐ Beoer features:
-‐ Bigrams, trigrams, -‐ Parts of speech -‐ Tf/idf weigh.ng -‐ Gramma.cal dependencies (e.g. nega.on marking) -‐ Named en..es
-‐ Alterna.ve strategies to calculate feature weights from counts
-‐ Transformed Normalized Weighted Naïve Bayes -‐ Mutual Informa.on -‐ Maximum entropy
-‐ Other approaches -‐ Sen.ment lexicons (cf. current classifier)
Evalua.on
-‐ In terms of Precision, Recall, F1, Accuracy, …
-‐ Very good on “simple” tasks (comparable to humans) -‐ e.g. spam detec.on -‐ In general, tasks for which grammar and context are not
important (nega.on, source/target/perspec.ve roles, …)
-‐ But rather bad on “difficult” tasks, including sen.ment analysis (worse than humans)
Sen$ment Precision Recall
Nega.ve 71% 90%
Neutral 96% 87%
Posi.ve 77% 92%
Sen$ment Precision Recall
Nega.ve 42% 43%
Neutral 83% 60%
Posi.ve 38% 76%
Student 1:
Sen$ment Precision Recall
Nega.ve 79% 72%
Neutral 76% 76%
Posi.ve 77% 73%
Maxent 2-‐grams
Current classifier:
(Results maxent/current for balanced english student dataset)
Many unresolved issues…
-‐ Other languages (Unsupervised learning/bootstrapping) -‐ Source/Target resolu.on -‐ Classifiers trained on one dataset/topic does not perform well
on other datasets/topics -‐ …
…and opportuni.es Many informa.on extrac.on problems can be cast as
classifica.on problems
-‐ Assigning tags to men.ons -‐ Predic.ng the number of likes/retweets/… of men.ons -‐ Deciding whom to send/assign a message -‐ …
-‐ In general, any problem where things must be “labeled”, “decided” or “predicted”, with a limited number of alterna.ves, and for which training data is available (can be user feedback!)
-‐ And our users generate massive amounts of data!!
à don’t hesitate to discuss ideas with me! ß
Part 2: Clojure -‐ Dynamic programming language targe.ng the JVM (and
javascript)
-‐ Combining interac.ve development of scrip.ng language with efficient and robust infrastructure for mul.threaded programming
-‐ -‐ Lisp dialect:
-‐ (almost) no syntax (+ 1 2) => 3 (list ‘+ 1 2) => (+ 1 2)
-‐ Code as data
(eval (list ‘+ 1 2)) => 3
Part 2: Clojure -‐ Project management through “leiningen”
-‐ bash$ lein new test-‐project
-‐ Add dependencies to project.clj, add code to src/test-‐project
-‐ bash$ lein uberjar => testproject.jar
-‐ Java –jar test-‐project.jar
-‐ Online demo…