slides sentiment 2013 10-3

Brief Introduc.on to Sen.ment Analysis

Joachim De Beule 4 May 2013

What is sen.ment?

Expression of:

-‐ an emo.on (I am happy) -‐ an evalua.on (Great idea!) -‐ a stance (I support the bill)

What is sen.ment?

Expression of:

-‐ an emo.on (I am happy) -‐ an evalua.on (Great idea!) -‐ a stance (I support the bill)

Involves a perspec.ve, a target (named en..es) and a sen.ment value.

Kermit was thrilled about the idea!

Sen.ment analysis is difficult!!

Sen$ment Precision Recall

Nega.ve 71% 90%

Neutral 96% 87%

Posi.ve 77% 92%


Nega.ve 88% 66%

Neutral 86% 97%

Posi.ve 91% 65%

Student 1:


Nega.ve 79% 91%

Neutral 96% 90%

Posi.ve 80% 92%

Student 2:

Student 3:

71% of the men.ons labeled “Nega.ve” by student 1 were also labeled “Nega.ve” by student 2 or 3 (or both)

29% of the men.ons labeled “Nega.ve” by student 1 were labeled neutral (or posi.ve) by both the other students.



Nega.ve 71% 90%

Neutral 96% 87%

Posi.ve 77% 92%


Nega.ve 88% 66%

Neutral 86% 97%

Posi.ve 91% 65%

Student 1:


Nega.ve 79% 91%

Neutral 96% 90%

Posi.ve 80% 92%

Student 2:

Student 3:

66% of the men.ons labeled “Nega.ve” by student 1 or 2 (or both) were also labeled “Nega.ve” by student 3

34% of the men.ons labeled “Nega.ve” by student 1 and 2 were not labeled “Nega.ve” by student 3



Nega.ve 71% 90%

Neutral 96% 87%

Posi.ve 77% 92%


Nega.ve 88% 66%

Neutral 86% 97%

Posi.ve 91% 65%

Student 1:


Nega.ve 79% 91%

Neutral 96% 90%

Posi.ve 80% 92%

Student 2:

Student 3:

Neutral is “easy” because 70% of all men.ons are neutral

Thus, always saying “Neutral” will be correct 70% of the .me and lets you recall 100% of the neutral messages


#tvvv neeeeee :( domien is out ;o ik blijf vanje houden

domien!

Eindelijk verlost van @belgacom! Surfen gaat een pak vlo?er met @telenet :-‐)


#tvvv neeeeee :( domien is out ;o ik blijf vanje houden domien!

ບ"ມ$ຕ&ນໄມ)ຖ+ກອອກoຂ)າພະເຈ&າຍ5ງຮ5ກທ9ານເປ5ນຕ&ນໄມ)!

Eindelijk verlost van @belgacom! Surfen gaat een pak vlo?er met @telenet :-‐)

ສ<ດທ)າຍຈາກຕ&ນໄມ)ເກມບ>ນແມ9ນ@າຍຂAນໄວທCມ$ປ9າໄມ)

Automa.c Sen.ment Analysis Basic strategy

Human annota.on

Features (unigrams)

Label/ Ac.on/

predic.on

Men.on

Tokeniza.on, POS taging, …

Learning

Classifier Model: Feature-‐weights

per class (“count table”)

(1) Training phase

Features (unigrams) Men.on

Tokeniza.on, POS taging, …

classifica.on

Classifier Model: Feature-‐weights

per class (“count table”)

(2) Opera.onal phase

Label/ Ac.on/

predic.on

Automa.c Sen.ment Analysis Basic strategy

Automa.c Sen.ment Analysis

Training Set: neeeeee :( domien is out = NegaDve ik blijf vanje houden domien! = PosiDve eindelijk verlost van @belgacom! = NegaDve surfen gaat een pak vlo?er met @telenet :-‐) = PosiDve … = …

“Bag of Words” “neeeeee :( domien is out” = PosiDve {“domien”, “is”, “neeeeee”, “out”, “:(“} = PosiDve

unigram #Nega$ve #Neutral #Posi$ve

… … … …

“Ik” 3132 6245 3700

… … … …

“:(“ 365 122 58

… … … …

“Domien” 22 13 14

“neeeeee” 4 1 0

… … … …

Train set àTable of unigram counts:

⇒ P[Nega.ve| “ik”] = 3132 / (3132+6245+3700) = 24%

⇒ P[Nega.ve| “ik ben blij”] = ?

Bayes rule of condi.onal probabili.es:

P[Nega.ve] x P[“ik ben blij” | Nega.ve]

P[ Nega.ve| “ik ben blij”] = P[“ik ben blij”]

P[“ik ben blij” | Neg.] = P[“ik” | nega.ve] (unigram)

x P[“ben” | Neg., “ik”] (bigram) x P[“blij” | Neg., “ik ben” ] (trigram)

Evidence (same for all senDments)

Prior (over all menDons) likelihood

Chain rule:

Improvements over Naïve Bayes -‐  Beoer features:

-‐  Bigrams, trigrams, -‐  Parts of speech -‐  Tf/idf weigh.ng -‐  Gramma.cal dependencies (e.g. nega.on marking) -‐  Named en..es

-‐  Alterna.ve strategies to calculate feature weights from counts

-‐  Transformed Normalized Weighted Naïve Bayes -‐  Mutual Informa.on -‐  Maximum entropy

-‐  Other approaches -‐  Sen.ment lexicons (cf. current classifier)

Evalua.on

-‐  In terms of Precision, Recall, F1, Accuracy, …

-‐  Very good on “simple” tasks (comparable to humans) -‐  e.g. spam detec.on -‐  In general, tasks for which grammar and context are not

important (nega.on, source/target/perspec.ve roles, …)

-‐  But rather bad on “difficult” tasks, including sen.ment analysis (worse than humans)


Nega.ve 71% 90%

Neutral 96% 87%

Posi.ve 77% 92%


Nega.ve 42% 43%

Neutral 83% 60%

Posi.ve 38% 76%

Student 1:


Nega.ve 79% 72%

Neutral 76% 76%

Posi.ve 77% 73%

Maxent 2-‐grams

Current classifier:

(Results maxent/current for balanced english student dataset)

Many unresolved issues…

-‐  Other languages (Unsupervised learning/bootstrapping) -‐  Source/Target resolu.on -‐  Classifiers trained on one dataset/topic does not perform well

on other datasets/topics -‐  …

…and opportuni.es Many informa.on extrac.on problems can be cast as

classifica.on problems

-‐  Assigning tags to men.ons -‐  Predic.ng the number of likes/retweets/… of men.ons -‐  Deciding whom to send/assign a message -‐  …

-‐  In general, any problem where things must be “labeled”, “decided” or “predicted”, with a limited number of alterna.ves, and for which training data is available (can be user feedback!)

-‐  And our users generate massive amounts of data!!

à don’t hesitate to discuss ideas with me! ß

Part 2: Clojure -‐  Dynamic programming language targe.ng the JVM (and

javascript)

-‐  Combining interac.ve development of scrip.ng language with efficient and robust infrastructure for mul.threaded programming

-‐  -‐  Lisp dialect:

-‐  (almost) no syntax (+ 1 2) => 3 (list ‘+ 1 2) => (+ 1 2)

-‐  Code as data

(eval (list ‘+ 1 2)) => 3

Part 2: Clojure -‐  Project management through “leiningen”

-‐  bash$ lein new test-‐project

-‐  Add dependencies to project.clj, add code to src/test-‐project

-‐  bash$ lein uberjar => testproject.jar

-‐  Java –jar test-‐project.jar

-‐  Online demo…

slides sentiment 2013 10-3

Documents