data mining methods in twitter
TRANSCRIPT
Data Mining Methods in Trading Strategies
Wendi Zhuwendizhu1991gmailcom
An analysis based on news sentiment
The Age of Big Data
8 Terabytes
Twitter 8000000000000 Bytes
Take Twitter SPY in 2010 as a simple example
Question Mining news data from Social Media to enhance trading
Yes
1 A Wall Street news analytics company Sentiment
data is a determinant of market moves after Federal Open Market Committee (FOMC) rate announcements
with a 75 accuracy rate in 2014
2 A Hedge Fund report We capture a burst of
negative sentiment of ResMed at 1114AM October 9 2014 Despite the serious allegations and the seeming
validity of the report it took the market over 60minutes to react
3 An Institutional Investor News sentiment Open-
to-Close (OTC) strategy on SPY returned 2976(before cost) over 2014 with a Sharpe Ratio of 31
Claims
1
2
3
4
PreviewFirst look at social media data
ImplemetationParsing twitter news sentiment
ImprovementA brief summary of Advanced methods
Trade the newsTentative trading practices
News Mining Step 1
Whatis a typical Social Media news like
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
The Age of Big Data
8 Terabytes
Twitter 8000000000000 Bytes
Take Twitter SPY in 2010 as a simple example
Question Mining news data from Social Media to enhance trading
Yes
1 A Wall Street news analytics company Sentiment
data is a determinant of market moves after Federal Open Market Committee (FOMC) rate announcements
with a 75 accuracy rate in 2014
2 A Hedge Fund report We capture a burst of
negative sentiment of ResMed at 1114AM October 9 2014 Despite the serious allegations and the seeming
validity of the report it took the market over 60minutes to react
3 An Institutional Investor News sentiment Open-
to-Close (OTC) strategy on SPY returned 2976(before cost) over 2014 with a Sharpe Ratio of 31
Claims
1
2
3
4
PreviewFirst look at social media data
ImplemetationParsing twitter news sentiment
ImprovementA brief summary of Advanced methods
Trade the newsTentative trading practices
News Mining Step 1
Whatis a typical Social Media news like
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
8 Terabytes
Twitter 8000000000000 Bytes
Take Twitter SPY in 2010 as a simple example
Question Mining news data from Social Media to enhance trading
Yes
1 A Wall Street news analytics company Sentiment
data is a determinant of market moves after Federal Open Market Committee (FOMC) rate announcements
with a 75 accuracy rate in 2014
2 A Hedge Fund report We capture a burst of
negative sentiment of ResMed at 1114AM October 9 2014 Despite the serious allegations and the seeming
validity of the report it took the market over 60minutes to react
3 An Institutional Investor News sentiment Open-
to-Close (OTC) strategy on SPY returned 2976(before cost) over 2014 with a Sharpe Ratio of 31
Claims
1
2
3
4
PreviewFirst look at social media data
ImplemetationParsing twitter news sentiment
ImprovementA brief summary of Advanced methods
Trade the newsTentative trading practices
News Mining Step 1
Whatis a typical Social Media news like
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
Take Twitter SPY in 2010 as a simple example
Question Mining news data from Social Media to enhance trading
Yes
1 A Wall Street news analytics company Sentiment
data is a determinant of market moves after Federal Open Market Committee (FOMC) rate announcements
with a 75 accuracy rate in 2014
2 A Hedge Fund report We capture a burst of
negative sentiment of ResMed at 1114AM October 9 2014 Despite the serious allegations and the seeming
validity of the report it took the market over 60minutes to react
3 An Institutional Investor News sentiment Open-
to-Close (OTC) strategy on SPY returned 2976(before cost) over 2014 with a Sharpe Ratio of 31
Claims
1
2
3
4
PreviewFirst look at social media data
ImplemetationParsing twitter news sentiment
ImprovementA brief summary of Advanced methods
Trade the newsTentative trading practices
News Mining Step 1
Whatis a typical Social Media news like
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
Question Mining news data from Social Media to enhance trading
Yes
1 A Wall Street news analytics company Sentiment
data is a determinant of market moves after Federal Open Market Committee (FOMC) rate announcements
with a 75 accuracy rate in 2014
2 A Hedge Fund report We capture a burst of
negative sentiment of ResMed at 1114AM October 9 2014 Despite the serious allegations and the seeming
validity of the report it took the market over 60minutes to react
3 An Institutional Investor News sentiment Open-
to-Close (OTC) strategy on SPY returned 2976(before cost) over 2014 with a Sharpe Ratio of 31
Claims
1
2
3
4
PreviewFirst look at social media data
ImplemetationParsing twitter news sentiment
ImprovementA brief summary of Advanced methods
Trade the newsTentative trading practices
News Mining Step 1
Whatis a typical Social Media news like
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
1 A Wall Street news analytics company Sentiment
data is a determinant of market moves after Federal Open Market Committee (FOMC) rate announcements
with a 75 accuracy rate in 2014
2 A Hedge Fund report We capture a burst of
negative sentiment of ResMed at 1114AM October 9 2014 Despite the serious allegations and the seeming
validity of the report it took the market over 60minutes to react
3 An Institutional Investor News sentiment Open-
to-Close (OTC) strategy on SPY returned 2976(before cost) over 2014 with a Sharpe Ratio of 31
Claims
1
2
3
4
PreviewFirst look at social media data
ImplemetationParsing twitter news sentiment
ImprovementA brief summary of Advanced methods
Trade the newsTentative trading practices
News Mining Step 1
Whatis a typical Social Media news like
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
1
2
3
4
PreviewFirst look at social media data
ImplemetationParsing twitter news sentiment
ImprovementA brief summary of Advanced methods
Trade the newsTentative trading practices
News Mining Step 1
Whatis a typical Social Media news like
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
News Mining Step 1
Whatis a typical Social Media news like
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
Take Twitter SPY in 2010 as a simple example
bull 2010-01-19T151452Z $SPY looks strong riding 5EMA - large gap from SMA50 - concern about 113 level gone for now
bull 2010-12-09T132849Z $SPY managed to reclaim the 1227 support level which should bode well for further price appreciation
bull 2010-12-10T155950Z $SPY long
bull 2010-01-21T205721Z $SPY closing the lows
bull 2010-09-07T001055Z Last Sunday strength in patterns showing a bearish market move
bull 2010-12-08T162517Z $SPY has now failed a breakout Could recover but for now this is a perfect picture of a failed breakout
bull 2010-12-16T152007Z this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like
Neutral
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
News Mining Step 2
How can we interpret the news sentiment by machine
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification
An introduction to NLTKNLTK is a platform for building Python programs to work with human
language It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification tokenization stemming tagging parsing and semantic reasoning
Twitter text Database descriptionSource httpstocktwitscomFormat JSONSize over 15 millionData entries Id body create at user name followers following hellipid918510bodyOptions Trade in Nordstrom Today $JWN - httpwwwcnbccomid34644850site14081545forcnbccreated_at2010-01-01T000902Z userid6328usernameOptionsHawknameJoe Kunkleavatar_urlhttpavatarsstocktwitsnetproduction6328thumb-1290207489png avatar_url_sslhttpss3amazonawscomst-avatarsproduction6328thumb-1290207489png officialfalse identityUserclassification[] ldquojoin_date2009-11- 01followers7072 following31ideas18866 following_stocks0locationBoston bioActive Options Trader - OptionsHawkcom Founder website_url httpwwwOptionsHawkcom trading_strategy assets_frequently_tradedldquo [Equities OptionsForexFutures]approachTechnicalholding_periodSwing TraderexperienceProfessionalsourceid1titleStockTwitsurlhttpstocktwitscomsymbols[id6039symbolJWNtitleNordstrom IncexchangeNYSEsectorServicesindustryApparel Storestrendingfalse]entitiessentimentnull
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
bull Starting Set10000 manually labeled twitter news items
bull Distribution of sentiment
Initial training set sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
1) Prepare training set ndash Data cleaning(removing nulls web links etc) amp import
bull pos_tweets = $SPY looks strong riding 5EMA - large gap fromhellippositivehellip
bull neg_tweets = $SPY closing the lows negativehellip
2) Split the text sentence into word features
bull spy looks strong riding 5ema large gap from sma50 concern about level gone for nowhellip
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features
bull contains(spy) True
bull contains(support) False
bull contains(strong) True helliphellip
Parsing News Sentiment Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
5) Apply this mapping into all the news texts and get the following form
Parsing News Sentiment
lsquospyrsquo supportrsquo lsquogonersquo hellip
Twitter_1 True False True hellip
Twitter_2 True False False hellip
Twitter_3 hellip hellip hellip hellip
SentimentWordFeature
_1WordFeature
_2WordFeature
_3hellip
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) hellip
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) hellip
Twitter_3 0 (Neu) hellip hellip hellip hellip
A typical classification problem
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
6) Classification
Naive Bayes classifier
6) A simple description of Naive Bayes
Bayes Formula
The naive assumptions come into play assume that each feature is conditionally independent
of every other feature
In this twitter example it means the word features independently affect the sentiment of the text
Parsing News Sentiment
k
ki
C class with items news of
C class and x feature word with items news of )C|x(p ki
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
7) Model trained We got 14356 word features Most Informative Features include
8) In sample test and out-of-sample test
bull Tweet= $SPY has now failed a breakout We could recover but for now this is a perfect picture of a failed breakout
bull Negative Prob(lsquonegative)= 085525
bull Tweet= lsquoSPY UP I like that
bull Positive Prob(positive)= 04123 Prob(lsquonegative)=01936
bull TOTAL in-sample accuracy 792
bull TOTAL out-of-sample accuracy 363
bull With a large enough training set the accuracy rate would get very high
Result
NEWS ITEM CONTAINS RATIO
lsquowidelyrsquo positi negati = 2198 10
lsquoheldrsquo positi negati = 1724 10
lsquomostrsquo positi negati = 457 10
lsquofallrsquo negati positi = 454 10
lsquomightrsquo negati neutra = 306 10
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
Pros 1 A basic approach in sentiment analysis Easy to use 2 Effective if the training set is large enough3 Ability to learn as the training set gets larger the results get
more and more accurate(intelligence)Cons 1 Failure in grasping the connection between words 2 Doesnrsquot consider the sequence of words3 Non-relevant word features
A simple summary Nltk and Naive Bayes method
Possible Improvements 1 Larger training set2 PCA addressing the problem of too many features3 Filtering remove spam and meaningless tweets4 Detecting short sequence of words
Currently working on themhellip
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
News Mining Step 3
Other Advanced Methods in measuring news sentiment
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
Other advanced approaches
Stanford NLP httpnlpstanfordeduPaper Christopher Manning and Dan Klein 2003 Optimization Maxent Models and Conditional Estimation without Magic Tutorial at HLT-NAACL 2003 and ACL 2003Core ideaMaximum entropy classifier Otherwise known as multiclass logistic regression The Max Entropy does not assume that the features are conditionally independent of each other
Vivekn httpgithubcomviveknsentimentPaper Fast and accurate sentiment classification using an enhanced Naive Bayesmodel Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206 2013 pp 194-201Core ideaThis tool works by examining individual words and short sequences of words (n-grams) not bad will be classified as positive despite having two individual words with a negative sentiment
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
More advanced approaches
bullOther ones I am currently working onbullVadersentiment- httpsgithubcomcjhuttovaderSentimentbullHutto CJ amp Gilbert EE (2014) VADER A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Eighth International Conference on Weblogs and Social Media (ICWSM-14) Ann Arbor MI June 2014
bullIndico-httpsindicoio
-01
-005
0
005
01
015
02
-04
-02
0
02
04
06
08
1
1232009 1222010 3132010 522010 6212010 8102010 9292010 11182010 172011 2262011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580000 twitter news
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
More advanced approaches
bullThomson Reuters news analytics httpthomsonreuterscomenhtmlbullGate (+Annie) - httpgateacukbullLingPipe - httpalias-icomlingpipebullWEKA NLP- httpwwwcswaikatoacnzmlwbullOpenNLP - httpincubatorapacheorgopenbullJULIE - httpwwwjulielabde
bullResearch still on goinghellipbullVisit my personal sitehttps publictableausoftwarecom viewsSPYVadernewssentiment2010SPYshowVizHome=no1
Thank you
Thank you