topic-dependent sentiment analysis of financial blogs

Topic-Dependent Sentiment Analysis of

Financial BlogsNeil O’Hare, Michael Davy, Adam Beringham, Paul Fergusion,

Paraic Sheridan, Cathal Gurrin, Alan F. SmeatonDate: 2010/04/19

Speaker: Yu-Cheng Hsieh

Outline

• Introduction• Glossary• Issues• Development of corpus• Analysis of corpus• Topic-based analysis• Experiment & Result• Conclusion

Introduction

• No existing work used blogs as source, most work used news as source.

• News are more likely to report a stock’s past performance.

• Blogs are more likely to express opinions and to make predictions about the performance of stocks.

Introduction (Cont.)

• The aim is to…1. Automatically extract the

subjective opinions uniquely found on blogs.

2. Track the changing sentiment from the blogosphere towards individual stocks and the market in general.

• Supervised

Glossary

• Document: a blog article.

• Topic: name of a stock.

• Unique document: a document contains a topic only.

• Topic shift: an issue in a multiple topic document.

Glossary (Cont.)

• Doc-Topic pair: a topic in a non-unique document. (also a sub-document of a document)

• Inter-annotator agreement: the agreement level of annotating labels on an object.

Issues

• Topic Shift - How to extract those topics in the document?

• What level should be analyzed? Document level? Paragraph level? sentence level? word level?

• How many labels should be used to annotate?

Extract sub-document

• Using proximity approach- Steps1. Find out topic word: T2. Set a window size: N3. Starting from T, expanding N words both

at the right and left side of T.

Development of corpus

• The corpus is made up of financial blog articles from “blogged.com”

• 232 financial blogs are identified

• Separate articles in blogs into 2 crawls according to the date

- Craw1: 3 weeks in Feb. 2009 - Craw2: 5 weeks from May. to Jun. 2009

Development of corpus (Cont.)

• Noise Removal - Using DiffPost algorithm - Concept: noise tend to be repeated across multiple articles.- Steps1. Brake each article into HTML segments2. Compare those segments3. Remove the repeat segments, only unique segment

s are kept.


• Labels - Very Negative/Positive - Neutral - Negative/Positive - Mixed - Not relevant - IDK (I Don’t Know)


• Topics and retrieval- 500 stocks were chosen to be topics from

“S&P 500”.

- Relevant articles must contain the whole company name in upper case.

- Unique annotations are identified by the combination of document and topic, doc-topic pair.


• Topics and retrieval- Also annotate a number of documents with

respect to their sentiment towards stocks in general.

=> ~ 1526 unique doc-topic pairs. ~ 167 of which were annotated for stocks in general. ~ 164 of which were annotated by two

annotators to facilitate inter-annotator agreement

analysis.

Analysis of corpus

• Annotation statistics

Analysis of corpus(Cont.)• Inter-Annotator Agreement

Cohen’s Kappa

• Example- Probability of consistent agreement P(a)= (20+15)/50=0.7- A said YES 30 times => 30/50=0.6 B said YES 25 times => 25/50=0.5 probability for both said… YES = 0.6*0.5 =0.3, NO=0.4*0.5=0.2 =>Probability of random agreement P(e)=0.3+0.2=0.5- Kappa = (0.7-0.5)/(1-0.5)=0.4

Analysis of corpus(Cont.)• Topic Relevance

Topic-based sentiment analysis

• Topic-based text extraction

- Blog articles often contains multiple topics.

- Topic-based extraction enables sentiment analysis at sub-document level, this should alleviate the topic-shift problem.

Topic-based sentiment analysis(Cont.)

• Topic-based text extraction- Three approaches to extract sub-

document1. N-word extraction2. N-sentence extraction3. N-paragraph extraction

Topic-based sentiment analysis(Cont.)

• Sentiment classification- The classification task attempts to model a

function 1. For binary classification

2. For 3-point classification

pairtopicdocaisXYXf ,:

},{ negativepositiveY

},,{ neutralnegativepositiveY

Experiment

• Discarded those Doc-Topic in the corpus not having labels , or were labelled in inconsistently by more than one annotators.

• 687 labelled documents for binary classification• 917 labelled documents for 3-point classification• Compare three classifiers 1. Multinomial Naïve Baye 2.SVM

3. Trivial classifier as baseline• 10-fold validation• Performance metric: classification accuracy• Sub-document were used to train the classifier

Y

Results

• Document level only

Results (Cont.)

Results (Cont.)

• Binary classification using MNB at N=30

Conclusion

• Explored the use of blog sources for sentiment analysis in the financial domain

• Developed a corpus of over 1,500 document-level annotations

• Analysis of the annotation effort suggets that humans have particular difficulty annotating for degree of polarity

• Proposed text-extraction approach to solve topic-shift problem.

• Plan to explore the use of linguistic features and domain independent experiments

Thanks for your listening

topic-dependent sentiment analysis of financial blogs

Documents

multiple topic document

topic word

document level

topicbased extraction

unique doctopic pairs

supervisedglossary document

combination of document

nonunique document