topic-dependent sentiment analysis of financial blogs

27
Topic-Dependent Sentiment Analysis of Financial Blogs Neil O’Hare, Michael Davy, Adam Beringham, Paul Fe rgusion, Paraic Sheridan, Cathal Gurrin, Alan F. S meaton Date: 2010/04/19 Speaker: Yu-Cheng Hsieh

Upload: claire

Post on 05-Jan-2016

35 views

Category:

Documents


4 download

DESCRIPTION

Topic-Dependent Sentiment Analysis of Financial Blogs. Neil O’Hare, Michael Davy, Adam Beringham, Paul Fergusion, Paraic Sheridan, Cathal Gurrin, Alan F. Smeaton Date: 2010/04/19 Speaker: Yu-Cheng Hsieh. Outline. Introduction Glossary Issues Development of corpus Analysis of corpus - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Topic-Dependent Sentiment Analysis of Financial Blogs

Topic-Dependent Sentiment Analysis of

Financial BlogsNeil O’Hare, Michael Davy, Adam Beringham, Paul Fergusion,

Paraic Sheridan, Cathal Gurrin, Alan F. SmeatonDate: 2010/04/19

Speaker: Yu-Cheng Hsieh

Page 2: Topic-Dependent Sentiment Analysis of Financial Blogs

Outline

• Introduction• Glossary• Issues• Development of corpus• Analysis of corpus• Topic-based analysis• Experiment & Result• Conclusion

Page 3: Topic-Dependent Sentiment Analysis of Financial Blogs

Introduction

• No existing work used blogs as source, most work used news as source.

• News are more likely to report a stock’s past performance.

• Blogs are more likely to express opinions and to make predictions about the performance of stocks.

Page 4: Topic-Dependent Sentiment Analysis of Financial Blogs

Introduction (Cont.)

• The aim is to…1. Automatically extract the

subjective opinions uniquely found on blogs.

2. Track the changing sentiment from the blogosphere towards individual stocks and the market in general.

• Supervised

Page 5: Topic-Dependent Sentiment Analysis of Financial Blogs

Glossary

• Document: a blog article.

• Topic: name of a stock.

• Unique document: a document contains a topic only.

• Topic shift: an issue in a multiple topic document.

Page 6: Topic-Dependent Sentiment Analysis of Financial Blogs

Glossary (Cont.)

• Doc-Topic pair: a topic in a non-unique document. (also a sub-document of a document)

• Inter-annotator agreement: the agreement level of annotating labels on an object.

Page 7: Topic-Dependent Sentiment Analysis of Financial Blogs

Issues

• Topic Shift - How to extract those topics in the document?

• What level should be analyzed? Document level? Paragraph level? sentence level? word level?

• How many labels should be used to annotate?

Page 8: Topic-Dependent Sentiment Analysis of Financial Blogs

Extract sub-document

• Using proximity approach- Steps1. Find out topic word: T2. Set a window size: N3. Starting from T, expanding N words both

at the right and left side of T.

Page 9: Topic-Dependent Sentiment Analysis of Financial Blogs

Development of corpus

• The corpus is made up of financial blog articles from “blogged.com”

• 232 financial blogs are identified

• Separate articles in blogs into 2 crawls according to the date

- Craw1: 3 weeks in Feb. 2009 - Craw2: 5 weeks from May. to Jun. 2009

Page 10: Topic-Dependent Sentiment Analysis of Financial Blogs

Development of corpus (Cont.)

• Noise Removal - Using DiffPost algorithm - Concept: noise tend to be repeated across multiple articles.- Steps1. Brake each article into HTML segments2. Compare those segments3. Remove the repeat segments, only unique segment

s are kept.

Page 11: Topic-Dependent Sentiment Analysis of Financial Blogs

Development of corpus (Cont.)

• Labels - Very Negative/Positive - Neutral - Negative/Positive - Mixed - Not relevant - IDK (I Don’t Know)

Page 12: Topic-Dependent Sentiment Analysis of Financial Blogs

Development of corpus (Cont.)

• Topics and retrieval- 500 stocks were chosen to be topics from

“S&P 500”.

- Relevant articles must contain the whole company name in upper case.

- Unique annotations are identified by the combination of document and topic, doc-topic pair.

Page 13: Topic-Dependent Sentiment Analysis of Financial Blogs

Development of corpus (Cont.)

• Topics and retrieval- Also annotate a number of documents with

respect to their sentiment towards stocks in general.

=> ~ 1526 unique doc-topic pairs. ~ 167 of which were annotated for stocks in general. ~ 164 of which were annotated by two

annotators to facilitate inter-annotator agreement

analysis.

Page 14: Topic-Dependent Sentiment Analysis of Financial Blogs

Analysis of corpus

• Annotation statistics

Page 15: Topic-Dependent Sentiment Analysis of Financial Blogs

Analysis of corpus(Cont.)• Inter-Annotator Agreement

Page 16: Topic-Dependent Sentiment Analysis of Financial Blogs

Cohen’s Kappa

• Example- Probability of consistent agreement P(a)= (20+15)/50=0.7- A said YES 30 times => 30/50=0.6 B said YES 25 times => 25/50=0.5 probability for both said… YES = 0.6*0.5 =0.3, NO=0.4*0.5=0.2 =>Probability of random agreement P(e)=0.3+0.2=0.5- Kappa = (0.7-0.5)/(1-0.5)=0.4

Page 17: Topic-Dependent Sentiment Analysis of Financial Blogs

Analysis of corpus(Cont.)• Topic Relevance

Page 18: Topic-Dependent Sentiment Analysis of Financial Blogs

Topic-based sentiment analysis

• Topic-based text extraction

- Blog articles often contains multiple topics.

- Topic-based extraction enables sentiment analysis at sub-document level, this should alleviate the topic-shift problem.

Page 19: Topic-Dependent Sentiment Analysis of Financial Blogs

Topic-based sentiment analysis(Cont.)

• Topic-based text extraction- Three approaches to extract sub-

document1. N-word extraction2. N-sentence extraction3. N-paragraph extraction

Page 20: Topic-Dependent Sentiment Analysis of Financial Blogs

Topic-based sentiment analysis(Cont.)

• Sentiment classification- The classification task attempts to model a

function 1. For binary classification

2. For 3-point classification

pairtopicdocaisXYXf ,:

},{ negativepositiveY

},,{ neutralnegativepositiveY

Page 21: Topic-Dependent Sentiment Analysis of Financial Blogs

Experiment

• Discarded those Doc-Topic in the corpus not having labels , or were labelled in inconsistently by more than one annotators.

• 687 labelled documents for binary classification• 917 labelled documents for 3-point classification• Compare three classifiers 1. Multinomial Naïve Baye 2.SVM

3. Trivial classifier as baseline• 10-fold validation• Performance metric: classification accuracy• Sub-document were used to train the classifier

Y

Page 22: Topic-Dependent Sentiment Analysis of Financial Blogs

Results

• Document level only

Page 23: Topic-Dependent Sentiment Analysis of Financial Blogs

Results (Cont.)

Page 24: Topic-Dependent Sentiment Analysis of Financial Blogs

Results (Cont.)

Page 25: Topic-Dependent Sentiment Analysis of Financial Blogs

Results (Cont.)

• Binary classification using MNB at N=30

Page 26: Topic-Dependent Sentiment Analysis of Financial Blogs

Conclusion

• Explored the use of blog sources for sentiment analysis in the financial domain

• Developed a corpus of over 1,500 document-level annotations

• Analysis of the annotation effort suggets that humans have particular difficulty annotating for degree of polarity

• Proposed text-extraction approach to solve topic-shift problem.

• Plan to explore the use of linguistic features and domain independent experiments

Page 27: Topic-Dependent Sentiment Analysis of Financial Blogs

Thanks for your listening