predictive analytics through sentiment analysis

Predictive Analytics throughS entiment Analysis

© 2015, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

Abstract

Introduction

Solution

Implementation

Named Entity Extraction

Sentiment Analysis

Model

Conclusion

Reference

Author Info

3

3

4

5

5

5

66

7

7

7

Table of Contents


Media in the 21st century has become very diverse and with the advent of the social media revolution in the last decade, it has grown in stature too. The layers of different media contents coupled with the expressions in the social media has become a cauldron of emotions, thought processes and sentiments. This data has become the focus of many groups/companies as a lot can be foreseen through this. The availability of data and improvement of technology have allowed us to find an incredible amount of value, which a few years ago would have been deemed impossible. The question we attempted to solve is whether we mine value out of commoncommon day-to-day data and use it for predictive analysis. This case study is our attempt is to showcase the use of daily financial news articles to predict stock market movements, to highlight the detailed ML experi-ments and results achieved.

Stock Market predictions have been generally based on stochastic models. The algorithms used are Exponen-tial Moving Average, and Head & Shoulders. Artificial Neural Networks and Genetic Algorithms are also used heavily. Many analysts use more traditional techniques such as P/E Ratio too. All these techniques used stock market prices, stock volumes traded and dividends paid to model the predictions. However our attempt was to highlight how market and public sentiment can be harnessed from financial news articles and used for pre-dicting stock movements without the regularly used entities/features.

Predictive Analytics with ‘Unstructured Data’ has become one of the cornerstones of Big Data Analytics. This use case showcases predictive analytics workflow/pipeline with unstructured data and how simply we can put it together using some open source tools.

The key challenges of Big Data according to Gartner are Information Strategy, Data Analytics and Enterprise Information Management. It also claims “Through 2015, 85% of Fortune 500 organizations will be unable to exploit big data for competitive advantage”. We believe the factors like ‘use of right techniques’, ‘extraction of right parameters’ and “discovering unknown values” contribute for the same. The endeavor was to investigate these factors in our case study.

TheThe idea was to tap into the textual content of financial news articles/blogs at regular interval of time, process it, store the features extracted into a database, build the model, and finally predict stock movement for a defined period ahead. As we predict the movement of the stocks to the users, we ingested the current ‘stock market feed’ to verify our prediction. The feedback was used by the system to fine tune the model. The end-to-end workflow of the system can be seen below.

Abstract

Introduction

Predictive Analytics through Sentiment Analysis | 3


News Feed

Stock Market Feed

Users

ApplicationPrediction

Database

ETL

PredictionFeedback

Let us take a step back and understand the intuition. Consider two sentences “Robert is a good student” and “Monica is the best student in the class”. What all information we can extract from the first sentence? First the principal entity of the sentence is “Robert”, the gender of the entity is “Male” and the sentence says something positive about entity. There remains a question though ‘is it measurable?’ From the second sentence we can easily interpret that “Monica” is a female student and this sentence too says something positive about the entity. There is no quantifiable attributes which can be associated with these sentences. Now if we start to quantifyquantify the positive or negative mood/sentiments according to some weights it can be measured. Hence lets consider ‘good’ = 4 and ‘best’ = 5. Then we can label the two sentences as in the table below.

What did we just do here? We transformed Unstructured Data and shaped it into Structured Data. All this was so simple, wasn’t it? Humans are trained to identify and distinguish between parts of, tone of a sentence, and interpretations of languages. Now the question is whether a machine can do it or not. Yes, it can, with the help of Text Mining tools/technologies, the techniques used are Entity Extraction and Sentiment Analysis.

Predictive Analytics can be applied to find answers to the unknown. The movement of the stocks in the future is what we are looking for. The stock news has a lot of unstructured information about stock symbols, compa-nies, profit/loss, and lastly the overall sentiment about the companies/market. The statistical algorithms used the structured information extracted from the financial news articles to build a model. This model is in turn used to predict the stock movement of the company as searched by the user.

Main Element - Robert

Gender - Male

Sentiment Index - 4

Main Element - Monica

Gender - Female

Sentiment Index - 5

Figure 1. System Workflow

Solution



News articles related to NASDAQ Stock Exchange were used. We have used a RSS News Feed to tap into that. The first step was to harvest the RSS Feed for the news content. After cleansing, the articles are parsed to separate core text. Each article is then passed through a data processing pipeline which consists of multiple steps and sub-steps. The major steps are Named Entity Extraction, Sentiment Analysis and our own algorithm. We have used externally library like GATE and referenced research work by Finn Arup Nielsen from the University of Denmark for this pipeline.

GATE is used for the extraction or identification of these entities. GATE has a pre-designed workflow known as ANNIE. ANNIE is an Information Extraction system and comprises of couple of steps that can be ordered, and/or removed/added. This consists of Tokenizer, Gazetter, Sentence Splitter, POS Tagger, NE Tran-sreducer and Orthomatcher. The tokenizer and sentence splitter are used to split the words and sentences. The role of the Gazetteer is to identify entities in the text based on pre-defined lists. The Orthomatcher module adds identity relations between named entities found by the semantic tagger, in order to perform co-reference.co-reference. The POS Tagger is used to tag the entities in the piece of text. We trained GATE to identify new STOCK CODE/SYMBOLS by learning from existing codes which was available with us.

Two different approaches were used for calculation of sentiments which are the basic features of the model to be built.

The primary features are English Keywords Sentiment, Stock Keywords Sentiment and Contextual Senti-ment. We decided to write the algorithm in the lines of research done by Finn Arup Nielsen from the Univer-sity of Denmark to calculate the sentiments. He uses a list of English keywords rated from +5 to -5 known as AFINN.

The keywords, which will be matched from the AFFIN list, would be ‘good’ with a scale of 5 and ‘more’ with a scale of 4. The distance of each entity in this sentence from these two keywords will be calculated. Further is a keyword from an entity the effect of the word is diminished and vice-versa. On the basis of this we get the English sentiment score which may be positive, neutral or negative.

WeWe further researched and scaled a list of about 3500 keywords, which are related with the stock markets, these were used to calculate the stock sentiment score in accordance with the stock market mood. Some of the keywords cannot be attributed to either of the classes above are contextual in nature; we calculated a contextual sentiment score for stock keywords too.

Here is an example for illustration:

“Rohit is a good student he always gets more than 90% score” now with the help of Named Entity Extraction we will get the following entities.

<Name>Ram<Name> is a good student he always gets more than <percentage>90 %< percentage> score.

Implementation

Named Entity Extraction

Sentiment Analysis


Model:


Once the features and their values were extracted we build a couple of models with the statistical comput-ing model R. The classification models used were ‘Random Forest’ and ‘Native Bayes’. Once the model was available we used the same to predict the movements (UP/DOWN) of a stock symbol. Simultaneously we ingested the ‘Stock Market Feed’ for that particular symbol and used that feedback to recalibrate the weightage of the keywords and hence refine the model.

A few rows from Table 1 were used for fine tuning of the model. The results achieved were about 80% accurate as we can see in the pie-charts in Figure 2.

Results Achieved

Down54%

Down24% Up

76%

Up46%

Up46%

Down26%Down

54%

Up46%

Actual

Native Bayes Random ForestError 19.3%Error 21.8%

Predicted Actual Predicted

Table 1. Prediction table

Figure 2. Pie-charts: Actual v/s Predicted


For more details contact: [email protected] us on twitter: http://twitter.com/hclers andOur blog http://www.hcltech.com/blogs/engineering-and-rd-servicesVisit our website: http://www.hcltech.com/engineering-rd-services

Hello, I’m from HCL’s Engineering and R&D Services. We enable technology led organizations to go to market with innovative products and solutions. We patner with our customers in building world class products and creating associated solution delivery ecosystems to help bring market leadership. We develop engineering products, solutions and platforms across Aerospace and Defense, Automotive, Consumer Electronics, Software, Online, Industrial Manufacturing, Medical Devices, Networking and Telecom, Office Automation, Semiconductor and Servers & Storage for our customers.

This whitepaper is published by HCL Engineering and R&D Services.

The views and opinions in this article are for informational purposes only and should not be considered as a substitute for professional business advice. The use herein of any trademarks is not an assertion of ownership of such trademarks by HCL nor intended to imply any association between HCL and lawful owners of such trademarks.

For more information about HCL Engineering and R&D Services,Please visit http://www.hcltech.com/engineering-rd-services

Copyright@ HCCopyright@ HCL TechnologiesAll rights reserved.

Kinnar Kumar SenHCL Engineering and R&D Services

Conclusion

Reference

Author Info

http://www.gartner.com/technology/test/big-data.jsp

http://fnielsen.posterous.com/simplest-sentiment-analysis-in-python-with-af

http://r-project.org

http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

http://nlp.standford.edu/software

The aim of this case study was not to predict the stock market with 100% accuracy but to bring out the fact that there are business values hidden in the data which we see, touch and experience every day, and tech-niques are available which can be put together to mine those values. As for improvement of the prediction is concerned, there are several areas which can be attended to, such as inclusion of Real Time Social Network feeds, refining the algorithms used, and attune them to stock markets, introducing the market segments features of the stock symbols.


predictive analytics through sentiment analysis

Business

unstructured data

availability of data

stock market movements

stock market predictions

stock market prices

stock volumes

predictive analysis

use case