sentiment analysis on big data

SPAN White Paper

Sentiment Analysis on Big DataMachine Learning Approach

Several sources on the web provide deep insight about people’s opinions on the products and services of various companies. Social networking sites like Facebook, Twitter, Google+, LinkedIn, YouTube, etc., blogs and discussion forums send out loud messages through users, who voice their opinions openly. The data captured from these sites is usually unstructured and huge in volume, and analyzing such massive content manually is a tedious task. This is where SPAN achieves a collateral edge through its accurately developed solution, where data from all the above sources on the web, treading back multiple years, can be collected and processed to derive concise results.Our real-time search technology enables us to extract a complete body of expressions through these sources from many users, simultaneously, on any given subject. This white paper describes the possible techniques in which sentiments of users on multiple social forums can be used and analyzed to gain a meaningful and actionable insight.

! ?

SPAN Prediction Engine

SPAN’s Prediction Engine on big data retrieves posts, comments or tweets about a company or a product to obtain predictive insights on consumer thought process. To analyze data and quantify the moods of individuals of social forums, the tool uses an algorithm developed specifically to analyze sentiments from social media conversation, on a massive scale. Our Prediction Engine examines the data from social sources, scores the post, comments, tweets, etc., on the sentiment scale and classifies the text by sentiment, as ‘positive’, ‘negative’,‘campaign’, ‘reply’ or ‘query’.

The same applies to other sentiment categories such as ‘reply’ and ‘query’. When each post expresses an adjective that belongs to one of the above categories, it becomes possible to compute a statistical model that can capture and quantify how people feel about something, as expressed in these social forums. A variety of sentiment analysis methods exist for analyzing all types of content from general news sources and other public data sources. SPAN Prediction Engine results in relatively high accuracy rate – a 78 percent agreement rate with manually reviewed content. Statistically put, typically, even humans have about 80 percent agreement rate with each other. Our tool processes sentiments for every single post on social forums, allowing the application to separate the mood around a particular product from the changes in the overall mood of the moment. If the sentiment for “Product X” is low on a Monday morning, is it because people are unhappy with the product, or because the sentiment for all terms is more negative on that Monday morning? You can analyze the general mood patterns of individuals to determine the true sentiment for any specific term using the SPAN Prediction Engine.

Instances:

The following post has a negatedsentiment as text, classified as negative:“What a horrible company with a horrible customer service and horrible attitudes.”

On the other hand, this tweet is classified as positive:“Nina was incredibly helpful and definitely made me a lot happier with your service.”

A post can be classified as a campaign if it is posted as an advertisement by the company. “Try out our new 4G data plan for this month http://sampleurl.com/xjkk”

In response to a question on customersatisfaction, a typical reply would be:“I found the service to be good, and prompt.”

A consumer may want to inquire about aservice center, with a query like: “Where is the service center nearest to my location?”

Social Media Sources

SPAN’sPredictive

Engine

Positive Negative Reply Campaign Query

Sentiment Analysis on Big Data | 2

As depicted in the image above the amount of negative sentiments expressed on social forums on a daily basis was more than the positive sentiments realized for that month.

Machine Learning

Machine learning deals with construction and study of intelligent systems that are developed to identify changes in the data in hand and improve the algorithmic order to accommodate new findings. For example, a machine learning system could be made to adopt changes constantly, (based on buyer opinion) to rate health, life or automotive insurance policies with respect to coverage, duration, premium, benefits, popularity, etc. For an insuranceservice provider, this provides a high degree of success in selling its products. Ratings based on buyer sentiment can appropriately be used to recommend a policy that meets theexpectations of a buyer.When we gather large volumes of direct or indirect opinions, views, interests and perspectives, we need to apply learning algorithms to generalize or establish new points of interest. Machine learning poses many scientific and engineering challenges. Statistics of the data collected and observed shifts rapidly in real-time and so do the feature of interests and views. Hence, themachine learning algorithms need to be continuously adaptive. For increased reliability, the statistical models need to be applied across multiple algorithms to obtain consolidated results.The machine learning algorithms used to perform sentiment analysis described in this paper are supervised learning algorithms. As the learning engine progresses with continuous arrival of inputs (training data), the prediction accuracy of the engine increases. The learning engine is generic in nature and can be used for a variety of applications and across multiple domains.

Sentiment Analysis

For analysis purposes, SPAN Prediction Engine was applied over extended time periods across all the social media data, isolating only those conversations referencing a telecom service provider company. This enabled us to comprehend how people actually felt, when the company released a product or raised its tariff for existing customers. We compared SPAN Prediction Engine’s output and stabilized these posts on different social media, and also quantified the volume of keywords related to the company or its products.

The graph represents the trend of negative comments posted in a particular month when a service by the telecom company was released.


Implementation Model

Basic Building Blocks in Sentiment Analysis

SentimentScores

SentimentAnalysis

DataPre-Processing

Input Data Source(Big Data)

Lexicons andLinguistic resources

Training Sets

Libraries to carry out NLPData source in HDFS;

posts & tweets are fetched for product/

company

Fetched from HDFS; posts & tweets labeled manually based on the

nature of sentiments

Training sets & input sources with NLP

Model built as per training set; predicts sentiments for posts

& comments from input source

The prediction output from the previous step is

shown in reports.

Statistical Model

A statistical model was built by giving thousands of training sets, which were tagged manually with precision. This model was further applied to the next set of social feeds from different social sources about the telecom company, which enabled us determine the sentiments withan accuracy rate of 78 percent. Percentages above 60 are acceptable in predictive analytics since most of the sentiment analytic models tag sentiments in three categories - positive,negative and neutral. We have categorized neutral sentiments into ‘reply’, ‘queries’ and‘campaign’ sections.

Unstructured Data

Learning Engine Reporting Engine

VisualizationReport Product Services

User

Structured Data

Using Hadoop

Web Portal


The graphical representation depicts the correlation between negative comments and queries, while the reply section is on the lower end. This portrays the increased percentage of queries asked that spikes up the ‘negative’ graph. The graph was validated when the company’s social media page was checked for user responses.This depiction provides a number of insights for a company to determine the ideal time to post a campaign about its new introductions to obtain more of positives than the negatives or the neutrals.

As depicted in the image above, the amount of negative and queries are expressed on these socialforums on daily basis were found to be correlated.

The image above shows the time of the day when most customers are active, which is mostly late nights.

There is a spike at 8 PM that is rising high till midnight, which indicates that a company should post a campaign or an ad about their new product between 8 PM and midnight.Subsequent to considering the ideal time to post your campaign or an ad, you would also know the top influencers and most used words by people in their conversations, to understand what the users of different age groups expect from a product / service.

The image shows top influencers and words used in such conversations.


For more information on our entire range of solutions and related offerings, get in touch with: [email protected]

About SPAN:

SPAN is an established Software Services Company offering comprehensive IT services since 1994. Our clients include Fortune 1000 companies, Inde-pendent Software Vendors and start-ups. SPAN’s Offshore Development Center in India is CMMI Level 5, PCMM Level 3, ISO 9001:2008 and ISO 27001:2005 certified. SPAN has a global footprint with offices in the U.S., India and group offices in Europe. There are multiple offshore development centers in Bangalore and Chandigarh, India. SPAN is ranked #7 Best IT Employers in India by a leading IT publication. SPAN’s Relationship Management (RM) Model is a well-defined, yet flexible framework, which provides ongoing business value to both, the client and SPAN. SPAN is wholly owned by USD 2.3 Billion Norwegian IT services major EVRY (www.evry.com).

USA Headquarters

SPAN Systems Corporation30 Knightsbridge Road, Suite 5252nd Floor, Piscataway, NJ 08854 Phone: 732-384-3361/1-800-SPAN-SYSFax: 732-384-3365

India Headquarters

SPAN Infotech (India) Pvt. Ltd.18/2, Vani Vilas Road, Basavanagudi Bangalore 560 004, IndiaPhone: +91-80- 40219600Fax:+91-80- 40219632www.spansystems.com

Copyright © 2014 by SPAN. All rights reserved. The contents of this document are protected by copyright law and international treaties. SPAN acknowledges the proprietary rights of the trademarks and product names of other companies mentioned in this document. The reproduction or distribution of the document or any portion of it thereof, in any form or by any means without the prior written permission of SPAN is prohibited.

Conclusion

With millions of conversations occurring on the social media each day, the science of extracting relevant data and using statistics to quantify how people are expressing themselves has become a rapidly evolving discipline. There are significant advantages to identifying correlations in social sentiments and product marketing when you are able to apply search techniques to social data, extracting only those conversations related to your company or product.When sentiment analysis is applied to such focused set of conversations over longer durations, it gives precise outcomes to open up prospective avenues for a company to enhance the value of its product / service portfolio. SPAN’s analytical solution provides additional results as they become available, and allows for deeper R&D, thereby improving an organization’s overall capabilities.

sentiment analysis on big data

Documents