full paper: analytics: key to go from generating big data to deriving business value

7
Analytics: Key to go from generating big data to deriving business value Deepali Arora 1 , Piyush Malik 2 , 1 Dept. of Electrical and Computer Engineering, University of Victoria, P.O. Box 3055 STN CSC, Victoria, B.C {darora}@ece.uvic.ca 2 Business Analytics and Strategy, IBM Global Business Services, 4400 N 1st Street, San Jose, CA {Piyush.Malik}@us.ibm.com Abstract—The potential to extract actionable insights from big data has gained increased attention of researchers in academia as well as several industrial sectors. The field has become interesting and problems look even more exciting to solve ever since organizations have been trying to tame large volumes of complex and fast arriving big data streams through newer computing paradigms. However, extracting meaningful and ac- tionable information from big data is a challenging and daunting task. The ability to generate value from large volumes of data is an art which combined with analytical skills needs to be mastered in order to gain competitive advantage in business. The ability of organizations to leverage the emerging technologies and integrate big data into their enterprise architectures effectively depends on the maturity level of the technology and business teams, capabilities they develop as well as the strategies they adopt. In this paper, through selected use cases, we demonstrate how statistical analyses, machine learning algorithms, optimization and text mining algorithms can be applied to extract meaningful insights from the data available through social media, online commerce, telecommunication industry, smart utility meters and used for variety of business benefits, including improving security. The nature of applied analytical techniques largely depends on the underlying nature of the problem so a one-size-fits-all solution hardly exists. Deriving information from big data is also subject to challenges associated with data security and privacy. These and other challenges are discussed in context of the selected problems to illustrate the potential of big data analytics. I. I NTRODUCTION The analysis of big data and the associated potential to ex- tract actionable information has gained attention of researchers in both academia and industry [1], [2], [3], [4]. Researchers in both academia/industry have emphasized on developing new tools and techniques for better storing, managing, and analyzing big data [5]. However, the business community is looking for ways to improve their profits by leveraging information hidden in big data through analytics [6]. Mas- sive amount of data are generated on a daily basis from various sources including (but not limited to) online shop- ping transactions, gas and electric meters, electronic health records, social networking interactions, weather and satellite data, embedded sensors in industrial machinery as well as in automobiles and aircrafts, data center computing equipment as well as telecommunication industry equipment. According to International Data Corporation (IDC), cumulative digital data is predicted to grow from 4.4 zeta-bytes (ZB) in 2013 to 44 ZB by the year 2020 [7]. Data is now considered as the “new oil” of the economy, defined mainly by four prominent characteristics- volume, velocity, variety and veracity [8]. While better understanding of the knowledge hidden within the large datasets generated from various sources can potentially help businesses, deriving useful information from these data is a big challenge. Before any kind of actionable insights from data can be derived using advanced analysis techniques, several pre- processing steps are involved. These steps include data collec- tion, data preparation and cleansing, data storage, and manage- ment [9]. The analysis of data can be broadly classified into three categories based on the depth of analysis: 1) descriptive analytics which exploits the historical trends to extract useful information from the data, 2) predictive analytics that focuses on predicting future probability of occurrence of pattern or trends, and 3) prescriptive analytics which focuses on decision making by gaining insights into the system behavior [9]. Regardless of the depth of the analysis, extracting information from data requires a solid understanding of techniques com- prising of statistical analysis, optimization, machine learning, text-mining algorithms, etc. A number of studies have highlighted the tools/algorithms that can be used to derive solutions for various problems associated with big data [1], [2], [3], [4], [10]. For example, [1] and [2] presented a brief review of the challenges and issues surrounding big data. Some of the popular tools, frameworks and technologies that can be used to aggregate, manage and analyze big data, includes Hadoop and its ecosystem of techniques and tools such as Pig, Hive, Hbase, Spark, High Performance Cluster Computing (HPCC)), in-memory computing engines and NoSQL databases, cloud based data service engines, etc. are still nascent and continually evolving under the open source software movement. A brief overview of how big data can be used to derive value for various organizations including government, educational institutions and industries is presented in [11]. Possibilities and challenges in implementing big data related technologies in organizations, including storage of the data, lack of skilled people and time involved in processing of huge datasets are discussed in [12]. [3] and [4] presented the general overview of how big data can be used to generate value for businesses. An in-depth tutorial on big data analytics is presented by Hu et al. [13], 2015 IEEE First International Conference on Big Data Computing Service and Applications 978-1-4799-8128-1/15 $31.00 © 2015 IEEE DOI 10.1109/BigDataService.2015.62 446

Upload: piyush-malik

Post on 15-Jul-2015

276 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Full Paper: Analytics: Key to go from generating big data to deriving business value

Analytics: Key to go from generating big data toderiving business value

Deepali Arora1, Piyush Malik2,1Dept. of Electrical and Computer Engineering, University of Victoria, P.O. Box 3055 STN CSC, Victoria, B.C

{darora}@ece.uvic.ca2Business Analytics and Strategy, IBM Global Business Services, 4400 N 1st Street, San Jose, CA

{Piyush.Malik}@us.ibm.com

Abstract—The potential to extract actionable insights from bigdata has gained increased attention of researchers in academiaas well as several industrial sectors. The field has becomeinteresting and problems look even more exciting to solve eversince organizations have been trying to tame large volumesof complex and fast arriving big data streams through newercomputing paradigms. However, extracting meaningful and ac-tionable information from big data is a challenging and dauntingtask. The ability to generate value from large volumes of data isan art which combined with analytical skills needs to be masteredin order to gain competitive advantage in business. The ability oforganizations to leverage the emerging technologies and integratebig data into their enterprise architectures effectively dependson the maturity level of the technology and business teams,capabilities they develop as well as the strategies they adopt.In this paper, through selected use cases, we demonstrate howstatistical analyses, machine learning algorithms, optimizationand text mining algorithms can be applied to extract meaningfulinsights from the data available through social media, onlinecommerce, telecommunication industry, smart utility meters andused for variety of business benefits, including improving security.The nature of applied analytical techniques largely depends onthe underlying nature of the problem so a one-size-fits-all solutionhardly exists. Deriving information from big data is also subjectto challenges associated with data security and privacy. These andother challenges are discussed in context of the selected problemsto illustrate the potential of big data analytics.

I. INTRODUCTION

The analysis of big data and the associated potential to ex-tract actionable information has gained attention of researchersin both academia and industry [1], [2], [3], [4]. Researchersin both academia/industry have emphasized on developingnew tools and techniques for better storing, managing, andanalyzing big data [5]. However, the business communityis looking for ways to improve their profits by leveraginginformation hidden in big data through analytics [6]. Mas-sive amount of data are generated on a daily basis fromvarious sources including (but not limited to) online shop-ping transactions, gas and electric meters, electronic healthrecords, social networking interactions, weather and satellitedata, embedded sensors in industrial machinery as well as inautomobiles and aircrafts, data center computing equipmentas well as telecommunication industry equipment. Accordingto International Data Corporation (IDC), cumulative digitaldata is predicted to grow from 4.4 zeta-bytes (ZB) in 2013 to

44 ZB by the year 2020 [7]. Data is now considered as the“new oil” of the economy, defined mainly by four prominentcharacteristics- volume, velocity, variety and veracity [8].While better understanding of the knowledge hidden within thelarge datasets generated from various sources can potentiallyhelp businesses, deriving useful information from these datais a big challenge.Before any kind of actionable insights from data can

be derived using advanced analysis techniques, several pre-processing steps are involved. These steps include data collec-tion, data preparation and cleansing, data storage, and manage-ment [9]. The analysis of data can be broadly classified intothree categories based on the depth of analysis: 1) descriptiveanalytics which exploits the historical trends to extract usefulinformation from the data, 2) predictive analytics that focuseson predicting future probability of occurrence of pattern ortrends, and 3) prescriptive analytics which focuses on decisionmaking by gaining insights into the system behavior [9].Regardless of the depth of the analysis, extracting informationfrom data requires a solid understanding of techniques com-prising of statistical analysis, optimization, machine learning,text-mining algorithms, etc.A number of studies have highlighted the tools/algorithms

that can be used to derive solutions for various problemsassociated with big data [1], [2], [3], [4], [10]. For example, [1]and [2] presented a brief review of the challenges and issuessurrounding big data. Some of the popular tools, frameworksand technologies that can be used to aggregate, manageand analyze big data, includes Hadoop and its ecosystemof techniques and tools such as Pig, Hive, Hbase, Spark,High Performance Cluster Computing (HPCC)), in-memorycomputing engines and NoSQL databases, cloud based dataservice engines, etc. are still nascent and continually evolvingunder the open source software movement. A brief overviewof how big data can be used to derive value for variousorganizations including government, educational institutionsand industries is presented in [11]. Possibilities and challengesin implementing big data related technologies in organizations,including storage of the data, lack of skilled people and timeinvolved in processing of huge datasets are discussed in [12].[3] and [4] presented the general overview of how big datacan be used to generate value for businesses. An in-depthtutorial on big data analytics is presented by Hu et al. [13],

2015 IEEE First International Conference on Big Data Computing Service and Applications

978-1-4799-8128-1/15 $31.00 © 2015 IEEE

DOI 10.1109/BigDataService.2015.62

446

Page 2: Full Paper: Analytics: Key to go from generating big data to deriving business value

who assessed different techniques that can be used for dataacquisition and pre-processing, data storage and managementand different analytics techniques to derive information.While these studies provide a good overview of the big

data opportunity, issues and challenges involved, value tobusinesses, and how various techniques can be used for datastorage, processing or analytics in general, none of thesestudies have discussed the application of different algorithmsto derive value for specific applications and this is the mainfocus of this paper. In this paper using five different use cases,we illustrate how big data analytics has been used in obtainingmeaningful information. The use cases considered in thispaper include sentiment analysis for social media, preventingcustomer churn in the telecommunication sector, enhancingcustomers’ online shopping experience, generating value fromsmart utility meters and improving data security. The objectiveof this paper is not to present new algorithms for any ofthe selected industry use cases but rather, to provide a briefoverview and are illustrative of the existing algorithms andmethods that can be be applied to derive value from big data.Detailed discussion of how different innovative algorithms canbe applied to realize value for each of these use cases andchallenges associated with them is beyond the scope of thecurrent paper and could be presented in an extended versionof this paper in the future.This paper is organized as follows: Application of data

analytics to different domains is discussed in Section 2,Section 3 highlights some of the challenges around big dataanalytics and finally, conclusions are presented in Section 4.

II. APPLICATION OF BIG DATA ANALYTICS

A. Sentiment analysis in social networks

The explosion of data in the form of blogs, online forumsand on social media channels such as Facebook, Twitter,Linkedin, Instagram, Pintrest, Youtube, etc has given con-sumers a new way of expressing their opinions about anyproduct or service and consequently may influence otherpotential buyers. Investigation of users’ opinions or sentimentsabout any product or service, expressed in textual form,on these websites/blogs is referred to as sentiment analysis[14]. Sentiment analysis combines natural language processingwith artificial intelligence capability and text analytics toevaluate statements found across various social platforms todetermine whether they are positive or negative with respect toa particular brand, product or service [15]. Sentiment analysisthus provides business intelligence which can be used tomake impactful decisions. In addition, consumers routinelylook for online reviews before buying any product or service.Developing techniques that can better automate the process ofanalyzing user generated web content about a given productor service is now the focus of research in both academia andindustry. Several companies are also involved in designingalgorithms/tools that can perform sentiment analysis eitheronline for free or at nominal costs. One such example is IBMWatson’s user modeling service that uses linguistic analyticsto generate psychographic profiles and extract cognitive andsocial characteristics based on users emails, text messages,

tweets, forum posts, etc [16]. Some of the other examples ofsentiment analysis tools includes Google analytics, Tweetstats,Social Mention, and Twendz [17].

There are three main classification levels in sentiment anal-ysis: the document-level, the sentence-level, and the aspect-level sentiment analysis [18] and the methodologies that canbe used to detect them are broadly classified into three maincategories, i.e., lexicon based techniques, machine learningtechniques and hybrid approaches [19]. The lexicon-basedapproach relies on a collection of known and pre-compiledsentiment terms, machine learning approaches are based onapplication of different algorithms that can be trained and thehybrid approaches are based on the combination of these twoapproaches [20]. A number of studies have used lexicon basedapproaches [21], machine learning based supervised [22], orunsupervised [23], [24] approaches, and combined machinelearning and lexicon based [25], [26] approaches to classifysentiments into positive or negative categories.

Sentiment analysis has been used by researchers in findingpeople’s opinion expressed on social media sites includingTwitter about products/services launched by a company [27]and in real world industrial application (based on secondauthor’s experience) in which one of IBM’s clients leveragedsentiments from social media to identify influencers of apublic policy. The general methodology in both these usecases involved four main steps: gathering data, generatingfeatures, designing a classifier that can differentiate betweendifferent sentiments i.e., positive, negative or neutral, andfinally deriving a sentiment score.

However, deriving information from the user created webcontent remains a daunting task as the sentiments may carryvarying meanings in different disciplines and cultures. Thus,to derive meaningful results, data features such as individ-ual keywords and their frequency of occurrences; parts ofspeech such as adjectives, adverbs; opinion words and phrasesincluding good or bad, likes, dislikes; and negations [28],[29] should be carefully derived following feature selectiontechniques [18]. Supervised machine learning approaches suchas classification algorithms can then be designed by convertingthe sentiment analysis problem to a simple text classificationproblem. For a standard text classification problem, the subsetof data is used to form a training record set defining differentclasses. These classes are related to the underlying featurevalues. The classification model can then be used to predict theclass label for any new instance. Several classification modelsare discussed in the literature [18]. Some of the commonlyused classifiers include the Naive Bayes classifier, supportvector machines (SVM), maximum entropy based classifier,decision trees, and neural networks [18]. Similarly unsuper-vised techniques can also be used to derive users’ sentimentsabout products/services [23], [24]. The power of integratingsentiments and intelligence trends from social media wasrecently hailed as the reason for IBM and Twitter to forge analliance to incorporate Twitter analytics into their consultingbusiness [30].

447

Page 3: Full Paper: Analytics: Key to go from generating big data to deriving business value

B. Preventing customer churn in telecommunication sector

The strong competition amongst telecommunication serviceproviders has compelled them to offer packages that couldpotentially attract either more customers or at least helpthem retain their existing ones. Since cost of acquiring anew customer is relatively high compared to retaining theexisting customers [31], companies are developing new andcompetitive ways to retain their customers and maintain longterm relationship with them to avoid customer churn. Churnersare the customers who leave their existing telecommunicationservice provider and switch to new ones for different reasons[32]. Customers generally switch services for lower pricesor better services. Predicting customer churn is important forcompanies as it directly affects their revenues. It can also helpcompanies take action by offering better service or attractivepackages to prevent their existing customer from switchingto different service provider. Literature reveals [33] that onaverage the telecommunication companies face around 2.2 %

of customer churn each month. Designing algorithms that canpredict and in turn prevent customer churn is important to thetelecommunication industry.The problem of predicting churn and non-churn customers

has been addressed in number of studies [31], [32]. However,with increasing competition, the companies are now turning to-wards machine learning algorithms to gain early insights abouttheir customers’ behavior such that timely actions can be takento prevent customer churn. One simple approach to predict ifthe user is churn or non-churn customer, is to formulate it as atwo class classifier problem using underlying feature values topredict the outcome. Some of the possible features that can beused to define churn and non-churn classes, includes durationof customers calls, services subscribed, usage pattern, anddemographics [31]. A comprehensive review of the approachesthat can be followed to predict churning customer is presentedin [31], [32].Telecommunication service providers can also use infor-

mation about customers usage pattern or services subscribedand demographics to design and offer customized packagesto their users [34]. One possible approach is to use clusteringalgorithms for customer segmentation based on the servicesthey use [35], [36], where clustering refers to partitioning ofdata points into small number of clusters with some similarity.This allows companies to identify customers for promotionof the products in future, in retaining their customers andattracting new customers by offering customized packages tothe targeted audiences based on their usage behaviors.A real world example at Celcom, a telecommunication

service provider in Asia that is using predictive personalizedanalytics to predict churn probability of its customers. Theyare also offering personalized incentives and geolocation basedcross brand promotional offers and coupons and offers, therebyincreasing engagement and loyalty with its client base [37].

C. Enhancing customers’ online shopping experience

With the advancements in technology and introduction ofsmartphones and tablets, online shopping has become conve-nient, ubiquitous and so much popular that it is predicted to

grow to $370 billion in 2017 [38]. Businesses are now usingadvanced analytics to predict customer behaviors and for car-rying out customer segmentation based on the characteristicsof the customer groups [39]. While data from online clickson stores’ inventory does yield information about what useris looking for, it still doesn’t provide companies the completeinformation about their consumers as many of them still goto retail malls to buy a product [40]. Retailers need to mergeboth offline and online data to design algorithms for betterunderstanding of their customers’ behaviors and for designingproduct recommendation engines for different audiences [41].One of the approaches followed to predict customer be-

havior is the use of the transactional data. For example, [42]developed a model using hierarchical clustering and a hiddenMarkov model (HMM) to predict customer behavior based ontransactional data. [43] also used Markov model to predict theprobability of click to conversion based on the time spent bythe customer on site. [44] compared the performance of ag-gregate (developing one model for all customers), segmented(developing models for different segments of customers) and1-to-1 (developing models for individual users) marketingapproaches across a broad range of experimental settingsincluding multiple segmentation levels, real-world marketingdatasets, dependent variables, different types of classifiers,segmentation/clustering techniques, and different predictivemeasures. Their results showed both 1-to-1 and segmentationapproaches significantly outperform the aggregate modellingapproaches. However, in the presence of little transactionaldata, the segmentation models outperformed both 1-to-1 andaggregate modelling approaches.Once a retailer knows the underlying behavior of a con-

sumer, then based on the products that a customer selected inthe past, they can design recommender systems to assist themin selecting similar products [45]. The underlying assumptionis that the consumers follow patterns similar to their pastspending habits and are likely to repeat it in the future. Usingdifferent machine learning techniques such as classification,genetic algorithms, clustering or K-nearest neighbor algo-rithms [45], retailers can potentially identify different customersegments and predict customers’ preference and spendingabilities. This can help retailers in better advertising of theirproducts to the right audiences.The data mining techniques can also be used to market

products to consumers based on their demographics informa-tion combined with their online activities. By combining theinformation about geographic location of a user, the time ofday/week they visit store, the products they buy, and mappingthose attributes against the actual sales data it is possible tohighlight hidden interactions between online and offline salesactivity of a consumer. However, combining online and offlineinformation is a real challenge for retailers [46].While online retailers like Amazon and eBay are already

using sophisticated data analytic techniques to enhance cus-tomers’ online shopping experience, the traditional brick andmortar retailers are also now realizing the benefits of analyticsfor increased profits. The acquisition of Kosmix labs by Wal-mart in 2011 is one such example [47]. Recently, a mid-scaleretailer, Macy’s have also leveraged big data analytics for bet-

448

Page 4: Full Paper: Analytics: Key to go from generating big data to deriving business value

ter inventory management based on customers’ segmentationcharacteristics. They developed a unique Omnichannel strategywhere customers can order via different channels and pick uptheir order in a store of their choice; through a central onlinefulfillment center. In-store customer localization abilities usingeither WiFi or beacons as underlying technologies are alsoemerging that would further assist in enhancing consumers’shopping experiences in future [48], [49].

D. Generating value from smart utility meters

With rapid deployment of smart electricity and gas meters,especially in developed countries, the utility companies arealso leaning towards extracting and utilizing the informationgenerated from smart meter data for increased profits, im-proved customer satisfaction and better resource management[50]. A meter is called smart or intelligent due to its abilityto measure the electricity usage in real time at much smallertime intervals than traditional meters (which keeps the recordof cumulative electricity consumption) [50]. Smart meters alsoallow to remotely control the consumption of electricity andto switch off supply when needed. To convert the data intoactionable insights, utility companies need to adapt techniquesfor accurate and timely collection, transfer, storage, processingand analyses of data. Many established companies includingIBM, SAP, Oracle, as well as startups like Autogrid arecurrently assisting utility companies in designing solutions forbetter understanding the hidden potential of the data generatedfrom smart meters [51], [52].Several machine learning algorithms have been proposed

in the literature for better management and control of datafor utility companies [50]. [53] suggested that by groupingcustomers based on usage readings following clustering tech-niques, the utility companies can identify consumer for tar-geted services. Knowledge of customer usage patterns can alsoassist utility companies in designing better demand responsetariff plans. For example, utility companies can encourageconsumers with flexible consumption patterns to minimizetheir usage during the peak hours by offering incentives [54].Likewise, consumers with high energy usage pattern can bepenalized if they are unable to curtail their consumption bylimiting use of household appliances during the peak energyusage hours. Machine learning algorithms such as independentcomponent analysis [55] and clustering techniques [56] havealso been used to identify the type of demand faced bydifferent consumer groups during the day [55]. Multiple linearregression models have also been used to predict the usage ofpower in households [56]. Support vector machine classifierhave been used to distinguish user groups based on their usagepatterns [57]. Customers’ load profiles can potentially assistin identifying and detecting irregularities or abnormalitiescaused either due to faulty metering or human interventionand fraud [51]. Finally, machine learning techniques can alsobe used to predict congestion or instability conditions withina network. This information can be used by utility companiesto identify overloaded or ageing components and carry out in-time preventive maintenance to avoid power losses and lostrevenues [58].

Real-world examples that illustrate data analysis use forutility companies include EnerNoc and Comverge, whichare assisting utility companies by designing tools such asdemand response programs that can encourage customersin reducing load demands during peak times, such as lateafternoon during a heat wave when the air conditioning loadstresses the grid’s capacity. In exchange for lowering powerconsumption, consumers are offered rebates. Leveraging bigdata technologies, AutoGrid software service also analyzesgrid usage patterns to predict power demand a day ahead thusencouraging both utilities and consumers to participate in load-shedding programs to prevent outages [59].

E. Improving Security

Cybercrime costs $118 billion annually and this figureis expected to grow significantly [60]. With easy access toinformation available online, sophisticated cybercrimes areoccurring at an alarming rate due to which traditional securitysolutions are no longer sufficient to defend against these esca-lating threats. Incidents of hacking, identity theft and stealingcredit card data from retailers and banks are in the news quiteregularly but recent sophisticated and organized breaches atSony involving an unreleased movie have shaken the world.While a lot still needs to be done to prevent cyberterrorism,Big data analytics in security now offers promising solutionstowards efficient detection of suspicious activities over thenetwork. It is expected that big data analytics will impactvarious aspects of information security such as network mon-itoring, user authentication and control, authorization, identitymanagement, fraud detection, data loss prevention and control[61]. Using big data analytics to detect threats and designsecurity solutions, the enterprises are now able to prevent theirsystems from future threats.A number of data mining techniques to detect cyber crimes

are proposed in the literature [61]. For example, classificationmodels such as Naive Bayes, support vector machines, neuralnetworks, decision trees have long been used to detect spamemails [62], (spamming implies sending unsolicited emails).Support vector machine techniques have also been used toprevent Denial of Service (DoS) attacks, where DoS attackrefers to the process of making system inaccessible to otherusers [63], [64]. While [63] used Enhanced Multi ClassSupport Vector Machines (EMCSVM) to predict various kindsof DoS attacks, [64] proposed radial-basis function neuralnetwork (RBFNN) and support vector machines (SVM), tosolve the DoS problem with an ability to detect or predict newattacks based on the patterns similar to the attack patterns thatappeared in the past. Classification models have also been usedto detect Malware [65] and phishing URLs [66] and emails[67].Data mining techniques have also been used for anomaly

detection to search for unusual patterns and network behaviors[68]. While feature selection approaches are used to prioritizefeatures that can assist in differentiating normal behavior fromthe one affected by the presence of anomalies, classifiers areused to differentiate between patterns [69]. These anomaliescould be present either due to internal system failure or due

449

Page 5: Full Paper: Analytics: Key to go from generating big data to deriving business value

to external attacks. In case of external attacks, identifyingthe intruders that carry out these malicious activities andidentifying the types of attacks are other major issues. Machinelearning approaches can now also be used for both intruderdetection [70] and finding the types of attacks [71].Finally, as more companies turn towards cloud computing

for storage and processing of big data, the security of cloudbecomes essential. Cloud computing is vulnerable to securitythreats including insecure application and programming in-terfaces, malicious insiders, shared technology vulnerabilities,data leakages and account hacking [72].A number of companies are also working on designing

solutions to protect users from cybercrime. For instance,IBMs’ QRadar security intelligence platform is designed todeliver the benefits of next-generation security information andevent management technology to various companies [73]. En-terprises use QRadar solutions to collect and correlate billionsof events and network flows per day in deployments that spanmultiple locations. By analyzing structured, enriched securitydata alongside unstructured data from across the enterpriseusing QRadar solutions, the malicious activities hidden deepin the masses of an organization’s data can be potentiallydetected.

III. CHALLENGES IN BIG DATA ANALYTICS

While big data analyses provide value to businesses thereare issues surrounding it in general that must be carefully dealtwith to exploit its full potential [31], [1]. One of the primaryconcerns around big data is security and privacy. Access tolarge data implies the potential to identify individuals andalso their profile on the basis of their behavior, likes, dislikes,daily routine, etc. Thus companies must take extra precautionsto prevent the confidentiality of users’ sensitive information.Another major challenge is data access and storage. Withhuge volumes of data being generated, it is not feasible tostore it on a single machine compelling companies to relyon the cloud for storage. Cloud computing can be usedto manage and store these large datasets but again privacyaround cloud is an open research problem. The risk of storingsensitive information on the cloud without sufficient securitymeasures have been unfortunately illustrated in a numberof instances. Eliminating single point of failure by creatingmultiple copies of data and storing on different nodes is alsoa challenge as these nodes have to be synchronized to retrievedata efficiently. Since data is available in different formats,extracting them and combining in a format that can be easilyimported for analysis is another challenge. Finally, the skillset(which is a culmination of advanced statistical techniques,data optimization methods, machine learning algorithms andthorough understanding of business value) required to extractmeaningful information from big data is seldom available.While these challenges are applicable in general to all

industrial domains, there are also challenges specific to eachof the applications considered in this study, which are brieflydiscussed below.

• Sentiment analysis: Sentiment analysis classifies textinto three main classes i.e., positive, negative and neutral

but given the subjectivity of text classification in realitytext can be classified into many categories [74]. There-fore instead of simple two-class classifiers, multi-classclassifiers should be used for better results. Designinga classifier for sentiment analysis in the presence oflimited amount of data available for training a classifier isquite challenging [14]. Moreover, the training data usedfor designing a classifier should be selected carefully asthe same word may have different meaning in differentdomains based on the context [75]. Sarcastic or ironicsentences often lead to wrong classification. Using onlywords rather than sentences also has the potential toerroneous classification. Finally, making general conclu-sions about any product/services based on the limitednumber of tweets or posts available on the web can yieldmisleading results and the results must be checked forstatistical significance.

• Predicting customer churn: Cost constraints dictate thattelecommunication companies focus more on retainingexisting customers rather than acquiring new ones andthus starts offering promotions to the existing customerswho are likely to churn. However, finding the real causefor customer churn is not always easy because identify-ing underlying variables that best describe a customer’sbehavioral profiles is a challenging task and may notalways yield users’ true intentions thus leading to wrongpredictions. Moreover, integration of data from miscella-neous sources such as customer base, call center inboundand outbound calls, billing, etc., to gather informationabout a customer is not always straightforward. Withhigh competition available, companies are now offeringservice plans suitable for different customer segmentsbut designing algorithms to group customers with similarpreferences based on partial information alone may notyield feasible solutions.

• Enhancing online shopping experience: Despite itspopularity, online shopping still has to overcome certainchallenges to encourage customers. One of the main chal-lenge in predicting customers’ behavior is merging onlinedata with offline transaction data as these datasets maynot be managed by a single entity. Customers’ securityand privacy concerns around using their transactionaldata for predicting their spending behavior also needto be addressed satisfactorily. Analyzing data to predictcustomers’ preference of products, to promote similarproducts or relevant coupons to targeted audiences, is achallenging issue which only gets worse with time dueto users’ changing shopping preferences.

• Smart utility meters: One of the major challenge facedby the utility companies is merging data that resides indisparate databases among various departments of utilitycompanies. Credibility of data is another major challengethat could have devastating effect on firm’s reputation.Since the data generated by smart meters may yieldabnormalities due to the faulty behaviors caused eitherby natural conditions or by human interference, thusmaking decisions based on faulty data can potentiallyimpact utility companies’ revenues. Lack of infrastructure

450

Page 6: Full Paper: Analytics: Key to go from generating big data to deriving business value

to support data processing and analysis, generated fromsmart meters, is another major challenge faced by utilitycompanies. Predicting customers’ profile patterns includ-ing number of people living in a household, appliancesthey use and the time of usage of different appliancesbased on their electricity usage bills for promotionaloffers could also raise privacy concerns for users.

• Security: Although the application of big data analyticsin improving security looks promising it has its ownchallenges [76]. One of the major challenges faced byorganizations is the data leakage caused by third partyintervention. Data loss is even more vulnerable if it ishoused in the cloud. Ownership of information hosted oncloud is another major issue faced by organizations andtrust boundaries need to be established carefully betweenthe data owners and the data storage owners. With largedatasets stored on cloud, proper security measures mustbe taken to prevent re-identification of users based on theinformation available through different datasets.

IV. CONCLUSIONS

The unprecedented growth in data in almost every sectorprovides businesses a unique opportunity to use analyticsto decipher hidden insights that can be used for makingbetter decisions. In this paper through five different use cases,we have illustrated how analytics can be applied to derivevalue from big data for various industrial applications. Theexamples considered in this study include sentiment analy-sis for social media, preventing churn of telecommunicationcustomers, enhancing customers’ online shopping experience,generating value from smart utility meters and improvingsecurity. While a number of different techniques have beenproposed in the existing literature to derive value for theseuse cases, classification and clustering models have been mostwidely used for these applications. The continuing growth ofstudies that attempt to derive value from big data suggest thatbig data analytics can provide useful insights for businesses,potentially also leading to increased revenues and businessadvantages over competition. However, big data analytics alsofaces challenges that need to be addressed, in conjunction, inorder to exploit the full potential of the hidden insights withinthese large datasets.

REFERENCES

[1] A. Katal, M. Wazid, and R. Goudar, “Big data: Issues, challenges,tools and good practices,” in Contemporary Computing (IC3), SixthInternational Conference on, Aug 2013, pp. 404–409.

[2] S. Sagiroglu and D. Sinanc, “Big data: A review,” in CollaborationTechnologies and Systems (CTS), International Conference on, May2013, pp. 42–47.

[3] F. Muhtaroglu, S. Demir, M. Obali, and C. Girgin, “Business modelcanvas perspective on big data applications,” in Big Data, IEEE Inter-national Conference on, Oct 2013, pp. 32–37.

[4] A. Rajpurohit, “Big data for business managers; bridging the gap be-tween potential and value,” in Big Data, IEEE International Conferenceon, Oct 2013, pp. 29–31.

[5] Z. Liu, P. Yang, and L. Zhang, “A sketch of big data technologies,” inInternet Computing for Engineering and Science, Seventh InternationalConference on, Sept 2013, pp. 26–29.

[6] S. Dhar and S. Mazumdar, “Challenges and best practices for enterpriseadoption of big data technologies,” in Technology Management Confer-ence (ITMC), 2014 IEEE International, June 2014, pp. 1–4.

[7] The digital universe of opportunities: Rich data and the increasingvalue of the internet of things. [Online]. Available: http://www.emc.com/leadership/digital-universe/2014iview/executive-summ%ary.htm

[8] P. Malik, “Governing big data: Principles and practices,” IBM Journalof Research and Development, vol. 57, no. 3/4, pp. 1:1–1:13, May 2013.

[9] H. Hu, Y. Wen, T.-S. Chua, and X. Li, “Toward scalable systems forbig data analytics: A technology tutorial,” Access, IEEE, vol. 2, pp.652–687, 2014.

[10] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,“Spark: Cluster computing with working sets,” in Proceedings of the2nd USENIX Conference on Hot Topics in Cloud Computing, ser.HotCloud’10, 2010, pp. 10–15.

[11] N. Y. Xin and L. Y. Ling, “How we could realize big data value,”in Instrumentation and Measurement, Sensor Network and Automation(IMSNA), 2013 2nd International Symposium on, Dec 2013, pp. 425–427.

[12] J. Wielki, “Implementation of the big data concept in organizations -possibilities, impediments and challenges,” in Computer Science andInformation Systems (FedCSIS), 2013 Federated Conference on, Sept2013, pp. 985–989.

[13] H. Hu, Y. Wen, T.-S. Chua, and X. Li, “Toward scalable systems forbig data analytics: A technology tutorial,” Access, IEEE, vol. 2, pp.652–687, 2014.

[14] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, “Lexicon-based methods for sentiment analysis,” Comput. Linguist., vol. 37, no. 2,pp. 267–307, 2011.

[15] M. Hu and B. Liu, “Mining and summarizing customer reviews,” inProceedings of the Tenth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, 2004, pp. 168–177.

[16] User modeling improves understanding of people’s prefer-ences to help engage users on their own terms. [On-line]. Available: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/user-mo%deling.html

[17] Five sentiment analysis tools that wont cost you acent. [Online]. Available: http://www.fieldassignment.com/2011/04/free-sentiment-analysis-tools.ht%ml

[18] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysisalgorithms and applications: A survey,” Ain Shams Engineering Journal,2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2090447914000550

[19] E. Boiy, P. Hens, K. Deschacht, and M. francine Moens, “Automaticsentiment analysis in on-line text,” in In Proceedings of the 11thInternational Conference on Electronic Publishing, 2007, pp. 349–360.

[20] D. Maynard and A. Funk, “Automatic detection of political opinions intweets,” in The Semantic Web: ESWC 2011 Workshops, vol. 7117, 2012,pp. 88–99.

[21] B. Liu, Sentiment Analysis and Opinion Mining. Morgan and ClaypoolPublishers, 2012.

[22] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: Sentimentclassification using machine learning techniques,” in Proceedings ofthe ACL-02 Conference on Empirical Methods in Natural LanguageProcessing, 2002, pp. 79–86.

[23] M. Usha and M. Indra Devi, “Analysis of sentiments using unsupervisedlearning techniques,” in Information Communication and EmbeddedSystems, International Conference on, Feb 2013, pp. 241–245.

[24] G. Li and F. Liu, “A clustering-based approach on sentiment analysis,”in Intelligent Systems and Knowledge Engineering, International Con-ference on, Nov 2010, pp. 331–337.

[25] L. Zhang, R. Ghosh, M. Dekhil, M. Hsu, and B. Liu. (2011) Combininglexicon-based and learning-based methods for twitter sentimentanalysis. [Online]. Available: http://www.hpl.hp.com/techreports/2011/HPL-2011-89.html

[26] P. P. Balage Filho, L. V. Avanco, M. d. G. V. Nunes, and T. A. S.Pardo, “NILC USP: An improved hybrid system for sentiment analysisin twitter messages,” in Proceedings of the 8th International Workshopon Semantic Evaluation. Association for Computational Linguisticsand Dublin City University, 2014, pp. 428–432.

[27] M. Neethu and R. Rajasree, “Sentiment analysis in twitter using machinelearning techniques,” in Computing, Communications and NetworkingTechnologies (ICCCNT),2013 Fourth International Conference on, July2013, pp. 1–5.

[28] C. Z. Charu C. Aggarwal, Mining Text Data. Springer, 2012.[29] Y. Mejova and P. Srinivasan, “Exploring feature definition and selection

for sentiment classifiers,” in ICWSM’11, 2011, pp. 1–6.[30] Twitter, ibm announce a new data analytics part-

nership. [Online]. Available: http://fortune.com/2014/10/29/twitter-ibm-data-analytics-partnership/

451

Page 7: Full Paper: Analytics: Key to go from generating big data to deriving business value

[31] N. Kamalraj and A. Malathi, “A survey on churn prediction techniques incommunication sector,” International Journal of Computer Applications,vol. 64, no. 5, pp. 39–42, February 2013, full text available.

[32] W. Bandara, A. Perera, and D. Alahakoon, “Churn prediction method-ologies in the telecommunications sector: A survey,” in Advances inICT for Emerging Regions, International Conference on, Dec 2013, pp.172–176.

[33] C.-P. Wei and I.-T. Chiu, “Turning telecommunications call detailsto churn prediction: a data mining approach,” Expert Systems withApplications, vol. 23, no. 2, pp. 103 – 112, 2002.

[34] C. Zhao, Y. Wu, and H. Gao, “Study on knowledge acquisition ofthe telecom customers’ consuming behaviour based on data mining,”in Wireless Communications, Networking and Mobile Computing, 4thInternational Conference on, Oct 2008, pp. 1–5.

[35] J. Zhao, W. Zhang, and Y. Liu, “Improved k-means cluster algorithm intelecommunications enterprises customer segmentation,” in InformationTheory and Information Security, IEEE International Conference on,Dec 2010, pp. 167–169.

[36] L. Ye, C. Qiu-ru, X. Hai-xu, L. Yi-jun, and Y. Zhi-min, “Telecomcustomer segmentation with k-means clustering,” in Computer ScienceEducation, 7th International Conference on, July 2012, pp. 648–651.

[37] Celcom loyalty deals. [Online]. Available: http://www2.nst.com.my/nation/celcom-loyalty-deals-1.558917

[38] J. Li. (2013) Study: Online shopping behavior in thedigital era. [Online]. Available: http://www.iacquire.com/blog/study-online-shopping-behavior-in-the-digi%tal-era

[39] P. Yang, Q. lun Zheng, H. Peng, and Q. Tan, “A stepwise learningapproach to automatic discovery of interest data blocks,” in MachineLearning and Cybernetics, 2004. Proceedings of 2004 InternationalConference on, vol. 3, Aug. 2004, pp. 1441–1446.

[40] (2014) Making online shopping smarter with ad-vanced analytics. [Online]. Available: www.cognizant.com/.../Making-Online-Shopping-Smarter-with-Advanced-anal%ytics.pdf

[41] R. Dewan, M. Freimer, and Y. Jiang, “Using online competitor’s inven-tory information for pricing,” in System Sciences, 40th Annual HawaiiInternational Conference on, Jan 2007, pp. 210a–210a.

[42] M. Mestre and P. Vitoria, “Tracking of consumer behaviour in e-commerce,” in Information Fusion, 16th International Conference on,July 2013, pp. 1214–1221.

[43] M. Gupta, H. Mittal, P. Singla, and A. Bagchi, “Characterizing compar-ison shopping behavior: A case study,” in Data Engineering Workshops(ICDEW), 2014 IEEE 30th International Conference on, March 2014,pp. 115–122.

[44] T. Jiang and A. Tuzhilin, “Segmenting customers from population toindividuals: Does 1-to-1 keep your customers forever?” Knowledge andData Engineering, IEEE Transactions on, vol. 18, no. 10, pp. 1297–1311, Oct 2006.

[45] H.-W. Yang, Z. geng Pan, X.-Z. Wang, and B. Xu, “A personalizedproducts selection assistance based on e-commerce machine learning,”in Machine Learning and Cybernetics, 2004. Proceedings of 2004International Conference on, vol. 4, Aug. 2004, pp. 2629–2633.

[46] P. Henry and H. Luo, “Wifi: what’s next?” Communications Magazine,IEEE, vol. 40, no. 12, pp. 66–72, Dec 2002.

[47] Wal-mart paid 300 million-plus for kos-mix. [Online]. Available: http://allthingsd.com/20110418/exclusive-wal-mart-paid-300-million-plus%-for-kosmix/

[48] Beacons, beacons, everywhere beacons. [Online].Available: http://www.mediapost.com/publications/article/231059/beacons-beacons-ev%erywhere-beacons.html

[49] Stores sniff out smartphones to follow shoppers. [On-line]. Available: http://www.technologyreview.com/news/520811/stores-sniff-out-smartphone%s-to-follow-shoppers/

[50] D. Alahakoon and X. Yu, “Advanced analytics for harnessing thepower of smart meter big data,” in Intelligent Energy Systems, IEEEInternational Workshop on, Nov 2013, pp. 40–45.

[51] Generating big value from big data in energy and utilities.[Online]. Available: http://www-01.ibm.com/software/data/bigdata/industry-energy.html3

[52] Utilities and big data: Using analytics for increased customersatisfaction. [Online]. Available: http://www.oracle.com/us/industries/utilities/big-data-analytics-custom%er-wp-2075868.pdf

[53] S. Valero, M. Ortiz, C. Senabre, C. Alvarez, F. Franco, and A. Gabaldon,“Methods for customer and demand response policies selection in newelectricity markets,” Generation, Transmission Distribution, IET, vol. 1,no. 1, pp. 104–110, January 2007.

[54] A. Albert and R. Rajagopal, “Smart meter driven segmentation: Whatyour consumption says about you,” Power Systems, IEEE Transactionson, vol. 28, no. 4, pp. 4019–4030, Nov 2013.

[55] H. Liao and D. Niebur, “Load profile estimation in electric transmissionnetworks using independent component analysis,” Power Systems, IEEETransactions on, vol. 18, no. 2, pp. 707–715, May 2003.

[56] C. Beckel, L. Sadamori, T. Staake, and S. Santini, “Revealing householdcharacteristics from smart meter data,” Energy, 2014.

[57] S. K. T. J. Nagi, K. S. Yap and S. K. Ahmed, “2ndinternational powerengineering and optimization conference,” in Power Load Forecastingusing Hybrid Self-Organizing Maps and Support Vector Machines, June2008.

[58] F. Zhao, G. Wang, C. Deng, and Y. Zhao, “A real-time intelligentabnormity diagnosis platform in electric power system,” in AdvancedCommunication Technology (ICACT), 2014 16th International Confer-ence on, Feb 2014, pp. 83–87.

[59] M. LaMonica. Bringing big data to smart meters.[Online]. Available: http://www.technologyreview.com/view/506476/bringing-big-data-to-smart-%meters/

[60] Cyber security analytics. [Online]. Available: http://www.teradata.com/Cyber-Security-Analytics/

[61] T. Mahmood and U. Afzal, “Security analytics: Big data analytics forcybersecurity: A review of trends, techniques and tools,” in InformationAssurance (NCIA), 2013 2nd National Conference on, Dec 2013, pp.129–134.

[62] P. Panigrahi, “A comparative study of supervised machine learningtechniques for spam e-mail filtering,” in Computational Intelligenceand Communication Networks, Fourth International Conference on, Nov2012, pp. 506–512.

[63] T. Subbulakshmi, S. Shalinie, V. GanapathiSubramanian, K. BalaKrish-nan, D. AnandKumar, and K. Kannathal, “Detection of ddos attacksusing enhanced support vector machines with real time generateddataset,” in Advanced Computing (ICoAC), 2011 Third InternationalConference on, Dec 2011, pp. 17–22.

[64] G. Tsang, P. Chan, D. Yeung, and E. Tsang, “Denial of service detectionby support vector machines and radial-basis function neural network,” inMachine Learning and Cybernetics, Proceedings of 2004 InternationalConference on, vol. 7, Aug 2004, pp. 4263–4268.

[65] M. Mas’ud, S. Sahib, M. Abdollah, S. Selamat, and R. Yusof, “Analysisof features selection and machine learning classifier in android malwaredetection,” in Information Science and Applications, International Con-ference on, May 2014, pp. 1–5.

[66] J. James, L. Sandhya, and C. Thomas, “Detection of phishing urlsusing machine learning techniques,” in Control Communication andComputing, International Conference on, Dec 2013, pp. 304–309.

[67] A. Almomani, B. Gupta, S. Atawneh, A. Meulenberg, and E. Almomani,“A survey of phishing email filtering techniques,” CommunicationsSurveys Tutorials, IEEE, vol. 15, no. 4, pp. 2070–2090, Fourth 2013.

[68] B. Thuraisingham, “Data mining for security applications,” in MachineLearning and Applications, 2004. Proceedings. 2004 International Con-ference on, Dec 2004, pp. 3–4.

[69] A. Aziz, A. Hassanien, S.-O. Hanaf, and M. Tolba, “Multi-layer hybridmachine learning techniques for anomalies detection and classificationapproach,” in Hybrid Intelligent Systems (HIS), 2013 13th InternationalConference on, Dec 2013, pp. 215–220.

[70] L. Khan, M. Awad, and B. Thuraisingham, “A new intrusion detectionsystem using support vector machines and hierarchical clustering,” TheVLDB Journal, vol. 16, no. 4, pp. 507–521, Oct. 2007.

[71] T. Subbulakshmi, S. Shalinie, V. GanapathiSubramanian, K. BalaKrish-nan, D. AnandKumar, and K. Kannathal, “Detection of ddos attacksusing enhanced support vector machines with real time generateddataset,” in Advanced Computing, Third International Conference on,Dec 2011, pp. 17–22.

[72] M. Khorshed, A. Ali, and S. Wasimi, “Trust issues that create threats forcyber attacks in cloud computing,” in Parallel and Distributed Systems,IEEE 17th International Conference on, Dec 2011, pp. 900–905.

[73] Ibm security intelligence with big data. [Online]. Available: http://www-03.ibm.com/security/solution/intelligence-big-data/

[74] J. T. Mr. Saifee Vohra, “Applications and challenges for sentimentanalysis : A survey,” International Journal of Engineering Research andTechnology, vol. 2, 2013.

[75] H. R. P, “Opinion mining and sentiment analysis - challenges andapplications,” International Journal of Application or Innovation inEngineering and Management (IJAIEM), vol. 3, 2014.

[76] A. A. Cardenas, P. K. Manadhata, and S. P. Rajan, “Big data analyticsfor security,” IEEE Security and Privacy, vol. 11, no. 6, pp. 74–76, 2013.

452