the what, why and how of big data
TRANSCRIPT
December 2nd, 2014
@NasoLuca
The What, Why and How of
Definitions, Examples, Suggestions, Howtos and much more
Agenda
✤ What is Big Data?
✤ Big Data Examples
✤ How to Tackle a Big Data Problem
✤ Sentiment Analysis
✤ Big Data tools
Part I Part II
How relevant is it?
Big Data
Social Media
Digital Marketing
Machine Learning
Computer Vision
Who’s more relevant to the people?
Let’s ask Google!
How relevant is it?
Big Data
Social Media
Digital Marketing
Machine Learning
Computer Vision
Google Trends
From 2007 to end 2014
Big Data Market
What is Big Data? How relevant is it?
Jobs to support Big Data
In 2012 it was $28B, for 2013 expected $37BScattered across a number of IT landscapes. 45% for new social network analysis and content analytics tools[1]
4.4 Million IT jobs globally by 2015, 1.9m in the US[1]
By 2018, the US alone could face a shortage of 200k people with deep analytical skills as well as 1.5m managers and analysts[2]
Definition
Big Data according to Oxford Dictionary[3]:big data n. Computing (also with capital initials) data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data.
Big Data according to Gartner[4]:Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
This is where the 3 Vs originated from: Volume Velocity Variety
VOLUMEAbout: Amount of data. Unit: bytes
What is Big Data? Definition
Information about the general population, education, health, medicine, travel, geographic locations, shopping, financial transactions, jobs, scientific experiments, emails, sensors, texts, photos, videos, activity on social networks …
2.5 Exabytes of data are created each day worldwide[5]
Facebook (2012): 200 PB of data each year In 3 years CERN collected 75 PB of data (with LHC)Most of US company have 100 TB[5]
1 ZB = 10002 PB = 10003 TB = 10004 GB
How much is Big Data? > 5 TB (as of 2014)
VELOCITYAbout: moving data. Unit: bytes per seconds
What is Big Data? Definition
This really has two interpretations:Data Generation Rate or Data Processing Rate
Every minute (2014)[6]:200M emails4M google search277k more tweets216k pictures on Instagram
What’s the limit to be considered big data?
As of 2014Generation: time to reach 5TB < Project Life TimeProcessing: > 1 MB/s = 5TB/2mo
VARIETYAbout: Form of the data.
3 Types: structured, semi-structured, unstructured
What is Big Data? Definition
1. Structured = Data in a fixed field within a record (spreadsheets, Relational Database)
2. Semi-Structured = XML, JSON, CSV (Text with columns, with a separator)
3. Unstructured = Data stored without any model, or that does not have any organisation
All of them can be Big Data
What is Big Data? Definition
VERACITYLack of accuracy
Data itself is often imprecise or incomplete (typos, empty fields, errors, source changes, …)The time of small and tidy samples is over
This concludes the classical 3 Vs of Big Data.To better describe Big Data we can add a couple more Vs.
VALUEAbout the actionable insights one can get
What is Big Data? Definition
People do not need data, they need insights which are hidden in the data: Value is a concentrated data-juice.
Obtaining correct, but irrelevant, information is a waste of time, effort and resources.
Close interactions between an analytics team and business managers can help you address the right questions.
“Datafication” is the movement behind Big Data[7]
What is Big Data? Implications
Big Data implicitly requires 3 paradigm shifts:
1. from “some” to “all”
2. from “clean” to “messy”
3. from “causation” to “correlation”
What is Big Data? Implications
Big Data Examples
General Application Fields
Not only business: Big Data have implications far beyond marketing and consumer goods
It will profoundly change how governments work and alter the nature of politics and our daily life too (smart cities).
When it comes to generating economic growth, providing public services, or fighting wars, those who can harness big data effectively will have a significant edge over others.
Forbes think that it will influence us in 5 ways[8]:
1. how we spend
2. how we vote
3. how we study
4. how we stay healthy
5. how we keep/lose privacy
Big Data Examples - General Application Fields
1. Fire-prevention @ New York City[7]
Big Data Examples - Real Life Applications
Problem
Imbalance between needs and resource
Too many complaints (25,000 per year) too few inspectors
(200).
You want your inspectors to tackle the most relevant cases
only/first.
How to prioritise the complaints?
1. Fire-prevention @ New York City[7]
Big Data Examples - Real Life Applications
1. Fire-prevention @ New York City[7]
Solutiona. Database with information about buildings (crime rates,
ambulance visits, utility usage, missed payments, …)b. Compare database to records of building fires, looking for
correlationsc. Estimate the probability of fire for each of the complaint
Big Data Examples - Real Life Applications
ResultThe efficiency of the inspectors raised from 13% to 70%Among the predictors of a fire were:
the type of building and the year it was builtpermits for exterior brickwork correlated with lower risks
2. Improve Formula 1 car performance[9]
Big Data Examples - Real Life Applications
2. Improve Formula 1 car performance[9]
Big Data Examples - Real Life Applications
Why is this Big Data?
Volume = average 10+ TB of data at each GP per team
Velocity = teams take decisions in <~ 30 seconds
Main goals
1. get real time alarms on brakes, tires, fuel and other factors
that affect car performance during a race
2. find ways to improve car performance in the long term
2. Improve Formula 1 car performance[9]
a. Collect data:130-160 sensors on a car during race, plus
weather conditions, track conditions …
b. Compare data with records of success/failuresc. Look for correlations to get (1) real-time alarms and (2)
long term insights
Big Data Examples - Real Life Applications
$1B cost of saving 0.1s from a single lap$60M money spent by a team on a supercomputer
3. Predict Flu Outbreak in Real-Time
Big Data Examples - Real Life Applications
3. Predict Flu Outbreak in Real-Time
Flu can spread very fast with catastrophic consequences,
traditional methods can be too slow.
Each day, millions of users around the world search for health
information online. As you might expect, there are more flu-
related searches during flu season.
Of course, not every person who searches for "flu" is actually
sick, but a pattern emerges when all the flu-related search
queries are added together.
Big Data Examples - Real Life Applications
3. Predict Flu Outbreak in Real-Time
a. Collect data: keywords searched on the web; data collected by national medical authorities (US Centers for Disease Control and Prevention - CDC)
b. Compare the trends of search queries (top 50M) with the records in real data
c. Find the keywords that correlate with the actual trends, to make predictions based on current searches.
Big Data Examples - Real Life Applications
There are 45 keywords that correlate well with the historical data
The predictions from this system can improve the CDC data by up
to 50% [Royal Society Open Science, 2014]
3. Predict Flu Outbreak in Real-Time
Big Data Examples - Real Life Applications
Orange: US real data
Blue: predictions based on keywords
3. Predict Flu Outbreak in Real-Time
Big Data Examples - Real Life Applications
Google Flu Trend GFT project: www.google.org/flutrends/
Published in Nature in 2009[10]
Example of power of Big Data and of failure of Big Data.
4. Reduce injuries in sports[11]
Big Data Examples - Real Life Applications
4. Reduce injuries in sports[11]
Big Data Examples - Real Life Applications
Injuries are probably the largest market inefficiency in pro
sports
In 2013, teams in the Major League Baseball spent $665
million on the salaries of injured players and replacements
Goal
anticipate when an athlete will get hurt before it actually
happens so to avoid it
4. Reduce injuries in sports[11]
a. Collect data: data about how players actually move (accelerations, elevations, jumping ranges, …) and at what intensity.
b. Compare with records of injuries; let doctors analyse the data
c. Predict the chances to get an injury and intervene before it happens both during workouts or matches
Big Data Examples - Real Life Applications
Founded in 2006, Catapult sales have increased ~70% for six consecutive years and is on track to gross $20 million in 2013.
5. Running massive multiplayer games
Big Data Examples - Real Life Applications
“Infinity Challenge”, a massive 5 week online battle.Two needs: handle massive amount of data in almost real time to update leaderboards and detect cheaters.
Big Data Examples - Real Life Applications
The development team was taking these insights and updating the game almost weekly, using direct player feedback to tweak the game.
Behind the scenes there was the Microsoft Big Data cloud platform - HDInsight on Azure.
5. Running massive multiplayer games
6. Transparency of Governments
Improving politics for all
Big Data Examples - Real Life Applications
6. Transparency of Governments
Improving politics for all
In 2009 the US government started www.data.gov
Today there are 133k datasets in different fields:Agriculture, Climate, Education, Energy, Finance, Geospatial, Global Development, Health, Jobs & Skills, Public Safety, Science & Research, Weather
Big Data Examples - Real Life Applications
Many countries have followedincluding Italy (from 2011): ~ 9k datasets from 80 PACode4italy @Montecitorio
The Dark Side
There is one massive downside to this: Privacy concerns
Do we really want all our data to be logged and stored? Data that can say where we are everyday, which products we buy, which movie we watch, how fast (or slow) we drive our car, where we park it, which roads we usually take, where we go with out bike, how much exercise we do (or don’t), what we eat, how much we spend, which drugs we take, …
Security issues: track my position, steal my identityNot all applications are customer-centric: insurance companies (use data to increase costs)
Governments need to protect citizens against unhealthy market dominance: data antitrustAlso, they need to regulate better the ways companies ask and get the data (just asking for permission with Terms of Use is not enough!)
Big Data Examples - The Dark Side
At present the control of information is being taken away from citizensThe danger is that individuals will not be able to control the ways they are monitored or what happens to the information
How to Tackle a Big Data Problem
Preliminary Steps
First things first: check if it really is a Big Data problem
From the examples we have seen that common 3 steps are:1. collect data2. find correlations (compare with historical records)3. make predictions
Do not follow these steps!
These are relevant phases to execute a Big Data project, once everything is in place.
Preliminary steps:
1. Goals and timescale
what you want to achieve and by when
2. Data
which data you have or need to get
3. Team
which skills you need (can change with data)
4. Silo breaking
connections you need to create (crm, it, marketing)
5. Budget
how much money you can put overall (business stakeholders)
How to Tackle a Big Data Problem - Preliminary Steps
How to Tackle a Big Data Problem - Four Universal Steps
1. Collect & store data (source, privacy, real-time)
2. Clean data (na, errors)
3. Analyse data (correlations)
4. Visualise data (kpi)
It is very unluckily to get everything right (or everything you
need) at first attempt. Be prepared to iterate.
4 Universal Steps
Agenda
✤ What is Big Data?
✤ Big Data Examples
✤ How to Tackle a Big Data Problem
Part I Part II
✤ Sentiment Analysis
✤ Big Data tools
Sentiment Analysis
What is Sentiment Analysis?
Sentiment Analysis according to Oxford[14]:
The process of computationally identifying and categorising
opinions expressed in a piece of text, especially in order to
determine whether the writer’s attitude towards a particular
topic, product, etc. is positive, negative, or neutral.
Operative definition in steps:
Trying to understand what people think about a subject,
from what they write,
automatically,
producing a measure of what they think.
Sentiment Analysis - What is Sentiment Analysis?
The challenge:
Sentiment Analysis - What is Sentiment Analysis?
Hundreds (if not more) of scientific papers have been published on this topic.
None of the problem is solved, applications are flourishing (plenty of space for new ideas)
What humans readily grasp from context is very difficult for computers to detect.Abbreviations, bad spelling and grammar, sarcasm, irony, slang, idiom and personality
Show me the data! Where is the sentiment expressed?
Activity on social network SurveyCRM notesReviews (movies, restaurants, events,…)BlogsNews
Sentiment Analysis - What is Sentiment Analysis?
Why is it important?
Today people are different, they are:
1. more digital/technological
2. more connected
3. less loyal to brands
Communication is bidirectional and people’s reach is large
The People, not the Companies, have the power …
… and they are not afraid to use it.
Sentiment Analysis - Why is it important?
Nestle’ censors a Greenpeace video criticising the companyDomino’s Pizza employees post a video showing bad health codesUnited Airlines broke a guitar and did not reimburse
Some reasons to do sentiment analysis:
Gather feedback from customers (automatic, reliable)
• Give chance to react in real time
Sentiment as proxy of sales, opinions influence a lot
• To make predictions
Sentiment Analysis - Why is it important?
Gather information from/about competitions (so start
“listening”!)
• Find ways to get new customers
Sentiment Analysis - Techniques[13]
One Technique consists in (mainly) looking for:Lexical choice, Negator, Intensifier, Modal operators
I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive.
Here is an (old) opinion:
Sentiment Analysis - Techniques
Lexical choice (words):positive: nice, boost, benefit, bravenegative: terrible, conspire, catastrophe, cowardly
Negator: can flip the valence,not, never
Intensifier: give the strength of the sentiment,really, very, most
Modal operators: distinguish hypothetical from real situations and weaken intensity,
might, could, should
A text can contain multiples sentiments, that will usually be connected to each other, maybe a comparison (as for products)
Analyse the whole text, each sentence
Sentiment Analysis - Techniques
Lexical choice (words):positive: nice, boost, benefit, bravenegative: terrible, conspire,
catastrophe, cowardly
Negator: can flip the valence,not, never
Intensifier: give the strength of the sentiment,
really, very, most
Modal operators: distinguish hypothetical from real situations and weaken intensity,
might, could, should
Sentiment Analysis - Techniques
There is a market of fake opinions!
Every opinion is a quintuple: entity, feature, sentiment value, holder, time
Mike87 on 23-06-2009 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive”
Sentiment Analysis - Techniques
(iPhone, GENERAL , +, Mike87, 23-06-2009)(iPhone, touch_screen, +, Mike87, 23-06-2009)…
We are making an unstructured data a structured data
An Operative Plan
Preliminary:
What’s your goal?
e.g. Reaction to my new product launch (1 month tail)
How can you obtain it?
e.g. Twitter, Facebook and related-field blogs (want to use
google alert?)
How can I measure it? Which KPI? Which test?
e.g. KPI: # of mentions/comments/posts, % of positive over
total; choose threshold values for the goal to be met (for each
KPI)
Universal step 1: Collect and Store The Data
Identify the datatweets that mention the product (or the company?), comments to your Facebook page posts, select the specific blogs to follow
Setup a system that can get the datacreate/buy some tool to get the data automatically and programmatically
Store the data somewhere useful for the project and for your company
(you don’t want to create new silos!)
Sentiment Analysis - An Operative Plan
Universal step 2: Clean The Data
Act on the datadeal with writer mistakes: replace, modify textdeal with program error: remove records
Sentiment Analysis - An Operative Plan
Universal step 3: Analyse The Data
Analyse the data, extract the sentimentBuild the KPI
Universal step 4: Visualise The Data
Learn from the numbers, you need to come out with a story
e.g. Reaction was massive on Twitter and Facebook (2 x threshold), initially very positive (1.5x), then reduce but still good (1.3x); for blog posts the positive test was just passed (1x)
Visualise the story, create a dashboard to follow evolution in real-timecreate a static infographics to describe what happened
Sentiment Analysis - An Operative Plan
Big Data Tools
What is Hadoop?
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware
Created in 2005 by Doug Cutting and Mike Cafarella
Named it after a toy elephant (Cutting son). Originally developed to support the Nutch search engine project
The base Apache Hadoop framework is composed of the following modules:
1. Hadoop Common – libraries and utilities for other modules2. Hadoop Distributed File System (HDFS) – a distributed
file-system that splits files into large blocks and distribute them among the machines
3. Hadoop MapReduce – a programming model for large scale data processing. MapReduce ships code (.jar files) to the nodes that have the required data, and the nodes then process the data in parallel.
4. Hadoop YARN - resource-management platform
Big Data Tools - What is Hadoop?
The Hadoop Ecosystem
Since 2012, "Hadoop" often refers not to just the base modules but rather to the Hadoop Ecosystem,
which includes all of the additional packages that can be installed on top of or alongside Hadoop.
Let us meet some of the “Hadoop tools”:
Hive
Pig
Sqoop
Oozie
Big Data Tools - The Hadoop Ecosystem
Both HIVE and PIG allow to run MapReduce jobs using simple query languages
Big Data Tools - The Hadoop Ecosystem
Hiveprovides a SQL-like interface to data and allows to impose a schema on the data, and is best suited for structured and semi structured data
Pigtranslates the Pig Latin language so that scripts can run on Hadoop. Best suited for data flow jobs, for semi-structured and unstructured data
Sqooptool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores.
Big Data Tools - The Hadoop Ecosystem
Oozieworkflow scheduler system to manage Apache Hadoop jobs.
Oozie is integrated with the rest of the Hadoop Ecosystem supporting several types of Hadoop jobs out of the box (including Pig, Hive and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
Big Data Tools - The Hadoop Ecosystem
Big Data with Microsoft
Hadoop can be deployed on premises as well as in the cloud. The cloud allows organisations to deploy Hadoop without hardware to acquire or specific setup expertise.
Vendors who currently have an offer for the cloud includeMicrosoft, Amazon and Google.
Let us focus on Microsoft
The key product is: HDInsight for Microsoft Azure
Big Data Tools - Big Data with Microsoft
Azure is Microsoft Cloud Platform, that offers several services
Azure HDInsightdeploys and provisions Apache Hadoop clusters in the cloud, it is compatible with: Ambari, Avro, HBase, HDFS, Hive, Mahout, MapReduce and YARN, Oozie, Pig, Sqoop, Storm, Zookeeper.
Azure Power ShellA scripting environment to control and automate the deployment and management of your workloads in Azure
Big Data Tools - Big Data with Microsoft
Windows Azure Blob Storage WASB
Blob Storage is a general-purpose Hadoop-compatible Azure storage solution that integrates with HDInsight.
Store data in Azure (blob) instead that in the cluster (HDFS)
(Positive) Consequences:Data are still there after you finish Map Reduce jobs and turn the cluster downEasier to share data with other applications
Big Data Tools - Big Data with Microsoft
Windows Azure Blob Storage WASB
Big Data Tools - Big Data with Microsoft
Excel on steroids, thanks to some powerful add-ins
Power Queryallows to simplifies data discovery and access.
You can connect to data across a wide variety of sources, including relational databases, Web and HadoopYou can combine and refine the dataYou can save queries and refresh the data
Big Data Tools - Big Data with Microsoft
Power Pivotallows non specialised users to do some Business Intelligence on different data sources and create interactive reports, sharable as web applications
Power Viewis a very interactive data exploration, visualisation and presentation tool
Power Mapis a data visualisation tool that allows to plot geographic and temporal data on a 3D map, show it over time, and create visual tours
it.linkedin.com/in/lucanaso/
@NasoLuca
Contacts
www.edisonweb.com
References
Big Data & Digital Marketing
Most of the original material
has been posted on: