introduction to big data: making sense of the world around us

Srinath Perera

Big Data Analysis: Deciphering the Haystack

Image cedit, CC licence, http://ansem315.deviantart.com/art/Asimov-Foundation-395188263

• Predict crime before it happens?• Which is hard!

• Asimov’s “Foundation” talks about mathematical models to predict the future.

• We are entering that Era where above are not just science fictions

e.g. Targeted Marketing

• Assume mass emails to 1M people, reaction rate of 1%, 2$ cost per email.– Then cost 2M$ and reach of 10k

people.• Lets say that looking at

demographics (e.g. where they live), you can find 250K people with reaction rate of 6%, then (e.g. by using decision trees)• Then cost 500K$ and reach of

15k people.

A day in your life Think about a day in your life?– What is the best road to take?– Would there be any bad weather?– How to invest my money?– How is my health?

There are many decisions that you can do better if only you can access the data and process them.

http://www.flickr.com/photos/kcolwell/5512461652/ CC licence

Internet of Things• Currently physical world and

software worlds are detached • Internet of things promises to

bridge this– It is about sensors and

actuators everywhere – In your fridge, in your blanket,

in your chair, in your carpet.. Yes even in your socks

– Umbrella that light up when there is rain and medicine cups

What can we do with Big Data?• Optimize (World is inefficient)

– 30% food wasted farm to plate

– GE 1% initiative (http://goo.gl/eYC0QE )• 1% saving in trains can save 2B/ year

• 1% in US healthcare is 20B/ year

• In contrast, Sri Lanka total exports 9B/ year.

• Save lives – Weather, Disease identification, Personalized

treatment

• Technology advancement– Most high tech research are done via simulations

Big Data Architecture

Why Big Data is hard?• How store? Assuming 1TB bytes it

takes 1000 computers to store a 1PB

• How to move? Assuming 10Gb network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB

• How to search? Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277CPU days to process a 1TB and 785 CPU years to process a 1 PB

• Big data needs distributed systems http://www.susanica.com/photo/9

Tools for Processing Data

Big data Processing Technologies Landscape

MapReduce/ Hadoop• First introduced by Google, and

used as the programming model for their systems

• Implemented by opensource projects like Apache Hadoop and Spark

• Users writes two functions: map and reduce

• The framework handles the details like distributed processing, fault tolerance, load balancing etc.

• Widely used, and the one of the catalyst of Big data

void map(ctx, k, line){(player, speed) =

split(line, ‘,’);

ctx.emit(player,speed)}

void reduce(ctx, player, speeds[]){

ctx.emit(k,avg(speeds));}

MapReduce (Contd.)

Apache Spark• New programming

model built on functional programming concepts

• Can be much faster for recursive usecases

• Performance: Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. The previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. (e.g. 30X speedup)

Calculating Avg Speed with Spark

pairs = file.map(fnSplit2Pair);tot = pairs.reduceByKey(a,b => a + b);count = pairs.reduceByKey(a, b => 2);avgSpeed = tot / count;

• Map data to a virtual variable, which does not load the data

• Then apply lambda functions

file = spark.textFile("hdfs://… speed.data”)

What if you can freeze time!!• Most solutions are overnight • Think how you would buy

something! Research a bit and buy, often overnight is too late.

• But not all trends takes time – People change their mind – Trends move fast – React to what customer is doing (do

not let him move away)• At CEP speed 400k/sec, if each

event takes a second, it takes 4 days to pass a second in real world!!

Real-time Analytics• Idea is to process data as they are

received in streaming fashion • Used when we need

– Very fast output – Lots of events (few 100k to millions)– Processing without storing (e.g. too

much data)• Two main technologies

– Stream Processing (e.g. Strom, http://storm-project.net/ )

– Complex Event Processing (CEP)http://wso2.com/products/complex-event-processor/

Complex Event Processing (CEP)• Sees inputs as Event streams and queried with

SQL like language • Supports Filters, Windows, Join, Patterns and

Sequences

define partition “playerPartition” as PlayerDataStream.pid; from PlayerDataStream#win.time(1m)

select pid, avg(speed) as avgSpeedinsert into AvgSpeedStream using partition playerPartition;

DEBS Grand Challenge• Event Processing

challenge • Real football game,

sensors in player shoes + ball

• Events in 15k Hz • Event format – Sensor ID, TS, x, y, z, v, a

• Queries– Running Stats– Ball Possession– Heat Map of Activity – Shots at Goal

Example: Detect Ball Possession • Possession is time a player

hit the ball until someone else hits it or it goes out of the ground

• See demo, http://goo.gl/VW6xQN

from Ball#window.length(1) as b joinPlayers#window.length(1) as p

unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55select ...insert into hitStream

from old = hitStream ,b = hitStream [old. pid !=

pid ],n= hitStream[b.pid == pid]*,

( e1 = hitStream[b.pid != pid ]or e2=

ballLeavingHitStream)select ...insert into BallPossessionStream

http://www.flickr.com/photos/glennharper/146164820/

Lambda Architecture

Machine Learning Tools• R – programming language for statistical computing

(most widely used)• Weka – java machine learning library (single node)• Scikit-learn – very easy to use python library • Scalable – Mahout : MapReduce implementation of Machine

learning algorithms – MLBase (based on Spark)– Others: GraphLab, VW, 0xData

• PMML (Predictive model markup language)– Let you port models between languages

Solving the Problem

Curious Case of Missing Data

http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/

• WW II, Returned Aircrafts and data on where they were hit?

• How would you add Armour?

Challenges due to Nature of Bigdata

• Lack of Control Experiment – Often countered with A/B testing in the field – Hard to prove causality

• Does it coming from a representative sample?• Privacy – Security – Randomized techniques (see http://goo.gl/sLfKIb )

Big data lifecycle

• Get the data, clean up

Making Sense of Data• Hindsight (to know what

happened)• Basic analytics + visualizations (min,

max, average, histogram, distributions … )

• Oversight (to know what is happening and fixing it)– Realtime analytics

• Insight – Pattern mining, Clustering,

• Foresight – Neural networks, Classification,

Recommendation

Hindsight (What happened?)• Analytics Implemented with

MapReduce or Queries – Min, Max, average, correlation,

histograms – Might join or group data in many

ways – Heatmaps, temporal trends

• Key Performance indicators (KPIs)– Avg time for a ticket for customer

service – Profit per square feet for retail

• Data is often presented with some visualizations http://www.flickr.com/photos/isriya/

2967310333/

Drill Down• Idea is to let users drill down

and explore a view of the data– E.g. find customers, region,

time of year etc., that responsible for most revenue

• With OLAP, Users go and define 3 dimensions (or more), and tool let users explore the cube and only look at subset of data. – E.g. tool: Mondrian, Apache

http://pixabay.com, by Steven Iodice

Usecase: Planning

• Urban Planning – People distribution – Mobility – Waste Management– E.g. see http://goo.gl/

jPujmM• Market Research – Buying Patterns – Sentiments

Oversight (What happening?)

• Realtime analytics• Realtime visualizations • Alarms (find problems) and

action recommendations– Classification – Anomaly detection

• Drill down and look at historical data as before.

Oversight : Usecases• Preprocessing: Correlations, filtering, transformations • Tracking - follow some related entity’s state (such as in

space, time or process status). – e.g. location of airline baggage, vehicle, tracking wild life

• Respond to emergencies – E.g. plan maintenance before aircraft lands

• Detect trends - event sequences, missing events, thresholds, Outliers, Complex trends triple bottom etc., – (e.g. algorithmic trading, SLA, system management)

• Building Profiles – extract info, relationships (e.g. targeted marketing)– Marketing, Recommendations

Insight (Understanding Why ?)• Pattern Mining – find frequent

associations (e.g. Market Basket), frequent sequences

• Clustering• Graph Analysis• Knowledge Discovery• Correlations between features and Finding principal

components • Simulations, Complex System modeling, matching a

statistical distribution

Usecase 1: Clustering• Clustering => group

introduction to big data: making sense of the world around us

Technology

make sense of your big data - pilato

getting your head around big data

big dollars little sense report

around a few big buttons

making sense of big data in health research: towards an eu

data needs in oncology: “making sense of the big data...

create a sense of urgency around your survey data |...

big rivers electric corporation’s response ......

making sense of the big society

connections: making sense of the world around us by

making sense of big data

making sense of linn energy llc's big trade

making sense of big data from ride-sharing services

making sense out of your big data

big internet companies around europe

remote sense big data analytics

rockin’ around big bend!

social media analytics - making sense of big data

michellemcqueen.weebly.com€¦ · web view ·...

a journey in making sense of big data (part 1: big bang)