introduction to big data: making sense of the world around us

44
Srinath Perera Big Data Analysis: Deciphering the Haystack

Upload: srinath-perera

Post on 02-Dec-2014

702 views

Category:

Technology


2 download

DESCRIPTION

World we live in generates a massive amounts of data. These data comes from many sources like people's activities in the internet, mobile activities, IoT devices and sensors, real world systems like power grid or factories etc. Big data is collection of technologies used to make sense of that data. A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.

TRANSCRIPT

Page 1: Introduction to Big Data: making sense of the World around Us

Srinath Perera

Big Data Analysis: Deciphering the Haystack

Page 2: Introduction to Big Data: making sense of the World around Us

Image cedit, CC licence, http://ansem315.deviantart.com/art/Asimov-Foundation-395188263

• Predict crime before it happens?• Which is hard!

• Asimov’s “Foundation” talks about mathematical models to predict the future.

• We are entering that Era where above are not just science fictions

Page 3: Introduction to Big Data: making sense of the World around Us

e.g. Targeted Marketing

• Assume mass emails to 1M people, reaction rate of 1%, 2$ cost per email.– Then cost 2M$ and reach of 10k

people.• Lets say that looking at

demographics (e.g. where they live), you can find 250K people with reaction rate of 6%, then (e.g. by using decision trees)• Then cost 500K$ and reach of

15k people.

Page 4: Introduction to Big Data: making sense of the World around Us

A day in your life Think about a day in your life?– What is the best road to take?– Would there be any bad weather?– How to invest my money?– How is my health?

There are many decisions that you can do better if only you can access the data and process them.

http://www.flickr.com/photos/kcolwell/5512461652/ CC licence

Page 5: Introduction to Big Data: making sense of the World around Us
Page 6: Introduction to Big Data: making sense of the World around Us

Internet of Things• Currently physical world and

software worlds are detached • Internet of things promises to

bridge this– It is about sensors and

actuators everywhere – In your fridge, in your blanket,

in your chair, in your carpet.. Yes even in your socks

– Umbrella that light up when there is rain and medicine cups

Page 7: Introduction to Big Data: making sense of the World around Us

What can we do with Big Data?• Optimize (World is inefficient)

– 30% food wasted farm to plate

– GE 1% initiative (http://goo.gl/eYC0QE )• 1% saving in trains can save 2B/ year

• 1% in US healthcare is 20B/ year

• In contrast, Sri Lanka total exports 9B/ year.

• Save lives – Weather, Disease identification, Personalized

treatment

• Technology advancement– Most high tech research are done via simulations

Page 8: Introduction to Big Data: making sense of the World around Us

Big Data Architecture

Page 9: Introduction to Big Data: making sense of the World around Us

Why Big Data is hard?• How store? Assuming 1TB bytes it

takes 1000 computers to store a 1PB

• How to move? Assuming 10Gb network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB

• How to search? Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277CPU days to process a 1TB and 785 CPU years to process a 1 PB

• Big data needs distributed systems http://www.susanica.com/photo/9

Page 10: Introduction to Big Data: making sense of the World around Us

Tools for Processing Data

Page 11: Introduction to Big Data: making sense of the World around Us

Big data Processing Technologies Landscape

Page 12: Introduction to Big Data: making sense of the World around Us

MapReduce/ Hadoop• First introduced by Google, and

used as the programming model for their systems

• Implemented by opensource projects like Apache Hadoop and Spark

• Users writes two functions: map and reduce

• The framework handles the details like distributed processing, fault tolerance, load balancing etc.

• Widely used, and the one of the catalyst of Big data

void map(ctx, k, line){(player, speed) =

split(line, ‘,’);

ctx.emit(player,speed)}

void reduce(ctx, player, speeds[]){

ctx.emit(k,avg(speeds));}

Page 13: Introduction to Big Data: making sense of the World around Us

MapReduce (Contd.)

Page 14: Introduction to Big Data: making sense of the World around Us

Apache Spark• New programming

model built on functional programming concepts

• Can be much faster for recursive usecases

• Performance: Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. The previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. (e.g. 30X speedup)

Page 15: Introduction to Big Data: making sense of the World around Us

Calculating Avg Speed with Spark

pairs = file.map(fnSplit2Pair);tot = pairs.reduceByKey(a,b => a + b);count = pairs.reduceByKey(a, b => 2);avgSpeed = tot / count;

• Map data to a virtual variable, which does not load the data

• Then apply lambda functions

file = spark.textFile("hdfs://… speed.data”)

Page 16: Introduction to Big Data: making sense of the World around Us

What if you can freeze time!!• Most solutions are overnight • Think how you would buy

something! Research a bit and buy, often overnight is too late.

• But not all trends takes time – People change their mind – Trends move fast – React to what customer is doing (do

not let him move away)• At CEP speed 400k/sec, if each

event takes a second, it takes 4 days to pass a second in real world!!

Page 17: Introduction to Big Data: making sense of the World around Us

Real-time Analytics• Idea is to process data as they are

received in streaming fashion • Used when we need

– Very fast output – Lots of events (few 100k to millions)– Processing without storing (e.g. too

much data)• Two main technologies

– Stream Processing (e.g. Strom, http://storm-project.net/ )

– Complex Event Processing (CEP)http://wso2.com/products/complex-event-processor/

Page 18: Introduction to Big Data: making sense of the World around Us

Complex Event Processing (CEP)• Sees inputs as Event streams and queried with

SQL like language • Supports Filters, Windows, Join, Patterns and

Sequences

define partition “playerPartition” as PlayerDataStream.pid; from PlayerDataStream#win.time(1m)

select pid, avg(speed) as avgSpeedinsert into AvgSpeedStream using partition playerPartition;

Page 19: Introduction to Big Data: making sense of the World around Us

DEBS Grand Challenge• Event Processing

challenge • Real football game,

sensors in player shoes + ball

• Events in 15k Hz • Event format – Sensor ID, TS, x, y, z, v, a

• Queries– Running Stats– Ball Possession– Heat Map of Activity – Shots at Goal

Page 20: Introduction to Big Data: making sense of the World around Us

Example: Detect Ball Possession • Possession is time a player

hit the ball until someone else hits it or it goes out of the ground

• See demo, http://goo.gl/VW6xQN

from Ball#window.length(1) as b joinPlayers#window.length(1) as p

unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55select ...insert into hitStream

from old = hitStream ,b = hitStream [old. pid !=

pid ],n= hitStream[b.pid == pid]*,

( e1 = hitStream[b.pid != pid ]or e2=

ballLeavingHitStream)select ...insert into BallPossessionStream

http://www.flickr.com/photos/glennharper/146164820/

Page 21: Introduction to Big Data: making sense of the World around Us

Lambda Architecture

Page 22: Introduction to Big Data: making sense of the World around Us

Machine Learning Tools• R – programming language for statistical computing

(most widely used)• Weka – java machine learning library (single node)• Scikit-learn – very easy to use python library • Scalable – Mahout : MapReduce implementation of Machine

learning algorithms – MLBase (based on Spark)– Others: GraphLab, VW, 0xData

• PMML (Predictive model markup language)– Let you port models between languages

Page 23: Introduction to Big Data: making sense of the World around Us

Solving the Problem

Page 24: Introduction to Big Data: making sense of the World around Us

Curious Case of Missing Data

http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/

• WW II, Returned Aircrafts and data on where they were hit?

• How would you add Armour?

Page 25: Introduction to Big Data: making sense of the World around Us

Challenges due to Nature of Bigdata

• Lack of Control Experiment – Often countered with A/B testing in the field – Hard to prove causality

• Does it coming from a representative sample?• Privacy – Security – Randomized techniques (see http://goo.gl/sLfKIb )

Page 26: Introduction to Big Data: making sense of the World around Us

Big data lifecycle

• Get the data, clean up

Page 27: Introduction to Big Data: making sense of the World around Us

Making Sense of Data• Hindsight (to know what

happened)• Basic analytics + visualizations (min,

max, average, histogram, distributions … )

• Oversight (to know what is happening and fixing it)– Realtime analytics

• Insight – Pattern mining, Clustering,

• Foresight – Neural networks, Classification,

Recommendation

Page 28: Introduction to Big Data: making sense of the World around Us

Hindsight (What happened?)• Analytics Implemented with

MapReduce or Queries – Min, Max, average, correlation,

histograms – Might join or group data in many

ways – Heatmaps, temporal trends

• Key Performance indicators (KPIs)– Avg time for a ticket for customer

service – Profit per square feet for retail

• Data is often presented with some visualizations http://www.flickr.com/photos/isriya/

2967310333/

Page 29: Introduction to Big Data: making sense of the World around Us

Drill Down• Idea is to let users drill down

and explore a view of the data– E.g. find customers, region,

time of year etc., that responsible for most revenue

• With OLAP, Users go and define 3 dimensions (or more), and tool let users explore the cube and only look at subset of data. – E.g. tool: Mondrian, Apache

Drill

http://pixabay.com, by Steven Iodice

Page 30: Introduction to Big Data: making sense of the World around Us

Usecase: Planning

• Urban Planning – People distribution – Mobility – Waste Management– E.g. see http://goo.gl/

jPujmM• Market Research – Buying Patterns – Sentiments

Page 31: Introduction to Big Data: making sense of the World around Us

Oversight (What happening?)

• Realtime analytics• Realtime visualizations • Alarms (find problems) and

action recommendations– Classification – Anomaly detection

• Drill down and look at historical data as before.

Page 32: Introduction to Big Data: making sense of the World around Us

Oversight : Usecases• Preprocessing: Correlations, filtering, transformations • Tracking - follow some related entity’s state (such as in

space, time or process status). – e.g. location of airline baggage, vehicle, tracking wild life

• Respond to emergencies – E.g. plan maintenance before aircraft lands

• Detect trends - event sequences, missing events, thresholds, Outliers, Complex trends triple bottom etc., – (e.g. algorithmic trading, SLA, system management)

• Building Profiles – extract info, relationships (e.g. targeted marketing)– Marketing, Recommendations

Page 33: Introduction to Big Data: making sense of the World around Us

Insight (Understanding Why ?)• Pattern Mining – find frequent

associations (e.g. Market Basket), frequent sequences

• Clustering• Graph Analysis• Knowledge Discovery• Correlations between features and Finding principal

components • Simulations, Complex System modeling, matching a

statistical distribution

Page 34: Introduction to Big Data: making sense of the World around Us

Usecase 1: Clustering• Clustering => group

similar items together. (e.g. KMeans)

• Applications – Similar documents,

Genes, Medical imaging, similar people, customers

– Crime Analysis – Compare chemical

compounds – Social network analysis

Page 35: Introduction to Big Data: making sense of the World around Us

Usecase 2: Graph Analytics• Types of Graphs: Social,

communication, Biological networks, Maps, Web, Sematic Web/Ontologies

• Problems– Counting triangles (influence)– Find hubs and authorities (key people, pages)– Finding shortest paths and minimum spanning tree

(Routing Internet traffic and UPS trucks) – Modularity - strength of community / Centrality– Graph, clique, sub graph detection

Page 36: Introduction to Big Data: making sense of the World around Us

Usecase 3: Modeling Solar System

• PDEs not solvable• Simulations • Other examples: Weather

Page 37: Introduction to Big Data: making sense of the World around Us

Foresight (Predict)• Build a Model – Weather, Economic models

• Predict the future values– Electricity load, traffic, demand,

sales• Classification– Spam detection, Group users,

Sentiment analysis • Find anomalies– Fraud, Predictive maintenance

• Recommendations – Targeted advertising, product

recommendations

Page 38: Introduction to Big Data: making sense of the World around Us

Prediction Technologies• Trying to build a model for the

data • Predict Next values in a

sequence – Regression, Neural networks,

Markov and Hidden Makov Models

• Classification– Decision Trees, SVMs, Graphical

Models

• Finding Anomalies– Markov Chains, outliers in a

distribution

• Recommendations

http://misterbijou.blogspot.com/2010_09_01_archive.html

Page 39: Introduction to Big Data: making sense of the World around Us

Usecase 1: Electricity Demand Forecast

• Find trends and cycles (e.g. ARIMA)

• Use regression to build a model using earlier data

• Predict based on the model

Page 40: Introduction to Big Data: making sense of the World around Us

Usecase 2: Predictive Maintenance• Idea is to fix the problem

before it broke, avoiding expensive downtimes– Airplanes, turbines, windmills – Construction Equipment– Car, Golf carts

• How– Anomaly detection (deviate

from normal operation)– Match against known error

patterns

Page 41: Introduction to Big Data: making sense of the World around Us

Usecase 3: Targeted Marketing

Page 42: Introduction to Big Data: making sense of the World around Us

Outline

Page 43: Introduction to Big Data: making sense of the World around Us

Big Data Projects are• Access to data is the main assert

– Data owners set the terms • Involve many Organizations

– Data owners rarely have expertise to make use of data• Multi-Domain

– Retain and teach cross domain people • Complicated and built on lot of opensource tools

– Do not reinvent the wheel and let go NIH• Mathematical

– Math, advance algorithms, Statistical methods, machine learning. Brush up your math!!

• Distributed – Learn bit of distributed systems

Page 44: Introduction to Big Data: making sense of the World around Us

Questions?