introduction to big data: making sense of the world around us

Post on 02-Dec-2014

703 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

World we live in generates a massive amounts of data. These data comes from many sources like people's activities in the internet, mobile activities, IoT devices and sensors, real world systems like power grid or factories etc. Big data is collection of technologies used to make sense of that data. A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.

TRANSCRIPT

Srinath Perera

Big Data Analysis: Deciphering the Haystack

Image cedit, CC licence, http://ansem315.deviantart.com/art/Asimov-Foundation-395188263

• Predict crime before it happens?• Which is hard!

• Asimov’s “Foundation” talks about mathematical models to predict the future.

• We are entering that Era where above are not just science fictions

e.g. Targeted Marketing

• Assume mass emails to 1M people, reaction rate of 1%, 2$ cost per email.– Then cost 2M$ and reach of 10k

people.• Lets say that looking at

demographics (e.g. where they live), you can find 250K people with reaction rate of 6%, then (e.g. by using decision trees)• Then cost 500K$ and reach of

15k people.

A day in your life Think about a day in your life?– What is the best road to take?– Would there be any bad weather?– How to invest my money?– How is my health?

There are many decisions that you can do better if only you can access the data and process them.

http://www.flickr.com/photos/kcolwell/5512461652/ CC licence

Internet of Things• Currently physical world and

software worlds are detached • Internet of things promises to

bridge this– It is about sensors and

actuators everywhere – In your fridge, in your blanket,

in your chair, in your carpet.. Yes even in your socks

– Umbrella that light up when there is rain and medicine cups

What can we do with Big Data?• Optimize (World is inefficient)

– 30% food wasted farm to plate

– GE 1% initiative (http://goo.gl/eYC0QE )• 1% saving in trains can save 2B/ year

• 1% in US healthcare is 20B/ year

• In contrast, Sri Lanka total exports 9B/ year.

• Save lives – Weather, Disease identification, Personalized

treatment

• Technology advancement– Most high tech research are done via simulations

Big Data Architecture

Why Big Data is hard?• How store? Assuming 1TB bytes it

takes 1000 computers to store a 1PB

• How to move? Assuming 10Gb network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB

• How to search? Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277CPU days to process a 1TB and 785 CPU years to process a 1 PB

• Big data needs distributed systems http://www.susanica.com/photo/9

Tools for Processing Data

Big data Processing Technologies Landscape

MapReduce/ Hadoop• First introduced by Google, and

used as the programming model for their systems

• Implemented by opensource projects like Apache Hadoop and Spark

• Users writes two functions: map and reduce

• The framework handles the details like distributed processing, fault tolerance, load balancing etc.

• Widely used, and the one of the catalyst of Big data

void map(ctx, k, line){(player, speed) =

split(line, ‘,’);

ctx.emit(player,speed)}

void reduce(ctx, player, speeds[]){

ctx.emit(k,avg(speeds));}

MapReduce (Contd.)

Apache Spark• New programming

model built on functional programming concepts

• Can be much faster for recursive usecases

• Performance: Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. The previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. (e.g. 30X speedup)

Calculating Avg Speed with Spark

pairs = file.map(fnSplit2Pair);tot = pairs.reduceByKey(a,b => a + b);count = pairs.reduceByKey(a, b => 2);avgSpeed = tot / count;

• Map data to a virtual variable, which does not load the data

• Then apply lambda functions

file = spark.textFile("hdfs://… speed.data”)

What if you can freeze time!!• Most solutions are overnight • Think how you would buy

something! Research a bit and buy, often overnight is too late.

• But not all trends takes time – People change their mind – Trends move fast – React to what customer is doing (do

not let him move away)• At CEP speed 400k/sec, if each

event takes a second, it takes 4 days to pass a second in real world!!

Real-time Analytics• Idea is to process data as they are

received in streaming fashion • Used when we need

– Very fast output – Lots of events (few 100k to millions)– Processing without storing (e.g. too

much data)• Two main technologies

– Stream Processing (e.g. Strom, http://storm-project.net/ )

– Complex Event Processing (CEP)http://wso2.com/products/complex-event-processor/

Complex Event Processing (CEP)• Sees inputs as Event streams and queried with

SQL like language • Supports Filters, Windows, Join, Patterns and

Sequences

define partition “playerPartition” as PlayerDataStream.pid; from PlayerDataStream#win.time(1m)

select pid, avg(speed) as avgSpeedinsert into AvgSpeedStream using partition playerPartition;

DEBS Grand Challenge• Event Processing

challenge • Real football game,

sensors in player shoes + ball

• Events in 15k Hz • Event format – Sensor ID, TS, x, y, z, v, a

• Queries– Running Stats– Ball Possession– Heat Map of Activity – Shots at Goal

Example: Detect Ball Possession • Possession is time a player

hit the ball until someone else hits it or it goes out of the ground

• See demo, http://goo.gl/VW6xQN

from Ball#window.length(1) as b joinPlayers#window.length(1) as p

unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55select ...insert into hitStream

from old = hitStream ,b = hitStream [old. pid !=

pid ],n= hitStream[b.pid == pid]*,

( e1 = hitStream[b.pid != pid ]or e2=

ballLeavingHitStream)select ...insert into BallPossessionStream

http://www.flickr.com/photos/glennharper/146164820/

Lambda Architecture

Machine Learning Tools• R – programming language for statistical computing

(most widely used)• Weka – java machine learning library (single node)• Scikit-learn – very easy to use python library • Scalable – Mahout : MapReduce implementation of Machine

learning algorithms – MLBase (based on Spark)– Others: GraphLab, VW, 0xData

• PMML (Predictive model markup language)– Let you port models between languages

Solving the Problem

Curious Case of Missing Data

http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/

• WW II, Returned Aircrafts and data on where they were hit?

• How would you add Armour?

Challenges due to Nature of Bigdata

• Lack of Control Experiment – Often countered with A/B testing in the field – Hard to prove causality

• Does it coming from a representative sample?• Privacy – Security – Randomized techniques (see http://goo.gl/sLfKIb )

Big data lifecycle

• Get the data, clean up

Making Sense of Data• Hindsight (to know what

happened)• Basic analytics + visualizations (min,

max, average, histogram, distributions … )

• Oversight (to know what is happening and fixing it)– Realtime analytics

• Insight – Pattern mining, Clustering,

• Foresight – Neural networks, Classification,

Recommendation

Hindsight (What happened?)• Analytics Implemented with

MapReduce or Queries – Min, Max, average, correlation,

histograms – Might join or group data in many

ways – Heatmaps, temporal trends

• Key Performance indicators (KPIs)– Avg time for a ticket for customer

service – Profit per square feet for retail

• Data is often presented with some visualizations http://www.flickr.com/photos/isriya/

2967310333/

Drill Down• Idea is to let users drill down

and explore a view of the data– E.g. find customers, region,

time of year etc., that responsible for most revenue

• With OLAP, Users go and define 3 dimensions (or more), and tool let users explore the cube and only look at subset of data. – E.g. tool: Mondrian, Apache

Drill

http://pixabay.com, by Steven Iodice

Usecase: Planning

• Urban Planning – People distribution – Mobility – Waste Management– E.g. see http://goo.gl/

jPujmM• Market Research – Buying Patterns – Sentiments

Oversight (What happening?)

• Realtime analytics• Realtime visualizations • Alarms (find problems) and

action recommendations– Classification – Anomaly detection

• Drill down and look at historical data as before.

Oversight : Usecases• Preprocessing: Correlations, filtering, transformations • Tracking - follow some related entity’s state (such as in

space, time or process status). – e.g. location of airline baggage, vehicle, tracking wild life

• Respond to emergencies – E.g. plan maintenance before aircraft lands

• Detect trends - event sequences, missing events, thresholds, Outliers, Complex trends triple bottom etc., – (e.g. algorithmic trading, SLA, system management)

• Building Profiles – extract info, relationships (e.g. targeted marketing)– Marketing, Recommendations

Insight (Understanding Why ?)• Pattern Mining – find frequent

associations (e.g. Market Basket), frequent sequences

• Clustering• Graph Analysis• Knowledge Discovery• Correlations between features and Finding principal

components • Simulations, Complex System modeling, matching a

statistical distribution

Usecase 1: Clustering• Clustering => group

similar items together. (e.g. KMeans)

• Applications – Similar documents,

Genes, Medical imaging, similar people, customers

– Crime Analysis – Compare chemical

compounds – Social network analysis

Usecase 2: Graph Analytics• Types of Graphs: Social,

communication, Biological networks, Maps, Web, Sematic Web/Ontologies

• Problems– Counting triangles (influence)– Find hubs and authorities (key people, pages)– Finding shortest paths and minimum spanning tree

(Routing Internet traffic and UPS trucks) – Modularity - strength of community / Centrality– Graph, clique, sub graph detection

Usecase 3: Modeling Solar System

• PDEs not solvable• Simulations • Other examples: Weather

Foresight (Predict)• Build a Model – Weather, Economic models

• Predict the future values– Electricity load, traffic, demand,

sales• Classification– Spam detection, Group users,

Sentiment analysis • Find anomalies– Fraud, Predictive maintenance

• Recommendations – Targeted advertising, product

recommendations

Prediction Technologies• Trying to build a model for the

data • Predict Next values in a

sequence – Regression, Neural networks,

Markov and Hidden Makov Models

• Classification– Decision Trees, SVMs, Graphical

Models

• Finding Anomalies– Markov Chains, outliers in a

distribution

• Recommendations

http://misterbijou.blogspot.com/2010_09_01_archive.html

Usecase 1: Electricity Demand Forecast

• Find trends and cycles (e.g. ARIMA)

• Use regression to build a model using earlier data

• Predict based on the model

Usecase 2: Predictive Maintenance• Idea is to fix the problem

before it broke, avoiding expensive downtimes– Airplanes, turbines, windmills – Construction Equipment– Car, Golf carts

• How– Anomaly detection (deviate

from normal operation)– Match against known error

patterns

Usecase 3: Targeted Marketing

Outline

Big Data Projects are• Access to data is the main assert

– Data owners set the terms • Involve many Organizations

– Data owners rarely have expertise to make use of data• Multi-Domain

– Retain and teach cross domain people • Complicated and built on lot of opensource tools

– Do not reinvent the wheel and let go NIH• Mathematical

– Math, advance algorithms, Statistical methods, machine learning. Brush up your math!!

• Distributed – Learn bit of distributed systems

Questions?

top related