introduction to big data: making sense of the world around us
Post on 21-Apr-2017
2.248 Views
Preview:
TRANSCRIPT
Srinath Perera
Big Data Analysis: Deciphering the Haystack
Image cedit, CC licence, http://ansem315.deviantart.com/art/Asimov-Foundation-395188263
• Predict crime before it happens?• Which is hard!
• Asimov’s “Foundation” talks about mathematical models to predict the future.
• We are entering that Era where above are not just science fictions
e.g. Targeted Marketing
• Assume mass emails to 1M people, reaction rate of 1%, 2$ cost per email.– Then cost 2M$ and reach of 10k
people.• Lets say that looking at
demographics (e.g. where they live), you can find 250K people with reaction rate of 6%, then (e.g. by using decision trees)• Then cost 500K$ and reach of
15k people.
A day in your life Think about a day in your life?– What is the best road to take?– Would there be any bad weather?– How to invest my money?– How is my health?
There are many decisions that you can do better if only you can access the data and process them.
http://www.flickr.com/photos/kcolwell/5512461652/ CC licence
Internet of Things• Currently physical world and
software worlds are detached • Internet of things promises to
bridge this– It is about sensors and
actuators everywhere – In your fridge, in your blanket,
in your chair, in your carpet.. Yes even in your socks
– Umbrella that light up when there is rain and medicine cups
What can we do with Big Data?• Optimize (World is inefficient)
– 30% food wasted farm to plate
– GE 1% initiative (http://goo.gl/eYC0QE )• 1% saving in trains can save 2B/ year• 1% in US healthcare is 20B/ year• In contrast, Sri Lanka total exports 9B/ year.
• Save lives – Weather, Disease identification, Personalized
treatment
• Technology advancement– Most high tech research are done via simulations
Big Data Architecture
Why Big Data is hard?• How store? Assuming 1TB bytes it
takes 1000 computers to store a 1PB • How to move? Assuming 10Gb
network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB
• How to search? Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277CPU days to process a 1TB and 785 CPU years to process a 1 PB
• Big data needs distributed systems http://www.susanica.com/photo/9
Tools for Processing Data
Big data Processing Technologies Landscape
MapReduce/ Hadoop• First introduced by Google, and
used as the programming model for their systems
• Implemented by opensource projects like Apache Hadoop and Spark
• Users writes two functions: map and reduce
• The framework handles the details like distributed processing, fault tolerance, load balancing etc.
• Widely used, and the one of the catalyst of Big data
void map(ctx, k, line){(player, speed) =
split(line, ‘,’);
ctx.emit(player,speed)}
void reduce(ctx, player, speeds[]){
ctx.emit(k,avg(speeds));}
MapReduce (Contd.)
Apache Spark• New programming
model built on functional programming concepts
• Can be much faster for recursive usecases
• Performance: Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. The previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. (e.g. 30X speedup)
Calculating Avg Speed with Spark
pairs = file.map(fnSplit2Pair);tot = pairs.reduceByKey(a,b => a + b);count = pairs.reduceByKey(a, b => 2);avgSpeed = tot / count;
• Map data to a virtual variable, which does not load the data
• Then apply lambda functions
file = spark.textFile("hdfs://… speed.data”)
What if you can freeze time!!• Most solutions are overnight • Think how you would buy
something! Research a bit and buy, often overnight is too late.
• But not all trends takes time – People change their mind – Trends move fast – React to what customer is doing (do
not let him move away)• At CEP speed 400k/sec, if each
event takes a second, it takes 4 days to pass a second in real world!!
Real-time Analytics• Idea is to process data as they are
received in streaming fashion • Used when we need
– Very fast output – Lots of events (few 100k to millions)– Processing without storing (e.g. too
much data)• Two main technologies
– Stream Processing (e.g. Strom, http://storm-project.net/ )
– Complex Event Processing (CEP)http://wso2.com/products/complex-event-processor/
Complex Event Processing (CEP)• Sees inputs as Event streams and queried with
SQL like language • Supports Filters, Windows, Join, Patterns and
Sequences
define partition “playerPartition” as PlayerDataStream.pid; from PlayerDataStream#win.time(1m)
select pid, avg(speed) as avgSpeedinsert into AvgSpeedStream using partition playerPartition;
DEBS Grand Challenge• Event Processing
challenge • Real football game,
sensors in player shoes + ball
• Events in 15k Hz • Event format – Sensor ID, TS, x, y, z, v, a
• Queries– Running Stats– Ball Possession– Heat Map of Activity – Shots at Goal
Example: Detect Ball Possession • Possession is time a player
hit the ball until someone else hits it or it goes out of the ground
• See demo, http://goo.gl/VW6xQN
from Ball#window.length(1) as b joinPlayers#window.length(1) as p
unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55select ...insert into hitStream
from old = hitStream ,b = hitStream [old. pid !=
pid ],n= hitStream[b.pid == pid]*,
( e1 = hitStream[b.pid != pid ]or e2=
ballLeavingHitStream)select ...insert into BallPossessionStream
http://www.flickr.com/photos/glennharper/146164820/
Lambda Architecture
Machine Learning Tools• R – programming language for statistical computing
(most widely used)• Weka – java machine learning library (single node)• Scikit-learn – very easy to use python library • Scalable – Mahout : MapReduce implementation of Machine learning
algorithms – MLBase (based on Spark)– Others: GraphLab, VW, 0xData
• PMML (Predictive model markup language)– Let you port models between languages
Solving the Problem
Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
• WW II, Returned Aircrafts and data on where they were hit?
• How would you add Armour?
Challenges due to Nature of Bigdata
• Lack of Control Experiment – Often countered with A/B testing in the field – Hard to prove causality
• Does it coming from a representative sample?• Privacy – Security – Randomized techniques (see http://goo.gl/sLfKIb )
Big data lifecycle
• Get the data, clean up
Making Sense of Data• Hindsight (to know what
happened)• Basic analytics + visualizations (min,
max, average, histogram, distributions … )
• Oversight (to know what is happening and fixing it)– Realtime analytics
• Insight – Pattern mining, Clustering,
• Foresight – Neural networks, Classification,
Recommendation
Hindsight (What happened?)• Analytics Implemented with
MapReduce or Queries – Min, Max, average, correlation,
histograms – Might join or group data in many
ways – Heatmaps, temporal trends
• Key Performance indicators (KPIs)– Avg time for a ticket for customer
service – Profit per square feet for retail
• Data is often presented with some visualizations http://www.flickr.com/photos/isriya/
2967310333/
Drill Down• Idea is to let users drill down
and explore a view of the data– E.g. find customers, region,
time of year etc., that responsible for most revenue
• With OLAP, Users go and define 3 dimensions (or more), and tool let users explore the cube and only look at subset of data. – E.g. tool: Mondrian, Apache
Drill
http://pixabay.com, by Steven Iodice
Usecase: Planning
• Urban Planning – People distribution – Mobility – Waste Management– E.g. see http://goo.gl/
jPujmM• Market Research – Buying Patterns – Sentiments
Oversight (What happening?)
• Realtime analytics• Realtime visualizations • Alarms (find problems) and
action recommendations– Classification – Anomaly detection
• Drill down and look at historical data as before.
Oversight : Usecases• Preprocessing: Correlations, filtering, transformations • Tracking - follow some related entity’s state (such as in
space, time or process status). – e.g. location of airline baggage, vehicle, tracking wild life
• Respond to emergencies – E.g. plan maintenance before aircraft lands
• Detect trends - event sequences, missing events, thresholds, Outliers, Complex trends triple bottom etc., – (e.g. algorithmic trading, SLA, system management)
• Building Profiles – extract info, relationships (e.g. targeted marketing)– Marketing, Recommendations
Insight (Understanding Why ?)• Pattern Mining – find frequent
associations (e.g. Market Basket), frequent sequences
• Clustering• Graph Analysis• Knowledge Discovery• Correlations between features and Finding principal
components • Simulations, Complex System modeling, matching a
statistical distribution
Usecase 1: Clustering• Clustering => group
similar items together. (e.g. KMeans)
• Applications – Similar documents,
Genes, Medical imaging, similar people, customers
– Crime Analysis – Compare chemical
compounds – Social network analysis
Usecase 2: Graph Analytics• Types of Graphs: Social,
communication, Biological networks, Maps, Web, Sematic Web/Ontologies
• Problems– Counting triangles (influence)– Find hubs and authorities (key people, pages)– Finding shortest paths and minimum spanning tree
(Routing Internet traffic and UPS trucks) – Modularity - strength of community / Centrality– Graph, clique, sub graph detection
Usecase 3: Modeling Solar System
• PDEs not solvable• Simulations • Other examples: Weather
Foresight (Predict)• Build a Model – Weather, Economic models
• Predict the future values– Electricity load, traffic, demand, sales
• Classification– Spam detection, Group users,
Sentiment analysis • Find anomalies– Fraud, Predictive maintenance
• Recommendations – Targeted advertising, product
recommendations
Prediction Technologies• Trying to build a model for the
data • Predict Next values in a
sequence – Regression, Neural networks,
Markov and Hidden Makov Models
• Classification– Decision Trees, SVMs, Graphical
Models• Finding Anomalies
– Markov Chains, outliers in a distribution
• Recommendations
http://misterbijou.blogspot.com/2010_09_01_archive.html
Usecase 1: Electricity Demand Forecast
• Find trends and cycles (e.g. ARIMA)
• Use regression to build a model using earlier data
• Predict based on the model
Usecase 2: Predictive Maintenance• Idea is to fix the problem
before it broke, avoiding expensive downtimes– Airplanes, turbines, windmills – Construction Equipment– Car, Golf carts
• How– Anomaly detection (deviate
from normal operation)– Match against known error
patterns
Usecase 3: Targeted Marketing
Outline
Big Data Projects are• Access to data is the main assert
– Data owners set the terms • Involve many Organizations
– Data owners rarely have expertise to make use of data• Multi-Domain
– Retain and teach cross domain people • Complicated and built on lot of opensource tools
– Do not reinvent the wheel and let go NIH• Mathematical
– Math, advance algorithms, Statistical methods, machine learning. Brush up your math!!
• Distributed – Learn bit of distributed systems
Questions?
top related