the big data dead valley dilemma and much more
TRANSCRIPT
![Page 2: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/2.jpg)
Unhidden Agenda
● Big Data Big Picture
● Big Data Dead Valley Dilemma
● Elastic Map Reduce (EMR) numbers
● Scaling Learning (MPI & hadoop)
![Page 3: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/3.jpg)
Big Data =
Lot of Data (evidence)
+
CPU bounded (forgotten)
![Page 4: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/4.jpg)
Big Data =
Lot of Data (evidence)
-
IO bounded (reality)
![Page 5: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/5.jpg)
IO bounded
CPU<100%Data
● HD/Bus speed● Network● File server
![Page 6: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/6.jpg)
Big Data Scalability(ex: hadoop)
= Cluster
+
Locality + node failure(Data move close to CPU)
![Page 7: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/7.jpg)
The Big Data Dilemma
![Page 8: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/8.jpg)
Big Data Dead ValleyTe
chno
Mat
urtit
y /
Ris
k
Enterprise size
SMB
Enterprise
Start-ups
Techno Maturity Risk
![Page 9: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/9.jpg)
Big Data =
SMALL MARKET
(B2B vs B2C)
![Page 10: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/10.jpg)
Small Market......hum?
![Page 11: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/11.jpg)
WHY?????
MaturityData, Process, QA, infra, talent, $, Long term vision
![Page 12: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/12.jpg)
Data->Analytics ->BI-> Big-Data -> Data-Mining -> ML
![Page 13: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/13.jpg)
Data Access & Quality
User data privacy, IT outsourcing protection, Data Quality
![Page 14: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/14.jpg)
Enterprise Slowness
1. Boston CXO Forum 24 October : Best Practice on Global Innovation (IBM, EMC, P&G, Intuit)
Exploit vs Explore - M&A 2. Brad Feld (Managing Director at Foundry Group)
Hierarchy vs network
![Page 15: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/15.jpg)
Big Data Dead ValleyTe
chno
Mat
urtit
y /
Ris
k
Enterprise Maturity
SMB
Enterprise
Start-ups
Techno Maturity Risk
![Page 16: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/16.jpg)
![Page 17: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/17.jpg)
QMarketing exampleLeveraging hadoop● map = hits to session● reduce = sessions to ROI
![Page 18: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/18.jpg)
Online Marketing Management
Channel % budget ROI----------------------------------------------PPC 50% ?Organic 20% ?Email Campaign 20% ?Social Media 10% ?
![Page 19: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/19.jpg)
ROI Dashboard
![Page 20: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/20.jpg)
All abstractions leakAbstract -> Procrastinate!
http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )
![Page 21: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/21.jpg)
Minimize A Tower of AbstractionSimplify & lower the layer of abstraction
Examples:
● Work on file not BD if possible● HD direct connect on server● Low level linux command lines (cut, grep, sed etc.)● High level languages : python
Abstraction = 20X benefits
![Page 22: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/22.jpg)
EMR vs AWS & S3 1.0(no data locality optimization + network &
~IO bounded)
EMR = 45 min AWS = 4 min
![Page 23: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/23.jpg)
EMR vs AWS & S3 2.0
EMR = 5+10 min* AWS = ~4 min
*30 min prepro ;)EMR = 5+4 if (big files & compress files)
![Page 24: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/24.jpg)
Scaling Machine Learning
● Scaling Data-Preprocessing = Hadoop● Small dataset = GPU● Train with Big Dataset = ?? Communication Infrastructures =
MPI & MapReduce (John Langford http://hunch.net/?p=2094)
![Page 25: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/25.jpg)
MPI allreduce
![Page 26: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/26.jpg)
![Page 27: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/27.jpg)
![Page 28: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/28.jpg)
![Page 29: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/29.jpg)
Hadoop vs MPI
MPI● No fault tolerance by default● Poor understanding of where data is (manual split on nodes + bad
communication & prog complexity)● Limit scale to ~100 nodes in practice (sharing unavoidable)● Cluster shared -> slower nodes issues before disk/node failure
MapReduce ● Setup and teardown costs are significant (interaction schedular &
communicating the prog + large number of node)● Worst: mapreduce wait for free nodes + many mapreduce iteration +
reach high quality prediction● Flaw: required refactoring code in map/reduce
![Page 30: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/30.jpg)
Hadoop-compatible AllReduce - Vowpall Rabbit (Hadoop + MPI)
● MPI = All reduce (all nodes same state)● MapReduce = Conceptual Simplicity● MPI: No need to refactor code● MapReduce: Data Locality (Map only)● MPI: Ability to use local storage (or RAM): temp file on
local disk + allow to be cached in RAM by OS● MapReduce: Automatic cleanup of local resources (tmp
files)● MPI: Fast Optimization approach remain within the
conceptual scope: AllReduce = fct call● MapReduce robustness (speculative execution to deal
with slow nodes)
![Page 31: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/31.jpg)
![Page 32: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/32.jpg)
![Page 33: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/33.jpg)
![Page 34: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/34.jpg)
![Page 35: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/35.jpg)
![Page 36: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/36.jpg)
![Page 37: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/37.jpg)
![Page 38: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/38.jpg)
Summary
● Big Data Big Picture○ BigData : Cluster + IO bounded (Locality)
● Big Data Dead Valley Dilemma (MMID)○ Small Market/Maturity/Data:access,quality/Slowness
● EMR (aws) = Slow● Minimize Tower or abstraction● Scaling MP: bottleneck = ML
○ MPI:no fault tolerance + where is the data?○ Hadoop: slow setup & teardown + Require
Refactoring○ Hadoop compatible AllReduce
![Page 39: The big data dead valley dilemma and much more](https://reader034.vdocuments.us/reader034/viewer/2022052411/55626561d8b42aab1a8b4c53/html5/thumbnails/39.jpg)
Reference MPI & hadoop
blog:http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.htmlhttp://hunch.net/?p=2094 Video & slides presentaiton John Langford Learning From Lots Of Data (full)
CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research
Slides: http://lisaweb.iro.umontrea...
Implementation :
vowpal_wabbit