dataiku big data paris - the rise of the hadoop ecosystem
DESCRIPTION
Snapshot of the hadoop ecosystem at the beginning of 2014, with the rise of real time and in memory processing distributed frameworks that complement and supplant the Map Reduce paradigmTRANSCRIPT
The Riseof the
HadoopEcosystem
Florian DouetteauCEO Dataiku
DATAIKU
DATA PREPARATIONMODELING STATISTICS
VISUALIZATION
ALL-IN-ONE
DATA SCIENCE STUDIO
TOPICS FOR TODAY
DRIVERS FOR THE NEW “REAL-TIME“HADOOP ECOSYSTEM
KEY TOOLS AND FRAMEWORKSTO BE AWARE OF
DRIVER 1: BACK TO THE BASICS
RAM - CPU - DISK
2000 2013
1000$ / GB
6$ / GB$10 / GB
$0.06 / GB
memory divided by 150
disk costdivided by 250
MAPREDUCE
times
HACKREDUCE
times
A PERSISTENT MEMORY PROBLEM
DATA IS BIGGER
IS USEFUL DATA BIGGER ?
WHOLE DATA
REFINED DATA
GOLD
NEEDLE IN HAYSTACK ?
OILD
REFINE BEFOREUSE
HOW BIG IS BIG DATA ?Web Site– $1B revenue per year – 10 Millions Unique Visitor per month– 100.Millions orders / actions / per day
10TBRAW DATA
1TBREFINE DATA
1 TERABYTE
FITS IN MEMORY
1TB
DRIVER 2 : ECOSYSTEM GROWS
• 1 Circle OPEN SOURCE– YAHOO – IBM –
LINKEDIN - FACEBOOK
• 2 Circle – STANDFORD BERKELEY– STARTUPS
STARTUPS
64m$
6.75m$
14m$
2m$
40m$
20m$
20.5m$
19m$
4m$
100m$
1.8m$
17m$
11m$
7.75m$
1.7m$
20132012
2011
2010
2009
$1B per yearInvested
in Big Data TECH
223m$
301m$
HAVE YOU SEEN THE MOVIE ?
dooop
ALL-IN-ONE SOLUTION
HDFS
MAP REDUCE
1. Safe Large Storage (HDFS)
2. Distributed computation paradigm (Map Reduce)
3. Resilient long job
4. Disk-CPU locality aware resource allocation
HADOOP =
LOVELY TANGLED TOGETHER
INTRODUCTING YARN
HDFSYARN
map reduce
provider1
Other cluster
provider…
THE NEW ECOSYSTEM
FASTER FASTER FASTER
REALLY FASTER ?
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
DEVELOPER CAN WAIT
DEVELOPPER CAN WAIT
BUSINESS WON’TWAIT
REAL-TIME QUERIES
Not All Queries are born
equals
RT QUERIES > IMPALA
MPP Database like performance for Hadoop
- Created in 2012 by Cloudera
- x100 performance over Hive (for certain queries)
RT QUERIES > DRILLExtensible architecturefor SQL Querying
• Started in 2013
• Apache Incubated Project• Lucidworks• Mapr • ElasticSearch• …
• Alpha Status
• Open architecture for supporting SQL like queries to various data sources: • Cassandra• MongoDB• HDFS• HBase
Apache DRILL
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
REAL-TIME UPDATES
UPDATE > Recommender SystemUpdate the Model Once per week using the whole history
Apply the model for each userusing the very last events
Real-TimeNavigation
Real-TimeRecommendation
UPDATE > STORM
STORM Reliable Distributed Real-Time Computations
- Connect to a variety of datasources (HDFS, RabbitMQ, JMS etc..)
- Run Computation in java (native) or python, ruby, perl …
- Guarantees that events are taken processed
- Distributes workload
UPDATES > SUMMINGBIRD
Write Map-Reduce like program and executing either in
• Batch• Real-Time• Hybrid Batch / Real-Time
• Open Sourced By Twitter in 2013
• Built on top of Storm (and Cascading)
• Program in Scala
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
FAST LEARNING DRIVE
GOOD PUPILS ITERATE
ITERATION FOR MACHINE LEARNING
……..
……..
Stochastic Gradient Descent : ITERATE
K-Means : ITERATE
Pages Rank: ITERATE
……..
LEARNING > GRAPHLAB
“Graph” Analytics in Memory
• Created at Carnegie-Mellon in 2009
• Generic Graph Traversal framework
• Packaged Machine Learning- Recommender Systems- Graph Analytics- Clustering
• Easy Python Integration
LEARNING > H2O
In-Memory Distribution Prediction Engine
Machine Learning- Classification- Regression- Clustering
- R/Python easy integration
ALL > SPARK
Real-Time Resilient Distributed Memory Framework
• Abstraction with any DAG operation on data:- Filter- Map- Reduce - Cache
SPARK AND ITS ECOSYSTEM
SHARK
MLBASE
STREAMING
Real-Time Queries
Real-Time Updates
In-Memory Learning
SPAR
K
THE WHOLE PICTURE
HDFSYARN
map reduce SPARK
GRAPHLAB
H2OST
REAM
ING
ML
BASE
SHAR
K
PIG
HIV
E
CASC
ADIN
G
STO
RM
DRI
LL
othe
r sto
rage
IMPA
LA
THANK YOU !
dataiku.com
DATAIKU STAND A4
DEMO
DATA SCIENCE STUDIO
Questions now
or later