The Riseof the
HadoopEcosystem
Florian DouetteauCEO Dataiku
DATAIKU
DATA PREPARATIONMODELING STATISTICS
VISUALIZATION
ALL-IN-ONE
DATA SCIENCE STUDIO
TOPICS FOR TODAY
DRIVERS FOR THE NEW “REAL-TIME“HADOOP ECOSYSTEM
KEY TOOLS AND FRAMEWORKSTO BE AWARE OF
DRIVER 1: BACK TO THE BASICS
RAM - CPU - DISK
2000 2013
1000$ / GB
6$ / GB$10 / GB
$0.06 / GB
memory divided by 150
disk costdivided by 250
MAPREDUCE
times
HACKREDUCE
times
A PERSISTENT MEMORY PROBLEM
DATA IS BIGGER
IS USEFUL DATA BIGGER ?
WHOLE DATA
REFINED DATA
GOLD
NEEDLE IN HAYSTACK ?
OILD
REFINE BEFOREUSE
HOW BIG IS BIG DATA ?Web Site– $1B revenue per year – 10 Millions Unique Visitor per month– 100.Millions orders / actions / per day
10TBRAW DATA
1TBREFINE DATA
1 TERABYTE
FITS IN MEMORY
1TB
DRIVER 2 : ECOSYSTEM GROWS
• 1 Circle OPEN SOURCE– YAHOO – IBM –
LINKEDIN - FACEBOOK
• 2 Circle – STANDFORD BERKELEY– STARTUPS
STARTUPS
64m$
6.75m$
14m$
2m$
40m$
20m$
20.5m$
19m$
4m$
100m$
1.8m$
17m$
11m$
7.75m$
1.7m$
20132012
2011
2010
2009
$1B per yearInvested
in Big Data TECH
223m$
301m$
HAVE YOU SEEN THE MOVIE ?
dooop
ALL-IN-ONE SOLUTION
HDFS
MAP REDUCE
1. Safe Large Storage (HDFS)
2. Distributed computation paradigm (Map Reduce)
3. Resilient long job
4. Disk-CPU locality aware resource allocation
HADOOP =
LOVELY TANGLED TOGETHER
INTRODUCTING YARN
HDFSYARN
map reduce
provider1
Other cluster
provider…
THE NEW ECOSYSTEM
FASTER FASTER FASTER
REALLY FASTER ?
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
DEVELOPER CAN WAIT
DEVELOPPER CAN WAIT
BUSINESS WON’TWAIT
REAL-TIME QUERIES
Not All Queries are born
equals
RT QUERIES > IMPALA
MPP Database like performance for Hadoop
- Created in 2012 by Cloudera
- x100 performance over Hive (for certain queries)
RT QUERIES > DRILLExtensible architecturefor SQL Querying
• Started in 2013
• Apache Incubated Project• Lucidworks• Mapr • ElasticSearch• …
• Alpha Status
• Open architecture for supporting SQL like queries to various data sources: • Cassandra• MongoDB• HDFS• HBase
Apache DRILL
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
REAL-TIME UPDATES
UPDATE > Recommender SystemUpdate the Model Once per week using the whole history
Apply the model for each userusing the very last events
Real-TimeNavigation
Real-TimeRecommendation
UPDATE > STORM
STORM Reliable Distributed Real-Time Computations
- Connect to a variety of datasources (HDFS, RabbitMQ, JMS etc..)
- Run Computation in java (native) or python, ruby, perl …
- Guarantees that events are taken processed
- Distributes workload
UPDATES > SUMMINGBIRD
Write Map-Reduce like program and executing either in
• Batch• Real-Time• Hybrid Batch / Real-Time
• Open Sourced By Twitter in 2013
• Built on top of Storm (and Cascading)
• Program in Scala
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
FAST LEARNING DRIVE
GOOD PUPILS ITERATE
ITERATION FOR MACHINE LEARNING
……..
……..
Stochastic Gradient Descent : ITERATE
K-Means : ITERATE
Pages Rank: ITERATE
……..
LEARNING > GRAPHLAB
“Graph” Analytics in Memory
• Created at Carnegie-Mellon in 2009
• Generic Graph Traversal framework
• Packaged Machine Learning- Recommender Systems- Graph Analytics- Clustering
• Easy Python Integration
LEARNING > H2O
In-Memory Distribution Prediction Engine
Machine Learning- Classification- Regression- Clustering
- R/Python easy integration
ALL > SPARK
Real-Time Resilient Distributed Memory Framework
• Abstraction with any DAG operation on data:- Filter- Map- Reduce - Cache
SPARK AND ITS ECOSYSTEM
SHARK
MLBASE
STREAMING
Real-Time Queries
Real-Time Updates
In-Memory Learning
SPAR
K
THE WHOLE PICTURE
HDFSYARN
map reduce SPARK
GRAPHLAB
H2OST
REAM
ING
ML
BASE
SHAR
K
PIG
HIV
E
CASC
ADIN
G
STO
RM
DRI
LL
othe
r sto
rage
IMPA
LA
THANK YOU !
dataiku.com
DATAIKU STAND A4
DEMO
DATA SCIENCE STUDIO
Questions now
or later