Transcript
Page 1: Dataiku   big data paris - the rise of the hadoop ecosystem

The Riseof the

HadoopEcosystem

Page 2: Dataiku   big data paris - the rise of the hadoop ecosystem

Florian DouetteauCEO Dataiku

DATAIKU

DATA PREPARATIONMODELING STATISTICS

VISUALIZATION

ALL-IN-ONE

DATA SCIENCE STUDIO

Page 3: Dataiku   big data paris - the rise of the hadoop ecosystem

TOPICS FOR TODAY

DRIVERS FOR THE NEW “REAL-TIME“HADOOP ECOSYSTEM

KEY TOOLS AND FRAMEWORKSTO BE AWARE OF

Page 4: Dataiku   big data paris - the rise of the hadoop ecosystem

DRIVER 1: BACK TO THE BASICS

RAM - CPU - DISK

Page 5: Dataiku   big data paris - the rise of the hadoop ecosystem

2000 2013

1000$ / GB

6$ / GB$10 / GB

$0.06 / GB

memory divided by 150

disk costdivided by 250

MAPREDUCE

times

HACKREDUCE

times

A PERSISTENT MEMORY PROBLEM

Page 6: Dataiku   big data paris - the rise of the hadoop ecosystem

DATA IS BIGGER

Page 7: Dataiku   big data paris - the rise of the hadoop ecosystem

IS USEFUL DATA BIGGER ?

WHOLE DATA

REFINED DATA

Page 8: Dataiku   big data paris - the rise of the hadoop ecosystem

GOLD

NEEDLE IN HAYSTACK ?

Page 9: Dataiku   big data paris - the rise of the hadoop ecosystem

OILD

REFINE BEFOREUSE

Page 10: Dataiku   big data paris - the rise of the hadoop ecosystem

HOW BIG IS BIG DATA ?Web Site– $1B revenue per year – 10 Millions Unique Visitor per month– 100.Millions orders / actions / per day

10TBRAW DATA

1TBREFINE DATA

Page 11: Dataiku   big data paris - the rise of the hadoop ecosystem

1 TERABYTE

FITS IN MEMORY

1TB

Page 12: Dataiku   big data paris - the rise of the hadoop ecosystem

DRIVER 2 : ECOSYSTEM GROWS

• GOOGLE

• 1 Circle OPEN SOURCE– YAHOO – IBM –

LINKEDIN - FACEBOOK

• 2 Circle – STANDFORD BERKELEY– STARTUPS

Page 13: Dataiku   big data paris - the rise of the hadoop ecosystem

STARTUPS

64m$

6.75m$

14m$

2m$

40m$

20m$

20.5m$

19m$

4m$

100m$

1.8m$

17m$

11m$

7.75m$

1.7m$

20132012

2011

2010

2009

$1B per yearInvested

in Big Data TECH

223m$

301m$

Page 14: Dataiku   big data paris - the rise of the hadoop ecosystem

HAVE YOU SEEN THE MOVIE ?

dooop

Page 15: Dataiku   big data paris - the rise of the hadoop ecosystem

ALL-IN-ONE SOLUTION

HDFS

MAP REDUCE

1. Safe Large Storage (HDFS)

2. Distributed computation paradigm (Map Reduce)

3. Resilient long job

4. Disk-CPU locality aware resource allocation

HADOOP =

Page 16: Dataiku   big data paris - the rise of the hadoop ecosystem

LOVELY TANGLED TOGETHER

Page 17: Dataiku   big data paris - the rise of the hadoop ecosystem

INTRODUCTING YARN

Page 18: Dataiku   big data paris - the rise of the hadoop ecosystem

HDFSYARN

map reduce

provider1

Other cluster

provider…

THE NEW ECOSYSTEM

Page 19: Dataiku   big data paris - the rise of the hadoop ecosystem

FASTER FASTER FASTER

REALLY FASTER ?

Page 20: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

Page 21: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

Page 22: Dataiku   big data paris - the rise of the hadoop ecosystem

DEVELOPER CAN WAIT

DEVELOPPER CAN WAIT

Page 23: Dataiku   big data paris - the rise of the hadoop ecosystem

BUSINESS WON’TWAIT

Page 24: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME QUERIES

Not All Queries are born

equals

Page 25: Dataiku   big data paris - the rise of the hadoop ecosystem

RT QUERIES > IMPALA

MPP Database like performance for Hadoop

- Created in 2012 by Cloudera

- x100 performance over Hive (for certain queries)

Page 26: Dataiku   big data paris - the rise of the hadoop ecosystem

RT QUERIES > DRILLExtensible architecturefor SQL Querying

• Started in 2013

• Apache Incubated Project• Lucidworks• Mapr • ElasticSearch• …

• Alpha Status

• Open architecture for supporting SQL like queries to various data sources: • Cassandra• MongoDB• HDFS• HBase

Apache DRILL

Page 27: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

Page 28: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME UPDATES

Page 29: Dataiku   big data paris - the rise of the hadoop ecosystem

UPDATE > Recommender SystemUpdate the Model Once per week using the whole history

Apply the model for each userusing the very last events

Real-TimeNavigation

Real-TimeRecommendation

Page 30: Dataiku   big data paris - the rise of the hadoop ecosystem

UPDATE > STORM

STORM Reliable Distributed Real-Time Computations

- Connect to a variety of datasources (HDFS, RabbitMQ, JMS etc..)

- Run Computation in java (native) or python, ruby, perl …

- Guarantees that events are taken processed

- Distributes workload

Page 31: Dataiku   big data paris - the rise of the hadoop ecosystem

UPDATES > SUMMINGBIRD

Write Map-Reduce like program and executing either in

• Batch• Real-Time• Hybrid Batch / Real-Time

• Open Sourced By Twitter in 2013

• Built on top of Storm (and Cascading)

• Program in Scala

Page 32: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

Page 33: Dataiku   big data paris - the rise of the hadoop ecosystem

FAST LEARNING DRIVE

GOOD PUPILS ITERATE

Page 34: Dataiku   big data paris - the rise of the hadoop ecosystem

ITERATION FOR MACHINE LEARNING

……..

……..

Stochastic Gradient Descent : ITERATE

K-Means : ITERATE

Pages Rank: ITERATE

……..

Page 35: Dataiku   big data paris - the rise of the hadoop ecosystem

LEARNING > GRAPHLAB

“Graph” Analytics in Memory

• Created at Carnegie-Mellon in 2009

• Generic Graph Traversal framework

• Packaged Machine Learning- Recommender Systems- Graph Analytics- Clustering

• Easy Python Integration

Page 36: Dataiku   big data paris - the rise of the hadoop ecosystem

LEARNING > H2O

In-Memory Distribution Prediction Engine

Machine Learning- Classification- Regression- Clustering

- R/Python easy integration

Page 37: Dataiku   big data paris - the rise of the hadoop ecosystem

ALL > SPARK

Real-Time Resilient Distributed Memory Framework

• Abstraction with any DAG operation on data:- Filter- Map- Reduce - Cache

Page 38: Dataiku   big data paris - the rise of the hadoop ecosystem

SPARK AND ITS ECOSYSTEM

SHARK

MLBASE

STREAMING

Real-Time Queries

Real-Time Updates

In-Memory Learning

SPAR

K

Page 39: Dataiku   big data paris - the rise of the hadoop ecosystem

THE WHOLE PICTURE

HDFSYARN

map reduce SPARK

GRAPHLAB

H2OST

REAM

ING

ML

BASE

SHAR

K

PIG

HIV

E

CASC

ADIN

G

STO

RM

DRI

LL

othe

r sto

rage

IMPA

LA

Page 40: Dataiku   big data paris - the rise of the hadoop ecosystem

THANK YOU !

dataiku.com

DATAIKU STAND A4

DEMO

DATA SCIENCE STUDIO

Questions now

or later

[email protected]


Top Related