dataiku big data paris - the rise of the hadoop ecosystem

40
The Rise of the Hadoop Ecosystem

Upload: dataiku

Post on 10-May-2015

1.372 views

Category:

Technology


2 download

DESCRIPTION

Snapshot of the hadoop ecosystem at the beginning of 2014, with the rise of real time and in memory processing distributed frameworks that complement and supplant the Map Reduce paradigm

TRANSCRIPT

Page 1: Dataiku   big data paris - the rise of the hadoop ecosystem

The Riseof the

HadoopEcosystem

Page 2: Dataiku   big data paris - the rise of the hadoop ecosystem

Florian DouetteauCEO Dataiku

DATAIKU

DATA PREPARATIONMODELING STATISTICS

VISUALIZATION

ALL-IN-ONE

DATA SCIENCE STUDIO

Page 3: Dataiku   big data paris - the rise of the hadoop ecosystem

TOPICS FOR TODAY

DRIVERS FOR THE NEW “REAL-TIME“HADOOP ECOSYSTEM

KEY TOOLS AND FRAMEWORKSTO BE AWARE OF

Page 4: Dataiku   big data paris - the rise of the hadoop ecosystem

DRIVER 1: BACK TO THE BASICS

RAM - CPU - DISK

Page 5: Dataiku   big data paris - the rise of the hadoop ecosystem

2000 2013

1000$ / GB

6$ / GB$10 / GB

$0.06 / GB

memory divided by 150

disk costdivided by 250

MAPREDUCE

times

HACKREDUCE

times

A PERSISTENT MEMORY PROBLEM

Page 6: Dataiku   big data paris - the rise of the hadoop ecosystem

DATA IS BIGGER

Page 7: Dataiku   big data paris - the rise of the hadoop ecosystem

IS USEFUL DATA BIGGER ?

WHOLE DATA

REFINED DATA

Page 8: Dataiku   big data paris - the rise of the hadoop ecosystem

GOLD

NEEDLE IN HAYSTACK ?

Page 9: Dataiku   big data paris - the rise of the hadoop ecosystem

OILD

REFINE BEFOREUSE

Page 10: Dataiku   big data paris - the rise of the hadoop ecosystem

HOW BIG IS BIG DATA ?Web Site– $1B revenue per year – 10 Millions Unique Visitor per month– 100.Millions orders / actions / per day

10TBRAW DATA

1TBREFINE DATA

Page 11: Dataiku   big data paris - the rise of the hadoop ecosystem

1 TERABYTE

FITS IN MEMORY

1TB

Page 12: Dataiku   big data paris - the rise of the hadoop ecosystem

DRIVER 2 : ECOSYSTEM GROWS

• GOOGLE

• 1 Circle OPEN SOURCE– YAHOO – IBM –

LINKEDIN - FACEBOOK

• 2 Circle – STANDFORD BERKELEY– STARTUPS

Page 13: Dataiku   big data paris - the rise of the hadoop ecosystem

STARTUPS

64m$

6.75m$

14m$

2m$

40m$

20m$

20.5m$

19m$

4m$

100m$

1.8m$

17m$

11m$

7.75m$

1.7m$

20132012

2011

2010

2009

$1B per yearInvested

in Big Data TECH

223m$

301m$

Page 14: Dataiku   big data paris - the rise of the hadoop ecosystem

HAVE YOU SEEN THE MOVIE ?

dooop

Page 15: Dataiku   big data paris - the rise of the hadoop ecosystem

ALL-IN-ONE SOLUTION

HDFS

MAP REDUCE

1. Safe Large Storage (HDFS)

2. Distributed computation paradigm (Map Reduce)

3. Resilient long job

4. Disk-CPU locality aware resource allocation

HADOOP =

Page 16: Dataiku   big data paris - the rise of the hadoop ecosystem

LOVELY TANGLED TOGETHER

Page 17: Dataiku   big data paris - the rise of the hadoop ecosystem

INTRODUCTING YARN

Page 18: Dataiku   big data paris - the rise of the hadoop ecosystem

HDFSYARN

map reduce

provider1

Other cluster

provider…

THE NEW ECOSYSTEM

Page 19: Dataiku   big data paris - the rise of the hadoop ecosystem

FASTER FASTER FASTER

REALLY FASTER ?

Page 20: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

Page 21: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

Page 22: Dataiku   big data paris - the rise of the hadoop ecosystem

DEVELOPER CAN WAIT

DEVELOPPER CAN WAIT

Page 23: Dataiku   big data paris - the rise of the hadoop ecosystem

BUSINESS WON’TWAIT

Page 24: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME QUERIES

Not All Queries are born

equals

Page 25: Dataiku   big data paris - the rise of the hadoop ecosystem

RT QUERIES > IMPALA

MPP Database like performance for Hadoop

- Created in 2012 by Cloudera

- x100 performance over Hive (for certain queries)

Page 26: Dataiku   big data paris - the rise of the hadoop ecosystem

RT QUERIES > DRILLExtensible architecturefor SQL Querying

• Started in 2013

• Apache Incubated Project• Lucidworks• Mapr • ElasticSearch• …

• Alpha Status

• Open architecture for supporting SQL like queries to various data sources: • Cassandra• MongoDB• HDFS• HBase

Apache DRILL

Page 27: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

Page 28: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME UPDATES

Page 29: Dataiku   big data paris - the rise of the hadoop ecosystem

UPDATE > Recommender SystemUpdate the Model Once per week using the whole history

Apply the model for each userusing the very last events

Real-TimeNavigation

Real-TimeRecommendation

Page 30: Dataiku   big data paris - the rise of the hadoop ecosystem

UPDATE > STORM

STORM Reliable Distributed Real-Time Computations

- Connect to a variety of datasources (HDFS, RabbitMQ, JMS etc..)

- Run Computation in java (native) or python, ruby, perl …

- Guarantees that events are taken processed

- Distributes workload

Page 31: Dataiku   big data paris - the rise of the hadoop ecosystem

UPDATES > SUMMINGBIRD

Write Map-Reduce like program and executing either in

• Batch• Real-Time• Hybrid Batch / Real-Time

• Open Sourced By Twitter in 2013

• Built on top of Storm (and Cascading)

• Program in Scala

Page 32: Dataiku   big data paris - the rise of the hadoop ecosystem

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

Page 33: Dataiku   big data paris - the rise of the hadoop ecosystem

FAST LEARNING DRIVE

GOOD PUPILS ITERATE

Page 34: Dataiku   big data paris - the rise of the hadoop ecosystem

ITERATION FOR MACHINE LEARNING

……..

……..

Stochastic Gradient Descent : ITERATE

K-Means : ITERATE

Pages Rank: ITERATE

……..

Page 35: Dataiku   big data paris - the rise of the hadoop ecosystem

LEARNING > GRAPHLAB

“Graph” Analytics in Memory

• Created at Carnegie-Mellon in 2009

• Generic Graph Traversal framework

• Packaged Machine Learning- Recommender Systems- Graph Analytics- Clustering

• Easy Python Integration

Page 36: Dataiku   big data paris - the rise of the hadoop ecosystem

LEARNING > H2O

In-Memory Distribution Prediction Engine

Machine Learning- Classification- Regression- Clustering

- R/Python easy integration

Page 37: Dataiku   big data paris - the rise of the hadoop ecosystem

ALL > SPARK

Real-Time Resilient Distributed Memory Framework

• Abstraction with any DAG operation on data:- Filter- Map- Reduce - Cache

Page 38: Dataiku   big data paris - the rise of the hadoop ecosystem

SPARK AND ITS ECOSYSTEM

SHARK

MLBASE

STREAMING

Real-Time Queries

Real-Time Updates

In-Memory Learning

SPAR

K

Page 39: Dataiku   big data paris - the rise of the hadoop ecosystem

THE WHOLE PICTURE

HDFSYARN

map reduce SPARK

GRAPHLAB

H2OST

REAM

ING

ML

BASE

SHAR

K

PIG

HIV

E

CASC

ADIN

G

STO

RM

DRI

LL

othe

r sto

rage

IMPA

LA

Page 40: Dataiku   big data paris - the rise of the hadoop ecosystem

THANK YOU !

dataiku.com

DATAIKU STAND A4

DEMO

DATA SCIENCE STUDIO

Questions now

or later

[email protected]