dataiku big data paris - the rise of the hadoop ecosystem

Post on 10-May-2015

1.372 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Snapshot of the hadoop ecosystem at the beginning of 2014, with the rise of real time and in memory processing distributed frameworks that complement and supplant the Map Reduce paradigm

TRANSCRIPT

The Riseof the

HadoopEcosystem

Florian DouetteauCEO Dataiku

DATAIKU

DATA PREPARATIONMODELING STATISTICS

VISUALIZATION

ALL-IN-ONE

DATA SCIENCE STUDIO

TOPICS FOR TODAY

DRIVERS FOR THE NEW “REAL-TIME“HADOOP ECOSYSTEM

KEY TOOLS AND FRAMEWORKSTO BE AWARE OF

DRIVER 1: BACK TO THE BASICS

RAM - CPU - DISK

2000 2013

1000$ / GB

6$ / GB$10 / GB

$0.06 / GB

memory divided by 150

disk costdivided by 250

MAPREDUCE

times

HACKREDUCE

times

A PERSISTENT MEMORY PROBLEM

DATA IS BIGGER

IS USEFUL DATA BIGGER ?

WHOLE DATA

REFINED DATA

GOLD

NEEDLE IN HAYSTACK ?

OILD

REFINE BEFOREUSE

HOW BIG IS BIG DATA ?Web Site– $1B revenue per year – 10 Millions Unique Visitor per month– 100.Millions orders / actions / per day

10TBRAW DATA

1TBREFINE DATA

1 TERABYTE

FITS IN MEMORY

1TB

DRIVER 2 : ECOSYSTEM GROWS

• GOOGLE

• 1 Circle OPEN SOURCE– YAHOO – IBM –

LINKEDIN - FACEBOOK

• 2 Circle – STANDFORD BERKELEY– STARTUPS

STARTUPS

64m$

6.75m$

14m$

2m$

40m$

20m$

20.5m$

19m$

4m$

100m$

1.8m$

17m$

11m$

7.75m$

1.7m$

20132012

2011

2010

2009

$1B per yearInvested

in Big Data TECH

223m$

301m$

HAVE YOU SEEN THE MOVIE ?

dooop

ALL-IN-ONE SOLUTION

HDFS

MAP REDUCE

1. Safe Large Storage (HDFS)

2. Distributed computation paradigm (Map Reduce)

3. Resilient long job

4. Disk-CPU locality aware resource allocation

HADOOP =

LOVELY TANGLED TOGETHER

INTRODUCTING YARN

HDFSYARN

map reduce

provider1

Other cluster

provider…

THE NEW ECOSYSTEM

FASTER FASTER FASTER

REALLY FASTER ?

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

DEVELOPER CAN WAIT

DEVELOPPER CAN WAIT

BUSINESS WON’TWAIT

REAL-TIME QUERIES

Not All Queries are born

equals

RT QUERIES > IMPALA

MPP Database like performance for Hadoop

- Created in 2012 by Cloudera

- x100 performance over Hive (for certain queries)

RT QUERIES > DRILLExtensible architecturefor SQL Querying

• Started in 2013

• Apache Incubated Project• Lucidworks• Mapr • ElasticSearch• …

• Alpha Status

• Open architecture for supporting SQL like queries to various data sources: • Cassandra• MongoDB• HDFS• HBase

Apache DRILL

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

REAL-TIME UPDATES

UPDATE > Recommender SystemUpdate the Model Once per week using the whole history

Apply the model for each userusing the very last events

Real-TimeNavigation

Real-TimeRecommendation

UPDATE > STORM

STORM Reliable Distributed Real-Time Computations

- Connect to a variety of datasources (HDFS, RabbitMQ, JMS etc..)

- Run Computation in java (native) or python, ruby, perl …

- Guarantees that events are taken processed

- Distributes workload

UPDATES > SUMMINGBIRD

Write Map-Reduce like program and executing either in

• Batch• Real-Time• Hybrid Batch / Real-Time

• Open Sourced By Twitter in 2013

• Built on top of Storm (and Cascading)

• Program in Scala

REAL-TIME

REAL-TIME QUERIES

REAL-TIME UPDATES

FASTMACHINE LEARNING

FAST LEARNING DRIVE

GOOD PUPILS ITERATE

ITERATION FOR MACHINE LEARNING

……..

……..

Stochastic Gradient Descent : ITERATE

K-Means : ITERATE

Pages Rank: ITERATE

……..

LEARNING > GRAPHLAB

“Graph” Analytics in Memory

• Created at Carnegie-Mellon in 2009

• Generic Graph Traversal framework

• Packaged Machine Learning- Recommender Systems- Graph Analytics- Clustering

• Easy Python Integration

LEARNING > H2O

In-Memory Distribution Prediction Engine

Machine Learning- Classification- Regression- Clustering

- R/Python easy integration

ALL > SPARK

Real-Time Resilient Distributed Memory Framework

• Abstraction with any DAG operation on data:- Filter- Map- Reduce - Cache

SPARK AND ITS ECOSYSTEM

SHARK

MLBASE

STREAMING

Real-Time Queries

Real-Time Updates

In-Memory Learning

SPAR

K

THE WHOLE PICTURE

HDFSYARN

map reduce SPARK

GRAPHLAB

H2OST

REAM

ING

ML

BASE

SHAR

K

PIG

HIV

E

CASC

ADIN

G

STO

RM

DRI

LL

othe

r sto

rage

IMPA

LA

THANK YOU !

dataiku.com

DATAIKU STAND A4

DEMO

DATA SCIENCE STUDIO

Questions now

or later

florian.douetteau@dataiku.com

top related