hadoop2, spark - big data paris 2020 cedric carbone.pdf · hadoop2, spark big data, real time,...

55
Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases

Upload: others

Post on 28-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases

Page 2: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Agenda

• Map Reduce

• Hadoop v1 limits

• Hadoop v2 and YARN

• Apache Spark

• Streaming : Spark vs Storm

• Machine Learning : Recommender System

• Use Case : Next Product To Buy

• Q&A

Page 3: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

What’s hadoop • The Apache™ Hadoop® project develops open-

source software for reliable, scalable, distributed computing.

• Java framework for storage and running data transformation on large cluster of commodity hardware

• Licensed under the Apache v2 license

• Created from Google's MapReduce, BigTable and Google File System (GFS) papers

Page 4: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

HDFS : Distributed Storage

• Distributed, • Scalable, • Portable, • Reliable file system for the Hadoop framework. Metadata / data separation:

• Name Nodes • Data Nodes

Page 5: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Map Reduce • Map() : parse inputs and generate 0 to n <key,

value>

• Reduce() : sums all values of the same key and generate a <key, value>

WordCount Example

• Each map take a line as an input and break into words – It emits a key/value pair of the word and 1

• Each Reducer sums the counts for each word – It emits a key/value pair of the word and sum

Page 6: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Map Reduce

Data Node 1

Data Node 2

Page 7: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Map Reduce

Page 8: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Map Reduce

Page 9: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Map Reduce

Page 10: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Map Reduce

Page 11: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Hadoop MapReduce v1

Page 12: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Hadoop MapReduce v1

Page 13: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Hadoop MapReduce v1

Page 14: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Not good for low-latency jobs on smallest dataset

Hadoop MapReduce v1

Page 15: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Hadoop MapReduce v1

Good for off-line batch jobs on massive data

Page 16: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Hadoop 1

• Batch ONLY

– High latency jobs

HDFS (Redundant, Reliable Storage)

MapReduce1 Cluster Resource Management + Data Processing

BATCH

HIVE Query

Pig Scripting

Cascading Accelerate Dev.

Page 17: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Hadoop2 : Big Data Operating System

• Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS

– Simultaneously & with predictable levels of service

– Data analysts and real-time applications

HDFS (Redundant, Reliable Storage)

MapReduce1 Data Processing

BATCH

YARN (Cluster Resource Management)

Other Data Processing

Page 18: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Hadoop2 : Big Data Operating System

• Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS

– Simultaneously & with predictable levels of service

– Data analysts and real-time applications

HDFS (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH (MapReduce)

INTERACTIVE (Tez)

STREAMING (Storm, Samza Spark Streaming)

GRAPH (Giraph, GraphX)

Machine Learning

(Spark MLLIb)

In-Memory (Spark)

ONLINE (Hbase HOYA)

OTHER (ElasticSearch)

Page 19: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Stinger.next

Page 20: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Stinger.next

Page 21: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

https://spark.apache.org

Apache Spark™ is a fast and general engine for large-scale data processing.

Page 22: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

The most active project

0

50

100

150

200

250

Patches

MapReduce Storm

Yarn Spark

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Lines Added

MapReduce Storm

Yarn Spark

Page 23: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Spark won the Daytona GraySort contest!

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Sort on disk 100TB of data 3x faster than Hadoop MapReduce using 10x fewer machines.

Page 24: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

RDD & Operation

Resilient Distributed Datasets (RDDs)

Operations

➜ Transformations (e.g. map, filter, groupBy)

➜ Actions (e.g. count, collect, save)

Page 25: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Spark

scala> val textFile = sc.textFile("README.md")

➜ textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

scala> textFile.count()

➜ res0: Long = 126

scala> textFile.first()

➜ res1: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line =>

line.contains("Spark"))

➜ linesWithSpark: spark.RDD[String]=spark.FilteredRDD@7dd4

scala>

textFile.filter(line=>line.contains("Spark")).count()

➜ res3: Long = 15

Page 26: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Streaming

Streaming

Page 27: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Storm

Page 28: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Storm

Page 29: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Storm vs Spark

Spark Streaming Storm Storm Trident

Processing model Micro batches Record-at-a-time Micro batches

Thoughput ++++ ++ ++++

Latency Second Sub-second Second

Reliability Models Exactly once At least once Exactly once

Embedded Hadoop Distro HDP, CDH, MapR HDP HDP

Support Databricks N/A N/A

Community ++++ ++ ++

Spark Storm

Scope Batch, Streaming, Graph, ML, SQL Streaming only

Page 30: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Machine Learning Library (Mllib)

Page 31: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Collaborative Filtering

Page 32: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Collaborative Filtering (learning)

Page 33: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Collaborative Filtering (learning)

Page 34: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Collaborative Filtering (learning)

Page 35: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Collaborative Filtering : Let’s use the model

Page 36: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Collaborative Filtering : similar behaviors

Page 37: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Collaborative Filtering Prediction

Page 38: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Netflix Prize (2009) Netflix is a provider of on-demand Internet streaming media

Page 39: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Input Data

UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 Etc… 2::1357::5::978298709 2::3068::4::978299000 2::1537::4::978299620

Page 40: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Matric Factorization

Page 41: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

The result

1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972) 1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994) 1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993) 1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991) 1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939) 2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994) 2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994) 2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993) 2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982) 2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994) 3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19

Page 42: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Real Time Big Data Use Case Next Gen Data Marketing Platform

Next Product To Buy

Page 43: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

“2013 Definitive Guide to Social Marketing” - Marketo.

Ready for Omni-channel? Traditional marketing

Current approach cannot keep up…

200m people on Do Not Call list

99.9%

of online banners are never clicked.

44%

of direct

marketing is never opened.

86% of TV viewers

skip commercials

Buyers complete

60%

of their research before reaching out

to vendors.

Page 44: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Statement

2000 2010 2013 2015

Multi Channel

Cross Channel

Omni Channel

Consumer Graph

Page 45: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Next Product to Buy in Action

Open data

Premium data

1

Page 46: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Next Product to Buy in Action

ERP

CRM Loyalty

Brand data

Open data

Premium data

1

Page 47: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Next Product to Buy in Action

CRM Loyalty

ERP Brand data

Open data

Premium data

2

Page 48: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Next Product to Buy in Action

CRM Loyalty

ERP Brand data

Open data

Premium data

3

Page 49: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Next Product to Buy in Action

CRM Loyalty

ERP Brand data

Open data

Premium data

4

Page 50: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Next Product to Buy in Action

CRM Loyalty

ERP Brand data

Open data

Premium data

4

Page 51: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Next Product to Buy in Action

CRM Loyalty

ERP Brand data

Open data

Premium data

4

Page 52: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Next Product to Buy in Action

CRM Loyalty

ERP Brand data

Open data

Premium data

5

Page 53: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Brand Premium Open Social Influans

Sales

Social Interactions

Graph

Fine Tune

Engage

OnBoard

Suggest

+

+

Page 54: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Real Time Big Data Use Case Next Gen Data Marketing Platform

Next Product To Buy

➜ Right Person

➜ Right Product

➜ Right Price

➜ Right Time

➜ Right Channel

Page 55: Hadoop2, Spark - Big Data Paris 2020 Cedric CARBONE.pdf · Hadoop2, Spark Big Data, real time, machine learning & use cases . Agenda •Map Reduce ... engine for large-scale data

Cédric Carbone

[email protected]

@carbone

www.hugfrance.fr

[email protected]

@hugfrance

Questions?

W e g r a p h c o n s u m e r s