google cloud platformfiles.meetup.com/18404940/big data reference architecture - reza rokni.pdfdata...

38
Google Cloud Platform Reference Architecture (Streaming) Reza Rokni

Upload: others

Post on 21-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google Cloud Platform Reference Architecture (Streaming)

Reza Rokni

Page 2: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but
Page 3: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Data .. Introduction

GB's

Page 4: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

...can be Big Introduction

TB's

Page 5: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

... really really big! ... but at least always batch?Introduction

TuesdayWednesday

Thursday

PB's

Page 6: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

... well... but at least it's on time..Introduction

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

Page 7: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

... it's doesn't even have the courtesy to be on time!Introduction

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

Page 8: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

10sec

Cloud Pub/SubAsync Messaging

Massive Scale NoSqlNoSQL Database Service

Cloud DataflowParallel data processing

BigQueryAnalytics Engine

CloudMLMachine Learning

File

Cloud StorageObject Store Exports

Cloud DataprocManaged Spark Hadoop

Page 9: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

100sec

Cloud Pub/SubAsync Messaging

Cloud StorageObject Store

Page 10: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Capture

• Globally redundant• Batched read/write• Custom labels• Push & Pull• Auto expiration• 10 MB Message Size• 7 Days storage for

unack Messages

Publisher A Publisher B Publisher C

Message 1

Topic A Topic B Topic C

Subscription XA Subscription XB Subscription YC

Subscription ZC

Cloud Pub/Sub

Subscriber X Subscriber Y

Message 2 Message 3

Subscriber Z

Message 1

Message 2

Message 3

Message 3

Cloud Pub/Sub API

Page 11: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

10sec

Cloud Pub/SubAsync Messaging

Cloud DataflowParallel data processing

File

Cloud StorageObject Store

Page 12: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google Cloud Dataflow ( Apache Beam ) Introduction

Apache Beam (incubating) Google Cloud Dataflow

Page 13: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Extra Reading : FlumeJava Combined with MillWheel Dataflow explained

Page 14: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

FlumeJava - The What not the HowDataflow explained

FlumeJava

TextIO.Read(MarketData)

ParDo(enrichData(bidsize,ask,bid,trade)

ParDo(filterData(bidsize>x))

BigQueryIO.Write

Code shown is sudo code only

Page 15: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

MillWheel - Framework for low latency data processing Dataflow explained

Page 16: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

C D

C+D

consumer-producer sibling

C D

C+D

Optimizer fusion Optimizer fusionProcesses

Page 17: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

100 mins. 65 mins.vs.

Dynamic Worker OptimizationProcesses

Page 18: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Count

Stream

Parse Message

BigQuery BigQuery

Window

Detect Anomaly

Building a clickstream processing pipeline● In this example we will

○ Read Data from Pub/Sub○ Window and Aggregate the Data○ Do something programmatically with the data

Page 19: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Batch Read

Parse Message

Clickstream

BigQuery

Pipeline p = Pipeline.create();

p.begin()

PCollection<String> dataCollection = p.apply(TextIO.Read.from(“gs://…”))

dataCollection.apply(new ParseMessage())

ParDo.of(new TokenizesMessage())

ParDo.of(new CreateRecords())

.apply(BigQueryIO.Write.to(...))

STEP 1 - Transport

Code shown is sudo code only

Page 20: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Count

Batch Read

Parse Message

Clickstream

BigQuery BigQuery

Window

Detect Anomaly

Pipeline p = Pipeline.create();

p.begin()

.apply(Window.<Record>into(FixedWindows.of(Duration.standardSecounds(60)))

.apply(ParDo.of(new CreateEventKey()))

.apply(Count)

.apply(ParDo.of(new DetectAnomaly()))

STEP 2 - Detect

Code shown is sudo code only

Page 21: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Count

Stream

Parse Message

BigQuery BigQuery

Window

Detect Anomaly

Pipeline p = Pipeline.create();

p.begin()

.apply(PubsubIO.Write.topic(...))

STEP 3 - Stream

.apply(TextIO.Read.from(“gs://…”))

.apply(PubsubIO.Read.topic(...))

Code shown is sudo code only

Page 22: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

1 + 1 = 2Completeness Latency Cost

$$$

Data Processing Tradeoffs

Page 23: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Requirements: Billing Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Page 24: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Requirements: Live Cost Estimate Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Page 25: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Requirements: Abuse Detection Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Page 26: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Requirements: Abuse Detection Backfill Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Page 27: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Dataflow explained

Inherent issues when dealing with streams

Page 28: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Watermarks

Page 29: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Watermark triggers

PCollection<KV<String, Integer>> scores = input

.apply(Window

.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark())

.apply(Sum.integersPerKey());

Code shown is sudo code only

Page 30: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Approximate Triggers

PCollection<KV<String, Integer>> scores = input

.apply(Window

.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1))))

.apply(Sum.integersPerKey());

Code shown is sudo code only

Page 31: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Requirements: Live Cost Estimate Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Page 32: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

GCP

Managed Service

User Code & SDK Work Manager

Dep

loy

& S

ched

ule

Pro

gres

s &

Lo

gsMonitoring UI

Job Manager

Page 33: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

10sec

Cloud Pub/SubAsync Messaging

Massive Scale NoSqlNoSQL Database Service

Cloud DataflowParallel data processing

BigQueryAnalytics Engine

File

Cloud StorageObject Store Exports

Page 34: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

BigQuery Or BigTable... Or Both??Pipeline Consumers

Massive Scale NoSqlNoSQL Database Service

BigQueryAnalytics Engine

Page 35: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

10sec

Cloud Pub/SubAsync Messaging

Massive Scale NoSqlNoSQL Database Service

Cloud DataflowParallel data processing

BigQueryAnalytics Engine

CloudMLMachine Learning

File

Cloud StorageObject Store Exports

Cloud DataprocManaged Spark Hadoop

Page 36: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

CloudML - Data pre-processing stagesMachine Learning

If Machine learning is the new rocket ship...

Data is the fuel!

Page 37: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

Let’s process some dataCloudML - API'sProcesses

Speech APIVision API

Page 38: Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData .. Introduction GB's...can be Big Introduction TB's... really really big! ... but

Google confidential │ Do not distribute

It is well known that a vital

ingredient of success is not

knowing that what you're

attempting can't be done

Terry Pratchett