beyond mapreduce, beyond lambda · beyond mapreduce, beyond lambda easy, unified, reliable...

Post on 20-May-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Beyond MapReduce, Beyond

Lambda Easy, unified, reliable processing for stream and batch

William Vambenepe

@vambenepe

Lead Product Manager for Big Data on Google Cloud Platform

http://research.google.com/archive/mapreduce.html

http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf

http://research.google.com/pubs/pub41378.html

http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

The Lambda Architecture

2012 2013 2002 2004 2006 2008 2010

Google Cloud

Dataflow

MapReduce

GFS Big Table

Dremel

Pregel

Flume

Colossus

Spanner MillWheel

Event Time - When Events Happened

Stream Time - When Events Are Processed

Batch vs Streaming

MapReduce

Batch

MapReduce

[10:00 - 11:00) [10:00 - 11:00) [11:00 -

12:00) [12:00 -

13:00) [13:00 -

14:00) [14:00 -

15:00) [15:00 -

16:00) [16:00 -

17:00) [18:00 -

19:00) [19:00 -

20:00) [21:00 -

22:00) [22:00 -

23:00) [23:00 - 0:00)

Batch: Fixed Windows

MapReduce

[10:00 - 11:00) [11:00 - 12:00)

Batch: User Sessions

Joan

Larry

Ingo

Amanda

Cheryl

Arthur

[11:00 - 12:00) [10:00 - 11:00)

Streaming

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Unordered

Unbounded

Of Varying Event Time Skew

Confounding characteristics of data streams

Event Time Skew

Str

ea

m T

ime

Event Time

Skew

Approaches

1.Time-Agnostic Processing

2.Approximation

3.Stream Time Windowing

4.Event Time Windowing

Approaches to reasoning about time

1. Time-Agnostic Processing - Filters

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

1. Time-Agnostic Processing - Hash Join

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

2. Approximation via Online Algorithms

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

3. Windowing by Stream Time

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Event Time

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

4. Windowing by Event Time - Fixed Windows

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Event Time

11:00 10:00 16:00 15:00 14:00 13:00 12:00 Stream Time

4. Windowing by Event Time - Sessions

Dataflow API

What are you computing?

Where in event time?

When in stream time?

What = Aggregation API

Where = Windowing API

When = Watermarks + Triggers API

Dataflow improvements over Lambda

Low-latency, approximate results

Complete, correct results as soon as possible

One system: less to manage, fewer resources, one set of bugs

Tools for explicit reasoning about time

= Power + Flexibility + Clarity

And those are just the programming model improvements…

What about the operational model improvements from

marrying Dataflow with Cloud?

Cloud Dataflow as a No-op Cloud service

Google Cloud Platform

Managed Service

User Code & SDK

Work Manager

De

plo

y &

Sch

ed

ule

Pro

gre

ss &

Log

s

Monitoring UI

Job Manager

Putting it all together

Stream

Batch

Cloud

Pub/Sub

Cloud Logs

Google

Analytics

Premium

Google

Cloud

Storage

Google

App

Engine

Cloud

Dataflow

BigQuery

Storage (tables)

Cloud

Storage (files)

Cloud

Dataflow

BigQuery

Analytics (SQL)

Bigtable (noSQL)

Optimizing Time To Answer

More time to dig

into your data

Programming

Resource

provisioning

Performance

tuning

Monitoring

Reliability Deployment &

configuration

Handling

Growing

Scale

Utilization

improvements

Data Processing with

Cloud Dataflow Typical Data Processing

Programming

For more info Google Cloud Services:

https://cloud.google.com/dataflow/

https://cloud.google.com/bigquery/

https://cloud.google.com/pubsub/

https://cloud.google.com/hadoop/

Contact me:

William Vambenepe

twitter: @vambenepe

email: vbp@google.com

Dataflow programming model

is open-source:

SDK @ github

/GoogleCloudPlatform/DataflowJavaSDK

(Python SDK in progress)

Spark runner @ github

/cloudera/spark-dataflow

Flink runner @ github

/dataArtisans/flink-dataflow

top related