apache beam and google cloud dataflow - idg - final

43
Google Cloud Dataflow the next generation of managed big data service based on the Apache Beam programming model Sub Szabolcs Feczak, Cloud Solutions Engineer Google 9th Cloud & Data Center World 2016 - 한국 IDG

Upload: szabolcs-feczak

Post on 13-Jan-2017

635 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Apache Beam and Google Cloud Dataflow - IDG - final

Google Cloud Dataflowthe next generation of managed big data service based on the Apache Beam programming model

Sub Szabolcs Feczak, Cloud Solutions Engineer

Google

9th Cloud & Data Center World 2016 - 한국 IDG

Page 2: Apache Beam and Google Cloud Dataflow - IDG - final

You leave here understanding the fundamentals of

the Apache Beam model and the Google Cloud Dataflow managed service

We have some fun.

1

Goals

2

Page 3: Apache Beam and Google Cloud Dataflow - IDG - final

Background and Historical overview

Page 4: Apache Beam and Google Cloud Dataflow - IDG - final

The trade-off quadrant of Big Data

CompletenessSpeed

Cost Optimization

Complexity

Time to Answer

Page 5: Apache Beam and Google Cloud Dataflow - IDG - final

MapReduce

Hadoop

Flume

Storm

Spark

MillWheel

Flink

Apache Beam

*

Batch

Streaming

Pipelines

Unified API

No Lam

bda

Iterative

Interactive

Exactly Once

State

Timers

Auto-A

wesom

e

Waterm

arks

Window

ing

High-level API

Managed Service

Triggers

Open Source

Unified Engine*

*O

ptimizer

* * **

*

* *

* *

*

Page 6: Apache Beam and Google Cloud Dataflow - IDG - final

Deep dive, probing familiarity with the subject

1M Devices

16.6K Events/sec

43B Events/month

518B Events/year

Page 7: Apache Beam and Google Cloud Dataflow - IDG - final

Before Apache Beam

Batch

Accuracy

Simplicity

Savings

Stream

Speed

Sophistication

Scalability

OROROROR

Page 8: Apache Beam and Google Cloud Dataflow - IDG - final

After Apache Beam

Batch

Accuracy

Simplicity

Savings

Stream

Speed

Sophistication

Scalability

ANDANDANDAND

Balancing correctness, latency and cost with a unified batch

with a streaming model

Page 10: Apache Beam and Google Cloud Dataflow - IDG - final
Page 11: Apache Beam and Google Cloud Dataflow - IDG - final

Apache Beam (incubating)

Java https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Python (ALPHA)

Scala /darkjh/scalaflow

/jhlch/scala-dataflow-dsl

SoftwareDevelopment Kits Runners

http://incubator.apache.org/projects/beam.htmlThe Dataflow submission to the Apache Incubator was accepted on February 1, 2016, and the resulting project is now called Apache Beam.

Spark runner@ /cloudera/spark-

dataflow

Flink runner @ /dataArtisans/flink-dataflow

Page 12: Apache Beam and Google Cloud Dataflow - IDG - final

• Movement

• Filtering

• Enrichment

• Shaping

• Reduction

• Batch computation

• Continuous computation

• Composition

• External orchestration

• Simulation

Where might you use Apache Beam?

AnalysisETL Orchestration

Page 13: Apache Beam and Google Cloud Dataflow - IDG - final

Why would you go with a managed service?

Page 14: Apache Beam and Google Cloud Dataflow - IDG - final

GCP

Managed Service

User Code & SDKWork Manager

Deploy & Schedule

Monitoring UI

Job Manager

Cloud Dataflow Managed Service advantages (GA since 2015 August)

Progress & Logs

Page 15: Apache Beam and Google Cloud Dataflow - IDG - final

Deploy Schedule & Monitor Tear Down

Worker Lifecycle ManagementCloud Dataflow Service

Page 16: Apache Beam and Google Cloud Dataflow - IDG - final

❯ Time & life never stop

❯ Data rates & schema are not static

❯ Scaling models are not static

❯ Non-elastic compute is wasteful and can create lag

Challenge: cost optimization

Page 17: Apache Beam and Google Cloud Dataflow - IDG - final

Auto-scaling800 QPS 1200 QPS 5000 QPS 50 QPS

10:00 11:00 12:00 13:00

Cloud Dataflow Service

Page 18: Apache Beam and Google Cloud Dataflow - IDG - final

100 mins. 65 mins.

vs.

Dynamic Work RebalancingCloud Dataflow Service

Page 19: Apache Beam and Google Cloud Dataflow - IDG - final

● ParDo fusion○ Producer Consumer○ Sibling○ Intelligent fusion

boundaries● Combiner lifting e.g. partial

aggregations before reduction

● http://research.google.com/search.html?q=flume%20java

...

Graph OptimizationCloud Dataflow Service

C D

C+D

consumer-producer

= ParallelDo

GBK = GroupByKey

+ = CombineValues

sibling

C D

C+D

A GBK + B

A+ GBK + B

combiner lifting

Page 20: Apache Beam and Google Cloud Dataflow - IDG - final

Deep dive into the programming model

Page 21: Apache Beam and Google Cloud Dataflow - IDG - final

The Apache Beam Logical Model

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

Page 22: Apache Beam and Google Cloud Dataflow - IDG - final

What are you computing?

● A Pipeline represents a graph

● Nodes are data processing

transformations

● Edges are data sets flowing

through the pipeline

● Optimized and executed as a

unit for efficiency

Page 23: Apache Beam and Google Cloud Dataflow - IDG - final

What are you computing? PCollections ● is a collection of homogenous

data of the same type

● Maybe be bounded or unbounded in size

● Each element has an implicit timestamp

● Initially created from backing data stores

Page 24: Apache Beam and Google Cloud Dataflow - IDG - final

Challenge: completeness when processing continuous data

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

8:00

Page 25: Apache Beam and Google Cloud Dataflow - IDG - final

What are you computing? PTransforms

transform PCollections into other PCollections.

What Where When How

Element-Wise(Map + Reduce = ParDo)

Aggregating(Combine, Join Group)

Composite

Page 26: Apache Beam and Google Cloud Dataflow - IDG - final

GroupByKey

Pair With Ones

Sum Values

Count

❯ Define new PTransforms by building up subgraphs of existing transforms

❯ Some utilities are included in the SDK• Count, RemoveDuplicates, Join,

Min, Max, Sum, ...

❯ You can define your own:• DoSomething, DoSomethingElse,

etc.

❯ Why bother?• Code reuse• Better monitoring experience

Composite PTransformsApache BeamSDK

Page 27: Apache Beam and Google Cloud Dataflow - IDG - final

Example: Computing Integer Sums

What Where When How

Page 28: Apache Beam and Google Cloud Dataflow - IDG - final

What Where When How

Example: Computing Integer Sums

Page 29: Apache Beam and Google Cloud Dataflow - IDG - final

Key 2

Key 1

Key 3

1

Fixed

2

3

4

Key 2

Key 1

Key 3

Sliding

123

54

Key 2

Key 1

Key 3

Sessions

2

43

1

Where in Event Time?

● Windowing divides data into event-time-based finite chunks.

● Required when doing aggregations over unbounded data.

What Where When How

Page 30: Apache Beam and Google Cloud Dataflow - IDG - final

What Where When How

Example: Fixed 2-minute Windows

Page 31: Apache Beam and Google Cloud Dataflow - IDG - final

What Where When How

When in Processing Time?

● Triggers control when results are emitted.

● Triggers are often relative to the watermark.Pr

oces

sing

Tim

e

Event Time

WatermarkSkew

Page 32: Apache Beam and Google Cloud Dataflow - IDG - final

What Where When How

Example: Triggering at the Watermark

Page 33: Apache Beam and Google Cloud Dataflow - IDG - final

What Where When How

Example: Triggering for Speculative & Late Data

Page 34: Apache Beam and Google Cloud Dataflow - IDG - final

What Where When How

How do Refinements Relate?

● How should multiple outputs per window accumulate?

● Appropriate choice depends on consumer.

Firing Elements

Speculative 3

Watermark 5, 1

Late 2

Total Observ 11

Discarding

3

6

2

11

Accumulating

3

9

11

23

Acc. & Retracting

3

9, -3

11, -9

11

Page 35: Apache Beam and Google Cloud Dataflow - IDG - final

What Where When How

Example: Add Newest, Remove Previous

Page 36: Apache Beam and Google Cloud Dataflow - IDG - final

1. Classic Batch 2. Batch with Fixed Windows

3. Streaming 5. Streaming with Retractions

4. Streaming with Speculative + Late Data

Customizing What Where When How

What Where When How

Page 37: Apache Beam and Google Cloud Dataflow - IDG - final

The key takeaway

Page 38: Apache Beam and Google Cloud Dataflow - IDG - final

Optimizing Your Time To Answer

More time to dig into your data

Programming

Resource provisioning

Performance tuning

Monitoring

ReliabilityDeployment & configuration

Handling Growing Scale

Utilization improvements

Data Processing with Cloud DataflowTypical Data Processing

Programming

Page 40: Apache Beam and Google Cloud Dataflow - IDG - final

What do customers have to say aboutGoogle Cloud Dataflow

"We are utilizing Cloud Dataflow to overcome elasticity challenges with our current Hadoop cluster. Starting with some basic ETL workflow for BigQuery ingestion, we transitioned into full blown clickstream processing and analysis. This has helped us significantly improve performance of our overall system and reduce cost."

Sudhir Hasbe, Director of Software Engineering, Zullily.com

“The current iteration of Qubit’s real-time data supply chain was heavily inspired by the ground-breaking stream processing concepts described in Google’s MillWheel paper. Today we are happy to come full circle and build streaming pipelines on top of Cloud Dataflow - which has delivered on the promise of a highly-available and fault-tolerant data processing system with an incredibly powerful and expressive API.”

Jibran Saithi, Lead Architect, Qubit

"We are very excited about the productivity benefits offered by Cloud Dataflow and Cloud Pub/Sub. It took half a day to rewrite something that had previously taken over six months to build using Spark"

Paul Clarke, Director of Technology, Ocado

“Boosting performance isn’t the only thing we want to get from the new system. Our bet is that by using cloud-managed

products we will have a much lower operational overhead. That in turn means we will have much more time to make

Spotify’s products better.”

Igor Maravić, Software Engineer working at Spotify

Page 41: Apache Beam and Google Cloud Dataflow - IDG - final

Demo Time!

Page 42: Apache Beam and Google Cloud Dataflow - IDG - final

Let’s build something - Demo!

Ingest stream from Wikipedia edits https://wikitech.wikimedia.org/wiki/Stream.wikimedia.org

Inspect the result set in our data warehouse (BigQuery)

Create a pipeline and run a Dataflow job to extract the top 10 active editors and top 10 pages edited

Extract words from a Shakespeare corpus, count the occurrences of each word, write sharded results as blobs into a key value store (Cloud Storage)

1.

2.

Page 43: Apache Beam and Google Cloud Dataflow - IDG - final

Thank You!cloud.google.com/dataflowcloud.google.com/blog/big-data/cloud.google.com/solutions/articles#bigdatacloud.google.com/newsletterresearch.google.com