![Page 1: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/1.jpg)
Jelena Pjesivac-GrbovicStaff software engineerCloud Big Data
Data Processing with Apache Beam (incubating) and
Google Cloud Dataflow
XLDB’16 - May 2016
In collaboration with Frances Perry, Tayler Akidau, and Dataflow team
![Page 2: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/2.jpg)
Infinite, Out-of-Order Data Sets
What, Where, When, How
Apache Beam (incubating)
Agenda
Google Cloud Dataflow
2
4
1
3
![Page 3: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/3.jpg)
Infinite, Out-of-Order Data Sets1
![Page 4: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/4.jpg)
Data...
![Page 5: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/5.jpg)
...can be big...
![Page 6: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/6.jpg)
...really, really big...
TuesdayWednesday
Thursday
![Page 7: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/7.jpg)
… maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:00
![Page 8: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/8.jpg)
… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
![Page 9: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/9.jpg)
Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
![Page 10: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/10.jpg)
Aggregating via Processing-Time Windows
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
![Page 11: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/11.jpg)
Aggregating via Event-Time Windows
Event Time
Processing Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
![Page 12: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/12.jpg)
Reality
Formalizing Event-Time SkewP
roce
ssin
g Ti
me
Event Time
Ideal
Skew
![Page 13: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/13.jpg)
Formalizing Event-Time Skew
Watermarks describe event time progress.
"No timestamp earlier than the watermark will be seen"
Pro
cess
ing
Tim
e
Event Time
~Watermark
Ideal
Skew
Often heuristic-based.
Too Slow? Results are delayed.Too Fast? Some data is late.
![Page 14: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/14.jpg)
What, Where, When, How2
![Page 15: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/15.jpg)
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
![Page 16: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/16.jpg)
What are you computing?
What Where When How
Element-Wise Aggregating Composite
![Page 17: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/17.jpg)
What: Computing Integer Sums
// Collection of raw log linesPCollection<String> raw = IO.read(...);
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn());
// Composite transformation containing an aggregationPCollection<KV<String, Integer>> scores =
input.apply(Sum.integersPerKey());
What Where When How
![Page 18: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/18.jpg)
What: Computing Integer Sums
What Where When How
![Page 19: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/19.jpg)
What: Computing Integer Sums
What Where When How
![Page 20: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/20.jpg)
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Where in event time?
What Where When How
Fixed Sliding1 2 3
54
Sessions
2
431
Key 2
Key 1
Key 3
Time
1 2 3 4
![Page 21: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/21.jpg)
Where: Fixed 2-minute Windows
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.apply(Sum.integersPerKey());
![Page 22: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/22.jpg)
Where: Fixed 2-minute Windows
What Where When How
![Page 23: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/23.jpg)
When in processing time?
What Where When How
• Triggers control when results are emitted.
• Triggers are often relative to the watermark.
Pro
cess
ing
Tim
e
Event Time
~Watermark
Ideal
Skew
![Page 24: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/24.jpg)
When: Triggering at the Watermark
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
![Page 25: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/25.jpg)
When: Triggering at the Watermark
What Where When How
![Page 26: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/26.jpg)
When: Early and Late Firings
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
![Page 27: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/27.jpg)
When: Early and Late Firings
What Where When How
![Page 28: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/28.jpg)
How do refinements relate?
What Where When How
• How should multiple outputs per window accumulate?• Appropriate choice depends on consumer.
Firing Elements
Speculative [3]
Watermark [5, 1]
Late [2]
Last Observed
Total Observed
Discarding
3
6
2
2
11
Accumulating
3
9
11
11
23
Acc. & Retracting
3
9, -3
11, -9
11
11
(Accumulating & Retracting not yet implemented.)
![Page 29: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/29.jpg)
How: Add Newest, Remove Previous
What Where When How
![Page 30: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/30.jpg)
1.Classic Batch 2. Batch with Fixed Windows
3. Streaming
5. Streaming With Retractions
4. Streaming with Speculative + Late Data
What Where When How
6. Sessions
What / Where / When / How
![Page 31: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/31.jpg)
3 Apache Beam (incubating)
![Page 32: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/32.jpg)
The Evolution of Beam
MapReduce
Google Cloud Dataflow
Apache Beam
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
![Page 33: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/33.jpg)
1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines -- starting with Java
3. Runners for Existing Distributed Processing Backends• Apache Flink (thanks to data Artisans)• Apache Spark (thanks to Cloudera)• Google Cloud Dataflow (fully managed service)• Local (in-process) runner for testing
What is Part of Apache Beam?
![Page 34: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/34.jpg)
1. End users: who want to write pipelines or transform libraries in a language that’s familiar.
2. SDK writers: who want to make Beam concepts available in new languages.
3. Runner writers: who have a distributed processing environment and want to support Beam pipelines
Apache Beam Technical Vision
Beam Model: Fn Runners
Runner A Runner B
Beam Model: Pipeline Construction
OtherLanguagesBeam Java Beam
Python
Execution Execution
Cloud Dataflow
Execution
![Page 35: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/35.jpg)
Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors
Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem
Growing the Beam Community
![Page 36: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/36.jpg)
Google Cloud Dataflow4
![Page 37: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/37.jpg)
• Fully managed service for running Beam pipelines• Dynamically provisioned, on-demand resources
• VMs, temporary storage• No tuning required
• Autoscaling + Dynamic Work Rebalancing• Built from the experience with Google
internal products
Google Cloud Dataflow
![Page 38: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/38.jpg)
Wor
kers
Time
With DWR
• Advanced straggler mitigation technique• Ensures all tasks finish at the same time
No Tuning Required: Dynamic Work Rebalancing
Wor
kers
Time
Without DWR
• For more info google: “No shard left behind: dynamic work rebalancing in Google Cloud Dataflow”
![Page 39: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/39.jpg)
• Dynamically adjust to the number of workers to match the load• Both for streaming and batch
No Tuning Required: Autoscaling
• For more info google: “Comparing Cloud Dataflow autoscaling to Spark and Hadoop”
Time Time
Wor
kers
![Page 40: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/40.jpg)
• Apache Beam connectors• Google Cloud
• Storage, BigQuery, BigTable, Datastore, Pub/Sub,
• External / Custom IO• Kafka, HDFS, many in flight
• Part of Google Cloud Platform• Monitoring UI• Cloud Logging• Cloud Debugger and Profiler• Stackdriver integration
Integrations
![Page 41: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/41.jpg)
• BigQuery• A fast, economical, and fully managed data warehouse solution
• Dataflow• Fully managed, real-time, data processing service for batch and
streaming• Dataproc
• Fast, easy to use managed Spark and Hadoop service• Datalab(beta)
• Interactive large scale data analysis, exploration and visualization• Pub/Sub
• Reliable, many-to-many, asynchronous messaging service• Genomics
• Empowers scientists to organize world’s genomics information
Big Data in Google Cloud Platform
![Page 42: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/42.jpg)
• Machine Learning Platform(alpha)
• Fast, large scale, easy to use Machine Learning service
• Vision API• Enables insights based on our powerful Vision APIs
• Speech API• Speech to text conversion powered by Machine Learning
• Translate API• Enables multilingual apps and programmatic translation
Machine Learning in Google Cloud Platform
![Page 43: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/43.jpg)
Learn More! Follow @GCPBigData + @ApacheBeam
Apache Beam (incubating)http://beam.incubator.apache.org
Google Cloud Dataflowhttp://cloud.google.com/dataflow
Google Cloud Platformhttp://cloud.google.com
![Page 44: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights](https://reader035.vdocuments.us/reader035/viewer/2022081517/5ec5da5b148dbc039436da9e/html5/thumbnails/44.jpg)
Thank you!