Sergio FernándezRedlink GmbH
December 7, 2016 - DataCamp Salzburg
(incubating)
Introduction to
THE decision
http://thenewstack.io/apache-streaming-projects-exploratory-guide/ https://twitter.com/ianhellstrom/status/710917506412716033
Gearpump
Google Dataflow
https://www.flickr.com/photos/somewhatfrank/7152104387/
streams
Apache Beam is a unified and agnostic
(batch+stream) programming model designed to provide
efficient and portable data processing pipelines
Some bits of history...
Beam
http://bitmin.net/blog/what-is-google-cloud-dataflow/
BeamProgrammingModel:abstract stack
SDK
DSL
Beam Pipeline Construction
Runner
Beam Fn Runners
Execution
BeamProgrammingModel:concrete stack
Java SDK
scio
Beam Pipeline Construction
Flink Runner
Beam Fn Runners
Execution 1
Python SDK x SDK
Apex Runner
Dataflow Runner
Spark Runner
Direct Runner
Execution N
Beam Capability Matrix
https://beam.incubator.apache.org/documentation/runners/capability-matrix/
Beam Model API in a nutshell
● Pipeline: a data processing job as a directed graph of steps
● PCollection: a parallel collection of timestamped elements that are in windows
● IO: produce/consume PCollections from/to outside the pipeline
● Transforms, for instance:○ ParDo: flatmap over elements of a PCollection○ (Co)GroupByKey: shuffle & group {{K: V}} → {K: [V]}○ Side inputs: global view of a PCollection used for broadcast / joins
https://beam.apache.org/documentation/programming-guide/
Options options = PipelineOptionsFactory.fromArgs(args)
.withValidation().as(Options.class);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("ReadLines", TextIO.Read.from(options.getInput()))
.apply(new CountWords())
.apply(MapElements.via(new FormatAsTextFn()))
.apply("WriteCounts", TextIO.Write.to(options.getOutput()));
pipeline.run();
Writing a basic Beam Pipeline
Run your Pipeline: Direct Runner
mvn compile exec:java \
-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \
-Dexec.args="--inputFile=../input.txt \
--output=target/direct/counts" \
-Pdirect-runner
http://beam.incubator.apache.org/get-started/quickstart/
Run your Pipeline: Spark
mvn compile exec:java \
-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \
-Dexec.args="--runner=SparkRunner \
--inputFile=input.txt --output=target/spark/counts" \
-Pspark-runner
http://beam.incubator.apache.org/get-started/quickstart/#runner-spark
Run your Pipeline: Flink
mvn package exec:java \
-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \
-Dexec.args="--runner=FlinkRunner \
--inputFile=input.txt \
--output=target/flink/counts" \
-Pflink-runner
http://beam.incubator.apache.org/get-started/quickstart/#runner-flink
Vielen Dank
Sergio FernándezSoftware Engineerhttps://www.wikier.org/
Redlink GmbHhttp://redlink.co
Work partially funded by SSIX, a European Union’s Horizon 2020 project (grant agreement no. 645425)