Download - Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Sergio FernándezRedlink GmbH

December 7, 2016 - DataCamp Salzburg

(incubating)

Introduction to

http://www.meetup.com/Salzburg-Big-Data-Meetup/events/228004128/

https://www.fh-salzburg.ac.at/

http://redlink.co/

http://ssix-project.eu/

https://beam.incubator.apache.org/

https://www.apache.org/

THE decision

http://thenewstack.io/apache-streaming-projects-exploratory-guide/ https://twitter.com/ianhellstrom/status/710917506412716033

Gearpump

Google Dataflow

https://www.flickr.com/photos/somewhatfrank/7152104387/

streams

https://aws.amazon.com/kinesis/

http://thenewstack.io/apache-streaming-projects-exploratory-guide/

http://thenewstack.io/apache-streaming-projects-exploratory-guide/

https://twitter.com/ianhellstrom/status/710917506412716033

https://twitter.com/ianhellstrom/status/710917506412716033

https://spark.apache.org/

https://apex.apache.org/

https://storm.apache.org/

https://samza.apache.org/

https://gearpump.apache.org/overview.html

https://nifi.apache.org/

https://aws.amazon.com/kinesis/

https://cloud.google.com/dataflow/



http://docs.confluent.io/3.0.0/streams/

http://docs.confluent.io/3.0.0/streams/

https://www.confluent.io/

Apache Beam is a unified and agnostic

(batch+stream) programming model designed to provide

efficient and portable data processing pipelines

https://beam.incubator.apache.org/

Some bits of history...

Beam

http://bitmin.net/blog/what-is-google-cloud-dataflow/



BeamProgrammingModel:abstract stack

SDK

DSL

Beam Pipeline Construction

Runner

Beam Fn Runners

Execution

BeamProgrammingModel:concrete stack

Java SDK

scio

Beam Pipeline Construction

Flink Runner

Beam Fn Runners

Execution 1

Python SDK x SDK

Apex Runner

Dataflow Runner

Spark Runner

Direct Runner

Execution N

Beam Capability Matrix

https://beam.incubator.apache.org/documentation/runners/capability-matrix/



Beam Model API in a nutshell

● Pipeline: a data processing job as a directed graph of steps

● PCollection: a parallel collection of timestamped elements that are in windows

● IO: produce/consume PCollections from/to outside the pipeline

● Transforms, for instance:○ ParDo: flatmap over elements of a PCollection○ (Co)GroupByKey: shuffle & group {{K: V}} → {K: [V]}○ Side inputs: global view of a PCollection used for broadcast / joins

https://beam.apache.org/documentation/programming-guide/



Options options = PipelineOptionsFactory.fromArgs(args)

.withValidation().as(Options.class);

Pipeline pipeline = Pipeline.create(options);

pipeline.apply("ReadLines", TextIO.Read.from(options.getInput()))

.apply(new CountWords())

.apply(MapElements.via(new FormatAsTextFn()))

.apply("WriteCounts", TextIO.Write.to(options.getOutput()));

pipeline.run();

Writing a basic Beam Pipeline

Run your Pipeline: Direct Runner

mvn compile exec:java \

-Dexec.mainClass=io.redlink.datacamp.beam.WordCount \

-Dexec.args="--inputFile=../input.txt \

--output=target/direct/counts" \

-Pdirect-runner

http://beam.incubator.apache.org/get-started/quickstart/



Run your Pipeline: Spark

mvn compile exec:java \


-Dexec.args="--runner=SparkRunner \

--inputFile=input.txt --output=target/spark/counts" \

-Pspark-runner

http://beam.incubator.apache.org/get-started/quickstart/#runner-spark



Run your Pipeline: Flink

mvn package exec:java \


-Dexec.args="--runner=FlinkRunner \

--inputFile=input.txt \

--output=target/flink/counts" \

-Pflink-runner

http://beam.incubator.apache.org/get-started/quickstart/#runner-flink



Vielen Dank

Sergio FernándezSoftware Engineerhttps://www.wikier.org/

Redlink GmbHhttp://redlink.co

Work partially funded by SSIX, a European Union’s Horizon 2020 project (grant agreement no. 645425)

http://redlink.co/

http://redlink.co/

https://www.wikier.org/

https://www.wikier.org/

http://redlink.co

http://redlink.co

http://ssix-project.eu/

Download - Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016

Top Related