![Page 1: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/1.jpg)
ProcessingDataofAnySizewithApacheBeam
1/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 2: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/2.jpg)
Mentoring,training,andhigh-levelconsultingcompanyfocusedonBigData,NoSQLandTheCloud
Foundedin2008WehelpmakecompaniessuccessfulwithBigDataprojects
OngoingteammentoringUsecaseevaluationManagementtrainingTechnicaltrainingArchitecturereviewsLiveandemailprogrammingsupport
Gotohttp://www.bigdatainstitute.ioformoreinformation
AboutBigDataInstitute
2/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 3: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/3.jpg)
Yourexperienceasadeveloper,analystoradministrator
Whichlanguagesyouuse
ExperiencewithHadoop,BigDataorNoSQL
Expectationsfromthisclass
AboutYou
3/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 4: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/4.jpg)
Chapter1
IntroducingApacheBeam
4/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 5: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/5.jpg)
WhatIsBeam?WhyUseBeam?UsingBeam
IntroducingApacheBeam
5/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 6: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/6.jpg)
ApacheBeamisaunifiedmodelforprocessingdata
WasoriginallycreatedatGoogleLaterdonatedtotheApacheFoundationasApacheBeamNowanApachetoplevelproject
BeamcodeiswrittentoitsAPICodeisexecutedondifferentrunnersNotdirectlytiedtoaframeworkorrunner
Allinteractionsaredonethroughpipelines
ApacheBeam
6/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 7: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/7.jpg)
Pipeline
The DoFNs take in the inputprocesses it, and emit the results
Source
The Source reads the input onerecord or row at a time
DoFN DoFN Sink
The Source saves the output of theDoFN to the targeted path
All work is encapsulated in aPipeline
BeamPipelinesDiagram
7/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 8: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/8.jpg)
Juan
Fatima
Mark
14:00 14:30 15:00 15:30 16:00
Data is broken intosessions based on acriteria for a timeoutbetween actions.
Data can be calculatedin fixed windows wherethe time doesn't change.
Data can be calculatedin sliding windows wherethe time is fixed butadvances.
BeamWindowing
8/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 9: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/9.jpg)
WhatIsBeam?WhyUseBeam?UsingBeam
IntroducingApacheBeam
9/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 10: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/10.jpg)
Learningframework-specificAPIseverytimeanewframeworkcomesoutorcompletelychangestheirexistingAPIdoesn’tcreate
value
TooManyAPIs
10/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 11: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/11.jpg)
Hadoop Cluster
Real-time data is published toKafka
Spark Streaming, Storm, orKafka Consumers process in
real-time
DataSource
DataSource
DataSource
RDBMS
Real-timeProcessingKafka Cluster
BI Analytics
Batch data is saved to HDFS
DataSource
DataSource
DataSource
MapReduce, Hive, Pig, Crunch,and Spark process data stored
in HDFS
Real-time data is archived toHDFS for analytics and offline
processing
GeneralArchitectureDiagram
11/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 12: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/12.jpg)
OneAPItorulethemallOneAPItolearnMovebetweenframeworks
ThemostunifiedbatchandstreamAPII’veused
UnifiedAPItotheecosystem
Riskmitigationofframeworks
Multiplelanguages
WhyI'mExcitedAboutBeam
12/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 13: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/13.jpg)
Beamisn'ttiedtoaspecificframework
ApacheSparkusesthespark-submit
ApacheFlinkcanbesubmittedwiththeMavenrunner
GoogleCloudDataflowcanbesubmittedwiththeMavenrunner
TheDirectRunnercanbestartedwiththeMavenrunner
RunningBeam
13/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 14: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/14.jpg)
BeamContributions
14/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 15: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/15.jpg)
WhatIsBeam?WhyUseBeam?UsingBeam
IntroducingApacheBeam
15/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 16: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/16.jpg)
IcannotteachhimTheboyhasnopatience
PCollection<String>etl=lines.apply(MapElements.via((Stringline)->line.toUpperCase()).withOutputType(TypeDescriptors.strings()));
ICANNOTTEACHHIMTHEBOYHASNOPATIENCE
MapElements
16/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 17: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/17.jpg)
Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.
PCollection<String>linecount=lines.apply(Regex.matches("I.*\\."));
Icannotteachhim.Theboyhasnopatience.
RegularexpressionscanbeusedtoparseKVs
Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.
PCollection<KV<String,String>>twoSentences=lines.apply(Regex.findKV("(.*)\\.(.*)",1,2));
<Icannotteachhim,Theboyhasnopatience>
RegexTransform
17/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 18: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/18.jpg)
Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.
PCollection<String>pats=lines.apply(ParDo.of(newPatLinesFN()));
staticclassPatLinesFNextendsDoFn<String,String>{@ProcessElementpublicvoidprocessElement(DoFn<String,String>.ProcessContextcontext)throwsException{String[]pieces=context.element().split("");
for(Stringpiece:pieces){if(piece.startsWith("pat")){context.output(piece);}}}}
patience.patience.
ExampleCustomDoFN
18/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 19: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/19.jpg)
importorg.apache.beam.sdk.Pipeline;importorg.apache.beam.sdk.io.TextIO;importorg.apache.beam.sdk.options.PipelineOptions;importorg.apache.beam.sdk.options.PipelineOptionsFactory;importorg.apache.beam.sdk.transforms.Count;importorg.apache.beam.sdk.transforms.Regex;importorg.apache.beam.sdk.transforms.ToString;
publicclassPicoWordCount{publicstaticvoidmain(String[]args){PipelineOptionsoptions=PipelineOptionsFactory.create();Pipelinep=Pipeline.create(options);
p.apply(TextIO.Read.from("playing_cards.tsv")).apply(Regex.split("\\W+")).apply(Count.perElement()).apply(ToString.elements()).apply(TextIO.Write.to("output/stringcounts"));
p.run();}}
PlayingCardAlgorithm
19/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 20: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/20.jpg)
WhatareotherpeopledoingwithBeam?http://tiny.jesse-anderson.com/beaminterview
WhereissomesampleBeamcode?http://tiny.jesse-anderson.com/beamtutorial
MainBeamsitehttps://beam.apache.org/
Convincingyourbosshttp://tiny.jesse-anderson.com/beam1http://tiny.jesse-anderson.com/beam2
NextSteps
20/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf
![Page 21: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec5da9f148dbc039436db71/html5/thumbnails/21.jpg)
Current:Instructor,ThoughtLeader,MonkeyTamer
Previously:CurriculumDeveloperandInstructor@ClouderaSeniorSoftwareEngineer@Intuit
Covered,ConferencesandPublishedIn:GigaOM,ArsTecnica,PragmaticProgrammers,Strata,OSCON,WallStreetJournal,CNN,BBC,NPR
SeeMeOn:http://www.jesse-anderson.com@jessetandersonhttp://tiny.bdi.io/linkedinhttp://tiny.bdi.io/youtube
AboutMe
21/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf