apache crunch simplifying big data with · 2017. 12. 14. · apache crunch compose processing into...
TRANSCRIPT
![Page 1: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/1.jpg)
Simplifying Big Data with Apache Crunch
Micah Whitacre@mkwhit
![Page 2: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/2.jpg)
![Page 3: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/3.jpg)
![Page 4: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/4.jpg)
![Page 5: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/5.jpg)
![Page 6: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/6.jpg)
![Page 7: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/7.jpg)
![Page 8: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/8.jpg)
![Page 9: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/9.jpg)
![Page 10: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/10.jpg)
![Page 11: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/11.jpg)
Semantic Chart Search
Cloud Based EMR
Medical Alerting System
Population Health Management
![Page 12: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/12.jpg)
Problem moves from scaling architecture...
![Page 13: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/13.jpg)
Problem moves from not only scaling architecture...
To how to scale the knowledge
![Page 14: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/14.jpg)
![Page 15: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/15.jpg)
Battling the 3 V’s
![Page 16: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/16.jpg)
Battling the 3 V’s
Daily, weekly, monthly uploads
![Page 17: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/17.jpg)
Battling the 3 V’s
Daily, weekly, monthly uploads
60+ different data formats
![Page 18: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/18.jpg)
Battling the 3 V’s
Daily, weekly, monthly uploads
60+ different data formats
Constant streams for near real time
![Page 19: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/19.jpg)
Battling the 3 V’s
Daily, weekly, monthly uploads
2+ TB of streaming data daily
60+ different data formats
Constant streams for near real time
![Page 20: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/20.jpg)
Normalize Data
Apply Algorithms
Load Data for
Displays
HBaseSolrVertica
AvroCSVVerticaHBase
Population Health
![Page 21: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/21.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 22: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/22.jpg)
Mapper
Reducer
![Page 23: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/23.jpg)
Struggle to fit into single MapReduce job
![Page 24: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/24.jpg)
Struggle to fit into single MapReduce job
Integration done through persistence
![Page 25: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/25.jpg)
Struggle to fit into single MapReduce job
Integration done through persistence
Custom impls of common patterns
![Page 26: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/26.jpg)
Struggle to fit into single MapReduce job
Integration done through persistence
Custom impls of common patterns
Evolving Requirements
![Page 27: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/27.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
AnonymizeData Avro
Prep for Bulk Load HBase
![Page 28: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/28.jpg)
Easy integration between teams
Focus on processing steps
Shallow learning curve
Ability to tune for performance
![Page 29: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/29.jpg)
Apache CrunchCompose processing into pipelines
Open Source FlumeJava impl
Utilizes POJOs (hides serialization)
Transformation through fns (not job)
![Page 30: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/30.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 31: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/31.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
Processing Pipeline
![Page 32: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/32.jpg)
PipelineProgrammatic description of DAG
Supports lazy execution
MapReduce, Spark, Memory
Implementations indicate runtime
![Page 33: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/33.jpg)
Pipeline pipeline = MemPipeline.getIntance();
Pipeline pipeline = new MRPipeline(Driver.class, conf);
Pipeline pipeline = new SparkPipeline(sparkContext, “app”);
![Page 34: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/34.jpg)
SourceReads various inputs
At least one required per pipeline
Custom implementations
Creates initial collections for processing
![Page 35: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/35.jpg)
SourceSequence Files
Avro ParquetHBaseJDBCHFilesTextCSV
StringsAvroRecords
ResultsPOJOs
ProtobufsThrift
Writables
![Page 36: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/36.jpg)
pipeline.read(From.textFile(path));
![Page 37: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/37.jpg)
pipeline.read(new TextFileSource(path,ptype));
![Page 38: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/38.jpg)
PType<String> ptype = …;pipeline.read(new TextFileSource(path,ptype));
![Page 39: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/39.jpg)
PTypeHides serialization
Exposes data in native Java forms
Avro, Thrift, and Protocol Buffers
Supports composing complex types
![Page 40: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/40.jpg)
Multiple Serialization Types
Serialization Type = PTypeFamily
Can’t mix families in single type
Avro & Writable available
Can easily convert between families
![Page 41: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/41.jpg)
PType<Integer> intTypes = Writables.ints();
PType<String> stringType = Avros.strings();
PType<Person> personType = Avros.records(Person.class);
![Page 42: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/42.jpg)
PType<Pair<String, Person>> pairType = Avros.pairs(stringType, personType);
![Page 43: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/43.jpg)
PTableType<String, Person> tableType = Avros.tableOf(stringType,personType);
![Page 44: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/44.jpg)
PType<String> ptype = …;PCollection<String> strings = pipeline.read(
new TextFileSource(path, ptype));
![Page 45: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/45.jpg)
PCollectionImmutable
Not created only read or transformed
Unsorted
Represents potential data
![Page 46: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/46.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 47: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/47.jpg)
Process Reference
Data
PCollection<String> PCollection<RefData>
![Page 48: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/48.jpg)
DoFnSimple API to implement
Location for custom logic
Transforms PCollection between forms
Processes one element at a time
![Page 49: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/49.jpg)
For each item emits 0:M items
FilterFn - returns boolean
MapFn - emits 1:1
![Page 50: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/50.jpg)
DoFn API
class ExampleDoFn extends DoFn<String, RefData>{ ...} Type of Data In Type of Data Out
![Page 51: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/51.jpg)
public void process(String s, Emitter<RefData> emitter) { RefData data = …; emitter.emit(data);}
Type of Data In Type of Data Out
![Page 52: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/52.jpg)
PCollection<String> refStringsPCollection<RefData> refs = refStrings.parallelDo(fn, Avros.records(RefData.class));
![Page 53: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/53.jpg)
PCollection<String> dataStrs...PCollection<RefData> refs = dataStrs.parallelDo(diffFn, Avros.records(Data.class));
![Page 54: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/54.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 55: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/55.jpg)
Hmm now I need to join...
We need a PTable
But they don’t have a common key?
![Page 56: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/56.jpg)
PTable<K, V>Immutable & Unsorted
Variation PCollection<Pair<K, V>>
Multimap of Keys and Values
Joins, Cogroups, Group By Key
![Page 57: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/57.jpg)
class ExampleDoFn extends DoFn<String, RefData>{ ...}
![Page 58: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/58.jpg)
class ExampleDoFn extends DoFn<String, Pair<String, RefData>>{
...}
![Page 59: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/59.jpg)
PCollection<String> refStringsPTable<String, RefData> refs = refStrings.parallelDo(fn, Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));
![Page 60: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/60.jpg)
PTable<String, RefData> refs…;PTable<String, Data> data…;
![Page 61: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/61.jpg)
data.join(refs);
(inner join)
![Page 62: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/62.jpg)
PTable<String, Pair<Data, RefData>> joinedData = data.join(refs);
![Page 63: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/63.jpg)
Joinsright, left, inner, outer
Mapside, BloomFilter, Sharded
Eliminates custom impls
![Page 64: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/64.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 65: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/65.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 66: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/66.jpg)
FilterFn API
class MyFilterFn extends FilterFn<...>{ ...} Type of Data In
![Page 67: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/67.jpg)
public boolean accept(... value){ return value > 3;}
![Page 68: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/68.jpg)
PCollection<Model> values = …;PCollection<Model> filtered = values.filter(new MyFilterFn());
![Page 69: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/69.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 70: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/70.jpg)
PTable<String,Model> models = …;
Keyed By PersonId
![Page 71: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/71.jpg)
PTable<String,Model> models = …;PGroupedTable<String, Model> groupedModels = models.groupByKey();
![Page 72: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/72.jpg)
PGroupedTable<K, V>Immutable & Sorted
PCollection<Pair<K, Iterable<V>>>
![Page 73: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/73.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 74: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/74.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 75: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/75.jpg)
PCollection<Person> persons = …;
![Page 76: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/76.jpg)
PCollection<Person> persons = …;pipeline.write(persons, To.avroFile(path));
![Page 77: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/77.jpg)
PCollection<Person> persons = …;pipeline.write(persons, new AvroFileTarget(path));
![Page 78: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/78.jpg)
TargetPersists PCollection
At least one required per pipeline
Custom implementations
![Page 79: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/79.jpg)
TargetSequence Files
Avro ParquetHBaseJDBCHFilesTextCSV
StringsAvroRecords
ResultsPOJOs
ProtobufsThrift
Writables
![Page 80: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/80.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 81: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/81.jpg)
Pipeline pipeline = …; ...pipeline.write(...);PipelineResult result = pipeline.done();
Execution
![Page 82: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/82.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
Map Reduce Reduce
![Page 83: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/83.jpg)
Tuning
GroupingOptions/ParallelDoOptions
Scale factors
Tweak pipeline for performance
![Page 84: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/84.jpg)
Focus on the transformations
Smaller learning curve
Functionality first
Less fragility
![Page 85: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/85.jpg)
Extend pipeline for new features
Iterate with confidence
Integration through PCollections
![Page 86: Apache Crunch Simplifying Big Data with · 2017. 12. 14. · Apache Crunch Compose processing into pipelines Open Source FlumeJava impl Utilizes POJOs (hides serialization) Transformation](https://reader035.vdocuments.us/reader035/viewer/2022063012/5fc802c5f149645f980fcafc/html5/thumbnails/86.jpg)
Linkshttp://crunch.apache.org/
http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading
http://dl.acm.org/citation.cfm?id=1806596.1806638