apache crunch simplifying big data with · 2017-12-14 · hbase solr vertica avro csv vertica hbase...

86
Simplifying Big Data with Apache Crunch Micah Whitacre @mkwhit

Upload: others

Post on 29-Jan-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Simplifying Big Data with Apache Crunch

Micah Whitacre@mkwhit

Page 2: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 3: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 4: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 5: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 6: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 7: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 8: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 9: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 10: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 11: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Semantic Chart Search

Cloud Based EMR

Medical Alerting System

Population Health Management

Page 12: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Problem moves from scaling architecture...

Page 13: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Problem moves from not only scaling architecture...

To how to scale the knowledge

Page 14: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch
Page 15: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Battling the 3 V’s

Page 16: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Battling the 3 V’s

Daily, weekly, monthly uploads

Page 17: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Battling the 3 V’s

Daily, weekly, monthly uploads

60+ different data formats

Page 18: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Battling the 3 V’s

Daily, weekly, monthly uploads

60+ different data formats

Constant streams for near real time

Page 19: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Battling the 3 V’s

Daily, weekly, monthly uploads

2+ TB of streaming data daily

60+ different data formats

Constant streams for near real time

Page 20: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Normalize Data

Apply Algorithms

Load Data for

Displays

HBaseSolrVertica

AvroCSVVerticaHBase

Population Health

Page 21: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 22: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Mapper

Reducer

Page 23: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Struggle to fit into single MapReduce job

Page 24: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Struggle to fit into single MapReduce job

Integration done through persistence

Page 25: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Struggle to fit into single MapReduce job

Integration done through persistence

Custom impls of common patterns

Page 26: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Struggle to fit into single MapReduce job

Integration done through persistence

Custom impls of common patterns

Evolving Requirements

Page 27: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

AnonymizeData Avro

Prep for Bulk Load HBase

Page 28: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Easy integration between teams

Focus on processing steps

Shallow learning curve

Ability to tune for performance

Page 29: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Apache CrunchCompose processing into pipelines

Open Source FlumeJava impl

Utilizes POJOs (hides serialization)

Transformation through fns (not job)

Page 30: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 31: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Processing Pipeline

Page 32: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PipelineProgrammatic description of DAG

Supports lazy execution

MapReduce, Spark, Memory

Implementations indicate runtime

Page 33: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Pipeline pipeline = MemPipeline.getIntance();

Pipeline pipeline = new MRPipeline(Driver.class, conf);

Pipeline pipeline = new SparkPipeline(sparkContext, “app”);

Page 34: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

SourceReads various inputs

At least one required per pipeline

Custom implementations

Creates initial collections for processing

Page 35: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

SourceSequence Files

Avro ParquetHBaseJDBCHFilesTextCSV

StringsAvroRecords

ResultsPOJOs

ProtobufsThrift

Writables

Page 36: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

pipeline.read(From.textFile(path));

Page 37: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

pipeline.read(new TextFileSource(path,ptype));

Page 38: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PType<String> ptype = …;pipeline.read(new TextFileSource(path,ptype));

Page 39: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PTypeHides serialization

Exposes data in native Java forms

Avro, Thrift, and Protocol Buffers

Supports composing complex types

Page 40: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Multiple Serialization Types

Serialization Type = PTypeFamily

Can’t mix families in single type

Avro & Writable available

Can easily convert between families

Page 41: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PType<Integer> intTypes = Writables.ints();

PType<String> stringType = Avros.strings();

PType<Person> personType = Avros.records(Person.class);

Page 42: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PType<Pair<String, Person>> pairType = Avros.pairs(stringType, personType);

Page 43: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PTableType<String, Person> tableType = Avros.tableOf(stringType,personType);

Page 44: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PType<String> ptype = …;PCollection<String> strings = pipeline.read(

new TextFileSource(path, ptype));

Page 45: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PCollectionImmutable

Not created only read or transformed

Unsorted

Represents potential data

Page 46: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 47: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

PCollection<String> PCollection<RefData>

Page 48: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

DoFnSimple API to implement

Location for custom logic

Transforms PCollection between forms

Processes one element at a time

Page 49: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

For each item emits 0:M items

FilterFn - returns boolean

MapFn - emits 1:1

Page 50: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

DoFn API

class ExampleDoFn extends DoFn<String, RefData>{ ...} Type of Data In Type of Data Out

Page 51: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

public void process(String s, Emitter<RefData> emitter) { RefData data = …; emitter.emit(data);}

Type of Data In Type of Data Out

Page 52: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PCollection<String> refStringsPCollection<RefData> refs = refStrings.parallelDo(fn, Avros.records(RefData.class));

Page 53: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PCollection<String> dataStrs...PCollection<RefData> refs = dataStrs.parallelDo(diffFn, Avros.records(Data.class));

Page 54: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 55: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Hmm now I need to join...

We need a PTable

But they don’t have a common key?

Page 56: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PTable<K, V>Immutable & Unsorted

Variation PCollection<Pair<K, V>>

Multimap of Keys and Values

Joins, Cogroups, Group By Key

Page 57: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

class ExampleDoFn extends DoFn<String, RefData>{ ...}

Page 58: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

class ExampleDoFn extends DoFn<String, Pair<String, RefData>>{

...}

Page 59: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PCollection<String> refStringsPTable<String, RefData> refs = refStrings.parallelDo(fn, Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

Page 60: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PTable<String, RefData> refs…;PTable<String, Data> data…;

Page 61: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

data.join(refs);

(inner join)

Page 62: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PTable<String, Pair<Data, RefData>> joinedData = data.join(refs);

Page 63: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Joinsright, left, inner, outer

Mapside, BloomFilter, Sharded

Eliminates custom impls

Page 64: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 65: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 66: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

FilterFn API

class MyFilterFn extends FilterFn<...>{ ...} Type of Data In

Page 67: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

public boolean accept(... value){ return value > 3;}

Page 68: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PCollection<Model> values = …;PCollection<Model> filtered = values.filter(new MyFilterFn());

Page 69: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 70: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PTable<String,Model> models = …;

Keyed By PersonId

Page 71: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PTable<String,Model> models = …;PGroupedTable<String, Model> groupedModels = models.groupByKey();

Page 72: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PGroupedTable<K, V>Immutable & Sorted

PCollection<Pair<K, Iterable<V>>>

Page 73: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 74: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 75: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PCollection<Person> persons = …;

Page 76: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PCollection<Person> persons = …;pipeline.write(persons, To.avroFile(path));

Page 77: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

PCollection<Person> persons = …;pipeline.write(persons, new AvroFileTarget(path));

Page 78: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

TargetPersists PCollection

At least one required per pipeline

Custom implementations

Page 79: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

TargetSequence Files

Avro ParquetHBaseJDBCHFilesTextCSV

StringsAvroRecords

ResultsPOJOs

ProtobufsThrift

Writables

Page 80: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 81: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Pipeline pipeline = …; ...pipeline.write(...);PipelineResult result = pipeline.done();

Execution

Page 82: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Map Reduce Reduce

Page 83: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Tuning

GroupingOptions/ParallelDoOptions

Scale factors

Tweak pipeline for performance

Page 84: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Focus on the transformations

Smaller learning curve

Functionality first

Less fragility

Page 85: Apache Crunch Simplifying Big Data with · 2017-12-14 · HBase Solr Vertica Avro CSV Vertica HBase Population Health. Process Reference Data Process Raw Person Data ... Apache Crunch

Extend pipeline for new features

Iterate with confidence

Integration through PCollections