building continuously curated ingestion pipelines

38
Building Continuously Curated Ingestion Pipelines Recipes for Success Arvind Prabhakar

Upload: arvind-prabhakar

Post on 11-Jan-2017

900 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Building Continuously Curated Ingestion Pipelines

Building Continuously Curated Ingestion Pipelines

Recipes for Success

Arvind Prabhakar

Page 2: Building Continuously Curated Ingestion Pipelines

“Thru 2018, 70% of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges”

@nheudecker tweet (Gartner, 26 Feb 2015)

Page 3: Building Continuously Curated Ingestion Pipelines

What is Data Ingestion?

Acquiring data as it is produced from data source(s)

Transforming it into a consumable form

Delivering the transformed data to the consuming system(s)

The challenge:Doing this continuously, and at scale across a wide variety of sources and consuming systems!

Page 4: Building Continuously Curated Ingestion Pipelines

Why is Data Ingestion Difficult? (Hint: Drift)

Modern data sources and consuming applications evolve rapidly (Infrastructure Drift)

Data produced changes without notice independent of consuming applications (Structural Drift)

Data semantics change over time as same data powers new use cases (Semantic Drift)

Page 5: Building Continuously Curated Ingestion Pipelines

Continuous data ingestion is critical to the success of modern data environments

Page 6: Building Continuously Curated Ingestion Pipelines

Plan Your Ingestion Infrastructure Carefully!

Plan ahead: Allocate time and resources specifically for building out your data ingestion infrastructure

Plan for future: Design ingestion infrastructure with sufficient extensibility to accommodate unknown future requirements

Plan for correction: Incorporate low-level instrumentation to help understand the effectiveness of your ingestion infrastructure, correcting it as your systems evolve

Page 7: Building Continuously Curated Ingestion Pipelines

The Benefits of Well-Designed Data Ingestion

Minimal effort needed to accommodate changes: Handle upgrades, onboard new data sources/consuming applications/analytics technologies, etc.

Increased confidence in data quality: Rest assured that your consuming applications are working with correct, consistent and trustworthy data

Reduced latency for consumption: Allow rapid consumption of data and remove any need for manual intervention for enabling consuming applications

Page 8: Building Continuously Curated Ingestion Pipelines

Recipe #1Create decoupled ingest infrastructure

An Independent Infrastructure between Data Sources and Consumers

Page 9: Building Continuously Curated Ingestion Pipelines

Decoupled Ingest Infrastructure

• Decoupled data format and packaging from source to destination- Example: read CSV files and write Sequence files

• Breakdown input data into its smallest meaningful representation- Example: Individual log record, individual tweet record, etc.

• Implement asynchronous data movement from source to destination- Data movement process is independent of the source or consumer process

An Independent Infrastructure between Data Sources and Consumers

Page 10: Building Continuously Curated Ingestion Pipelines

Decoupled Ingestion Using Apache Flume

Fan-in Log Aggregation

• Load data into client-tier Flume agents

• Route data through intermediate tiers

• Deposit data via collector tiers

Agent

Agent

Agent

Agent

Agent

Agent

Agent

Agent

Agent

Agent

Agent

Agent

Agent

Agent

Tier-1 Tier-2 Tier-3Log Files Hive Warehouse

Page 11: Building Continuously Curated Ingestion Pipelines

Decoupled Ingestion Using Apache Flume + StreamSets

Agent

Agent

Agent

Agent

Agent

Agent

Tier-2 Tier-3 Hive WarehouseUse StreamSets Data Collector to onboard data into Flume from variety of sources

SDC Tier

Page 12: Building Continuously Curated Ingestion Pipelines

Decoupled Ingestion Using Apache Kafka

Pub-Sub Log Flow

• Producers load data into Kafka topics from various log files

• Consumers read data from the topics and deposit to destination

ProducersLog Files Hive Warehouse

Producer

Consumers

Consumer

Consumer

Consumer

Consumer

Apache Kafka

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Page 13: Building Continuously Curated Ingestion Pipelines

Decoupled Ingestion Using Apache Kafka + StreamSets

Standalone SDC Hive WarehouseCluster SDC

Apache Kafka

Page 14: Building Continuously Curated Ingestion Pipelines

Decoupled Ingest Infrastructure

System Flume Kafka StreamSets

Decouple Formats

Use built-in clients, sources, serializers. Extend if necessary.

Use third-party producers, consumers. Write your own if necessary.

Built-in any-to-any format conversion support.

Smallest Representation

Opaque payload (Event body)

Opaque payload Interpreted record format

Asynchronous Data Movement

Yes Yes Yes*

An Independent Infrastructure between Data Sources and Stores/Consumers

Page 15: Building Continuously Curated Ingestion Pipelines

Poll: Tools for Ingest

What tools do you use for Ingest?

o Kafka

o Flume

o Other

Page 16: Building Continuously Curated Ingestion Pipelines

Results: Tools for Ingest

Page 17: Building Continuously Curated Ingestion Pipelines

Data scientists spend 50 to 80 percent of their time in mundane labor of collecting and preparing unruly digital data, before it can be explored.

“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”The New York Times -- Steve Lohr, Aug 17, 2014

Page 18: Building Continuously Curated Ingestion Pipelines

Recipe #2Implement in-stream data sanitization

Prepare data using in-stream transformations to make it consumption-ready

Page 19: Building Continuously Curated Ingestion Pipelines

In-Stream Data Sanitization

• Convert and enforce data types where necessary- Example: turn strings into integers to enable type matching for consuming systems

• Filter and route data depending upon downstream requirements- Example: Build routing logic to filter out invalid or low value data to reduce processing costs

• Enrich and transform data where needed- Example: Extract and enrich credit card type information before masking the card number

Prepare data using in-stream transformations to make it consumption ready

Page 20: Building Continuously Curated Ingestion Pipelines

Data Sanitization using Apache Flume

• Use Interceptors to convert data types where necessary

• Use Channel Selectors for contextual routing in conjunction with Interceptors

• Use Interceptors to enrich or transform messages Flume Agent

Source

Channel 1

Channel ..

Channel n

Sink 1

Sink …

Sink n

Interceptors

Flume provides few Interceptors for static decoration and rudimentary data transformations

Page 21: Building Continuously Curated Ingestion Pipelines

Data Sanitization using Apache Kafka

• No built-in support for data types beyond a broad value type

• Use custom producers that choose topics thus implementing routing and filtering

• Use custom producers and consumers to implement validation and transformations where needed

Producers-consumer chains can be applied between topics to achieve data sanitization

Topic A

Topic B

Kafka

Producer …

Producer 1

Producer n

Consumer 1

Sanitizing Consumer Producer

Consumer x

Consumer y

Consumer z

Page 22: Building Continuously Curated Ingestion Pipelines

Data Sanitization using StreamSets

• Built-in routing, filtering, validation using minimal schema specification

Producers-consumer chains can be applied between topics to achieve data sanitization

• Support for JavaScript, Python, Java EL as well as Java extensions

• Built-in transformations including PII masking, anonymization, etc.

Page 23: Building Continuously Curated Ingestion Pipelines

In-Stream Data SanitizationPrepare data using in-stream transformations to make it consumption ready

System Flume Kafka StreamSets

Introspection/Validation Minimal support. Use custom Interceptors or morphlines.

None. Must be done within custom producers and consumers.

Extensive built-in support for introspection and validation.

Filter/Routing Rudimentary support using built-in Interceptors. Will require using custom interceptors.

None. Must be done within custom producers and consumers.

Sophisticated built-in contextual routing based on data introspection.

Enrich/Transform Minimal support. Will require using custom interceptors.

None. Must be done within custom producers and consumers.

Extensive support via built-in functions as well as scripting processors for enrichment and transformations.

Page 24: Building Continuously Curated Ingestion Pipelines

Poll: Preparing Data for Consumption

For the most part, at what point do you prepare your data for consumption?

o Before Ingest

o During Ingest

o After Ingest

Page 25: Building Continuously Curated Ingestion Pipelines

Results: Preparing Data for Consumption

Page 26: Building Continuously Curated Ingestion Pipelines

Recipe #3Implement drift-detection and handling

Enable runtime checks for detection and handling of drift

Page 27: Building Continuously Curated Ingestion Pipelines

Drift Detection and Handling

• Specify and validate intent rather than schema to catch structure drift- Example: data must contain a certain attribute, and not an attribute at a particular ordinal

• Specify and validate semantic constraints to catch semantic drift- Example: service operating within a particular city must have bounded geo-coordinates

• Isolate ingest logic from downstream to enable infrastructure drift handling- Example: Use isolated class-loaders to enable writing to binary incompatible systems

Enable runtime checks for detection and handling of drift

Page 28: Building Continuously Curated Ingestion Pipelines

Drift Handling in Apache Flume and Kafka

No support for structural drift: Flume and Kafka are low-level frameworks and work with opaque data. You need to handle this in your code.

No support for semantic drift: This is because Flume and Kafka work with opaque data. Again, you need to handle this in your code.

No support for infrastructure drift: These systems do not provide class-loader isolation or other semantics to easily handle infrastructure drift. Their extension mechanism makes it impossible to handle this in your code in most cases.

Page 29: Building Continuously Curated Ingestion Pipelines

Structural Drift Handling in StreamSets

• This stream-selector is using a header-name to identify an attribute

• If the data is delimited (e.g. CSV) the column position could vary from record to record

• If the attribute is absent, the record flows through the default stream

Page 30: Building Continuously Curated Ingestion Pipelines

Semantic Drift Handling in StreamSets

• This stage specifies a required field as well as preconditions for field value validation

• Records that meet this constraint flow through the pipeline

• Records that do not meet this constraint are siphoned off to an error pipeline

Page 31: Building Continuously Curated Ingestion Pipelines

Infrastructure Drift Handling in StreamSets

• This pipeline duplicates the data into two Kafka instances that may be incompatible

• No recompilation or dependency reconciliation required to work in such environments

• Ideal for handling upgrades, onboarding of new clusters, applications, etc.

Page 32: Building Continuously Curated Ingestion Pipelines

Drift Detection and Handling

System Flume Kafka StreamSets

Structural Drift No support No support Support for loosely defined intent-specification for structural validation.

Semantic Drift No support No support Support for complex pre-conditions and routing for semantic data validation.

Infrastructure Drift No support No support Support for class-loader isolation to ensure operation in binary incompatible environments.

Enable runtime checks for detection and handling of drift

Page 33: Building Continuously Curated Ingestion Pipelines

Poll: Prevalence of Data Drift

How often do you encounter data drift as we’ve described it?

o Never

o Occasionally

o Frequently

Page 34: Building Continuously Curated Ingestion Pipelines

Results: Prevalence of Data Drift

Page 35: Building Continuously Curated Ingestion Pipelines

Recipe #4Use the right tool for the right job

Page 36: Building Continuously Curated Ingestion Pipelines

Use the Right Tool for the Job!

Apache Flume: Best suited for long range bulk data transfer with basic routing and filtering support

Apache Kafka: Best suited for data democratization

StreamSets: Best suited for any-to-any data movement, drift detection and handling, complex routing and filtering, etc.

Use these systems together to build your best in class ingestion infrastructure!

Page 37: Building Continuously Curated Ingestion Pipelines

Sustained data ingestion is critical for success of modern data environments

Page 38: Building Continuously Curated Ingestion Pipelines

Thank You!

Learn More at www.streamsets.com

Get StreamSets : http://streamsets.com/opensource/

Documentation: http://streamsets.com/docs

Source Code: https://github.com/streamsets/

Issue Tracker: https://issues.streamsets.com/

Mailing List: https://goo.gl/wmFPLt

Other Resources: http://streamsets.com/resources/