dealing with drift - building an enterprise data lake

26
Dealing with Drift Building an Enterprise Data Lake

Upload: pat-patterson

Post on 29-Jan-2018

273 views

Category:

Software


3 download

TRANSCRIPT

Dealing with DriftBuilding an Enterprise Data Lake

Speakers

Nathan Swetye

Sr. Manager of Platform Engineering

Cox Automotive

Michael Gay

Lead Technical Architect

Cox Automotive

Pat Patterson

Community Champion

StreamSets

3

25 (and growing) companies dealing with the automotive space

Spans the full vehicle ownership lifecycle

Data perceived as the integration point for all companies

Cox Automotive

Enterprise Data DNA

Commercial Customers Across Verticals

150,000 downloads40 of the Fortune 100Doubling each quarter

Strong Partner Ecosystem Open Source Success

Mission: empower enterprises to harness their data in motion.

StreamSets Overview

StreamSets Data Collector™

StreamSets Dataflow Performance

Manager (DPM™)

Instrumented, open source UI and engine to build any-to-any

dataflows.

Cloud Service to map, measure and master dataflow

operations.

DATAFLOW LIFECYCLE

Developers

Scientists

Architects

StreamSets Enterprise

EVOLVE (Proactive)

REMEDIATE (Reactive)

DEVELOP OPERATE

Operators

Stewards

Architects

EFFICIENCYIntent Driven FlowsBatch & Streaming IngestIn-stream Sanitization

CONTROLFine-grained Stage & Flow MetricsDrift HandlingLineage and Impact Analysis Capture

AGILITYFlexible deploymentException HandlingSeamless Evolution

StreamSets Data Collector is a complete IDE for building and executing any-to-any ingest pipelines.

StreamSets Data Collector

StreamSets DPM provides a single pane of glass to map, measure and master your dataflow operations.

MASTERAvailability & AccuracyProactive Remediation

MEASUREAny PathAny Time

MAPDataflow LineageLive Data Architecture

StreamSetsDataflow Performance Manager (DPM)

Data DriftChange is the New Normal

The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and

modernization of the systems that produce the data

Structure Drift

SemanticDrift

Infrastructure Drift

SQL on Hadoop (Hive) Y/Y Click Through Rate

80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis

Example: Data Loss and Corrosion

Data Drift and Scale

At the micro level, data drift leads to breakage and errors

At the macro level, data drift brings your system to a grinding halt!

11

The Problem of Data Exchange at ScaleEveryone wants each others’ data, but often difficult to acquire

A tangled mess of data flow

A source of anguish and sorrow

12

The Problem of Data Exchange at ScaleEnter the Data Lake

The central store for valuable data

Mission: Data Lake, not Data Swamp

Data$Lake

13

Great. A Data Lake. But how do you Populate it?

Problem: $$ Cost – a Question of Scale• 25 Companies• 9+ Source Types, mostly DBs• 1-Many Schemas per Database• Many Tables per Schema

Example:• AutoTrader -> Oracle -> ATM1:

~1600 Tables

14

Great. A Data Lake. But how do you Populate it?

Problem: $$ Cost – a Question of Scale• 25 Companies• 9+ Source Types, mostly DBs• 1-Many Schemas per Database• Many Tables per Schema

Example:• AutoTrader -> Oracle -> ATM1:

~1600 Tables

We’ve ingested about that much

15

Great. A Data Lake. But how do you Populate it?

16

Back to Square 0

17

Back to Square 0

18

Cox Automotive’s StreamSets Architecture

Databases

Amazon S3

Files

FTP

Sources

StreamSets

Acquisition

StreamSets

StreamSets

StreamSets

Hadoop Filesystem

Big Data SQL

Amazon S3

Targets

StreamSets

Ingestion

StreamSets

StreamSets

StreamSets

Data Pipelines

Separates Acquisition from Ingestion

Dynamic Error Handling

Encrypted Data in Transit

Data standards applied automatically:

• Compression• File Formats• Partitioning Schemes• Row-level Watermarks• Time-stamping

Ingestion farm scales with demand

Auto-creates schemas en route

Data comes from a variety of sources

Pipelines are established for each source

Ingestion Back Pressure

Scaling, Secure,load-balanced

Actual ingestion activities

On-premises and Cloud Big Data

Systems

StreamSets

RPC

StreamSets

StreamSets

StreamSets

Load

Bal

ance

r

19

Acquisition Deployment Model

Ingest Form

StreamSets

Pipeline Deployment

Virtual HostDeployment

IngestionTeam Member

StreamSets

AcquisitionPipeline

Enterprise Data Lake

start workflow

submit form

start workflow

build virtual host

deploy data pipeline

Enterprise Data Sources

DevOpsTeam Member

20

Throughput!

0

100

200

300

400

Jan Feb Mar Apr May Jun Jul Aug Sept

Monthly Ingestion RequestsStreamSets

7x

Live Environment

25

Where do we go from Here?

• Amazon Web Services• StreamSets Dataflow Performance Manager• Acquire/Ingest decision point: Centralized, Federated, or Democratized?• Quality• Streamline access to sources• Change data capture• Integration with enterprise data catalogs• Ingestion post-processing

Questions