pachyderm: data storage and processing with docker

15
Pachyderm: Data Storage and Processing with Docker Joe Doliner Founder & CEO [email protected]

Upload: joseph-zwicker

Post on 22-Jan-2018

646 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Pachyderm:

Data Storage and Processing with Docker

Joe Doliner

Founder & CEO

[email protected]

Adroll’s Architecture

AmazonS3

Luigi

Storage Scheduler Packaging

Docker

Storage Scheduler Packaging

Docker

• Open source • Generalized for different use cases• End-to-end solution• Leverages Docker ecosystem

Pachyderm Pipeline System(pps)

Pachyderm File

System(pfs)

Adroll’s Archiecture for everyone else

What is PFS?

A copy-on-write distributed file system

Core storage for Pachyderm

What is PFS?

Copy-on-write is the paradigm that “powers” technologies like Docker and Spark

Why is this cool?

• View diffs of your data

• Instantly revert to previous state

• Immutability

• Reduce storage needs

• BranchingCommit

0

Commit

1

Commit

2

Commit

3

Commit

4

Git for huge data sets

What is PPS?

• Schedules dependency graph

• Manages containerized pipelines

• Understands copy-on-write storage

Task 1

Task 2 Task 3

Task 4

Dashboard

Task 5

Task 6

PPS + PFS is…

Efficient: incremental processing

3

2

1

0

Data Analysis

Task 4

DashboardTask 6

Task 1

Task 2 Task 3

Task 5

1% more

data

Task 4

DashboardTask 6

Only process jobs that rely on the data that changed

PPS + PFS is…

Flexible: both batched pipelines and streaming

Daily batchedpipelinesData Analysis

Task 4

DashboardTask 6

Task 1

Task 2 Task 3

Task 5

1

0

∆Time = 1 day

2

Large batched DAG that processes all the new data each day

PPS + PFS is…

Flexible: both batched pipelines and streaming

Data Analysis

Task 4

DashboardTask 6

Task 1

Task 2 Task 3

Task 5

2

1

0

Streaming updates

3

∆Time = 1 second4

Micro-batches that update constantly as new data streams in

Commits are insanely cheap so you can take one every second

PPS + PFS is…

Task 1

Task 2 Task 3

Task 4

Dashboard

Task 5

Task 6

$ Task 2 failed$ Task 4 and 6 waiting…

… Fixing code …

$ Task 2 resuming...$ Task 2 complete!$ Task 4 starting…

Monitoring

Resilient: seamless pipeline restarts

PPS + PFS is…

PFS storage nodes

PPS

Copy-on-write storage nodes

Elastically scaling computation nodes

Cost-effective: resource management through delayed execution

d2.8xlarge

PPSPPS

PPSSpot

SpotSpotElastically add spot instances when prices are cheap or needs are high

PPS + PFS is…

PFS storage nodes

PPS

Copy-on-write storage nodes

Elastically scaling computation nodes

Cost-effective: resource management through delayed execution

d2.8xlarge

PPSPPS

PPSSpot

SpotSpot

S3

Slow/cheap storage

Back up data to S3 for long-term storage or “cold” data

Summary

• Container ecosystem is powerful

• Copy-on-write data is really powerful

• Containers plus copy-on-write is insanely powerful

Thank You!

Questions?

Pachyderm.io

[email protected]