pachyderm: data storage and processing with docker
TRANSCRIPT
Storage Scheduler Packaging
Docker
• Open source • Generalized for different use cases• End-to-end solution• Leverages Docker ecosystem
Pachyderm Pipeline System(pps)
Pachyderm File
System(pfs)
Adroll’s Archiecture for everyone else
Why is this cool?
• View diffs of your data
• Instantly revert to previous state
• Immutability
• Reduce storage needs
• BranchingCommit
0
Commit
1
Commit
2
Commit
3
Commit
4
Git for huge data sets
What is PPS?
• Schedules dependency graph
• Manages containerized pipelines
• Understands copy-on-write storage
Task 1
Task 2 Task 3
Task 4
Dashboard
Task 5
Task 6
PPS + PFS is…
Efficient: incremental processing
3
2
1
0
Data Analysis
Task 4
DashboardTask 6
Task 1
Task 2 Task 3
Task 5
1% more
data
Task 4
DashboardTask 6
Only process jobs that rely on the data that changed
PPS + PFS is…
Flexible: both batched pipelines and streaming
Daily batchedpipelinesData Analysis
Task 4
DashboardTask 6
Task 1
Task 2 Task 3
Task 5
1
0
∆Time = 1 day
2
Large batched DAG that processes all the new data each day
PPS + PFS is…
Flexible: both batched pipelines and streaming
Data Analysis
Task 4
DashboardTask 6
Task 1
Task 2 Task 3
Task 5
2
1
0
Streaming updates
3
∆Time = 1 second4
Micro-batches that update constantly as new data streams in
Commits are insanely cheap so you can take one every second
PPS + PFS is…
Task 1
Task 2 Task 3
Task 4
Dashboard
Task 5
Task 6
$ Task 2 failed$ Task 4 and 6 waiting…
… Fixing code …
$ Task 2 resuming...$ Task 2 complete!$ Task 4 starting…
Monitoring
Resilient: seamless pipeline restarts
PPS + PFS is…
PFS storage nodes
PPS
Copy-on-write storage nodes
Elastically scaling computation nodes
Cost-effective: resource management through delayed execution
d2.8xlarge
PPSPPS
PPSSpot
SpotSpotElastically add spot instances when prices are cheap or needs are high
PPS + PFS is…
PFS storage nodes
PPS
Copy-on-write storage nodes
Elastically scaling computation nodes
Cost-effective: resource management through delayed execution
d2.8xlarge
PPSPPS
PPSSpot
SpotSpot
S3
Slow/cheap storage
Back up data to S3 for long-term storage or “cold” data
Summary
• Container ecosystem is powerful
• Copy-on-write data is really powerful
• Containers plus copy-on-write is insanely powerful