tectonic summit 2016: multitenant data architectures with kubernetes
TRANSCRIPT
Motivation
• Software development and data science have distinct lifecycles.
• Repeatability is fundamental to both.• Bridging the data science lifecycle into the software
development lifecycle presents challenges.
Multi-tenancy with Multiplicity
• No tool really does it all. (Sorry.)• Data wrangling, ETL/ELT, different algorithms hosted in
different compute frameworks, …• Data pipeline or workflow to tie it all together.• Everyone wants something different, sometimes for good
reasons.
Being able to run a large number of different workloads for a large number of different users is a win.
Containers
• Package apps with their libraries in a (relatively) clean manner — especially important for native code.
• Ensure traceability of code, presuming that there is a solid CI and repository solution in place.
Kubernetes is awesome.
For reasons you already know:• Bin packing.• Horizontal scale-out for the platform, auto-scaling for pods.• Service discovery, load balancing.• Self-healing.• Batch execution.
And more reasons in the future:• GPU affinity.• Backplane for Spark.
A Simple Idea
What if we could package workloads in containers and then kubectl could be our fundamental devops primitive…?
Napkin Sketch:1. Build a control plane that
knows how to stamp out workloads via a Provisioning API.
2. Profit.
Kubernetes
Control Plane Wor
kloa
d 1
Wor
kloa
d 2
Wor
kloa
d 3
Provisioning API
Challenges
• Typical workloads consist of multiple types of containers that need to collaborate.
• Containerization (often) isn’t that bad, depending on your taste.• Many workloads or components thereof (e.g., Spark) aren’t
designed in a manner that permits the best use of Kubernetes facilities.
Surgery (or holding your nose) is frequently required, but sometimes (e.g., TensorFlow!) things work well from the start.
Example
Problem:• Zookeeper• Nodes have distinct identity, and the client protocol is designed
to defy load balancing.
Solution:• Replication controller per node and call it a day.
Some Familiar ProblemsOnce you can stamp out workloads, you get down to familiar problems:
• Tenant-attributed logging (workload and user) and metrics.• “Billing” and metering.• Visibility and other flavors of operability.• Security — from purposeful or accidental attackers.• Workload isolation, e.g., for PII.
Fixing these problems frequently frequently requires surgery, and none of these problems are unique to containerization or cluster scheduling of workloads, i.e., you have to solve them anyway.
Wrap Up
• Building a data processing platform on Kubernetes has some obvious starting points and some familiar challenges.
• More data scientists and middleware makers are starting with containers as a packaging scheme.