tectonic summit 2016: multitenant data architectures with kubernetes

Multitenant Data Architectures with Kubernetes

Paul Brown

[email protected]

Motivation

• Software development and data science have distinct lifecycles.

• Repeatability is fundamental to both.• Bridging the data science lifecycle into the software

development lifecycle presents challenges.

Multi-tenancy with Multiplicity

• No tool really does it all. (Sorry.)• Data wrangling, ETL/ELT, different algorithms hosted in

different compute frameworks, …• Data pipeline or workflow to tie it all together.• Everyone wants something different, sometimes for good

reasons.

Being able to run a large number of different workloads for a large number of different users is a win.

Containers

• Package apps with their libraries in a (relatively) clean manner — especially important for native code.

• Ensure traceability of code, presuming that there is a solid CI and repository solution in place.

Kubernetes is awesome.

For reasons you already know:• Bin packing.• Horizontal scale-out for the platform, auto-scaling for pods.• Service discovery, load balancing.• Self-healing.• Batch execution.

And more reasons in the future:• GPU affinity.• Backplane for Spark.

A Simple Idea

What if we could package workloads in containers and then kubectl could be our fundamental devops primitive…?

Napkin Sketch:1. Build a control plane that

knows how to stamp out workloads via a Provisioning API.

2. Profit.

Kubernetes

Control Plane Wor

kloa

d 1

Wor

kloa

d 2

Wor

kloa

d 3

Provisioning API

Challenges

• Typical workloads consist of multiple types of containers that need to collaborate.

• Containerization (often) isn’t that bad, depending on your taste.• Many workloads or components thereof (e.g., Spark) aren’t

designed in a manner that permits the best use of Kubernetes facilities.

Surgery (or holding your nose) is frequently required, but sometimes (e.g., TensorFlow!) things work well from the start.

Example

Problem:• Zookeeper• Nodes have distinct identity, and the client protocol is designed

to defy load balancing.

Solution:• Replication controller per node and call it a day.

Some Familiar ProblemsOnce you can stamp out workloads, you get down to familiar problems:

• Tenant-attributed logging (workload and user) and metrics.• “Billing” and metering.• Visibility and other flavors of operability.• Security — from purposeful or accidental attackers.• Workload isolation, e.g., for PII.

Fixing these problems frequently frequently requires surgery, and none of these problems are unique to containerization or cluster scheduling of workloads, i.e., you have to solve them anyway.

Wrap Up

• Building a data processing platform on Kubernetes has some obvious starting points and some familiar challenges.

• More data scientists and middleware makers are starting with containers as a packaging scheme.

tectonic summit 2016: multitenant data architectures with kubernetes

Technology