tectonic summit 2016: multitenant data architectures with kubernetes

10
Multitenant Data Architectures with Kubernetes Paul Brown [email protected]

Upload: coreos

Post on 08-Jan-2017

64 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

Multitenant Data Architectures with Kubernetes

Paul Brown

[email protected]

Page 2: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

Motivation

• Software development and data science have distinct lifecycles.

• Repeatability is fundamental to both.• Bridging the data science lifecycle into the software

development lifecycle presents challenges.

Page 3: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

Multi-tenancy with Multiplicity

• No tool really does it all. (Sorry.)• Data wrangling, ETL/ELT, different algorithms hosted in

different compute frameworks, …• Data pipeline or workflow to tie it all together.• Everyone wants something different, sometimes for good

reasons.

Being able to run a large number of different workloads for a large number of different users is a win.

Page 4: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

Containers

• Package apps with their libraries in a (relatively) clean manner — especially important for native code.

• Ensure traceability of code, presuming that there is a solid CI and repository solution in place.

Page 5: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

Kubernetes is awesome.

For reasons you already know:• Bin packing.• Horizontal scale-out for the platform, auto-scaling for pods.• Service discovery, load balancing.• Self-healing.• Batch execution.

And more reasons in the future:• GPU affinity.• Backplane for Spark.

Page 6: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

A Simple Idea

What if we could package workloads in containers and then kubectl could be our fundamental devops primitive…?

Napkin Sketch:1. Build a control plane that

knows how to stamp out workloads via a Provisioning API.

2. Profit.

Kubernetes

Control Plane Wor

kloa

d 1

Wor

kloa

d 2

Wor

kloa

d 3

Provisioning API

Page 7: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

Challenges

• Typical workloads consist of multiple types of containers that need to collaborate.

• Containerization (often) isn’t that bad, depending on your taste.• Many workloads or components thereof (e.g., Spark) aren’t

designed in a manner that permits the best use of Kubernetes facilities.

Surgery (or holding your nose) is frequently required, but sometimes (e.g., TensorFlow!) things work well from the start.

Page 8: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

Example

Problem:• Zookeeper• Nodes have distinct identity, and the client protocol is designed

to defy load balancing.

Solution:• Replication controller per node and call it a day.

Page 9: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

Some Familiar ProblemsOnce you can stamp out workloads, you get down to familiar problems:

• Tenant-attributed logging (workload and user) and metrics.• “Billing” and metering.• Visibility and other flavors of operability.• Security — from purposeful or accidental attackers.• Workload isolation, e.g., for PII.

Fixing these problems frequently frequently requires surgery, and none of these problems are unique to containerization or cluster scheduling of workloads, i.e., you have to solve them anyway.

Page 10: Tectonic Summit 2016: Multitenant Data Architectures with Kubernetes

Wrap Up

• Building a data processing platform on Kubernetes has some obvious starting points and some familiar challenges.

• More data scientists and middleware makers are starting with containers as a packaging scheme.