running spark inside containers with haohai ma and khalid ahmed
TRANSCRIPT
Myself
• “How High”
• Software Architect
• IBM Spectrum Computing
• Toronto Canada
2IBM Spectrum Computing
Agenda
• Why container?
• Migrate spark workload to container
• Spark instance on Kubernetes
– Architecture
– Workflow
– Multi-tenancy
• Future work
3IBM Spectrum Computing
Why use containers?
• To enforce the CPU and memory bounds. – CPU shares are proportional to the allocated slots
– spark.driver.memory & spark.executor.memroy
• To completely isolate the file system – Solve the dependency conflicts
• To create and ship images– Develop once and run everywhere
4IBM Spectrum Computing
No prebuilt Spark image
• A running container needs an application image
– Independent to Spark versions
• Seamlessly migrate Spark workloads to a
container based environment
– Assume: Spark is distributed onto the host file system
5IBM Spectrum Computing
Host Filesystem
Spark
installation
Regular Spark workload
IBM Spectrum Computing
Spark Master
JVM
Spark Submit
Spark Driver
Spark Executor
Host Filesystem
Spark
installation
Running in containers
IBM Spectrum Computing
Spark Master
Spark Executor
container: ubuntu
JVM
Spark Submit
Spark Driver
container: ubuntu
container: image
Creating a container definition for an application image
IBM Spectrum Computing
Extra dependency from
host file system
Submitting workload with the container definition
9IBM Spectrum Computing
spark-submit --class<main-class> --master<master-url> --deploy-mode cluster \
--conf spark.ego.driver.docker.definition= MyAppDef \
--conf spark.ego.executor.docker.definition= MyAppDef \
<application-jar> \
[application-arguments]
Cluster Mode: Define container specifications for
the drivers and executors
Host Filesystem
Spark
installation
Running in containers
IBM Spectrum Computing
Spark Master
Spark Executor
Container: myappimage:v1Spark Submit
Spark Driver
container: myappimage:v1
Infobatch
lib
Spark Instance on Kubernetes
• Increase resource utilization
– Share nodes between Spark and surrounding ecosystem
• Isolation between tenants and apply resource enforcement
– Each tenant gets a dedicated Spark working instance
– Tenant price plan can directly map to its resource quota
• Simplify deployment and roll out
11IBM Spectrum Computing
Architecture
IBM Spectrum Computing
IBM Spectrum Conductor with Spark
Spark Instance
Group
Spark Instance
Group
Spark
Master
History
ServerNotebook
Shuffle
ServiceSpark
Master
History
ServerNotebook
Shuffle
Service
SparkaaS level
Tenant level
Spark instance group
Admin PortalSpark Information Hub Image and deployment
Architecture
IBM Spectrum Computing
IBM Spectrum Conductor with Spark
Spark Instance
Group
Spark Instance
Group
Spark
Master
History
ServerNotebook
Shuffle
ServiceSpark
Master
History
ServerNotebook
Shuffle
Service
Spark Service level
Tenant level
Spark Information Hub Image and deployment Spark instance group
Admin Portal
• A Spark instance group is an independent
deployment in Kubernetes.
• A docker image is built automatically based on the
Spark version, configuration, notebook edition, and
user application dependencies.
• Initially only one container for Spark services of the
Spark instance group.
• Dynamic scalability based on workload demand.
Architecture
IBM Spectrum Computing
IBM Spectrum Conductor with Spark
Spark Instance
Group
Spark Instance
Group
Spark
Master
History
ServerNotebook
Shuffle
ServiceSpark
Master
History
ServerNotebook
Shuffler
Service
SparkaaS level
Tenant level
Spark Information Hub Image and deployment Spark instance group
Admin Portal
• IBM Spectrum Conductor with Spark - End points
• Admin manages Spark instance group life cycle
• Tenant accesses Spark workloads and notebooks
• Deploy by a helm chart and expose as a service with
one single container: CWS master
• Multiple deployments in a Kubernetes cluster
• Cloud: One deployment for one Spaas
• On-Prem: One deployment for one BU
Creating a Spark instance groupKubernetes
Namespace: ns4bu1
container: CWS
spaas4bu1_cwsmaster
Registry
tenant1 Image
Deploying a Spark instance groupKubernetes
Namespace: ns4bu1
container: CWS
spaas4bu1_cwsmaster
container: tenant1
spaas4bu1_tenant1
Registry
tenant1 Image
Kubernetes
Namespace: ns4bu1
container: tenant1
container: tenant1 container: tenant1
Scaling the Spark instance group based on workload demands
container: tenant1
scheduler
API Server
K8s master
Spark Master
Spark Driver
Spark ExecutorSpark Executor
Multitenancy with Spark instance groupsKubernetes
Namespace: ns4bu1
container: CWS
spaas4bu1_cwsmaster
Registry
container: tenant1
spaas4bu1_tenant1container: tenant1
spaas4bu1_tenant1container: tenant1
spaas4bu1_tenant1container: tenant1
spaas4bu1_tenant1
container: tenant2
spaas4bu1_tenant2container: tenant2
spaas4bu1_tenant2container: tenant2
spaas4bu1_tenant2
tenant1 Image
tenant2 Image
Registry
tenant1 Image
tenant2 Image
tenant3 Image
tenant4 Image
Multi-SpaasKubernetes
Namespace: ns4bu1
container: CWS
spaas4bu1_cwsmaster
container: tenant1
spaas4bu1_tenant1container: tenant1
spaas4bu1_tenant1container: tenant1
spaas4bu1_tenant1container: tenant1
spaas4bu1_tenant1
container: tenant2
spaas4bu1_tenant2container: tenant2
spaas4bu1_tenant2
Namespace: ns4bu2
container: CWS
spaas4bu2_cwsmaster
container: tenant3
spaas4bu2_tenant3
container: tenant4
spaas4bu2_tenant4
Survey: Spark on Kubernetes
SPARK-18278 StandaloneIBM Spectrum Conductor
with Spark on
Kubernetes
Dynamic allocation on demand
Yes Static Yes
K8s interaction granularity Job level Instance level – static Instance level –dynamic
Deployment Automation• Simple deploy by helm
charts
No Yes Yes
Spark instance per tenant • Multi-job/workflow/user• Image with user
applications• Security
No limited Yes
Future work
• Integration with Kubernetes batch workload
scheduler
– Kube-arbitrator (https://github.com/kubernetes-
incubator/kube-arbitrator)
• Performance comparation with other Spark on
Kubernetes solutions
26IBM Spectrum Computing