multi-tenant deep learning and streaming as-a-service with … · 2019-05-20 · python (also...
TRANSCRIPT
![Page 1: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/1.jpg)
Multi-tenant Deep Learning and Streaming as-a-Service with HopsworksTheoflos Kakantousis (@theofloskak)COO – Logical Clocks AB
Big Data Moscow 2018
![Page 2: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/2.jpg)
©2018 Logical Clocks AB. All Rights Reserved2
Deep Learning & Streaming-as-a-Service in Sweden
● Hopsworks
– Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service
– Built on Hops Hadoop (www.hops.io)
– hops.site, 600+ users as of September 2018 ● RISE SICS ICE
– 250 kW Datacenter, ~1000 servers
https://www.sics.se/projects/sics-ice-data-center-in-lulea
![Page 3: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/3.jpg)
©2018 Logical Clocks AB. All Rights Reserved3
[…] the general consensus seems to be that everyoneexpects some gain in performance numbers if the dataset size is increased dramatically [...]
Deep Learning needs Big data
Sun et Al. - Revisiting Unreasonable Efectiveness of Data in Deep Learning Era - 2017
Joel et Al. - Deep Learning Scaling is Predictable, Empirically - 2017
![Page 4: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/4.jpg)
©2018 Logical Clocks AB. All Rights Reserved4
AI Hierarchy of Needs
DataEngineers
DataScientists
DataScientists?
DDL(Distributed
Deep Learning)
Deep Learning, RL
Machine Learning (ML)
Data Analytics
Data Pipelines
Big Data
Lots of GPUs
GPUs
![Page 5: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/5.jpg)
Full-stack Data Science
![Page 6: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/6.jpg)
©2018 Logical Clocks AB. All Rights Reserved6
Hopsworks
Hopsworks
Rest API
![Page 7: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/7.jpg)
©2018 Logical Clocks AB. All Rights Reserved7
Hopsworks
Develop Train Test Deploy
MySQL Cluster
Hive
InfuxDB
ElasticSearch
KafkaProjects,Datasets,Users
HopsFS / YARN
Spark, Flink, Tensorfow
Jupyter
Jobs, Kibana, Grafana
RESTAPI
![Page 8: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/8.jpg)
Big data needs scalable storage
![Page 9: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/9.jpg)
©2018 Logical Clocks AB. All Rights Reserved9
HopsFS*
Metadata
Datanode
Namenode
● HDFS derivative with distributed metadata
– 37x increased capacity– 16x increased
throughput
HDFS Client
HDFS Client
Scale-out all layers
* HopsFS - https://goo.gl/yFCsGc
Scale Challenge Winner (2017)
Hops
![Page 10: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/10.jpg)
©2018 Logical Clocks AB. All Rights Reserved10
HopsFS support for Small Files *
RAMNVMe Disk
Datanode
Namenode
> 64KB (Configurable)
< 1KB 1KB < > 64KB
● Integrates NVMe ● Open Images Dataset:
– 9m images– ~80% small fles (<64 KB)
NVMe Disk
Metadata layer - NDB
*Size Matters: Improving the Performance of Small Files in Hadoop, Middleware 2018. Niazi et al
![Page 11: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/11.jpg)
Multi-tenancy
![Page 12: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/12.jpg)
©2018 Logical Clocks AB. All Rights Reserved12
Projects for Software-as-a-Service
Proj-42 Proj-X
Shared TopicTopic /Projs/My/Data
Proj-AllCompanyDB
![Page 13: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/13.jpg)
©2018 Logical Clocks AB. All Rights Reserved13
Manage Projects like Github
![Page 14: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/14.jpg)
©2018 Logical Clocks AB. All Rights Reserved14
Share like in Dropbox
Share any Data Source/Sink: HDFS Datasets, Kafka Topics, etc
![Page 15: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/15.jpg)
©2018 Logical Clocks AB. All Rights Reserved15
Project Authorization
● Data Owner Privileges– Import/Export data– Manage Membership– Share DataSets, Topics
● Data Scientist Privileges
– Write and Run code● Delegate Administration of
privileges to users
![Page 16: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/16.jpg)
©2018 Logical Clocks AB. All Rights Reserved16
Custom Python environments with Conda
Python libraries are usable by Spark/Tensorfow
![Page 17: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/17.jpg)
©2018 Logical Clocks AB. All Rights Reserved17
TLS (not Kerberos) for security in Hops
● X.509 Certifcates for authentication● 1 Certifcate for each project
user● New App certifcate
generated for each job
● Store an audit trail of the operations (read/write/create/etc) users and apps perform on HopsFs
Resource Manager
Node Manager
HopsFs
Generate App Cert
Auth w/ App Cert
Project_user cert
![Page 18: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/18.jpg)
©2018 Logical Clocks AB. All Rights Reserved18
TLS certifcate generation
Users don’t see the certifcates,authenticate using:• LDAP• password • 2-Factor Authentication
Add/DelUsers
Distributed Database
Insert/Remove CertsProject Mgr
RootCA
HDFSSparkKafkaYARN
Cert Signing Requests
IntermediateCertifcate Authority
Hopsworks
![Page 19: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/19.jpg)
Streaming-as-a-Service
![Page 20: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/20.jpg)
©2018 Logical Clocks AB. All Rights Reserved20
ETL Workloads
ParquetHive
Hopsworks Jobs
trigger
Elastic
pipelines transform raw datato structured data
HopsFS
![Page 21: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/21.jpg)
©2018 Logical Clocks AB. All Rights Reserved21
Streaming Analytics in Hopsworks
HopsFS YARN
HopsFS YARN
Grafana/InfluxDBGrafana/InfluxDB
Elastic/KibanaElastic/Kibana
Public Cloud or On-PremisePublic Cloud or On-Premise
Parquet / ORC
Data Src
Batch Analytics
Kafka
…...MySQLMySQL
![Page 22: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/22.jpg)
©2018 Logical Clocks AB. All Rights Reserved22
Lifecycle of a Streaming Job
Developer
1.Discover: Schema Registry and Kafka Broker Endpoints2.Create: Kafka Properties file with certs and broker
details3.Create: Producer/Consumer using Kafka Properties
4.Download: the Schema for the Topic from the Schema Registry
5.Distribute: X.509 certs to all hosts on the cluster6.Cleanup securely
Operations
Facilitate dev+ops with hops-util https://github.com/logicalclocks/hops-util
![Page 23: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/23.jpg)
©2018 Logical Clocks AB. All Rights Reserved23
Kafka Self-Service UI
Manage & Share• Topics• ACLs• Avro Schemas
Manage & Share• Topics• ACLs• Avro Schemas
![Page 24: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/24.jpg)
©2018 Logical Clocks AB. All Rights Reserved24
Realtime Logs
● YARN aggregates logs on job completion– No good to us for Streaming
● Collect logs and make them searchable in real-time using Filebeat, Logstash, Elasticsearch, and Kibana
![Page 25: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/25.jpg)
©2018 Logical Clocks AB. All Rights Reserved25
Realtime Logs
![Page 26: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/26.jpg)
©2018 Logical Clocks AB. All Rights Reserved26
Resource Monitoring/Alerting
![Page 27: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/27.jpg)
©2018 Logical Clocks AB. All Rights Reserved27
Jupyter Notebooks
![Page 28: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/28.jpg)
©2018 Logical Clocks AB. All Rights Reserved28
Dela* – A Global Ecosystem for Datasets
Peer-to-Peer Search and Download for Huge DataSets(ImageNet, YouTube8M, MsCoCo, Reddit, etc)
*http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)
![Page 29: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/29.jpg)
ML & Deep Learning-as-a-Service
![Page 30: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/30.jpg)
©2018 Logical Clocks AB. All Rights Reserved30
HopsML Pipeline
![Page 31: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/31.jpg)
©2018 Logical Clocks AB. All Rights Reserved31
HopsML Spark/TensorFlow Arch
Executor/Tf Executor/Tf
Driver
HopsFSTensorBoard Model Serving
Conda Envs
Conda Envs
![Page 32: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/32.jpg)
Distributed Training
![Page 33: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/33.jpg)
©2018 Logical Clocks AB. All Rights Reserved33
Deep Learning Hierarchy of Scale
DDLAllReduce
on GPU Servers
DDL with GPU Serversand Parameter Servers
Parallel Experiments on GPU Servers
Single GPU
Many GPUs on a Single GPU Server
Days/Hours
Days
Weeks
Minutes
Training Time for ImageNet
Hours
“My Model’s Training.”
Training
![Page 34: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/34.jpg)
©2018 Logical Clocks AB. All Rights Reserved34
GPU Resource Requests in HopsYARN
HopsYARN HopsYARN
10 GPUs on 1 host
100 GPUs on 10 hosts with ‘Infiniband’
Hops supports a Hetrogenous Mix of GPUs
4 GPUs on any host
![Page 35: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/35.jpg)
Experiments in Hopsworks
![Page 36: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/36.jpg)
©2018 Logical Clocks AB. All Rights Reserved36
The boring part of the job
● Find good Hyperparameters for your model
● Test diferent confgurations● Automate this!
“I have to run a hundred experiments to fnd the best
model,” he complained, as he showed me his Jupyter notebooks.
“That takes time. Every experiment takes a lot of
programming, because there are so many diferent parameters.
[https://thomaswdinsmore.com/2018/01/30/predictions-for-2018/ ]
![Page 37: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/37.jpg)
©2018 Logical Clocks AB. All Rights Reserved37
Experiments in TensorFlow/Hopsworks
● Run and evaluate multiple models in parallel on a subset of the dataset
Experiment 1 Experiment 2
Experiment 4Experiment 3
![Page 38: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/38.jpg)
©2018 Logical Clocks AB. All Rights Reserved38
Reproducible Experiments
● Results tracking● Hyperparameter tracking● Jupyter notebook versioning● Conda Env versioning● WIP: Dataset versioning
![Page 39: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/39.jpg)
©2018 Logical Clocks AB. All Rights Reserved39
Experiments Dashboard
![Page 40: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/40.jpg)
©2018 Logical Clocks AB. All Rights Reserved40
TensorBoard (1)
![Page 41: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/41.jpg)
©2018 Logical Clocks AB. All Rights Reserved41
TensorBoard (2)
![Page 42: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/42.jpg)
©2018 Logical Clocks AB. All Rights Reserved42
HopsAPI*
● Python (also Java/Scala)– Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod, TensorFlowOnSpark– Parallel experiments
● Gridsearch● Model Architecture Search with Genetic Algorithms
– Secure Streaming Analytics with Kafka/Spark/Flink– SSL/TLS certs, Avro Schema, Endpoints for
Kafka/Zookeeper/etc.
* https://github.com/logicalclocks/hops-util-py
![Page 43: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/43.jpg)
Model Serving
![Page 44: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/44.jpg)
©2018 Logical Clocks AB. All Rights Reserved44
Standard serving infrastructure
Scale model serving with Kubernetes
Considered best practice by the community
Provide tools to easily:● Fault tolerance● Rolling release new models● Autoscaling
![Page 45: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/45.jpg)
©2018 Logical Clocks AB. All Rights Reserved45
Model Monitoring
HopsFS
Serving infrastructure
Re-train and deploy new model
Model monitoring infrastructure
● Log model inference requests/results to Kafka● Spark monitors model performance and input data● When to retrain?
![Page 46: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/46.jpg)
©2018 Logical Clocks AB. All Rights Reserved46
Model Serving on Kubernetes
![Page 47: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/47.jpg)
©2018 Logical Clocks AB. All Rights Reserved47
Orchestrating Hops workfows
Data Collection
Experimentation Training ServingFeature
Extraction
Data Transformation & Verifcation
Test
Airflow (Hopsworks Operator)
![Page 48: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/48.jpg)
Demo
![Page 49: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/49.jpg)
©2018 Logical Clocks AB. All Rights Reserved49
Summary
● Build a single platform to cover the entire AI hierarchy of needs.
● Increase productivity of Data Scientists – Manage all your data pipelines and workflows
under a single roof– Have first-class support for Python / Streaming/
Deep Learning / ML / Data Governance / GPUs
![Page 50: Multi-tenant Deep Learning and Streaming as-a-Service with … · 2019-05-20 · Python (also Java/Scala) – Manage TensorBoard, load/save models in HDFS – TensorFlow, Horovod,](https://reader033.vdocuments.us/reader033/viewer/2022060509/5f246ee885caf300cc54d83f/html5/thumbnails/50.jpg)
Hopsworks → logicalclocks.comGitHub → github.com/logicalclocksTwitter → @logicalclocks