cirrus: a serverless framework for andrew zhang, randy ... · serverless frameworks machine...
TRANSCRIPT
![Page 1: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/1.jpg)
Cirrus:A Serverless Framework for End-to-end ML WorkflowsJoao Carreira, Pedro Fonseca, Alexey Tumanov,
Andrew Zhang, Randy Katz
![Page 2: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/2.jpg)
Machine Learning
![Page 3: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/3.jpg)
End-to-end ML workflows● Modern end-to-end ML workflows are complex
![Page 4: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/4.jpg)
End-to-end ML workflows● Modern end-to-end ML workflows are complex
● ML workflows consist of 3 heterogeneous stages
![Page 5: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/5.jpg)
End-to-end ML workflows● Modern end-to-end ML workflows are complex
● ML workflows consist of 3 heterogeneous stages
Dataset Preprocessing
![Page 6: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/6.jpg)
End-to-end ML workflows● Modern end-to-end ML workflows are complex
● ML workflows consist of 3 heterogeneous stages
Dataset Preprocessing
Model Training
![Page 7: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/7.jpg)
End-to-end ML workflows● Modern end-to-end ML workflows are complex
● ML workflows consist of 3 heterogeneous stages
Dataset Preprocessing
Model Training Hyperparameter Tuning
![Page 8: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/8.jpg)
End-to-end ML workflows● Modern end-to-end ML workflows are complex
● ML workflows consist of 3 heterogeneous stages
ML workflows are interactive and iterative
Dataset Preprocessing
Model Training Hyperparameter Tuning
![Page 9: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/9.jpg)
Provisioning ML workflowsProvisioning ML workflows is challenging
● Complex infrastructure management detracts from ML work● Resource waste due to overprovisioning of resources
Hard to accurately estimate resource demands of each stage
Data scientists have limited systems expertise
![Page 10: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/10.jpg)
Serverless computingOutput
Input
Code
AWS S3
![Page 11: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/11.jpg)
Serverless computingOutput
Input
Code
![Page 12: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/12.jpg)
Fine-grainedresources
Fine-grained billing
High elasticity
Automatic resource configuration / provisioning
/ maintenance
Serverless computing benefitsTight provisioning of
resourcesSimplifying infrastructure
management
![Page 13: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/13.jpg)
Challenges of serverless
Small local memoryand storage
Short-lived andunpredictable launch times
Low bandwidth andno P2P communication
Lack of fastshared storage
Limited lambda package size
![Page 14: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/14.jpg)
Existing approachesServerless Frameworks Machine Learning Frameworks
Short-lived and unpredictablelaunch times
![Page 15: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/15.jpg)
Existing approachesServerless Frameworks Machine Learning Frameworks
PyWren
Short-lived and unpredictablelaunch times
Limit. Pkgsize
Download dependencies from S3
High-latency communication through S3No fast
storage
StragglersUnpred.launch
![Page 16: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/16.jpg)
Existing approachesServerless Frameworks Machine Learning Frameworks
PyWren
Short-lived and unpredictablelaunch times
Limit. Pkgsize
Download dependencies from S3
High-latency communication through S3No fast
storage
StragglersUnpred.launch
Smallmem.
Unable to launch runtimes in lambdas
No ring/tree reducesNo driver-to-worker comm.
Precludes MPIUnpred.launch
No P2Pcomm.
![Page 17: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/17.jpg)
Cirrus: a framework for serverless end-to-end
ML workflows
![Page 18: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/18.jpg)
Robust handling of lambda termination
Ultra-lightweight runtime + data prefetching
Limited pkgsize
High-perf. data store (parameter-server and KV)
1)Addressing serverless challenges
No fast storage
Low memory
Limited package size
No P2P communication
Short lifetimes andunpredictable launch
Cirrus: design principles
![Page 19: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/19.jpg)
Per-stage fine-grained variable agile scalability
Cirrus: design principles
Limited pkgsize
Tight provisioning of resources
Simplifying infrastructuremanagement
High-level API supports end-to-end ML
2)Achieving benefits for end-to-end ML
![Page 20: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/20.jpg)
Cirrus architecture (client side)Dashboard
Python API
Client frontend
Preproc. Training Tuning
Create/Stop Task
Client backendTask
SchedulerLambdaManager
Client side(stateful)
Data scientist
![Page 21: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/21.jpg)
Cirrus Dashboard
![Page 22: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/22.jpg)
Cirrus Dashboard
![Page 23: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/23.jpg)
Cirrus Dashboard
![Page 24: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/24.jpg)
Cirrus Dashboard
![Page 25: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/25.jpg)
Server side(stateless)
Cirrus runtimeData Iterator API
Minibatch Buffer
Sparse LR Mat. Fact. LDA
Data store client API
put(gradient)
get(model)Data store
PS API Key-value API
ModelsKey-values
SGD Adagrad
Momentum
Cirrus architecture (server side)
put/get key
![Page 26: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/26.jpg)
Cirrus evaluation1. Cirrus provides benefits by specializing both for serverless and
end-to-end ML
2. We show that Cirrus outperforms a state-of-the-art serverless system: PyWren
![Page 27: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/27.jpg)
Evaluation setup1. Deployment: AWS Lambdas (3GB of mem.)
2. Benchmark: async. distributed SGD Sparse Logistic Regression task
3. Dataset: Criteo Dataset (a dataset of display ads)
4. PyWren:
a. Baseline: iterative synchronous SGD training using AWS S3 to
store gradients and model
b. + 3 incremental optimizations
5. Cirrus: 2 modes (with/without prefetching)
![Page 28: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/28.jpg)
Cirrus outperforms vanilla serverlessSynchronous SGD training suffers from stragglers
TestLoss
![Page 29: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/29.jpg)
● Multiple SGD iterations on each lambda invocation
● Asynchronous SGD
TestLoss
Cirrus outperforms vanilla serverless
![Page 30: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/30.jpg)
Sparse gradients and training data prefetchingTest
Loss
Cirrus outperforms vanilla serverless
![Page 31: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/31.jpg)
Replace AWS S3 with high-performance store (Redis)
TestLoss +700x updates/sec
Cirrus outperforms vanilla serverless
![Page 32: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/32.jpg)
Cirrus without training data prefetching
TestLoss
10x faster
Cirrus outperforms vanilla serverless
![Page 33: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/33.jpg)
Cirrus with training data prefetching
TestLoss
10x faster
10x faster
Cirrus outperforms vanilla serverless
![Page 34: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec656074faae761ee4db61f/html5/thumbnails/34.jpg)
Conclusion1. End-to-end ML workflows:
a. time-consuming infrastructure managementb. resource overprovisioning
2. Cirrus -- serverless end-to-end ML framework:a. simplify deployment of ML workflowsb. per-stage provisioning of resources
3. Cirrus outperforms existing serverless solutions by specializing for serverless and ML