airbnb’s end-to-end machine learning infrastructure
TRANSCRIPT
![Page 1: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/1.jpg)
BigheadAirbnb’s End-to-End
Machine Learning Infrastructure
Atul Kale and Xiaohan ZengML Infra @ Airbnb
Strata Data NYC 2018
![Page 2: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/2.jpg)
Background Design Goals ArchitectureDeep Dive
FutureWork
![Page 3: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/3.jpg)
Background
![Page 4: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/4.jpg)
Airbnb’s Mission:
“Create a world where anyone can belong anywhere”
![Page 5: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/5.jpg)
Airbnb’s Product
A global travel community that offers magical end-to-end trips, including where you stay, what you do and the people you meet.
![Page 6: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/6.jpg)
Airbnb is already driven by Machine Learning
Search Ranking Smart Pricing Fraud Detection
![Page 7: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/7.jpg)
But there are *many* more opportunities for ML
● Paid Growth - Hosts
● Classifying / Categorizing Listings
● Experience Ranking + Personalization
● Room Type Categorizations
● Customer Service Ticket Routing
● Airbnb Plus
● Listing Photo Quality
● Object Detection - Amenities
● ....
![Page 8: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/8.jpg)
Intrinsic Complexities with Machine Learning
● Understanding the business domain
● Selecting the appropriate Model
● Selecting the appropriate Features
● Fine tuning
![Page 9: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/9.jpg)
Incidental Complexities with Machine Learning
● Integrating with Airbnb’s Data Warehouse
● Scaling model training & serving
● Keeping consistency between: Prototyping vs Production, Training vs Inference
● Keeping track of multiple models, versions, experiments
● Supporting iteration on ML models
→ ML models take on average 8 to 12 weeks to build
→ ML workflows tended to be slow, fragmented, and brittle
![Page 10: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/10.jpg)
The ML Infrastructure Team addresses these challenges
Vision
Airbnb routinely ships ML-powered features
throughout the product.
Mission
Equip Airbnb with shared technology to build
production-ready ML applications with no
incidental complexity.
![Page 11: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/11.jpg)
Atul KaleEngineering Manager
Varant ZanoyanSoftware Engineer
Andrew HohProduct Manager
John ParkSoftware Engineer
Patrick YoonSoftware Engineer
Xiaohan ZengSoftware Engineer
Nikhil SimhaSoftware Engineer
Alfredo LuqueSoftware Engineer
Evgeny ShapiroSoftware Engineer
Conglei ShiSoftware Engineer
Andrew CheongSoftware Engineer
Aaron SiegelEngineering Lead, Data Platform
Machine Learning Infrastructure Team
![Page 12: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/12.jpg)
Supporting the Full ML Lifecycle
![Page 13: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/13.jpg)
Bighead: Design Goals
![Page 14: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/14.jpg)
Scalable
Seamless Versatile
Consistent
![Page 15: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/15.jpg)
Seamless
● Easy to prototype, easy to productionize
● Same workflow across different frameworks
![Page 16: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/16.jpg)
Versatile
● Supports all major ML frameworks
● Meets various requirements○ Online and Offline○ Data size○ SLA○ GPU training○ Scheduled and Ad hoc
![Page 17: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/17.jpg)
Consistent
● Consistent environment across the stack
● Consistent data transformation
○ Prototyping and Production
○ Online and Offline
![Page 18: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/18.jpg)
Scalable
● Horizontal
● Elastic
![Page 19: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/19.jpg)
Bighead: Architecture Deep Dive
![Page 20: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/20.jpg)
Execution Management: Bighead Library
Environment Management: Docker Image Service
Feature Data Management: Zipline
Bighead Service / UI
Prototyping Lifecycle Management Production
Real Time Inference
Batch Training + Inference
Redspot
ML Automator
Airflow
Deep Thought
![Page 21: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/21.jpg)
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle Management Production
Airflow
Real Time Inference
Batch Training + Inference
![Page 22: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/22.jpg)
RedspotPrototyping with Jupyter Notebooks
![Page 23: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/23.jpg)
Jupyter Notebooks?What are those?
“Creators need an immediate connection to what they are creating.”
- Bret Victor
![Page 24: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/24.jpg)
● Interactivity and Feedback
● Access to Powerful Hardware
● Access to Data
The ideal Machine Learning development environment?
![Page 25: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/25.jpg)
● A fork of the JupyterHub project
● Integrated with our Data Warehouse
● Access to specialized hardware (e.g. GPUs)
● File sharing between users via AWS EFS
● Packaged in a familiar Jupyterhub UI
Redspota Supercharged Jupyter Notebook Service
![Page 26: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/26.jpg)
Redspot
![Page 27: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/27.jpg)
Versatile
● Customized Hardware: AWS EC2 Instance Types e.g. P3, X1
● Customized Dependencies: Docker Images e.g. Py2.7, Py3.6+Tensorflow
Redspota Supercharged Jupyter Notebook Service
Consistent
● Promotes prototyping in the exact environment that your model will use in production
Seamless
● Integrated with Bighead Service & Docker Image Service via APIs & UI widgets
![Page 28: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/28.jpg)
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle Management Production
Airflow
Real Time Inference
Batch Training + Inference
![Page 29: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/29.jpg)
Docker Image ServiceEnvironment Customization
![Page 30: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/30.jpg)
● ML Users have a diverse, heterogeneous set of dependencies
● Need an easy way to bootstrap their own runtime environments
● Need to be consistent with the rest of Airbnb’s infrastructure
Docker Image Service - Why
+
![Page 31: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/31.jpg)
● Our configuration management solution
● A composition layer on top of Docker
● Includes a customization service that
faces our users
● Promotes Consistency and Versatility
Docker Image Service - Dependency Customization
![Page 32: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/32.jpg)
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle Management Production
Airflow
Real Time Inference
Batch Training + Inference
![Page 33: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/33.jpg)
Bighead ServiceModel Lifecycle Management
![Page 34: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/34.jpg)
● Tracking ML model changes is just as
important as tracking code changes
● ML model work needs to be
reproducible to be sustainable
● Comparing experiments before you
launch models into production is critical
Model Lifecycle Management - why?
![Page 35: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/35.jpg)
![Page 36: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/36.jpg)
![Page 38: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/38.jpg)
Seamless
● Context-aware visualizations that carry over from the prototyping experience
Bighead Service
Consistent
● Central model management service
● Single source of truth about the state of a model, it’s dependencies, and what’s deployed
![Page 39: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/39.jpg)
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle Management Production
Airflow
Real Time Inference
Batch Training + Inference
![Page 40: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/40.jpg)
Bighead Library
![Page 41: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/41.jpg)
ML Models are highly heterogeneous in
Frameworks Training data
● Data quality
● Structured vs Unstructured (image, text)
Environment
● GPU vs CPU
● Dependencies
![Page 42: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/42.jpg)
ML Models are hard to keep consistent
● Data in production is different from data in training
● Offline pipeline is different from online pipeline
● Everyone does everything in a different way
![Page 43: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/43.jpg)
Versatile
● Pipeline on steroids - compute graph for preprocessing / inference / training / evaluation / visualization
● Composable, Reusable, Shareable
● Support popular frameworks
Bighead Library
Consistent
● Uniform API
● Serializable - same pipeline used in training, offline inference, online inference
● Fast primitives for preprocessing
● Metadata for trained models
![Page 44: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/44.jpg)
Bighead Library: ML Pipeline
![Page 45: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/45.jpg)
Visualization - Pipeline
![Page 46: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/46.jpg)
Easy to Serialize/Deserialize
![Page 47: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/47.jpg)
Visualization - Training Data
![Page 48: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/48.jpg)
Visualization - Transformer
![Page 49: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/49.jpg)
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle Management Production
Airflow
Real Time Inference
Batch Training + Inference
![Page 50: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/50.jpg)
Deep ThoughtOnline Inference
![Page 51: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/51.jpg)
Easy to do
● Data scientists can’t launch models without engineer team
● Engineers often need to rebuild models
Hard to make online model serving...
Scalable
● Resource requirements varies across models
● Throughput fluctuates across time
Consistent with training
● Different data
● Different pipeline
● Different dependencies
![Page 52: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/52.jpg)
Seamless
● Integration with event logging, dashboard
● Integration with Zipline
Deep Thought
Consistent
● Docker + Bighead Library: Same data source, pipeline, environment from training
Scalable
● Kubernetes: Model pods can easily scale
● Resource segregation across models
![Page 53: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/53.jpg)
![Page 54: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/54.jpg)
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle Management Production
Airflow
Real Time Inference
Batch Training + Inference
![Page 55: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/55.jpg)
ML AutomatorOffline Training and Batch Inference
![Page 56: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/56.jpg)
Automated training, inference, and evaluation are necessary
● Scheduling
● Resource allocation
● Saving results
● Dashboards and alerts
● Orchestration
ML Automator - Why
![Page 57: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/57.jpg)
Seamless
● Automate tasks via Airflow: Generate DAGs for training, inference, etc. with appropriate resources
● Integration with Zipline for training and scoring data
ML Automator
Consistent
● Docker + Bighead Library: Same data source, pipeline, environment across the stack
Scalable
● Spark: Distributed computing for large datasets
![Page 58: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/58.jpg)
ML Automator
![Page 59: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/59.jpg)
Execution Management: Bighead Library
Environment Management: Docker Image Service
Redspot
Feature Data Management: Zipline
Bighead Service / UI
Deep Thought
ML Automator
Prototyping Lifecycle Management Production
Airflow
Real Time Inference
Batch Training + Inference
![Page 60: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/60.jpg)
ZiplineML Data Management Framework
![Page 61: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/61.jpg)
Feature management is hard
● Inconsistent offline and online datasets
● Tricky to generate training sets that depend on time correctly
● Slow training sets backfill
● Inadequate data quality checks or monitoring
● Unclear feature ownership and sharing
![Page 62: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/62.jpg)
For more information on Zipline, please come to our other talk
Zipline: Airbnb's Data Management Platform for Machine Learning
2:55pm - 1A 21/22
![Page 63: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/63.jpg)
Big Summary
End-to-End platform to build and deploy ML models to production that is seamless, versatile, consistent, and scalable
● Model lifecycle management● Feature generation & management● Online & offline inference● Pipeline library supporting major frameworks● Docker image customization service● Multi-tenant training environment
Built on open source technology
● TensorFlow, PyTorch, Keras, MXNet, Scikit-learn, XGBoost● Spark, Jupyter, Kubernetes, Docker, Airflow
![Page 64: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/64.jpg)
Future Works
Monitoring
● Feature distribution
● Model performance
BigQueue
● On demand compute service with heterogeneous resources
![Page 65: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/65.jpg)
One more thing...
![Page 67: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/67.jpg)
Questions?
![Page 68: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/68.jpg)
Appendix
![Page 69: Airbnb’s End-to-End Machine Learning Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022022304/62155129743c3464cc5f7f3a/html5/thumbnails/69.jpg)