@rent the runway finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... ·...
TRANSCRIPT
![Page 1: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/1.jpg)
Finding that dress at scale @Rent The Runway
Saurabh Bhatnagar
![Page 2: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/2.jpg)
Bio17 years in ML/data
Prev: Responsible for personalization and ML at RTR
Prev: Found Data Science at Barnes & Nobles
Prev: consulted at HP, Unilever, …
Now: Founder, Virevol AI
@analyticsaurabhwww.sanealytics.com
![Page 3: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/3.jpg)
Rent The Runway- Democratize luxury fashion- eCommerce rental model- Closet in the cloud- 8m registered users- Optional Unlimited membership programs- 1,500 dresses dry cleaned every hour- Biggest dry cleaner in the World
1 Sr Data Scientist + 2 Jr Data Scientists Team!
![Page 4: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/4.jpg)
What you will learnHow to scale using
- Strategy- How to bet on the right infra stack- Software engineering and tests/checks for ML- Maintain complex ML jungle- Practical lessons you can take to work on Monday
![Page 5: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/5.jpg)
![Page 6: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/6.jpg)
![Page 7: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/7.jpg)
![Page 8: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/8.jpg)
![Page 9: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/9.jpg)
![Page 10: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/10.jpg)
Image search
![Page 11: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/11.jpg)
Instagram + Humans => Dress review
![Page 12: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/12.jpg)
Reverse Logistics
![Page 13: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/13.jpg)
Scale?- Typical Scope creep: You need a team of 60 data scientists + huge infra team- RTR Story: One Sr Data Scientist + 2 Jr over 5 years!- Artisanal hand-rolled ML
![Page 14: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/14.jpg)
Fashion is unsolved
RTR != Netflix
High stakes
It is visual
Preferences change
Underlying reason for buying
is poorly understood
Supply side challenges
![Page 15: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/15.jpg)
Team
sN (N-1)
2
Take home lesson: The complexity of
communication increases
exponentially proportional to number
of teams involved
![Page 16: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/16.jpg)
Scale: You need a strategy- KISS: An exercise to exonerate complexity is an exercise in simplicity- Study opportunity, pitch, demo, get team, agree on deliverables- Tie metrics to $$$$, obsess over product, set expectations- Metric will be wrong over time- Simple linear baseline first , deploy, then reiterate (80/20)- Research vs “Can’t fail” expectations, overcommunicate- Need more backend/frontend engineers, UX design, project, etc to drive
algorithmic success than ML Engineers- Mentor
KISS
![Page 17: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/17.jpg)
Good fences make good neighbours- ML Saas, i.e SLAs
- Latency SLA, uptime SLAs: Engineering caches last served + default user
- Separation of concerns/clear action on failure:- For Engineering: Just restart, it will go back to previous version of model on start
- For ETL team: We can create jobs to put data on pipe, but ownership with that team
- Reliability via tests and checks (ML flavor)- Graceful fallbacks... And fallbacks to fallbacks (cold start)- Recompute models daily (depends), continuous deployment- Software architect for parts to be switchable (even languages)- Data analysis (R/shiny, SQL, Tableau, python) vs ML- Reports tracking what ML can’t do, guardrails
KISS
![Page 18: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/18.jpg)
Data infra through time (not accurate)- Look ma, I can store files… yay!- OK, too many files, directories, organization (databases)- Hadoop - Look ma, I can store files on multiple computers… yay!- Spark - Need to organize for ML on multiple machines… needs a lot of infra,
JVM- GPU - Look ma, I can process a LOT of data really fast on one box- Future? Streaming GPU databases, ML framework, ...
KISS
Lesson: It is hard to pick tech that lasts 5 years! Be switchable by design
![Page 19: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/19.jpg)
![Page 20: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/20.jpg)
Scaling: When not to use GPUs- Those network costs add up.. Keep data transfer at minimum- Example, sparse SVD/cf: Y = R ( U I )
- r * u * i + r + u * k + i * k <= 1 GB / (32-bit floats) <= 8e9 / 32- u <= (g * 8e9 / 32 - i * k) / (i * r + k)- If r=1%, k=100, i=1e4, u ~1.2m. If i=1e6, u <= 14k- AWS C5.18xlarge = 72CPUs, 144 Gb => 180m vs 3.5m per batch - 1080i GPU = 3,584 cores, 11 Gb => 140m or 262k users per batch- Spark cluster, if we’re talking petabytes (are we, though? See num of items,
hashing tricks)- Other reasons: To keep network costs low, where does data already live?
Algo not parallelizable?
![Page 21: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/21.jpg)
1m items 1m items
![Page 22: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/22.jpg)
ML as a ServiceSoftware at scale
![Page 23: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/23.jpg)
Train user style recommendations
Serve style recommendations
(gRPC)
Train user event recommendations
Train review language model(spacy)
Serve image search(flask)
dress allocation solver
DeepDress ML Lib
Train user fit recommendations
Data Bus
...
![Page 24: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/24.jpg)
XFL S3Train Membership recos
(GPU)
Engg S3Serve Membership recos
(CPU gRPC python)
JAVA cache serverUpdate recos server
(CPU)XFL Kafka Engg Kafka
![Page 25: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/25.jpg)
Software at scale: Reuse and flexibility- DeepDress library has shared data buses, embeddings, models. Library can load
latest embeddings/models across the ecosystem (over S3)- Versioning and default users/products are important for fallbacks- Languages are irrelevant, problem you’re solving is important- However for reliability, base in one language (python), glue for others. We have
R and C++ bindings via feather/arrow- gRPC + JAVA to serve (RTR backend stack is only in JAVA)- Data: Abstracted Bus. Can be Disk, S3 or Kafka or something else in the future
KISS
Lesson: People who forget relational databases are condemned to reinvent it
![Page 26: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/26.jpg)
DeepDress AI library- Bus
- S3/Disk/Kafka/DoubleDecker
- DataLoader
- Orders, Person, People, Reviews, Photos
- Model
- CollaborativeFiltering, CarouselNet, DressNet, Dress2Vec, FitModel, RModel, FulfillmentSolver
…
- Load/Save models, embeddings, auto checkpoints to Bus
- Checks
- HoldOutCheck, SelfDriftCheck, MetricDriftCheck, ...
- Tests
- Metrics
- Rcpp
- Utils
Built on top of pyTorch, numpy, pandas and some R/C++, external libs like spaCy, PuLP not required
![Page 27: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/27.jpg)
Simplify your workflow- Understand and improve model, reuse, no ensembles!- Work with product to figure out better ways to capture data and improve
model- This isn’t Kaggle, you have influence on data, UX and roadmap.- North star ($)
KISS
![Page 28: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/28.jpg)
DressNet v3
Dress2Vec DressReviews2VecUser Embedding Item Embedding ...
ReLU
...
Item Vector
BCE Loss
![Page 29: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/29.jpg)
Validation as integration test
ML code:
Inputs -> Black box ML (function + data) -> Output
Test: Change in data changes assumptions.. Could be upstream ETL problem but blind to it.
Regular deterministic code:
Inputs -> Some known function -> Output
Test: Make sure output works for some expected inputs… unit tests, fuzz tests, random tests, integration tests
![Page 30: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/30.jpg)
Hol
dout
Che
ck Train:
90% of users with full history + 10% of users with last k missing
Test:
For those 10%, check against those last k
![Page 31: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/31.jpg)
Other useful checksSelfDriftCheck
Did the prediction metrics change compared to
last n day moving average?
MetricDriftCheck
Compare to another business metric ($$$)
Does this metric still track reality?
Tests/Checks are a way to encode our assumptions for building that model, choosing that metric and assuming those relationships in data
KISS
IntegrationCheck
Scrape website and see if that’s what we sent
![Page 32: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/32.jpg)
Strategy : SLAs, $, trackingInfra : GPU, glueMLaas : Embedding DB/API
![Page 33: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/33.jpg)
Virevol AIAutomating and augmenting retail
![Page 34: @Rent The Runway Finding that dress at scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer](https://reader035.vdocuments.us/reader035/viewer/2022081404/5f04ad6d7e708231d40f2872/html5/thumbnails/34.jpg)
Keep in touch
@analyticsaurabh
www.virevol.comwww.sanealytics.comwww.RentTheRunway.com