![Page 1: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/1.jpg)
Data Science in the Cloud
Stefan Krawczyk @stefkrawczyklinkedin.com/in/skrawczykNovember 2016
![Page 2: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/2.jpg)
Who are Data Scientists?
![Page 3: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/3.jpg)
![Page 4: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/4.jpg)
![Page 5: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/5.jpg)
![Page 6: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/6.jpg)
Means: skills vary wildly
![Page 7: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/7.jpg)
But they’re in demand and expensive
![Page 8: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/8.jpg)
“The Sexiest Job of the 21st Century”
- HBR
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
![Page 9: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/9.jpg)
How many Data Scientists do you have?
![Page 10: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/10.jpg)
At Stitch Fix we have ~80
![Page 11: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/11.jpg)
~85% have not done formal CS
![Page 12: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/12.jpg)
But what do they do?
![Page 13: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/13.jpg)
What is Stitch Fix?
![Page 14: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/14.jpg)
![Page 15: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/15.jpg)
![Page 16: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/16.jpg)
![Page 17: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/17.jpg)
![Page 18: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/18.jpg)
![Page 19: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/19.jpg)
![Page 20: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/20.jpg)
![Page 21: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/21.jpg)
Two Data Scientist facts:
1. Has AWS console access*.
2. End to end, they’re responsible.
![Page 22: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/22.jpg)
How do we enable this without
?
![Page 23: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/23.jpg)
Make doing the right thing the easy thing.
![Page 24: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/24.jpg)
Fellow Collaborators
Horizontal team focused on Data Scientist Enablement
![Page 25: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/25.jpg)
1. Eng. Skills2. Important
3. What they work on
![Page 26: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/26.jpg)
Let’s Start
![Page 27: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/27.jpg)
Will Only Cover
1. Source of truth: S3 & Hive Metastore2. Docker Enabled DS @ Stitch Fix3. Scaling DS doing ML in the Cloud
![Page 28: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/28.jpg)
Source of truth:
S3 & Hive Metastore
![Page 29: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/29.jpg)
Want Everyone to Have Same View
A
B
![Page 30: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/30.jpg)
This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
A
B
![Page 31: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/31.jpg)
This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
● But it’s easy to outgrow these options with a big data/team.
A
B
![Page 32: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/32.jpg)
● Amazon’s Simple Storage Service
● Infinite* storage
● Can write, read, delete, BUT NOT append.
● Looks like a file system*:○ URIs: my.bucket/path/to/files/file.txt
● Scales well
S3
* For all intents and purposes
![Page 33: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/33.jpg)
● Hadoop service, that stores:○ Schema○ Partition information, e.g. date○ Data location for a partition
Hive Metastore
![Page 34: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/34.jpg)
● Hadoop service, that stores:○ Schema○ Partition information, e.g. date○ Data location for a partition
Hive Metastore:
Hive Metastore
Partition Location
20161001 s3://bucket/sold_items/20161001
...
20161031 s3://bucket/sold_items/20161031
sold_items
![Page 35: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/35.jpg)
Hive Metastore
![Page 36: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/36.jpg)
● Replacing data in a partition
But if we’re not careful
![Page 37: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/37.jpg)
● Replacing data in a partition
But if we’re not careful
![Page 38: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/38.jpg)
But if we’re not careful
B
A
![Page 39: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/39.jpg)
But if we’re not careful
● S3 is eventually consistent
● These bugs are hard to track down
A
B
![Page 40: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/40.jpg)
● Use Hive Metastore to control partition source of truth
● Principles:○ Never delete○ Always write to a new place each time a partition changes
● Stitch Fix solution: ○ Use an inner directory → called Batch ID
Hive Metastore to the Rescue
![Page 41: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/41.jpg)
Batch ID Pattern
![Page 42: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/42.jpg)
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/20161031/20161101002256/
sold_items
![Page 43: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/43.jpg)
● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/20161031/20161101002256/s3://bucket/sold_items/20161031/20161102234252
sold_items
![Page 44: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/44.jpg)
● Overwriting a partition is just a matter of updating the location● To the user this is a hidden inner directory
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/20161031/20161101002256/s3://bucket/sold_items/20161031/20161102234252
sold_items
![Page 45: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/45.jpg)
Enforce via API
![Page 46: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/46.jpg)
Enforce via API
![Page 47: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/47.jpg)
Python:
store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])df = load_dataframe(src_db, src_table, partitions=[‘2016’])
R:sf_writer(data = result,
namespace = dest_db, resource = dest_table, partitions = c(as.integer(opt$ETL_DATE)))
sf_reader(namespace = src_db, resource = src_table, partitions = c(as.integer(opt$ETL_DATE)))
API for Data Scientists
![Page 48: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/48.jpg)
● Full partition history○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easily■ What data changed and when
○ Can anchor downstream consumers to a particular batch ID
Batch ID Pattern Benefits
![Page 49: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/49.jpg)
Docker Enabled DS @ Stitch Fix
![Page 50: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/50.jpg)
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: In the Beginning...
![Page 51: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/51.jpg)
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: Evolution I
![Page 52: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/52.jpg)
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: Evolution II
![Page 53: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/53.jpg)
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
Low High
Ad hoc Infra: Evolution III
![Page 54: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/54.jpg)
● Control of environment○ Data Scientists don’t need to worry about env.
● Isolation○ can host many docker containers on a single machine.
● Better host management○ allowing central control of machine types.
Why Does Docker Lower Overhead?
![Page 55: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/55.jpg)
Flotilla UI
![Page 56: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/56.jpg)
● Has:○ Our internal API libraries○ Jupyter Notebook:
■ Pyspark■ IPython
○ Python libs: ■ scikit, numpy, scipy, pandas, etc.
○ RStudio○ R libs:
■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc.
● Mounts User NFS
● User has terminal access to file system via Jupyter for git, pip, etc.
Our Docker Image
![Page 57: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/57.jpg)
Docker Deployment
![Page 58: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/58.jpg)
Docker Deployment
![Page 59: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/59.jpg)
Docker Deployment
![Page 60: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/60.jpg)
● Docker tightly integrates with the Linux Kernel.○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel can:● Break the ECS agent because the container doesn’t respond.● Break isolation between containers.
■ E.g. Mounting NFS
● Docker Hub:○ Switched to artifactory
Our Docker Problems So Far
![Page 61: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/61.jpg)
Scaling DS doing ML in the Cloud
![Page 62: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/62.jpg)
1. Data Latency2. To Batch
or Not To Batch3. What’s in a Model?
![Page 63: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/63.jpg)
Data Latency
How much time do you spend waiting for data?
![Page 64: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/64.jpg)
*This could be a laptop, a shared system, a batch process, etc.
![Page 65: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/65.jpg)
Use Compression
*This could be a laptop, a shared system, a batch process, etc.
![Page 66: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/66.jpg)
Use Compression - The Components
[ 1.3234543 0.23443434 … ]
[ 1 0 0 1 0 0 … 0 1 0 0 0 1 0 1 ... … 1 0 1 1 ]
[ 1 0 0 1 0 0 … 0 1 0 0 ]
[ 1.3234543 0.23443434 … ]
{ 100: 0.56, … ,110: 0.65, … , … , 999: 0.43 }
![Page 67: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/67.jpg)
Use Compression - Python Comparison
Pickle: 60MBZlib+Pickle: 129KBJSON: 15MBZlib+JSON: 55KB
Pickle: 3.1KBZlib+Pickle: 921BJSON: 2.8KBZlib+JSON: 681B
Pickle: 2.6MBZlib+Pickle: 600KBJSON: 769KBZlib+JSON: 139KB
[ 1.3234543 0.23443434 … ]
[ 1 0 0 1 0 0 … 0 1 0 0 0 1 0 1 ... … 1 0 1 1 ]
[ 1 0 0 1 0 0 … 0 1 0 0 ]
[ 1.3234543 0.23443434 … ]
{ 100: 0.56, … ,110: 0.65, … , … , 999: 0.43 }
![Page 68: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/68.jpg)
● Naïve scheme of JSON + Zlib works well:
Observations
import jsonimport zlib...# compresscompressed = zlib.compress(json.dumps(value))# decompressoriginal = json.loads(zlib.decompress(compressed))
![Page 69: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/69.jpg)
● Naïve scheme of JSON + Zlib works well:
● Double vs Float: do you really need to store that much precision?
Observations
import jsonimport zlib...# compresscompressed = zlib.compress(json.dumps(value))# decompressoriginal = json.loads(zlib.decompress(compressed))
![Page 70: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/70.jpg)
● Naïve scheme of JSON + Zlib works well:
● Double vs Float: do you really need to store that much precision?
● For more inspiration look to columnar DBs and how they compress columns
Observations
import jsonimport zlib...# compresscompressed = zlib.compress(json.dumps(value))# decompressoriginal = json.loads(zlib.decompress(compressed))
![Page 71: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/71.jpg)
To Batch or Not To Batch:
When is batch inefficient?
![Page 72: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/72.jpg)
● Online:○ Computation occurs synchronously when needed.
● Streamed:○ Computation is triggered by an event(s).
Online & Streamed Computation
![Page 73: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/73.jpg)
Online & Streamed Computation
Very likely you start with a batch system
![Page 74: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/74.jpg)
Online & Streamed Computation
● Do you need to recompute:○ features for all users?○ predicted results for all users?
Very likely you start with a batch system
![Page 75: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/75.jpg)
Online & Streamed Computation
● Do you need to recompute:○ features for all users?○ predicted results for all users?
● Are you heavily dependent on your
ETL running every night?
Very likely you start with a batch system
![Page 76: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/76.jpg)
Online & Streamed Computation
● Do you need to recompute:○ features for all users?○ predicted results for all users?
● Are you heavily dependent on your
ETL running every night?
● Online vs Streamed depends on in house factors:○ Number of models○ How often they change○ Cadence of output required○ In house eng. expertise○ etc.
Very likely you start with a batch system
![Page 77: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/77.jpg)
Online & Streamed Computation
● Do you need to recompute:○ features for all users?○ predicted results for all users?
● Are you heavily dependent on your
ETL running every night?
● Online vs Streamed depends on in house factors:○ Number of models○ How often they change○ Cadence of output required○ In house eng. expertise○ etc.
Very likely you start with a batch system
We use online system for recommendations
![Page 78: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/78.jpg)
Streamed Example
![Page 79: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/79.jpg)
Streamed Example
![Page 80: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/80.jpg)
Streamed Example
![Page 81: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/81.jpg)
Streamed Example
![Page 82: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/82.jpg)
● Dedicated infrastructure → More room on batch infrastructure○ Hopefully $$$ savings○ Hopefully less stressed Data Scientists
Online/Streaming Thoughts
![Page 83: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/83.jpg)
● Dedicated infrastructure → More room on batch infrastructure○ Hopefully $$$ savings○ Hopefully less stressed Data Scientists
● Requires better software engineering practices○ Code portability/reuse○ Designing APIs/Tools Data Scientists will use
Online/Streaming Thoughts
![Page 84: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/84.jpg)
● Dedicated infrastructure → More room on batch infrastructure○ Hopefully $$$ savings○ Hopefully less stressed Data Scientists
● Requires better software engineering practices○ Code portability/reuse○ Designing APIs/Tools Data Scientists will use
● Prototyping on AWS Lambda & Kinesis was surprisingly quick○ Need to compile C libs on an amazon linux instance
Online/Streaming Thoughts
![Page 85: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/85.jpg)
What’s in a Model?
Scaling model knowledge
![Page 86: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/86.jpg)
Ever:● Had someone leave and then nobody understands how they trained their
models?
![Page 87: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/87.jpg)
Ever:● Had someone leave and then nobody understands how they trained their
models?○ Or you didn’t remember yourself?
![Page 88: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/88.jpg)
Ever:● Had someone leave and then nobody understands how they trained their
models?○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?
![Page 89: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/89.jpg)
Ever:● Had someone leave and then nobody understands how they trained their
models?○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?○ Or not known what’s changed between model deployments?
![Page 90: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/90.jpg)
Ever:● Had someone leave and then nobody understands how they trained their
models?○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?○ Or not known what’s changed between model deployments?
● Wanted to compare model performance over time?
![Page 91: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/91.jpg)
Ever:● Had someone leave and then nobody understands how they trained their
models?○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?○ Or not known what’s changed between model deployments?
● Wanted to compare model performance over time?
● Wanted to train a model in R/Python/Spark and then deploy it a webserver?
![Page 92: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/92.jpg)
Produce Model Artifacts
![Page 93: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/93.jpg)
● Isn’t that just saving the coefficients/model values?
Produce Model Artifacts
![Page 94: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/94.jpg)
● Isn’t that just saving the coefficients/model values?○ NO!
Produce Model Artifacts
![Page 95: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/95.jpg)
● Isn’t that just saving the coefficients/model values?○ NO!
● Why?
Produce Model Artifacts
![Page 96: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/96.jpg)
● Isn’t that just saving the coefficients/model values?○ NO!
● Why?
Produce Model Artifacts
![Page 97: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/97.jpg)
● Isn’t that just saving the coefficients/model values?○ NO!
● Why?
How do you deal with organizational drift?
Produce Model Artifacts
![Page 98: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/98.jpg)
● Isn’t that just saving the coefficients/model values?○ NO!
● Why?
How do you deal with organizational drift?
Produce Model Artifacts
Makes it easy to keep an archive and track changes over time
![Page 99: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/99.jpg)
● Isn’t that just saving the coefficients/model values?○ NO!
● Why?
How do you deal with organizational drift?
Produce Model Artifacts
Helps a lot with model debugging & diagnosis!
Makes it easy to keep an archive and track changes over time
![Page 100: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/100.jpg)
● Isn’t that just saving the coefficients/model values?○ NO!
● Why?
How do you deal with organizational drift?
Produce Model Artifacts
Helps a lot with model debugging & diagnosis!
Makes it easy to keep an archive and track changes over time Can more easily use in
downstream processes
![Page 101: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/101.jpg)
● Analogous to software libraries
● Packaging:○ Zip/Jar file
Produce Model Artifacts
![Page 102: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/102.jpg)
But all the above seems complex?
![Page 103: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/103.jpg)
We’re building APIs.
![Page 104: in the Cloud Data ScienceDocker tightly integrates with the Linux Kernel. Hypothesis: Anything that makes uninterruptable calls to the kernel can: Break the ECS agent because the container](https://reader035.vdocuments.us/reader035/viewer/2022071300/608921f53f556e2e8c78eab8/html5/thumbnails/104.jpg)
Fin; Questions?
@stefkrawczyk