![Page 1: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/1.jpg)
SCALABLE & DEPLOYABLE DATA SCIENCE WITH THE ANACONDA PLATFORM
Kristopher OverholtProduct Manager
Continuum Analytics
![Page 2: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/2.jpg)
#OpenDataScienceMeans #AnacondaCON
OVERVIEW• Collaborative Data Science Workflows
• Scaling Out with Anaconda• Spectrum of parallelization• Spark, Hadoop, Dask, and other parallel frameworks• Example distributed/parallel use cases
• Productionizing Data Science Projects
• Enterprise deployment considerations
• Deploying Data Science Projects• Notebooks, dashboards, interactive applications, and models with APIs
![Page 3: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/3.jpg)
#OpenDataScienceMeans #AnacondaCON
COLLABORATIVE DATA SCIENCE WORKFLOWS
![Page 4: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/4.jpg)
#OpenDataScienceMeans #AnacondaCON
COLLABORATIVE DATA SCIENCE WORKFLOWSData science teams often use intermediate deployments and modular, layered development approaches for data ingest, data cleaning, computation, machine learning, visualization, etc.
![Page 5: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/5.jpg)
#OpenDataScienceMeans #AnacondaCON
ANACONDA - SCALED OUT OPEN DATA SCIENCE
Application and Visualization Jupyter Notebook, Matplotlib, seaborn, Bokeh, etc.
Analytics pandas, NumPy, SciPy, Numba, scikit-learn, NLTK, scikit-image, PIL, and more
Computation PySpark, SparkR Dask, Distributed
Data and Resource Management HDFS, NFS, YARN, SGE, SLURM
Servers Bare-metal or Cloud-based Cluster Clus
ter
Ana
cond
a
![Page 6: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/6.jpg)
#OpenDataScienceMeans #AnacondaCON
SPECTRUM OF PARALLELIZATION
ThreadsProcesses
MPIZeroMQ
Explicit control: Fast but low-level Implicit control: Restrictive but easy
Dask HadoopSpark
SQL:HivePig
Impala
![Page 7: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/7.jpg)
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA AND SPARKUsing Anaconda with Spark is:
• Extensible: Use libraries from Anaconda with PySpark and SparkR jobs
• Integrated: Use interactive notebooks with data in HDFS and on YARN clusters
• Secure: Works with Kerberized Hadoop clusters
• Scalable: Map pandas, NumPy, SciPy jobs on large clusters and data sets
• Seamless: Works with Cloudera CDH, Hortonworks HDP, and other enterprise Hadoop distributions
Anaconda dramatically simplifies the installation and management of popular Python and R packages and their dependencies.
![Page 8: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/8.jpg)
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA AND DASKDask is a Python parallel computing library that is:
• Familiar: Implements parallel NumPy and Pandas objects
• Fast: Optimized for demanding for numerical applications
• Flexible: for sophisticated and messy algorithms
• Scales up: Runs resiliently on clusters of 100s of machines
• Scales down: Pragmatic in a single process on a laptop
• Interactive: Responsive and fast for interactive data science
![Page 9: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/9.jpg)
#OpenDataScienceMeans #AnacondaCON
OTHER WAYS TO SCALE OUT WITH ANACONDAAnaconda integrates with:
• Spark (PySpark, SparkR) and other
Hadoop components, including YARN,
HDFS, Hive, Impala, and more
• Dask, Distributed, knit, dask-ec2, hdfs3,
fastparquet
• CSV, SQL, JSON, HDF5, Parquet, etc.
• Amazon Web Services, Microsoft Azure,
Google Cloud Platform
• Streaming analytics: Streamparse for
Apache Storm, Spark Streaming, Kafka,
Python integration with ELK
Anaconda Technology Partners:
• Cloudera
• Hortonworks
• IBM
• H2O
• Docker
• … and more
![Page 10: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/10.jpg)
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA
Anaconda platform
ClusterBiz Analysts, Data Scientists Developers,Data Engineers, DevOps
![Page 11: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/11.jpg)
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA
Without Anaconda Scale
Head Node1. Manually install Python,
packages & dependencies2. Manually install R, packages &
dependencies
With Anaconda Scale
Compute Nodes1. Manually install Python,
packages & dependencies2. Manually install R,
packages & dependencies
Compute Nodes
Head NodeEasily install Anaconda with performance optimized Python and R packages and manage environments across all nodes in a cluster
![Page 12: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/12.jpg)
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA –EXAMPLE USE CASES
Analyzing text, tabular, or array data using Dask
• Use Pandas dataframes orNumPy arrays at scale
• Work with data in different formats and data stores
Distributed natural language processing with text data using PySpark
• Explore data using a distributed memory cluster
• Interactively query and analyze data using libraries from Anaconda
Distributed machine learning workflows with Dask, Spark, H2O, Tensorflow, and more
• Work interactively and collaboratively in notebooks
• Simplify installation and management of ML libraries and dependencies
Handle custom code and workflows usingDask
• Work with custom data formats
• Construct complex pipelines including ETL and flexible computations
![Page 13: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/13.jpg)
#OpenDataScienceMeans #AnacondaCON
![Page 14: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/14.jpg)
#OpenDataScienceMeans #AnacondaCON
![Page 15: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/15.jpg)
#OpenDataScienceMeans #AnacondaCON
![Page 16: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/16.jpg)
#OpenDataScienceMeans #AnacondaCON
PRODUCTIONIZING DATA SCIENCE PROJECTS
![Page 17: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/17.jpg)
#OpenDataScienceMeans #AnacondaCON
PRODUCTIONIZING DATA SCIENCE PROJECTS
• Provisioning compute resources
• Managing dependencies and environments
• Ensuring availability, uptime, and monitoring status
• Engineering for scalability
• Sharing compute resources
• Securing data and network connectivity and credentials
• Securing network communications and SSL
• Managing authentication and access control
![Page 18: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/18.jpg)
#OpenDataScienceMeans #AnacondaCON
DEPLOYING WITHCOLLABORATIVE DATA SCIENCE WORKFLOWS
Review Design Build Validate Deploy
Assess and review requirements and
data sources
Conceptualdesign of
interactive application or
dashboard
Build the dashboard or
application with Anaconda
Test and validatedashboard or application
Deploy dashboard or application at scale using best
practices
![Page 19: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/19.jpg)
#OpenDataScienceMeans #AnacondaCON
DEPLOYING DATA SCIENCE PROJECTS -NOTEBOOKS
![Page 20: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/20.jpg)
#OpenDataScienceMeans #AnacondaCON
DEPLOYING DATA SCIENCE PROJECTS -DASHBOARDS
![Page 21: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/21.jpg)
#OpenDataScienceMeans #AnacondaCON
DEPLOYING DATA SCIENCE PROJECTS –INTERACTIVE APPLICATIONS
![Page 22: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/22.jpg)
#OpenDataScienceMeans #AnacondaCON
DEPLOYING DATA SCIENCE PROJECTS –MODELS WITH REST APIS
Load Data
Clean Data
Anomaly Detection
Models withREST APIs
DashboardsReports
InteractiveApplications
Regression
Clustering
Machine LearningPipeline
Deployed Applications
Developers and data scientists can build additional layers of visualizations, dashboards, or interactive applications that consume data from API endpoints.
![Page 23: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/23.jpg)
#OpenDataScienceMeans #AnacondaCON
SCALABLE AND DEPLOYABLE DATA SCIENCE… with Anaconda and Anaconda Enterprise, including:
• Scaled-up Analytics: Develop and deploy the same code/environments on your local machine and a cluster
• Environment management: Dynamically manage Python, R, dependencies and other conda packages and environments across a cluster
• Collaboration: Easily share versioned notebooks and projects across users and replicate analysts’ environments for different jobs/users/groups
• Hadoop integration: Support for Hadoop, Spark and other distributed workflows; compatible with enterprise Hadoop distributions
![Page 24: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/24.jpg)
#OpenDataScienceMeans #AnacondaCON
ADDITIONAL RESOURCES FOR SCALABLE AND DEPLOYABLE DATA SCIENCE• Anaconda Enterprise subscriptions:
https://www.continuum.io/anaconda-subscriptions
• Anaconda Scalehttps://docs.continuum.io/anaconda-scale
• Webinars on scaling out with Anacondahttps://www.continuum.io/webinars
• Blog posts on scaling out with Anacondahttps://www.continuum.io/blog/developer-blogProductionizing and Deploying Data Science Projects
![Page 25: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ce822a1a28ab210a8b5c5f/html5/thumbnails/25.jpg)
Thank You!
@ContinuumIO @koverholt