open data science with r and anaconda
TRANSCRIPT
![Page 1: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/1.jpg)
OPEN DATA SCIENCE WITH RMake Life Easier & More Powerful with Anaconda
Christine Doig, Senior Data Scientist
![Page 2: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/2.jpg)
2
Christine Doig is a Senior Data Scientist at Continuum Analytics, where she worked on MEMEX, a DARPA-funded project helping stop human trafficking. She has 5+ years of experience in analytics, operations research, and machine learning in a variety of industries, including energy, manufacturing, and banking. Christine holds a M.S. in Industrial Engineering from the Polytechnic University of Catalonia in Barcelona. She is an open source advocate and has spoken at many conferences, including PyData, EuroPython, SciPy and PyCon.
About me
Christine DoigSenior Data Scientist
Continuum Analytics
![Page 3: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/3.jpg)
3
• Introduction to Open Data Science • Introduction to Anaconda, the leading Open Data Science platform • Package and environment management for R
– conda, R-Essentials and MRO • Data Science Collaboration in R
– Jupyter notebooks for R and Anaconda Enterprise Notebooks • Scaling R
– Anaconda for cluster management and SparkR
Agenda - Open Data Science with R
![Page 4: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/4.jpg)
OPEN DATA SCIENCEIntroduction to
![Page 5: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/5.jpg)
“ ”© 2015 Continuum Analytics- Confidential & Proprietary 5
An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms
Wikipedia
Data Science is …
![Page 6: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/6.jpg)
© 2015 Continuum Analytics- Confidential & Proprietary
Open Data Science is …
an inclusive movement that makes open source tools of data science - data, analytics, & computation - easily work
together as a connected ecosystem
6
![Page 7: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/7.jpg)
© 2015 Continuum Analytics- Confidential & Proprietary
Open Source ecosystems for Data Science
7
NumPy SciPy
Pandas Scikit-learn
Jupyter/IPython
dplyr shiny
tidyr
ggplot
Spark
tidyr
![Page 8: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/8.jpg)
ANACONDAIntroduction to
![Page 9: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/9.jpg)
© 2015 Continuum Analytics- Confidential & Proprietary 9
is…. the leading Open Data Science platform powered by Python the fastest growing Open Data Science language
• Accelerate Time-to-Value • Connect Data, Analytics, & Compute • Empower Data Science Teams
![Page 10: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/10.jpg)
10
Why Anaconda? • Easy to install on all platforms • Trusted by industry leaders: e.g. Microsoft Azure ML
• Large user base: 3M+ downloads • BSD license • Extensible - easily build, share and install proprietary libraries with Anaconda Cloud
• Language agnostic - Python, R, Scala… • Allows isolated custom sandboxes with different versions of packages
Why Anaconda?
![Page 11: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/11.jpg)
11
Anaconda Glossary
PYTHONNumPy, SciPy, Pandas, Scikit-learn, Jupyter /
IPython, Numba, Matplotlib, Spyder, Numexpr,
Cython, Theano, Scikit-image, NLTK, NetworkX and
150+ packages
conda
PYTHON
cond
conda
• Anaconda distribution: Python distribution that includes 150+ packages for data science
• conda: Cross-platform and language agnostic package and environment manager
• Miniconda: Lightweight version of Anaconda, with just Python and conda.
• Anaconda Cloud: Cloud service to host and share public and private packages, environments and notebooks
• conda environments: custom isolated sandboxes to easily reproduce and share data science projects
![Page 12: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/12.jpg)
PACKAGE AND ENVIRONMENT MANAGEMENT FOR R
![Page 13: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/13.jpg)
13
From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft
An R Reproducibility Problem
![Page 14: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/14.jpg)
14
Reproducibility• Programming language (R, Python, Scala…) • Packages (OSS libraries or internally developed) • Data or Access to data • Configuration of Services: DBs, keys… • Your Analysis - Script, Notebook
![Page 15: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/15.jpg)
15
Reproducibility solutions
Bare metal
Virtual Machines
Docker containers
Conda environments
Your Analysis or Application
Your laptop, server, EC2 instance
Env 1 Env 2 Env 3
Analysis 1 Analysis 2 Analysis 3
![Page 16: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/16.jpg)
16
Conda Environments• Programming language (R, Python, Scala…) • Packages (OSS libraries or internally developed) • Data or Access to data • Configuration of Services: DBs, keys… • Your Analysis - Script, Notebook
![Page 17: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/17.jpg)
17
lightweight isolated sandbox to manage your dependencies and allow reproducibility of your project
environment.yml
$ conda env create
$ source activate ENV_NAME
Conda Environments
![Page 18: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/18.jpg)
18
Where packages, notebooks, and environments are shared. Powerful collaboration and package management for open source and private projects.
Public projects and notebooks are always free.REGISTER TODAY! ANACONDA.ORG
![Page 19: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/19.jpg)
19
Anaconda for R
https://www.continuum.io/blog/developer/jupyter-and-conda-r
• R-Essentials: A conda metapackage with 80+ R packages for data science
• MRO: Microsoft R Open distribution with MKL
conda config --add channels r conda install r-essentials
conda config --add channels mro conda install r
![Page 20: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/20.jpg)
20
• Package and environment manager • Language angnostic (Python, R, Java…) • Cross-platform (Windows, OS X, Linux)
$ conda install python=2.7 $ conda install pandas $ conda install -c r r $ conda install mongodb
Conda
![Page 21: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/21.jpg)
21
name: myenv channels: - chdoig - r - foo
dependecies: - python=2.7 - r - r-ldavis - pandas - mongodb - spark=1.5 - pip - pip: - flask-migrate - bar=1.4
environment.yml
$ conda env create $ source activate myenv
$ conda env export -n freeze.yml
Create and activate
Freeze versions
Upload to anaconda.org
$ conda server upload my_foo_env.yml $ conda env create chdoig/my_foo_env.yml
Conda environments flow example
![Page 22: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/22.jpg)
22
FAQ• R-Essentials has too many / too few / not the packages I
want, how can I create my own “R-Essentials”?
• I need an R package that is not on R-Essentials or the R channel, but is available through CRAN, how do I get it?
$ conda skeleton cran ldavis $ conda build r-ldavis/ $ conda server upload r-ldavis $ conda install -c chdoig r-ldavis
$ conda metapackage custom-r-bundle 0.1.0 --dependencies r-irkernel jupyter r-ggplot2 r-dplyr --summary "My custom R bundle”
![Page 23: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/23.jpg)
23
Anaconda: Navigator
• Launch applications and easily manage conda packages, environments and channels.
• No need of using the command line.
•Available for Windows, OS X and Linux.
• Anaconda Navigator has replaced Launcher.
• Integration with Anaconda Cloud.
A desktop graphical user interface included in
Anaconda
![Page 24: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/24.jpg)
24
Anaconda Repository
• Centralized internal repository to share package, environments and notebooks.
• Control user or team access to packages, environments and notebooks
• Blacklist packages in your organization (e.g. GPL licenses)
• Internal mirror Anaconda • Build and easily share internal developed software
![Page 25: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/25.jpg)
DATA SCIENCE COLLABORATION WITH R
![Page 26: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/26.jpg)
© 2015 Continuum Analytics- Confidential & Proprietary
Data Science Development Environments
26
PyCharm Spyder
Text Editors: Sublime, vim, emacs…
RStudio Eclipse
![Page 27: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/27.jpg)
27
http://jupyter.org/https://try.jupyter.org/
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Jupyter
![Page 28: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/28.jpg)
28
IPython IPython notebook
nbviewer tmpnb binderJupyter
https://try.jupyter.org/
http://mybinder.org/
Jupyter
![Page 29: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/29.jpg)
29
Jupyter: IRkernel
https://www.continuum.io/blog/developer/jupyter-and-conda-r
conda config --add channels r conda install r-essentials jupyter notebooks
Trivial to get started writing R notebooks the same way you
write Python ones.
![Page 30: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/30.jpg)
30
To start jupyter notebooks, simply run the following command:
$ jupyter notebook
http://nbviewer.ipython.org/github/chdoig/conda-jupyter-irkernel/blob/master/Jupyter%20and%20conda%20for%20R.ipynb
Jupyter
![Page 31: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/31.jpg)
31
Jupyter
![Page 32: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/32.jpg)
32
Jupyter
![Page 33: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/33.jpg)
33
$ jupyter nbconvert my_r_notebook.ipynb --to slides --post serve
Jupyter
![Page 34: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/34.jpg)
DEMO 1: ENVIRONMENTS & REPOSITORY
![Page 35: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/35.jpg)
35
Moving your team to collaborate with each other with Anaconda Enterprise Notebooks
Data Scientist
Interactive notebooks
Models
Data apps & visualizations
Data Scientist Data Scientist
![Page 36: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/36.jpg)
36
Anaconda Enterprise Notebooks
• Collaborate with your team on the same project
• Notebooks enterprise extensions: diff, collaborative locking
• Manage collaborators and access to projects
• Search and tag notebooks
![Page 37: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/37.jpg)
DEMO 2: NOTEBOOKS AND AEN
![Page 38: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/38.jpg)
SCALING R
![Page 39: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/39.jpg)
39
Scalability
Data Scientists want: • Easy cluster setup and provisioning -> Anaconda for cluster management
• Distributed framework to scale analysis -> SparkR
![Page 40: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/40.jpg)
40
Anaconda for cluster management
• Dynamically manage conda environments across a cluster
• Works with enterprise Hadoop distributions and HPC clusters
• Integrates with on-premises Anaconda repository
• Cluster management features are available with Anaconda subscriptions
Client Machine Compute Node
Compute Node
Compute Node
Head Node
![Page 41: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/41.jpg)
41
Anaconda for cluster management
Before Anaconda for cluster management
Head Node1. Manually install Python,
packages & dependencies2. Manually install R, packages &
dependencies
After Anaconda for cluster management
Compute Nodes1. Manually install Python,
packages & dependencies2. Manually install R,
packages & dependencies
Compute Nodes
Head NodeEasily install conda environments and packages (including Python and R) across cluster nodes
• Empower IT with scalable and supported Anaconda deployments • Fast, secure and scalable Python & R package management on tens or thousands of nodes • Backed by an enterprise configuration management system • Scalable Anaconda deployments tested in enterprise Hadoop and HPC environments
![Page 42: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/42.jpg)
42
SparkR
• Distributed framework for large scale processing
• Provides an R interface through SparkR
![Page 43: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/43.jpg)
DEMO 3: ANACONDA FOR CLUSTER MANAGEMENT AND SPARKR
![Page 46: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/46.jpg)
46
• Need a centralized repository to publish and share notebooks, environments and packages (OSS and private)? Get Anaconda Repository! (Available in Anaconda Workgroups and Enterprise)
• Need a centralized server to help your data science team interactively collaborate on projects? Get Anaconda Enterprise Notebooks! (Available Enterprise)
• Need a “data scientist friendly” cluster manager? Get Anaconda for cluster management! (Available in Anaconda Workgroups and Enterprise)
Enterprise Product Solutions
![Page 47: Open Data Science with R and Anaconda](https://reader031.vdocuments.us/reader031/viewer/2022021815/586e72e11a28ab99598b5269/html5/thumbnails/47.jpg)
47
• Download Anaconda: https://www.continuum.io/downloads
• Sign up for Anaconda cloud: https://anaconda.org
• Contact [email protected] for more information aboutAnaconda subscriptions, consulting, or training
Contact Information and Additional Details