2018 anaconda state of data science report...the anaconda state of data science is strong. with 2 to...
TRANSCRIPT
In April this year, Anaconda Inc. launched its first survey of the Anaconda community. We wanted to get a better understanding of what users do with Anaconda, what they think about it, and the data sources, visualization, and scale-out approaches they use. The survey ran from March 22-April 30, resulting in 4,218 responses with a 100% survey completion rate.
We at Anaconda are very grateful to everyone who responded, especially because so many took the time to provide detailed comments and feedback.
Executive SummaryThe Anaconda State of Data Science is strong. With 2 to 2.5 million downloads per month during January-March 2018, Anaconda is easily the most popular Python distribution, with a growing R following. Below are key conclusions from the survey:
• The future is bright for Python and R. Students and academics are strong users of Anaconda, comprising 41% of the respondents. The future data scientists and machine learning experts of tomorrow are learning and using Python and R today.
• There is a Data Scientist/Software Developer crossover. Both Data Scientists and Software Developers write Python or R code, and the two job roles are not mutually exclusive: Data Scientists who write Python and R libraries can also be Software Developers. But we did not expect almost as many Software Developer users as Data Scientists (15% vs. 16%).
• Machine learning is a key application for Anaconda users. 14% of respondents are doing machine learning.
• Cloud-native data science and use of cloud services continue to rise, at the expense of traditional Hadoop-centric “big data” infrastructure. Responses concerning data sources and scale-out technologies indicate strong uptake of APIs, cloud data services, and container-based approaches to data science at the expense of traditional Hadoop deployments.
• Matplotlib continues to enjoy its first-mover advantage in visualization, sweeping the category. But it is still a highly crowded space with many strong competitors, both open source and commercial. Plotly, Tableau, Microsoft Power BI, and Tibco Spotfire are all strong commercial competitors to Matplotlib and other open source projects like ggplot, Bokeh, D3, and Altair.
• It matters a lot that Anaconda is free… but not so much that it is open source. Free was ranked the most important attribute, while the open source licensing was second to last.
2018 Anaconda State of Data Science Report
Demographics
We wanted to understand the occupations of Anaconda users, so this was a single choice question where the results sum to 100%. The leading category is Student (26%), which is understandable given that Anaconda is very popular in the teaching realm due to its ease of use, as well as its ability to ensure that every student gets exactly the same Python and R environment and can reproduce the instructor’s results. It also suggests that the growth of Anaconda Python/R users will continue to accelerate as these students graduate and put their expertise to work.
Outside of students, Data Scientists (16%), Academics (15%), and Software Developers (15%) form the majority of users. Because Anaconda was born out of frustration with the difficulty of doing reproducible Python data science, we expected that Data Scientists and Academics would be popular user occupations, but we did not expect almost as many Software Developers as Data Scientists. There is an overlap of needs given both groups write code, but Software Developers also have some distinct priorities, and our product team has been reading all the comments carefully to ensure we understand those and can serve them better.
The most popular response in the “Other” category was Scientist (including specializations like Geoscientist, Chemist, etc.) followed by Engineer.
Python versus R 99% of respondents use Anaconda for Python, as we might expect for a project that came out of the Python community. However, 1% of respondents use Anaconda for R only, with 14% using both R and Python. We have recently boosted Anaconda’s R capabilities and made Microsoft MRO the default package set in the distribution, and we will continue to expand our R support.
Student (attending school full or part time)
Data Scientist
Academic (e.g. Researcher, Professor)
Software Developer / Engineer / Programmer
Analyst (e.g. Business Analyst)
Hobbyist
Other (Please Specify)
IT (any role within an IT org)
Consultant
Trainer or Educator
25.53%
16.38%
15.48%
14.65%
8.39%
8.25%
3.70%
3.58%
2.84%
1.19%
1,077
691
653
618
354
348
156
151
120
50
How Anaconda is Being Used We also asked respondents to provide a brief summary of what they are doing with Anaconda, as a free text field. We found that Anaconda usage truly spans a wide variety of fields in academia, industry, and government. Use cases include financial asset allocation, formulating cancer public policy, 4D oilfield seismic data processing, medical image machine learning, and molecular dynamic simulations.
Using word matching, we tagged responses with broader categories. For example, for machine learning we matched words and phrases like “ML,” “machine learning,” and “Neural network,” as well as the names of popular ML libraries such as TensorFlow and scikit-learn. While not an exact science, it does give an indication of the relative popularity of use cases. Responses could be given more than one category tag.
Perhaps unsurprisingly, data and numerical analysis (understanding data and what it tells us) was the largest category of usage at 20%. Python and R package management came in just behind at 18%, followed by machine learning at 14%. We did not try to differentiate between different forms of machine learning, given this was a free text field and many responses didn’t provide that level of detail.
What’s Important To Anaconda Users
For this question, we asked respondents to stack-rank seven qualities of Anaconda from first to last in order to understand what was most important to them. The displayed order was randomized for each respondent to help minimize the impact of cognitive bias.
One of the challenges faced by any open source project is how to fund its development, and there are certainly plenty of opinions about the best way to do that within the open source community. Anaconda has always been free, and the survey responses validate that approach: being free is the most important
Free
Easy to install and manage packages
Trusted source of packages
Easily reproduce same results elsewhere
conda environments
BSD open source license
Navigator GUI
characteristic of the product, scoring 5.16. Zero cost is much more important than its open source licensing, which scores 3.26.
Close behind Free is the ease of installing and managing Python and R packages (4.82). Behind that is the fact that Anaconda Inc. provides a trusted source of packages to the community through our free open repository (4.16). Ease of reproducing data science across systems and the use of conda environments are almost a tie, at 3.86 and 3.70 respectively.
The Navigator GUI comes last at 3.04. In reading the comments, we discovered two reasons for this: first, many users prefer the command line to a GUI, especially when automating tasks; and second, there is room for improvement in Anaconda Navigator, and the product team has already begun planning improvements there.
Data Sources
Files reign supreme when it comes to data sources, with 89% of respondents using them for data access, followed by classical SQL databases at 49%. In third place are REST APIs at 25%, demonstrating that getting data from other applications and websites is a key part of modern data science.
In fourth place, Google Cloud’s data services (16.95%) just edge out traditional big data stores like HDFS / Hadoop and Spark (16.88%) and Amazon Web Services’ data offerings (16.12%). AWS is the dominant player in the IaaS marketplace and pioneered modern large scale object storage that is a fraction of the cost of big data technologies like HDFS, as well as fast large-scale query-oriented databases like RedShift. The strength of Google Cloud’s showing is impressive given that AWS has 10x the IaaS revenue of Google (per Gartner: www.gartner.com/newsroom/id/3808563). Google Cloud has a strong data services play and has been
CSV or other files
SQL database (e.g. Oracle, SQL Server, MySQL,
MariaDB, Postgres
REST API from another app (e.g. Twitter API)
Google Cloud (e.g. Cloud Storage, BigQuery,
BigTable)
HDFS / Hadoop / Spark
AWS (e.g. S3, Redshift, Dynamo)
NoSQL database (e.g. MongoDB, CouchDB,
Cassandra)
Distributed SQL engines (e.g. Hive, Impala, Presto)
Azure (e.g. Blob Storage, Cloud Database)
Other (please specify)
89.09%
49.19%
24.59%
16.95%
16.88%
16.12%
13.61%
8.80%
7.14%
6.92%
3,758
2,075
1,037
715
712
680
574
371
301
292
focused on leveraging that into growth of customer base, and this result indicates that it’s paying off with the Anaconda community.
Hadoop-style big data performed relatively weakly versus the other options given this is a data-centric audience. Hadoop has dominated on-premises (non-cloud) data infrastructure for the past 10 years and spawned two tech IPOs (Hortonworks and Cloudera). What was “big data” in 2005 when Hadoop began now easily fits into a single server’s memory, and there is a plethora of alternatives to building a Hadoop data lake.
NoSQL databases came in at 14%, right behind the cloud services, demonstrating their value for storing and processing semi-structured data. Microsoft Azure usage came in at 7.14%.
Scaling out data science and machine learning
We asked respondents what technologies they use for scaling out their data science. The majority (52%) don’t use scale-out technologies, and the next closest is deployment to a Linux server at 34%.
Docker makes a strong showing at 19%, beating out Hadoop/Spark with 15%, followed by Kubernetes at 5.8%. This result suggests modern cloud-native style architectures like Docker and Kubernetes are in the ascendancy, at the expense of traditional Hadoop “big data” and Apache Mesos (0.85%).
Dask, an open source technology for parallelizing single host algorithms and machine learning across multiple CPU cores or multiple servers, came in at 3.0% of responses.
The “Other” category at 1.5% included a variety of supercomputers and various AWS, Google Cloud, and Microsoft Azure services.
None
Linux Servers
Docker
Hadoop/Spark
Kubernetes
Dask
Other (Please Specify)
Mesos
52.18%
33.90%
18.94%
15.48%
5.76%
3.06%
1.54%
0.85%
2,201
1,430
799
653
243
129
65
36
Visualization
For this question we used alphabetical ordering for the responses in order to make it easier for respondents to find and select the visualization tools they use.
Matplotlib in all its guises absolutely crushed this category, with 75% using it directly, 47% using it via Pandas Plotting, and 27% via Seaborn (multiple responses were allowed).
Plotly and ggplot (for R) came in almost neck-and-neck at 24.4% and 24.3% respectively. The usage of ggplot far exceeds the percentage of R users represented in the survey, indicating that Python users also like to use it. Tableau came in next at 20%, followed by Bokeh at 14% and D3 at 10%.
The “Other” category was substantial for this answer. In order from most popular to least are:
1. Microsoft Power BI
2. Tibco Spotfire
3. Microsoft Excel
4. Qlik
5. Altair
Matplotlib
Pandas plotting
Seaborn
Plotly
ggplot (R)
Tableau
Bokeh
d3
Other (please specify)
Holoviews
74.82%
46.80%
27.17%
24.37%
24.30%
20.29%
13.92%
10.10%
9.39%
1.92%
3,156
1,974
1,146
1,028
1,025
856
587
426
396
81
Where users get help
Perhaps unsurprisingly, Anaconda users would rather hit Google Search (4.89) or Stackoverflow (4.42) to find an answer versus reading the docs (4.05)! With this question we randomized the order of responses for each respondent to help minimize the effects of cognitive bias. This result, along with comments from respondents, indicates room for improvement in the Anaconda documentation, which the team will be working on.
What would Anaconda users do differently?We deliberately left this as a final, open-ended question so we could receive candid answers on anything we can do to improve. One common theme was better interoperability between the pip installer and conda, especially when pip-installing packages into conda environments. This is something that the conda team is actively working on, for release later this year. Better documentation was another theme.
Improved interoperability with Docker and container-building in general was also popular, with users looking to build smaller containers more quickly using Miniconda and conda environments.
Finally, it warmed our hearts that the word “Love” was one of the most frequent words encountered in these responses, as in “Love what you do already” and “Nothing, I love it!”
We’d like to thank all the respondents for taking the time to complete our survey. Let’s do it again next year!
About Anaconda, Inc.With 6 million users, Anaconda is the world’s most popular Python data science platform. Anaconda, Inc. continues to leadopen source projects like Anaconda, NumPy, and SciPy that form the foundation of modern data science. Anaconda’s flagshipproduct, Anaconda Enterprise, allows organizations to secure, govern, scale, and extend Anaconda to deliver actionableinsights that drive businesses and industries forward.
Google for an answer
StackOverflow
Read the documentation
Github
Anaconda email list/Google Group
Social media (e.g. Twitter)