python for data science
TRANSCRIPT
PYTHON FOR DATA SCIENCE
Gabriel MoreiraMachine Learning Engineer
@gspmoreira
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Why so much buzz?
Big Data
WHERE IS DATA SCIENCE BEEN USED?http://www.kdnuggets.com/2014/12/where-analytics-data-mining-data-science-applied.html
RECOMMENDATIONS EVERYWHERE
WHAT IS DATA SCIENTIST
http://www.datasciencecentral.com/profiles/blogs/are-you-a-data-scientist
A Data Scientist is someone with deliberate dual personality who can first build a curious business case defined with a telescopic vision and can then dive deep with microscopic lens to sift through DATA to reach the goal while defining and executing all the intermittent tasks.
WHAT IS A DATA SCIENTIST?Data scientists explore and transform data in novel ways to create and publish new features and combine data from diverse sources to create new value. Data scientists make visualizations with researchers, engineers, web developers, and designers to expose raw, intermediate, and refined data early and often.
Applied researchers solve the heavy problems that data scientists uncover and that stand in the way of delivering value. These problems take intense effort and require novel methods from statistics and machine learning.
[Agile Data Science, O’Reilly, 2014]
http://nirvacana.com/thoughts/becoming-a-data-scientist/Data Science MetroMap Curriculum
IS DATA SCIENTIST THENEW WEBMASTER?
[Doing Data Science, O’Reilly, 2014]
[Hillary Mason, Data Scientist]
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
DATA SCIENCE IS IOSEMN
What about Python?
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
PYTHON IS IOSEMN
js
ANALYSIS CASE CORPORATE SOCIAL NETWORKS
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
INQUIRE1. Which communities are more popular?
2. Is the engagement of users in corporate communities increasing?
3. What is the distribution of posts publishing time, during the day?
4. What is the percentage of interactions (likes and comments)?
5. How is the likes distribution by user?
6. Is there a relationship between publishing hour and number of interactions?
7. What communities are more engaging (greater avg. interactions on posts)?
8. What are the most relevant words in the posts?
9. How to group posts about similar subjects?
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
OBTAIN•Download data from another location (e.g., a web page or server)
•Query data from a database (e.g., MySQL or Oracle)
•Extract data from an API (e.g., Twitter, Facebook) •Extract data from another file (e.g., an HTML file or spreadsheet)
•Generate data yourself (e.g., reading sensors or taking surveys)
TWITTER PUBLIC STREAM API
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
Show me the code!
Data Analysis with IPython Notebook Demo bit.ly/python4ds_nb
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
INTERPRET
•Drawing conclusions from your data
•Evaluating what your results mean
•Communicating your result
DATA PRODUCTS“If information has context and the context is interactive, insights are not predictable."
[Agile Data Science, O’Reilly, 2014]
SENTIMENT ANALYSIS
bit.ly/eleicoes2014debatesbt
Analytical Dashboard
SENTIMENT ANALYSISAnalytical Dashboard
bit.ly/eleicoes2014debatesbt
SENTIMENT ANALYSISDashboard Online - JavaScript
NETWORK ANALYSIS
https://linkedjazz.orgjs
What about Python for Big Data?
PYTHON IN HADOOP• Hadoop Streaming - Allows MapReduce jobs from any
executable script - including Python!Example using AWS Elastic MapReduce: http://workingsweng.com.br/2014/04/clusterizando-raios-com-hadoop-e-k-means-em-map-reduce/
• Other supporting options for Python in Hadoop
HADOOPY
Pig UDFs in Jython
THE NEXT-GEM DATA SCIENTIST
The best minds of my generation are thinking about how to make people click ads... That sucks. [Jeff Hammerbacher]
Next-gen data scientists don’t try to impress with complicated algorithms and models that don’t work. They spend a lot more time trying to get data into shape than anyone cares to admit—maybe up to 90% of their time. Finally, they don’t find religion in tools, methods, or academic depar tments . They are versat i le and interdisciplinary.
[Doing Data Science, O’Reilly, 2014]
DATA SCIENCE COURSES
• Introduction to Data Science (Univ. of Washington)
• Data Science specialization (John Hopkins)
• Intro to Hadoop and MapReduce (Cloudera)
• Machine Learning (Stanford)
• Statistical Learning (Stanford)
http://workingsweng.com.br/2014/04/cursos-mooc-e-especializacoes-em-data-science/
BOOKS
The road can be challenging
But may be fun!