Download - Data Scientist Toolbox
Data Scientist Toolbox
Andrei Savu - Axemblr.comBigData.ro 2013
Me
• Founder of Axemblr.com
• Organizer of Bucharest JUG (bjug.ro)
• Passion for DevOps, Data Analysis
• Connect with me on LinkedIn
@ Axemblr
• Service Deployment Orchestration
• Infrastructure Automation (DevOps)
• Apache Hadoop On-Demand Appliance
• Axemblr Provisionrhttps://github.com/axemblr/axemblr-provisionr
(Big)Data in a nutshell
• Business Intelligence / Research Evolved
• Significant change in Decision Making
• Enables new Products & Features
• Enables new Business Models
Data Scientist
• Has a Business / Research oriented perspective
• Knowledge of statistics & software engineering (AI, infrastructure)
• Ability to explore questions and formulate hypotheses to be tested
Data Science Project
• Focused on particular business goals
• Based on a set of important questions
• Result > Answers that support business decisions
The Algorithm• Find *Important*
Questions
• Identify & Extract Data
• Store & Sample
• Analyse
• Visualization
• Create Pipelines
• Automate & Deploy
• Learn & Repeat!
Start w/ “Big” Questions... answer them with (Big)Data
How can we understand & improve the conversion rate? How can we increase customer satisfaction?
How can we find important mentions in social media?
Identify Data SourcesOR add more probes / sensors as needed
Google Analytics, Web server logs, Mixpanel, Custom application metrics, Mouse tracking, Facebook metrics etc.
Extract Data... to a medium that allows you to run arbitrary queries
Local filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig
Extract
• Database dump tool, replicas or backups
• External web services
• Apache Sqoop (SQL-to-Hadoop)
• Implement pipelines / real-time streams
• Write custom tools as needed
CurateUnfortunately Data is Messy
Curate - Your Way
• Use or develop tools / scripts
• On large volumes there no obvious choices
• Custom ways of filtering & aggregating large streams (e.g. twitter, sensors)
• Reuse existing software components for data curation / validation
DataWranglerInteractive System for Data cleaning a transformation
http://vis.stanford.edu/wrangler/
Open RefineFormer Google Refine
https://github.com/OpenRefine/OpenRefine
Sample (time, etc.)As needed to support interactive exploration
Why Sample?
• Interactive exploration to create and check assumptions, to create algorithms
• Be careful with “Statistical Significance”
• Sample Smart: By time, By location etc.
Analyse SampleThis is were the fun begins
Analyse Sample
• Create models
• Create algorithms
• Check hypotheses
• Faster feedback loops & Immediate Gratification
Excel-like
Python
RStudio
Gephi.org
Analyse Allapply your results to the entire data set
How to Analyse All?
• “Easy” on a single machine
• Go distributed w/ Hadoop, MPI, Storm, Oracle Exa* etc.
• Key: Leverage existing tools
• Tools: sed, awkSQL, RHadoop, Apache Hive, Pig, Cloudera Impala, MPI, Custom MR
VisualizationCommunicate meaning w/ Graphics
http://selection.datavisualization.ch/
Automate & DeployMake it part of your internal dashboard
Learn & RepeatAnswer most of the time generate new questions
Thanks! Questions?Andrei Savu / [email protected]
@andreisavu