data scientist toolbox

30
Data Scientist Toolbox Andrei Savu - Axemblr.com BigData.ro 2013

Upload: andrei-savu

Post on 27-Jan-2015

141 views

Category:

Technology


3 download

DESCRIPTION

My presentation at http://www.bigdata.ro/ on how to get your job done as data scientist!

TRANSCRIPT

Page 1: Data Scientist Toolbox

Data Scientist Toolbox

Andrei Savu - Axemblr.comBigData.ro 2013

Page 2: Data Scientist Toolbox

Me

• Founder of Axemblr.com

• Organizer of Bucharest JUG (bjug.ro)

• Passion for DevOps, Data Analysis

• Connect with me on LinkedIn

Page 3: Data Scientist Toolbox

@ Axemblr

• Service Deployment Orchestration

• Infrastructure Automation (DevOps)

• Apache Hadoop On-Demand Appliance

• Axemblr Provisionrhttps://github.com/axemblr/axemblr-provisionr

Page 4: Data Scientist Toolbox

(Big)Data in a nutshell

• Business Intelligence / Research Evolved

• Significant change in Decision Making

• Enables new Products & Features

• Enables new Business Models

Page 5: Data Scientist Toolbox

Data Scientist

• Has a Business / Research oriented perspective

• Knowledge of statistics & software engineering (AI, infrastructure)

• Ability to explore questions and formulate hypotheses to be tested

Page 6: Data Scientist Toolbox

Data Science Project

• Focused on particular business goals

• Based on a set of important questions

• Result > Answers that support business decisions

Page 7: Data Scientist Toolbox

The Algorithm• Find *Important*

Questions

• Identify & Extract Data

• Store & Sample

• Analyse

• Visualization

• Create Pipelines

• Automate & Deploy

• Learn & Repeat!

Page 8: Data Scientist Toolbox

Start w/ “Big” Questions... answer them with (Big)Data

How can we understand & improve the conversion rate? How can we increase customer satisfaction?

How can we find important mentions in social media?

Page 9: Data Scientist Toolbox

Identify Data SourcesOR add more probes / sensors as needed

Google Analytics, Web server logs, Mixpanel, Custom application metrics, Mouse tracking, Facebook metrics etc.

Page 10: Data Scientist Toolbox

Extract Data... to a medium that allows you to run arbitrary queries

Local filesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig

Page 11: Data Scientist Toolbox

Extract

• Database dump tool, replicas or backups

• External web services

• Apache Sqoop (SQL-to-Hadoop)

• Implement pipelines / real-time streams

• Write custom tools as needed

Page 12: Data Scientist Toolbox

CurateUnfortunately Data is Messy

Page 13: Data Scientist Toolbox

Curate - Your Way

• Use or develop tools / scripts

• On large volumes there no obvious choices

• Custom ways of filtering & aggregating large streams (e.g. twitter, sensors)

• Reuse existing software components for data curation / validation

Page 14: Data Scientist Toolbox

DataWranglerInteractive System for Data cleaning a transformation

http://vis.stanford.edu/wrangler/

Page 16: Data Scientist Toolbox

Sample (time, etc.)As needed to support interactive exploration

Page 17: Data Scientist Toolbox

Why Sample?

• Interactive exploration to create and check assumptions, to create algorithms

• Be careful with “Statistical Significance”

• Sample Smart: By time, By location etc.

Page 18: Data Scientist Toolbox

Analyse SampleThis is were the fun begins

Page 19: Data Scientist Toolbox

Analyse Sample

• Create models

• Create algorithms

• Check hypotheses

• Faster feedback loops & Immediate Gratification

Page 20: Data Scientist Toolbox

Excel-like

Page 21: Data Scientist Toolbox

Python

Page 22: Data Scientist Toolbox

RStudio

Page 23: Data Scientist Toolbox

Gephi.org

Page 24: Data Scientist Toolbox

Analyse Allapply your results to the entire data set

Page 25: Data Scientist Toolbox

How to Analyse All?

• “Easy” on a single machine

• Go distributed w/ Hadoop, MPI, Storm, Oracle Exa* etc.

• Key: Leverage existing tools

• Tools: sed, awkSQL, RHadoop, Apache Hive, Pig, Cloudera Impala, MPI, Custom MR

Page 26: Data Scientist Toolbox

VisualizationCommunicate meaning w/ Graphics

Page 28: Data Scientist Toolbox

Automate & DeployMake it part of your internal dashboard

Page 29: Data Scientist Toolbox

Learn & RepeatAnswer most of the time generate new questions

Page 30: Data Scientist Toolbox

Thanks! Questions?Andrei Savu / [email protected]

@andreisavu