building better analytics workflows (strata-hadoop world 2013)

Post on 12-Jun-2015

67.685 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

TRANSCRIPT

Strata-Hadoop World 2013

Building better analytics workflows

www.datapad.io

Wes McKinney

2

• Former quant @ AQR (a hedge fund)

• Creator of Pandas project for Python

• Author of Python for Data Analysis — O’Reilly

• Founder and CEO of DataPad

@wesmckinn

www.datapad.io

• > 20k copies since Oct 2012• Bringing many new people

to Python and data analysis with code

3

www.datapad.io

• Increasing data scale

• More and more data munging/integration

• Need for Statistics and Predictive Analytics

• Building complex data visualizations

• Inadequacy of Excel or other UI-driven data tools

4

Why so many learning to program?

www.datapad.io5

Acquisition Preparation Visualization Analysis Sharing

The Analytics Workflow

www.datapad.io6

The Analytics Workflow

www.datapad.io7

What do we care about?

•Minimize time to answer

•Ask more questions

•Reduce friction between tools and processes

•Team productivity

www.datapad.io

Data Tools for Humans (TM?)

8

www.datapad.io9

What can go wrong?

•Inefficient workflows lead to lower quality analysis

•Results may not be actionable in a reasonable time-frame

www.datapad.io11

Three type of problems

•Tooling

•Workflow management

•Collaboration

www.datapad.io

Big Notable Data Trends

12

www.datapad.io

Data Preparation: an ongoing problem

13

www.datapad.io

For programmers, luckily it’s not 2005 anymore

•R: Hadley Wickham’s packages

•Python: pandas

•Hadoop: Pig

www.datapad.io

Data preparation withvisual tools

•Google OpenRefine

•Google Fusion Tables

•Microsoft Excel

•Data Wrangler

www.datapad.io

Some new startups building data preparation tools

www.datapad.io

Business Intelligence:essential for doing business

www.datapad.io

BI macro-trends

•Self Service BI

•Visual Discovery

•SQL on Hadoop

www.datapad.io

It’s the hey-day for BI startups

www.datapad.io

Predictive Analytics is getting easier

www.datapad.io

Some predictive analytics startups

www.datapad.io

Perils of “data science in a box”

www.datapad.io

Predictive analytics pitfalls

•Signal vs. Noise

• Identify the right patterns

•Uncertain ROI

www.datapad.io

Some analytics workflow problems still need work

www.datapad.io

Friction between tools

www.datapad.io

Friction between tools:a typical scenario

•Excel and SQL for data wrangling

•Tableau for visualization

•SPSS/R for modeling

www.datapad.io

Time series analytics

www.datapad.io

Large scale visualization

www.datapad.io30

A

B

C D

E

F

Data workflows as dependency graphs?

www.datapad.io31

Data workflows as dependency graphs?

CHRONOS

www.datapad.io

Iterating on analysis

www.datapad.io

Versioning and provenance

www.datapad.io

Leveraging diverse skill sets

•Within teams, different competencies

•Work together on a data project - sharing code, data, tracking changes

www.datapad.io

The elusiveGitHub for Data Analysis?

www.datapad.io

...Google Docs for Data Analysis?

www.datapad.io

Make an impact

•Getting results into the hands of people who need it

•Getting models "into production"

www.datapad.io

Some possible solutions

www.datapad.io

Build more integrated tool environments

www.datapad.io

Enhance collaboration

www.datapad.io

Accessible data science...with training wheels

www.datapad.io

One more thing

www.datapad.io

•http://datapad.io

•Founded in 2013, located in SF

• In private beta, join us!

•Hiring for engineering

www.datapad.io

Q&A time

www.datapad.io

Thank you!

46

top related