the missing manual for data science: remix. resuse. reproduce from structure:data 2013

Post on 27-Jan-2015

102 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation from Matt Wood, Amazon Web Services #dataconf More at http://event.gigaom.com/structuredata/

TRANSCRIPT

THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE

SPEAKER: Matt WoodPrincipal Data ScientistAmazon Web Services

Monday, April 1, 13

The Missing Manual:

matthew@amazon.comDr. Matt Wood

Reproduce, Reuse, Remix

@mza

Monday, April 1, 13

Monday, April 1, 13

Hello.

Monday, April 1, 13

Monday, April 1, 13

Data.

Monday, April 1, 13

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Monday, April 1, 13

Monday, April 1, 13

Generation challenge.

Monday, April 1, 13

Linus Bengtsson et al. PLoS Medicine, 2011

Amazing data generators: cell phones tracking cholera in Haiti

Monday, April 1, 13

You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011

Amazing data generators: social networks tracking influenza

Monday, April 1, 13

500% return on ad spend

Amazing data generators: web app logs targeting advertising

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

Chromosome 11 : ACTN3 : rs1815739

Monday, April 1, 13

Chromosome X : rs6625163

Monday, April 1, 13

Chromosome 19 : FUT2 : rs601338

Monday, April 1, 13

Chromosome 2 : rs10427255

Monday, April 1, 13

TYPE II

Chromosome 10 : rs7903146

Monday, April 1, 13

+0.25

Chromosome 15 : rs2472297

Monday, April 1, 13

Monday, April 1, 13

Generation challenge.

Monday, April 1, 13

Generation challenge.X

Monday, April 1, 13

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Monday, April 1, 13

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Monday, April 1, 13

Monday, April 1, 13

Utility computing.

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

Remove constraints.

Monday, April 1, 13

Monday, April 1, 13

Analytics challenge.

Monday, April 1, 13

Analytics challenge.X

Monday, April 1, 13

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Monday, April 1, 13

Monday, April 1, 13

Beautiful, unique.

Monday, April 1, 13

Monday, April 1, 13

Impossible to recreate.

Monday, April 1, 13

Monday, April 1, 13

Snowflake Data Science

Monday, April 1, 13

Monday, April 1, 13

Reproducibility.

Monday, April 1, 13

Monday, April 1, 13

Reproducibility scales data science.

Monday, April 1, 13

Monday, April 1, 13

Reproduce. Reuse. Remix.

Monday, April 1, 13

Monday, April 1, 13

Value++

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

How do we get from here to there?

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

1. Data has Gravity

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

Monday, April 1, 13

Increasingly large data collections.

Monday, April 1, 13

Monday, April 1, 13

Challenging to obtain and manage.

Monday, April 1, 13

Monday, April 1, 13

Expensive to experiment.

Monday, April 1, 13

Monday, April 1, 13

Large barrier to reproducibility.

Monday, April 1, 13

Monday, April 1, 13

Move data to the users.

Monday, April 1, 13

Move data to the users.X

Monday, April 1, 13

Monday, April 1, 13

Move tools to the data.

Monday, April 1, 13

Monday, April 1, 13

Place data where it can be consumed by tools.

Monday, April 1, 13

Monday, April 1, 13

Place tools where they can access data.

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

More data,more users,more uses,

more locations

Monday, April 1, 13

Monday, April 1, 13

Cost

Monday, April 1, 13

Monday, April 1, 13

Force multiplier.

Monday, April 1, 13

Monday, April 1, 13

Cost and complexity kill reproducibility.

Monday, April 1, 13

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

2. Ease of use is a prerequisite

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

Monday, April 1, 13

Monday, April 1, 13

Help overcome the suck threshold.

Monday, April 1, 13

Monday, April 1, 13

Easy to embrace and extend.

Monday, April 1, 13

Monday, April 1, 13

Choose the right abstraction for the user.

Monday, April 1, 13

Monday, April 1, 13

$ ec2-run-instances

Monday, April 1, 13

Monday, April 1, 13

$ starcluster start

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

Package and automate.

Monday, April 1, 13

Monday, April 1, 13

Expert-as-a-service.

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

1000 GenomesProject

Cloud BioLinux

Monday, April 1, 13

Monday, April 1, 13

Illumina Basespace

1000 GenomesProject + your genomic data

Monday, April 1, 13

Amazon S3

http://www.youtube.com/watch?v=oGcZ7WVx6EI

Legacy data warehousing

Cassandra Aegisthus Hadoop, Hive, Pig

Monday, April 1, 13

Amazon S3

http://www.youtube.com/watch?v=oGcZ7WVx6EI

Legacy data warehousing

Cassandra Aegisthus Hadoop, Hive, Pig

MicrostrategySting

R

Monday, April 1, 13

Monday, April 1, 13

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

3. Reuse is as important as reproduction

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

Monday, April 1, 13

Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

Monday, April 1, 13

Monday, April 1, 13

Data scientists are hackers.

Monday, April 1, 13

Monday, April 1, 13

They have their own way of working.

Monday, April 1, 13

Monday, April 1, 13

Beware the Big Red Button.

Monday, April 1, 13

Monday, April 1, 13

Fire and forget reproduction is a good first step, but limits

longer term value.

Monday, April 1, 13

Monday, April 1, 13

Monolithic, one-stop-shop.

Monday, April 1, 13

Monday, April 1, 13

Work well for intended purpose.

Monday, April 1, 13

Monday, April 1, 13

Challenging to install, dependency heavy.

Monday, April 1, 13

Monday, April 1, 13

Di!cult to grok.

Monday, April 1, 13

Monday, April 1, 13

Data scientists are hackers:embrace it.

Monday, April 1, 13

Monday, April 1, 13

Small things. Loosely coupled.

Monday, April 1, 13

Monday, April 1, 13

Easier to grok, reuse and integrate.

Monday, April 1, 13

Monday, April 1, 13

Lower barrier to entry.

Monday, April 1, 13

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

4. Build for collaboration

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

Monday, April 1, 13

Workflows are memes.

Monday, April 1, 13

Monday, April 1, 13

Reproduction is just the first step.

Monday, April 1, 13

Monday, April 1, 13

Bill of materials: code, data, configuration, infrastructure.

Monday, April 1, 13

Monday, April 1, 13

Full definition for reproduction.

Monday, April 1, 13

Monday, April 1, 13

Utility computing provides aplayground for data science.

Monday, April 1, 13

Code + AMI + custom datasets + public datasets + databases + compute + result data

Monday, April 1, 13

Code + AMI + custom datasets + public datasets + databases + compute + result data

Monday, April 1, 13

Code + AMI + custom datasets + public datasets + databases + compute + result data

Monday, April 1, 13

Code + AMI + custom datasets + public datasets + databases + compute + result data

Monday, April 1, 13

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

5. Provenance is a first class object

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

Monday, April 1, 13

Versioning becomes really important.

Monday, April 1, 13

Monday, April 1, 13

Especially in an active community.

Monday, April 1, 13

Monday, April 1, 13

Doubly so with loosely coupled tools.

Monday, April 1, 13

Monday, April 1, 13

Provenance metadata is a first class entity.

Monday, April 1, 13

Monday, April 1, 13

Distributed provenance.

Monday, April 1, 13

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

1. Data has gravity2. Ease of use is a prerequisite3. Reuse is as important as reproduction4. Build for collaboration5. Provenance is a first class object

5PRINCIPLESREPRODUCIBILITY

OF

Monday, April 1, 13

Monday, April 1, 13

Monday, April 1, 13

top related