data science day new york: data science: a personal history

1

Data Science: A Personal HistoryJeff Hammerbacher

2

Data Scientist

Data Applications Scientist

“I have only heard back from one person about that ‘Data Applications Scientist’ thing. I had anticipated more discussion” – me, February 29, 2008

3

“I guess I’m arguing for ‘Data’ to replace ‘Research’ in those titles (I am happy to drop the ‘Applications’) as the primary focus of our organization is not corporate research.” – me, March 1, 2008

4

Data Scientist

“I’d like to avoid specialization at this early stage and I expect every member of our group to have a mix of research, engineering, and analysis in their workload.” – me, March 1, 2008

5

Facebook Data Team

The Facebook Data Team built scalable platforms for the collection, management, and analysis of data.

We used these platforms to drive informed decisions in areas critical to the success of the company and to build data-intensive products and services.

6

7

Data Science

Introduction to Data Science

1. Data Preparation2. Data Presentation3. Experimentation4. Observation5. Data Products

8

9

Data Scientist-Computer Symbiosis

Philosophy

• Instrument everything• Put all of your data in one place• Data first, questions later• Store first, structure later• Keep raw data forever• Let everyone party on the data• Produce tools to support the whole research cycle• Modular and composable infrastructure

10

CDH

• Storage• Append-only unstructured data• Append-only tabular data• Mutable tabular data

11

CDH

• Compute• Resource management• Parallel frameworks• High-level interfaces• Libraries

12

CDH

• Integration• File system API• Database API• Batch data import/export• Event data import• User interface

13

Cloudera Products

• Subscription• Proprietary software• Support

• Training and Certification• Services

14

Cloudera Deployment

15

Cloudera Workloads (Batch)

• Active archive• Data reservoir• ETL/ELT offload

16

Cloudera Workloads (Interactive)

• Application data delivery

17

Cloudera Customer Survey

• 67% use Hive• 54% use HBase• 51% load data every 90 minutes or less• 71% move data from Hadoop to RDBMS for

interactive SQL• 62% would like to consolidate into single platform

18

Cloudera Impala

• General-purpose SQL query engine• Should work both for analytic and transactional workloads • Will support queries that take from microseconds to hours

19

Cloudera Impala

• Runs directly within Hadoop• Reads widely used Hadoop file formats • Talks to widely used Hadoop storage managers• Runs on same nodes that run Hadoop processes

20

Cloudera Impala

• High performance• C++ instead of Java• Runtime code generation• Completely new execution engine—not MapReduce

21

Cloudera Impala

• Validated Beta Partners• MicroStrategy• QlikView• Tableau• Pentaho• Karmasphere• Capgemini

22

New Cloudera Workloads (Interactive)

• Operational reporting• Ad hoc query

23

Cloudera Deployment

24

25

The Future

Potential Future Workloads

• Search• MPI• Stream processing• Graph computations• Linear algebra• Optimization• Simulation

26

The Last Mile

• Data libraries• Language• Libraries• IDE for Data Scientists

• Mixed-initiative• Memory• Collaboration• Model and analysis path selection

27

Doing Data Science

• More data sources• More rows• More columns (novel or derived)• Better data quality• Better outcomes• Better loss functions• Causal inference in observational studies• Effect size estimates• Meta-analysis• Model lifecycle

28

data science day new york: data science: a personal history

Documents