Transcript
Page 1: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

T he R is e of Data S c ienc e in the age of B ig Data A nalytic sWhy Data Dis tillation and Mac hine L earning A ren’t E nough

David M S mithV P Marketing and C ommunityR evolution Analytic s

Page 2: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialToday, we’ll dis c us s :

What is Data Science? Why machine learning isn’t enough Why Data Science works The Data Scientists Toolkit The Future of Big Data Analytics Closing thoughts and resources

2

Page 3: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

3© Dov Harrington, CC By-2.0http://www.flickr.com/photos/idovermani/4110546683/

Page 4: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialWhere is it s afe to fis h near S an F ranc is co?

4San Francisco Estuary Institutehttp://www.sfei.org/tools/wqt

Page 5: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialHurric ane S andy

Bob Rudishttp://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/

5

Page 6: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialHurric ane S andy

Ed Chenhttp://blog.echen.me/hurricane-sandy-outages/

6

Page 7: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

When did Michael J acks on have his bigges t hits ?

New York Times, June 25 2009 (3 hours after Michael Jackson’s death)http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7

Page 8: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialT hree E s s ential S kills of Data S c ientis ts

8Drew Conwayhttp://www.dataists.com/2010/09/the-data-science-venn-diagram/

Data IntegrationMashups

Applications

ModelsVisualizationPredictionsUncertainty

ProblemsData Sources

Credibility

EffectiveData

Applications

Page 9: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

9Image © Abode of Chaos, CC BY 2.0http://www.flickr.com/photos/home_of_chaos/6418989233/

Page 10: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialMac hine learning (ML ) for predic tions

10

Res

pons

e

Feat

ures

Res

pons

es

MLscoring rules

Building the Model

Valid

atio

n se

t

Pre

dict

ions

scoring rules

Validating the Model

New

Dat

a

Pre

dict

ions

(sco

res)

scoring rules

Scoring new data

“Accuracy”

Page 11: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialP roblem: A lac k of pers pec tive

11Image © 2010 David M Smith. Some rights reserved CC BY-2.0

Page 12: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialP roblem: L ac k of c redibility

12

Page 13: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialP roblem: C omplexity

13

Page 14: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialData Science to the Rescue!

14

Page 15: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialA ns wer Unas ked Ques tions

15Revolutions blog: “The Uncanny Valley of Big Data”http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html

Page 16: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

16

“More data beats better algorithms, every time” – Google

“Companies that have massive amounts of data without massive amounts

of clue are going to be displaced by startups that have less data but more

clue.” -- Tim O’Reilly

Google Research, “The Unreasonable Effectiveness of Data”: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html

Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwdTechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html

F ill in knowledge gaps

Page 17: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialAvoid ineffec tive reac tions

17Stupid Data Miner Trickshttp://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf

S&P

500

Page 18: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

18© Henricks Photos CC-BY-ND 2.0http://www.flickr.com/photos/hendricksphotos/3240667626/

Page 19: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential0. Data (B ig & Mes s y)

19

Page 20: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential1. A language for programming with data

20

Download the White Paper

R is Hotbit.ly/r-is-hot

Page 21: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

21

Grant awards to homeless veterans FY09Data: Data.govAnalysis: Drew Conway

User-defined functions

Internet API interfaceXML parsing

Custom graphics

Data import and pre-processing

Iterative data processing

Page 22: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential2. S peed. L ots and lots of s peed.

22

Variable Transformation

Model Estimation

Model Refinement

Model Comparison / Benkmarking

Feature SelectionSampling

AggregationData Predictions

Page 23: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

Core 0(Thread 0)

Core n(Thread n)

Core 2(Thread 2)

Core 1(Thread 1)

Multicore Processor (4, 8, 16+ cores)

DataData Data

Disk

Shared Memory

Us e all available c omputing c yc les

23

Page 24: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

Compute Node

Compute Node

Master Node

DataPartition

DataPartition

Compute Node

Compute Node

DataPartition

DataPartition

3. A lgorithms that don’t choke on B ig Data

PEMAs: Parallel External-Memory Algorithms24

BIGDATA

Page 25: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialDrink les s c offee!

25

Single ThreadedNon-optimized

algorithms

OptimizedParallelizedAlgorithms

Page 26: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential4. Move c ode to data (not vic e vers a)

26

Map-Reduce

RHadoop: http://bit.ly/RHadoop

Page 27: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialB ig Data A pplianc es

27

More info: http://bit.ly/R-Netezza

Page 28: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialP lay Nic e with Others

• Business Intelligence Tools• Web-based data apps• Reporting / Spreadsheets

Presentation Layer

• R

Analytics Layer

• Relational datastores• Unstructured datastores

Data Layer

28

Page 29: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialWhat every data s c ientis t needs

Open-Source RRevolution R

EnterpriseInterface with multiple data sources

Exploratory data analysis

Wide range of statistical methods

High-speed computation

Big Data support

Data/code locality (Hadoop, etc.)

Print-quality data visualization

Scheduled batch production

Works in a multi-tool ecosystem

Integration into Data Apps

29

Page 30: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialR evolution R E nterpris e: B ig-Data R

Open-Source RRevolution R

EnterpriseInterface with multiple data sources

Exploratory data analysis

Wide range of statistical methods

High-speed computation

Big Data support

Data/code locality (Hadoop, etc.)

Print-quality data visualization

Scheduled batch production

Works in a multi-tool ecosystem

Integration into Data Apps

30www.revolutionanalytics.com/products

Page 31: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution Confidential

31Image © www.tinyplanetphotography.com

Page 32: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialA nd … the future?

Even more data

Cloud computing

Demand for Data Scientists

Diverging paradigms for data analytics

32http://www.indeed.com/jobtrends

Page 33: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialDiverging data paradigms

33

HadoopNoSQL

FilesClusters

Data Appliances

More data, better fault tolerance

Easier programming, better performanceExplorationModeling

StoragePreprocessing

Production

Page 34: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialData S c ienc e in P roduc tion

Real-time Big Data Analytics: From Deployment to Production

Thursday, November 29, 201210:00AM - 11:00AM Pacific Time

www.revolutionanalytics.com/news-events/free-webinars/

34

Page 35: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialB uilding Data S c ienc e Teams

DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI

Statistics and Data Science graduates

Kaggle and Chorus

Revolution Analytics R Training: http://www.revolutionanalytics.com/services/training/

35

Page 36: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialC los ing T houghts

Data Science process leads to more powerful, and more useful models

Data Scientists need a technology platform to think about, explore, and model data

Revolution R Enterprise is R for Big Data

36

Page 37: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialR es ourc es

Revolution R Enterprise : R for Big Data www.revolutionanalytics.com/products

Rhadoop : Connecting R and Hadoop bit.ly/r-hadoop

Contact David Smith [email protected] @revodavid blog.revolutionanalytics.com

37

Page 38: The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

Revolution ConfidentialT hank you.

38

www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR

The leading commercial provider of software and support for the popular open source R statistics language.


Top Related