big data analytics with r

30
Big Data Analytics with R Derek McCrae Norton, Senior Sales Engineer April 2, 2014

Upload: great-wide-open

Post on 27-Jan-2015

119 views

Category:

Technology


1 download

DESCRIPTION

Great Wide Open - Day 1 Derek Norton - Revolution Analytics 11:15 AM - Operations 2 (Big Data)

TRANSCRIPT

Page 1: Big Data Analytics with R

Big Data Analytics with R

Derek McCrae Norton, Senior Sales Engineer

April 2, 2014

Page 2: Big Data Analytics with R

Agenda Introduction

Big Data

Analytics

R

Revolution R Enterprise

Synergy

Conclusion

© 2013 Revolution Analytics

Page 3: Big Data Analytics with R

Who are you anyway? Statistician

– My degrees are all in statistics.

Consultant

– My experience has been mostly in Marketing Analytics focusing on Predictive

Analytics.

Sales Engineer

– Still consulting, just with a much heavier emphasis on client interaction.

Founder/Director Atlanta R Users Group.

– Shameless plug. Please join if interested.

– http://www.meetup.com/R-Users-Atlanta/

Husband, Father, Outdoorsman, Serial Hobbyist, …

© 2013 Revolution Analytics

Page 4: Big Data Analytics with R

Big Data

© 2013 Revolution Analytics

Page 5: Big Data Analytics with R

Big Data and Big Opportunities

© 2013 Revolution Analytics

“Big data is data that

exceeds the processing

capability of conventional

database systems”

Edd Dumbill

O’Reilly Radar*, Jan 2012

Worldwide data created and replicated, Zettabytes

1 2

35

* radar.oreilly.com/2012/01/what-is-big-data.html

Page 6: Big Data Analytics with R

What is Big Data?

Big Data is a loosely defined term used to describe

data sets so large and complex that they become

awkward to work with using standard statistical

software.

© 2013 Revolution Analytics

Snijders, Matzat, & Reips (2012)

Page 7: Big Data Analytics with R

Does Big Data Mean Hadoop? The short answer is no.

The longer answer is maybe.

Hadoop adoption is

turning that maybe

into a probably.

© 2013 Revolution Analytics

?

Page 8: Big Data Analytics with R

Analytics

© 2013 Revolution Analytics

Page 9: Big Data Analytics with R

What is Analytics?

Analytics is the combination of mathematical,

statistical, and heuristic techniques to glean useful

insights from data and to implement actions derived

from those insights.

© 2013 Revolution Analytics

Derek McCrae Norton

Page 10: Big Data Analytics with R

Analytics The current buzzword is “Data Science,” but I

don’t really agree with that nomenclature.

– What statistician, analyst, (data scientist) actually

follows the scientific method?

That being said, the current definition of “Data Science”

is a pretty good surrogate for what we are discussing.

Whatever descriptors you use, one thing is clear… You must use

something to help you carry out the actual work.

– R, Python, SAS, etc.

– RDBMS, Hadoop, etc.

© 2013 Revolution Analytics

Page 11: Big Data Analytics with R

© 2013 Revolution Analytics

Page 12: Big Data Analytics with R

What is the R language? A Platform…

– A Procedural Language for Stats, Math and Data Science

– A Complete Data Visualization Framework

– Provided as Open Source

A Community…

– 2M+ Users with the Skill to Tackle Big Data Statistical and Numerical Analysis and

Machine Learning Projects

– Active User Groups Across the World

An Ecosystem

– CRAN: 5000+ Freely Available Packages

– Applicable to Big Data if scaled

© 2013 Revolution Analytics

Page 13: Big Data Analytics with R

THE R USER COMMUNITY

Page 14: Big Data Analytics with R

A brief history of R 1993: Research project in Auckland, NZ

– Ross Ihaka and Robert Gentlemen

1995: Released as open-source software

– Generally compatible with the “S” language

1997: R core group formed

2000: R 1.0.0 released

2004: First international

user conference in Vienna

2013: R 3.0.0 released

© 2013 Revolution Analytics

Page 15: Big Data Analytics with R

R is Free Open Source, licensed under GPL (like Linux!)

– Free as in beer

– Free as in freedom

Flexible

Open for integration

– Data (SAS, SPSS, Excel, SQL Server, Oracle, …)

– Systems (applications, webservers, …)

Broad user-base

– De-facto standard for data analysis teaching

© 2013 Revolution Analytics

Page 16: Big Data Analytics with R

16

R is exploding in popularity & function

Web Site Popularity Number of links to main web site

R

SAS

SPSS

S-Plus

Stata

Scholarly Activity Google Scholar hits (’05-’09 CAGR)

R 46%

SAS -11%

SPSS -27%

S-Plus 0%

Stata 10%

Internet Discussion Mean monthly traffic on email discussion list

R

SAS

Stata

SPSS

S-Plus

Package Growth Number of R packages listed on CRAN

4,332 as of

Feb 2013

© 2013 Revolution Analytics

Page 17: Big Data Analytics with R

So why isn’t everyone using R?

“The best thing about R is that it was developed by

statisticians. The worst thing about R is that it was

developed by statisticians.”

© 2013 Revolution Analytics

Bo Cowgill

Google (at SF R Meetup)

Page 18: Big Data Analytics with R

Otherwise R is Great! Right? Who here has used R?

– Thoughts?

Who has never seen this?

Who here has more than 1 core/processor?

Who has ever used r-help?

– ’They’ did write documentation that told you that Perl was needed, but

‘they’ can’t read it for you. - Brian D. Ripley, R-help (February 2001)

– This is all documented in TFM. Those who WTFM don’t want to have to

WTFM again on the mailing list. RTFM. - Barry Rowlingson, R-help

(October 2003)

© 2013 Revolution Analytics

Page 19: Big Data Analytics with R

What is Revolution R Enterprise?

© 2013 Revolution Analytics

Page 20: Big Data Analytics with R

Motivators

© 2013 Revolution Analytics

Big Data In-memory bound Hybrid memory & disk

scalability

Operates on bigger

volumes & factors

Speed of

Analysis

Single threaded Parallel threading Shrinks analysis time

Enterprise

Readiness

Community support Commercial support Delivers full service

production support

Analytic

Breadth &

Depth

5000+ innovative

analytic packages

Leverage open source

packages plus Big Data

ready packages

Supercharges R

Commercial

Viability

Risk of deployment of

open source

Commercial license Eliminate risk with open

source

Page 21: Big Data Analytics with R

Introducing Revolution R Enterprise (RRE) The Big Data Big Analytics Platform

DistributedR

DevelopR DeployR

ScaleR

ConnectR

Big Data Big Analytics Ready

– Enterprise readiness

– High performance analytics

– Multi-platform architecture

– Data source integration

– Development tools

– Deployment tools

© 2013 Revolution Analytics

Page 22: Big Data Analytics with R

The Platform Step by Step: R Capabilities

R+CRAN • Open source R interpreter

• UPDATED R 3.0.2

• Freely-available R algorithms

• Algorithms callable by RevoR

• Embeddable in R scripts

• 100% Compatible with existing R scripts, functions and packages

RevoR • Performance enhanced R interpreter

• Based on open source R

• Adds high-performance math

Available On: • PlatformTM LSFTM Linux®

• Microsoft® HPC Clusters

• Windows® & Linux Servers

• Windows & Linux Workstations

• IBM® Netezza®

• NEW Cloudera Hadoop®

• NEW Hortonworks Hadoop

• NEW Teradata® Database

• Intel® Hadoop

• IBM BigInsightsTM

© 2013 Revolution Analytics

Page 23: Big Data Analytics with R

The Platform Step by Step: Parallelization & Data Sourcing ConnectR

• High-speed & direct connectors

Available for: • High-performance XDF

• SAS, SPSS, delimited & fixed format text data files

• Hadoop HDFS (text & XDF)

• Teradata Database & Aster

• EDWs and ADWs

• ODBC

ScaleR • Ready-to-Use high-performance

big data big analytics

• Fully-parallelized analytics

• Data prep & data distillation

• Descriptive statistics & statistical tests

• Correlation & covariance matrices

• Predictive Models – linear, logistic, GLM

• Machine learning

• Monte Carlo simulation

• NEW Tools for distributing customized algorithms across nodes

DistributedR • Distributed computing framework

• Delivers portability across platforms

Available on:

• Windows Servers

• Red Hat and NEW SuSE Linux Servers

• IBM Platform LSF Linux

• Microsoft HPC Clusters

• NEW Teradata Database

• NEW Cloudera Hadoop

• NEW Hortonworks Hadoop © 2013 Revolution Analytics

A single package

(RevoScaleR)

Page 24: Big Data Analytics with R

DeployR • Web services software

development kit for integration analytics via Java, JavaScript or .NET APIs

• Integrates R Into application infrastructures

Capabilities:

• Invokes R Scripts from web services calls

• RESTful interface for easy integration

• Works with web & mobile apps, leading BI & Visualization tools and business rules engines

DevelopR • Integrated development

environment for R

• Visual ‘step-into’ debugger

Available on:

• Windows

The Platform Step by Step: Tools & Deployment

DevelopR DeployR

© 2013 Revolution Analytics

Page 25: Big Data Analytics with R

DistributedR

ScaleR

ConnectR

DeployR

Write Once. Deploy Anywhere.

DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE

In the Cloud Amazon AWS

Workstations & Servers Desktop Server

Clustered Systems IBM Platform LSF Microsoft HPC

EDW Teradata

Hadoop Hortonworks Cloudera

© 2013 Revolution Analytics

Page 26: Big Data Analytics with R

Synergy

© 2013 Revolution Analytics

Page 27: Big Data Analytics with R

Put it all together Talent fresh out of school knows R.

RRE is R plus more.

RRE provides a unified way of carrying out analytics (small or big).

RRE code is portable…

© 2013 Revolution Analytics

Page 28: Big Data Analytics with R

Scale and Portability Set “compute context” to define hardware (one line of code)

– Native job-scheduler handles distribution, monitoring, failover etc.

Same code runs on other supported architectures

– Just change compute context

© 2013 Revolution Analytics

42 seconds instead of 6 minutes on the local machine

Page 29: Big Data Analytics with R
Page 30: Big Data Analytics with R

References 1. Snijders, C., Matzat, U., & Reips, U.-D. (2012). ‘Big Data’: Big gaps of

knowledge in the field of Internet. International Journal of Internet

Science, 7, 1-5. http://www.ijis.net/ijis7_1/ijis7_1_editorial.html

2. Conway, D, THE DATA SCIENCE VENN DIAGRAM

© 2013 Revolution Analytics