high performance predictive analytics in r and hadoop

High Performance Predictive

Analytics in R and Hadoop

Presented by:Mario E. Inchiosa, Ph.D.US Chief Scientist

Hadoop Summit 2013June 27, 2013 1

Innovate with R

2

Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts

Most powerful statistical programming language Flexible, extensible and comprehensive for productivity

Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data

Thriving open-source community Leading edge of analytics research

Fills the talent gap New graduates prefer R

Download the White Paper

R is Hotbit.ly/r-is-hot

http://info.revolutionanalytics.com/R-is-Hot-Whitepaper.html

http://www.nytimes.com/interactive/2012/05/05/sports/baseball/mariano-rivera-and-his-peers.html?ref=baseball

R is open source and drives analytic innovation but has some limitations

Bigger data sizes

Speed of analysis

Production support

Memory Bound Big Data

Single Threaded Scale out, parallel processing, high speed

Community Support Commercial production support

Innovation and scale

Innovative – 5000+ packages, exponential growth

Combines with open source R packages where needed

4

Revolution R EnterpriseHigh Performance, Multi-Platform Analytics Platform

Revolution R Enterprise

DeployRWeb Services

Software Development Kit

DevelopRIntegrated Development

Environment

ConnectRHigh Speed & Direct Connectors

Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC

ScaleRHigh Performance Big Data Analytics

Platform LSF, MS HPC Server, MS Azure Burst, SMP Servers

DistributedRDistributed Computing Framework

Platform LSF, MS HPC Server, MS Azure Burst

RevoRPerformance Enhanced Open Source R + CRAN packages

IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers

Open Source RPlus

Revolution Analytics

performance enhancements

Revolution Analytics

Value-Add ComponentsProviding Power

and Scale to Open Source R

Our objectives with respect to Hadoop Allow our customers to do predictive

analytics as easily in Hadoop as they can using R on their laptops today

Scalable and High Performance

Provide the first enterprise-ready, commercially supported, full-featured, out-of-the-box Predictive Analytics suite running in Hadoop

5

Overview of Revolution and Hadoop

Revolution currently provides capabilities for using R and Hadoop together RHadoop packages: rmr2, rhdfs, rhbase

RevoScaleR package: Rapidly read and process data from HDFS

We are in the process of porting RevoScaleR to run fully “in Hadoop.” It currently runs on laptops, workstations, servers, and MPI-based clusters.

6

7

RHadoop: Map-Reduce with R

Unlock data in Hadoop using only the R language No need to learn Java, Pig, Python or any other language

8

RevoScaleR Overview

An R package that adds capabilities to R: Data Import/Clean/Explore/Transform Analytics – Descriptive and Predictive Parallel and distributed computing Visualization

Scales from small local data to huge distributed data

Scales from laptop to server to cluster to cloud

Portable – the same code works on small and big data, and on laptop, server, cluster, Hadoop

Key ways RevoScaleR enhances R

High Performance Computing (HPC) functions: Parallel/Distributed computing

High Performance Analytics (HPA) functions: Big Data + Parallel/Distributed computing

XDF file format; rapidly store and extract data

Use results from HPA and HPC functions in other R packages

Revolution R Enterprise 9

10

High Performance Big Data Analytics with ScaleR

Statistical Tests

Machine Learning

Simulation

Descriptive Statistics

Data Visualization

R Data Step

Predictive Models

Sampling

Company Confidential – Do not distribute 11

ScaleR: High Performance ScalableParallel External Memory Algorithms

Data import – Delimited, Fixed, SAS, SPSS, OBDC

Variable creation & transformation

Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category

(means, sums)

Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product

matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data

(standard tables & long form) Marginal Summaries of Cross

Tabulations

Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test

Data Prep, Distillation & Descriptive Analytics

Subsample (observations & variables)

Random Sampling

R Data Step Statistical Tests

Sampling

Descriptive Statistics

Company Confidential – Do not distribute 12

ScaleR: High Performance ScalableParallel External Memory Algorithms

Sum of Squares (cross product matrix for set variables)

Multiple Linear Regression Generalized Linear Models (GLM)

- All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models

Histogram Line Plot Scatter Plot Lorenz Curve ROC Curves (actual data and

predicted values)

K-Means

Statistical Modeling

Decision Trees

Predictive Models Cluster AnalysisData Visualization

Classification

Machine Learning

Simulation

Monte Carlo

HPC Capabilities in RevoScaleR

Execute (essentially) any R function in parallel on nodes and cores

Results from all runs are returned in a list to the user’s laptop

Extensive control over parameters Extensive control over nodes, cores, and

times to run Ideal for simulations and for running R

functions on small amounts of data


14

Key ScaleR HPA features

Handles an arbitrarily large number of rows in a fixed amount of memory

Scales linearly with the number of rows

Scales linearly with the number of nodes

Scales well with the number of cores per node

Scales well with the number of parameters

Extremely high performance

Revolution R Enterprise

Regression comparison using in-memory data: lm() vs rxLinMod()


lm() in R

lm() in Revolution R

rxLinMod() in Revolution R

GLM comparison using in-memory data: glm() vs rxGlm()


SAS HPA Benchmarking comparison* Logistic Regression

Rows of data 1 billion 1 billion

Parameters “just a few” 7

Time 80 seconds 44 seconds

Data location In memory On disk

Nodes 32 5

Cores 384 20

RAM 1,536 GB 80 GB

19

Revolution R is faster on the same amount of data, despite using approximately a 20 th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.

*As published by SAS in HPC Wire, April 21, 2011

Double

45%

1/6th

5%

5%

Revolution R Enterprise Delivers Performance at 2% of the Cost

Allstate compares SAS, Hadoop and R for Big-Data Insurance ModelsApproach Platform Time to fit

SAS 16-core Sun Server 5 hours

rmr/MapReduce 10-node 80-coreHadoop Cluster

> 10 hours

R 250 GB Server Impossible (> 3 days)

Revolution R Enterprise 5-node 20-coreLSF cluster

5.7 minutes


Generalized linear model, 150 million observations, 70 degrees of freedomhttp://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html

What makes ScaleR so fast?

Lot’s of things! This kind of performance requires

Careful architecting that supports performance Constant, intense focus on all of the details A review of every line of code with an eye to

performance (in addition to giving the correct answers, of course)

Extensive profiling and continuous benchmarking to detect problems and improve code


Specific speed-related factors -- 1

Efficient computational algorithms Efficient memory management – minimize

data copying and data conversion Heavy use of C++ templates; optimal code Efficient data file format; fast access by row

and column


Specific speed-related factors -- 2 Models are pre-analyzed to detect and

remove duplicate computations and points of failure (singularities)

Handle categorical variables efficiently


Parallel External Memory Algorithms (PEMA’s) The HPA analytics algorithms are all built on

a platform that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms

These Parallel External Memory Algorithms (PEMA’s) process data a chunk at a time in parallel across cores and nodes


Scalability and portability of Revo’s implementation of PEMA’s These PEMA algorithms can process an unlimited

number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability

They are independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions

They are independent of where the data is coming from, giving portability with respect to data sources


Simplified ScaleR Internal Architecture


Analytics EnginePEMA’s are implemented here

(Scalable, Parallelized, Threaded, Distributable)

Inter-process CommunicationMPI, RPC, Sockets, Files

Data SourcesHDFS, Teradata, ODBC, SAS, SPSS,

CSV, Fixed, XDF

ScaleR on Hadoop – Base Implementation Each iteration (pass through the data) is a separate

MapReduce job The mapper produces “intermediate result objects”

for each task that are combined using a combiner and then a reducer

A master process decides if another pass through the data is required

Allow data to be stored temporarily (or longer) in XDF binary format for increased speed especially on iterative algorithms


Performance enhancements

A single Map job that handles all iterations, to reduce overhead of starting multiple jobs

Tree structured communication and reduction to reduce time for those operations, especially on large clusters

Alternative communication schemes (e.g. sockets) instead of files


Thank You!

Mario Inchiosa

[email protected]


Backup Slides


Functionality in common across HPA algorithms Ability to do on-the-fly data transformations

in the R language In-formula: ArrDelay>15 ~ DayOfWeek + … As transform: Late = ArrDelay>15 As transform function: Specify an R function to

process an entire chunk of data at a time Row selection Both standard weights and frequency

weights


Portability of user code

The same analytics code can be used on a laptop, server, MPI cluster, or Hadoop

The same analytics code can be used on a small data.frame in memory and on a huge distributed data file


Sample code for logit on laptop

# Specify local data sourceairData <- myLocalDataSource

# Specify model formula and parametersrxLogit(ArrDelay>15 ~ Origin + Year +

Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)


Sample code for logit on Hadoop

# Change the “compute context”

rxSetComputeContext(myHadoopCluster)

# Change the data source if necessary

airData <- myHadoopDataSource

# Otherwise, the code is the same

rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)


high performance predictive analytics in r and hadoop

Technology

process data

r language

flowing data

data scientists

open source r packages

hadoop revolution

data analysis software

small local data