high performance predictive analytics in r and hadoop

34
High Performance Predictive Analytics in R and Hadoop Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist Hadoop Summit 2013 June 27, 2013 1

Upload: hadoop-summit

Post on 10-May-2015

7.777 views

Category:

Technology


0 download

DESCRIPTION

Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.

TRANSCRIPT

Page 1: High Performance Predictive Analytics in R and Hadoop

High Performance Predictive

Analytics in R and Hadoop

Presented by:Mario E. Inchiosa, Ph.D.US Chief Scientist

Hadoop Summit 2013June 27, 2013 1

Page 2: High Performance Predictive Analytics in R and Hadoop

Innovate with R

2

Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts

Most powerful statistical programming language Flexible, extensible and comprehensive for productivity

Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data

Thriving open-source community Leading edge of analytics research

Fills the talent gap New graduates prefer R

Download the White Paper

R is Hotbit.ly/r-is-hot

Page 3: High Performance Predictive Analytics in R and Hadoop

R is open source and drives analytic innovation but has some limitations

Bigger data sizes

Speed of analysis

Production support

Memory Bound Big Data

Single Threaded Scale out, parallel processing, high speed

Community Support Commercial production support

Innovation and scale

Innovative – 5000+ packages, exponential growth

Combines with open source R packages where needed

Page 4: High Performance Predictive Analytics in R and Hadoop

4

Revolution R EnterpriseHigh Performance, Multi-Platform Analytics Platform

Revolution R Enterprise

DeployRWeb Services

Software Development Kit

DevelopRIntegrated Development

Environment

ConnectRHigh Speed & Direct Connectors

Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC

ScaleRHigh Performance Big Data Analytics

Platform LSF, MS HPC Server, MS Azure Burst, SMP Servers

DistributedRDistributed Computing Framework

Platform LSF, MS HPC Server, MS Azure Burst

RevoRPerformance Enhanced Open Source R + CRAN packages

IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers

Open Source RPlus

Revolution Analytics

performance enhancements

Revolution Analytics

Value-Add ComponentsProviding Power

and Scale to Open Source R

Page 5: High Performance Predictive Analytics in R and Hadoop

Our objectives with respect to Hadoop Allow our customers to do predictive

analytics as easily in Hadoop as they can using R on their laptops today

Scalable and High Performance

Provide the first enterprise-ready, commercially supported, full-featured, out-of-the-box Predictive Analytics suite running in Hadoop

5

Page 6: High Performance Predictive Analytics in R and Hadoop

Overview of Revolution and Hadoop

Revolution currently provides capabilities for using R and Hadoop together RHadoop packages: rmr2, rhdfs, rhbase

RevoScaleR package: Rapidly read and process data from HDFS

We are in the process of porting RevoScaleR to run fully “in Hadoop.” It currently runs on laptops, workstations, servers, and MPI-based clusters.

6

Page 7: High Performance Predictive Analytics in R and Hadoop

7

RHadoop: Map-Reduce with R

Unlock data in Hadoop using only the R language No need to learn Java, Pig, Python or any other language

Page 8: High Performance Predictive Analytics in R and Hadoop

8

RevoScaleR Overview

An R package that adds capabilities to R: Data Import/Clean/Explore/Transform Analytics – Descriptive and Predictive Parallel and distributed computing Visualization

Scales from small local data to huge distributed data

Scales from laptop to server to cluster to cloud

Portable – the same code works on small and big data, and on laptop, server, cluster, Hadoop

Page 9: High Performance Predictive Analytics in R and Hadoop

Key ways RevoScaleR enhances R

High Performance Computing (HPC) functions: Parallel/Distributed computing

High Performance Analytics (HPA) functions: Big Data + Parallel/Distributed computing

XDF file format; rapidly store and extract data

Use results from HPA and HPC functions in other R packages

Revolution R Enterprise 9

Page 10: High Performance Predictive Analytics in R and Hadoop

10

High Performance Big Data Analytics with ScaleR

Statistical Tests

Machine Learning

Simulation

Descriptive Statistics

Data Visualization

R Data Step

Predictive Models

Sampling

Page 11: High Performance Predictive Analytics in R and Hadoop

Company Confidential – Do not distribute 11

ScaleR: High Performance ScalableParallel External Memory Algorithms

Data import – Delimited, Fixed, SAS, SPSS, OBDC

Variable creation & transformation

Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category

(means, sums)

Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product

matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data

(standard tables & long form) Marginal Summaries of Cross

Tabulations

Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test

Data Prep, Distillation & Descriptive Analytics

Subsample (observations & variables)

Random Sampling

R Data Step Statistical Tests

Sampling

Descriptive Statistics

Page 12: High Performance Predictive Analytics in R and Hadoop

Company Confidential – Do not distribute 12

ScaleR: High Performance ScalableParallel External Memory Algorithms

Sum of Squares (cross product matrix for set variables)

Multiple Linear Regression Generalized Linear Models (GLM)

- All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models

Histogram Line Plot Scatter Plot Lorenz Curve ROC Curves (actual data and

predicted values)

K-Means

Statistical Modeling

Decision Trees

Predictive Models Cluster AnalysisData Visualization

Classification

Machine Learning

Simulation

Monte Carlo

Page 13: High Performance Predictive Analytics in R and Hadoop

HPC Capabilities in RevoScaleR

Execute (essentially) any R function in parallel on nodes and cores

Results from all runs are returned in a list to the user’s laptop

Extensive control over parameters Extensive control over nodes, cores, and

times to run Ideal for simulations and for running R

functions on small amounts of data

Revolution R Enterprise 13

Page 14: High Performance Predictive Analytics in R and Hadoop

14

Key ScaleR HPA features

Handles an arbitrarily large number of rows in a fixed amount of memory

Scales linearly with the number of rows

Scales linearly with the number of nodes

Scales well with the number of cores per node

Scales well with the number of parameters

Extremely high performance

Revolution R Enterprise

Page 15: High Performance Predictive Analytics in R and Hadoop
Page 16: High Performance Predictive Analytics in R and Hadoop
Page 17: High Performance Predictive Analytics in R and Hadoop

Regression comparison using in-memory data: lm() vs rxLinMod()

Revolution R Enterprise 17

lm() in R

lm() in Revolution R

rxLinMod() in Revolution R

Page 18: High Performance Predictive Analytics in R and Hadoop

GLM comparison using in-memory data: glm() vs rxGlm()

Revolution R Enterprise 18

Page 19: High Performance Predictive Analytics in R and Hadoop

SAS HPA Benchmarking comparison* Logistic Regression

Rows of data 1 billion 1 billion

Parameters “just a few” 7

Time 80 seconds 44 seconds

Data location In memory On disk

Nodes 32 5

Cores 384 20

RAM 1,536 GB 80 GB

19

Revolution R is faster on the same amount of data, despite using approximately a 20 th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.

*As published by SAS in HPC Wire, April 21, 2011

Double

45%

1/6th

5%

5%

Revolution R Enterprise Delivers Performance at 2% of the Cost

Page 20: High Performance Predictive Analytics in R and Hadoop

Allstate compares SAS, Hadoop and R for Big-Data Insurance ModelsApproach Platform Time to fit

SAS 16-core Sun Server 5 hours

rmr/MapReduce 10-node 80-coreHadoop Cluster

> 10 hours

R 250 GB Server Impossible (> 3 days)

Revolution R Enterprise 5-node 20-coreLSF cluster

5.7 minutes

Revolution R Enterprise 20

Generalized linear model, 150 million observations, 70 degrees of freedomhttp://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html

Page 21: High Performance Predictive Analytics in R and Hadoop

What makes ScaleR so fast?

Lot’s of things! This kind of performance requires

Careful architecting that supports performance Constant, intense focus on all of the details A review of every line of code with an eye to

performance (in addition to giving the correct answers, of course)

Extensive profiling and continuous benchmarking to detect problems and improve code

Revolution R Enterprise 21

Page 22: High Performance Predictive Analytics in R and Hadoop

Specific speed-related factors -- 1

Efficient computational algorithms Efficient memory management – minimize

data copying and data conversion Heavy use of C++ templates; optimal code Efficient data file format; fast access by row

and column

Revolution R Enterprise 22

Page 23: High Performance Predictive Analytics in R and Hadoop

Specific speed-related factors -- 2 Models are pre-analyzed to detect and

remove duplicate computations and points of failure (singularities)

Handle categorical variables efficiently

Revolution R Enterprise 23

Page 24: High Performance Predictive Analytics in R and Hadoop

Parallel External Memory Algorithms (PEMA’s) The HPA analytics algorithms are all built on

a platform that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms

These Parallel External Memory Algorithms (PEMA’s) process data a chunk at a time in parallel across cores and nodes

Revolution R Enterprise 24

Page 25: High Performance Predictive Analytics in R and Hadoop

Scalability and portability of Revo’s implementation of PEMA’s These PEMA algorithms can process an unlimited

number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability

They are independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions

They are independent of where the data is coming from, giving portability with respect to data sources

Revolution R Enterprise 25

Page 26: High Performance Predictive Analytics in R and Hadoop

Simplified ScaleR Internal Architecture

Revolution R Enterprise 26

Analytics EnginePEMA’s are implemented here

(Scalable, Parallelized, Threaded, Distributable)

Inter-process CommunicationMPI, RPC, Sockets, Files

Data SourcesHDFS, Teradata, ODBC, SAS, SPSS,

CSV, Fixed, XDF

Page 27: High Performance Predictive Analytics in R and Hadoop

ScaleR on Hadoop – Base Implementation Each iteration (pass through the data) is a separate

MapReduce job The mapper produces “intermediate result objects”

for each task that are combined using a combiner and then a reducer

A master process decides if another pass through the data is required

Allow data to be stored temporarily (or longer) in XDF binary format for increased speed especially on iterative algorithms

Revolution R Enterprise 27

Page 28: High Performance Predictive Analytics in R and Hadoop

Performance enhancements

A single Map job that handles all iterations, to reduce overhead of starting multiple jobs

Tree structured communication and reduction to reduce time for those operations, especially on large clusters

Alternative communication schemes (e.g. sockets) instead of files

Revolution R Enterprise 28

Page 29: High Performance Predictive Analytics in R and Hadoop

Thank You!

Mario Inchiosa

[email protected]

Revolution R Enterprise 29

Page 30: High Performance Predictive Analytics in R and Hadoop

Backup Slides

Revolution R Enterprise 30

Page 31: High Performance Predictive Analytics in R and Hadoop

Functionality in common across HPA algorithms Ability to do on-the-fly data transformations

in the R language In-formula: ArrDelay>15 ~ DayOfWeek + … As transform: Late = ArrDelay>15 As transform function: Specify an R function to

process an entire chunk of data at a time Row selection Both standard weights and frequency

weights

Revolution R Enterprise 31

Page 32: High Performance Predictive Analytics in R and Hadoop

Portability of user code

The same analytics code can be used on a laptop, server, MPI cluster, or Hadoop

The same analytics code can be used on a small data.frame in memory and on a huge distributed data file

Revolution R Enterprise 32

Page 33: High Performance Predictive Analytics in R and Hadoop

Sample code for logit on laptop

# Specify local data sourceairData <- myLocalDataSource

# Specify model formula and parametersrxLogit(ArrDelay>15 ~ Origin + Year +

Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)

Revolution R Enterprise 33

Page 34: High Performance Predictive Analytics in R and Hadoop

Sample code for logit on Hadoop

# Change the “compute context”

rxSetComputeContext(myHadoopCluster)

# Change the data source if necessary

airData <- myHadoopDataSource

# Otherwise, the code is the same

rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)

Revolution R Enterprise 34