© 2010 ibm corporation ibm almaden research center ricardo: integrating r and hadoop sudipto das 1,...

© 2010 IBM Corporation

IBM Almaden Research Center

Ricardo: Integrating R and Hadoop

Sudipto Das1, Yannis Sismanis2, Kevin S Beyer2, Rainer Gemulla2, Peter J. Haas2, John McPherson2

1 UC Santa Barbara2 IBM Almaden Research Center

Presented by: Luyuang Zhang Yuguan Li

IBM Almaden Research

Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}2

Outline

Motivation & Background

Architecture & Components

Trading with Ricardo–Simple Trading

–Complex Trading

Evaluation

Conclusion



Deep Analytics on Big Data

Enterprises collect huge amounts of data–Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, …

–User interaction data and history

–Click and Transaction logs Deep analysis critical for competitive edge

–Understanding/Modeling data

–Recommendations to users

–Ad placement Challenge: Enable Deep Analysis and Understanding

over massive data volumes–Exploiting data to its full potential



Motivating Examples Data Exploration/Model

Evaluation/Outlier Detection

Personalized Recommendations– For each individual customer/product

– Many applications to Netflix, Amazon, eBay, iTunes, …

Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage

Application Scenario: Movie Recommendations– Millions of Customers

– Hundreds of thousands of Movies

– Billions of Movie Ratings



Analyst’s Workflow

Data Exploration–Deal with raw data

Data Modeling–Deal with processed data

–Use assigned method to build model fits the data

Model Evaluation–Deal with built model

–Use data to test the accuracy of model



Big Data and Deep Analytics – The Gap

R, SPSS, SAS – A Statistician’s toolbox–Rich statistical, modeling, visualization functionality

–Thousands of sophisticated add-on packages developed by hundreds of statistical experts and available through CRAN

–Operate on small data amounts entirely in memory on a single server

–Extensions for data handling cumbersome

Hadoop – Scalable Data Management Systems–Scalable, Fault-Tolerant, Elastic, …–“Magnetic”: easy to store data–Limited deep analytics: mostly descriptive analytics



Filling the Gap: Existing Approaches Reducing Data size by Sampling

–Approximations might result in losing competitive advantage–Loses important features of the long tail of data distributions

[Cohen et al., VLDB 2009] Scaling out R

–Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi]

–Main memory based in most cases

–Re-implementing DBMS and distributed processing functionality Deep Analysis within a DBMS

–Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout]

–Not Sustainable – missing out from R’s community development and rich libraries



Ricardo: Bridging the Gap David Ricardo, famous economist from 19th century

– “Comparative Advantage”

Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06]– Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA– Recommender Systems/Latent Factorization [our paper]– A key requirement for Ricardo is that the amount of data that must be

communicated between both systems be sufficiently small

Large-part includes joins, group bys, distributive aggregations– Hadoop + Jaql: excellent scalability to large-scale data management

Small-part includes matrix/vector operations– R: excellent support for numerically stable matrix inversions, factorizations,

optimizations, eigenvector decompositions,etc.

Ricardo: Establishes “trade” between R and Hadoop/Jaql



Ricardo: Bridging the Gap

–Trade– R send aggregation-processing queries (written in Jaql) to Hadoop– Hadoop send aggregated data to R for advanced satistical processing



R in a Nutshell

1950 1960 1970 1980 1990 2000 20103.5

3.6

3.7

3.8

3.9

Year of Release

Ra

tin

g



R in a Nutshell

1950 1960 1970 1980 1990 2000 20103.5

3.6

3.7

3.8

3.9

Year of Release

Ra

tin

g

R supports Rich statistical functionality



Jaql in a Nutshell

JSON View of the data:

Jaql Example:

Scalable Descriptive Analysis using Hadoop

Jaql a representative declarative interface



Ricardo: The Trading Architecture

Complexity of Trade between R and Hadoop― Simple Trading: Data Exploration― Complex Trading: Data Modeling



Simple Trading: Exploratory Analytics

Gain insights about data Example - top-k outliers

for a model – Identify data items on which

the model performed most poorly

Helpful for improving accuracy of model

The trade: –Use complex statistical

models using rich R functionality

–Parallelize processing over entire data using Hadoop/Jaql



Complex Trading: Latent FactorsSVD-like matrix factorization

Minimize Square Error: Σi,j (piqj - rij)2

p

q

The trade: ― Use complex statistical models in R― Parallelize aggregate computations using Hadoop/Jaql



Complex Trading: Latent Factors

However, in real world………

A vector of factors for each customer and item!



Latent Factor Models with Ricardo

Goal– Minimize Square Error: e = Σi,j (piqj - rij)2

– Numerical methods needed (large, sparse matrix)

Pseudocode1. Start with initial guess of parameters pi and qj.2. Compute error & gradient

– E.g., de/dpi = Σj 2qj (piqj – rij)3. Update parameters.

– R implements many different optimization algorithms4. Repeat steps 2 and 3 until convergence.

p

q

Data intensive,but parallelizable!



The R Component• Parameters

e: squared errorde: gradientspq: concatenation of the latent factors for users

and items

• R code

optim( c(p,q), fe, fde, method="L-BFGS-B" )

• Goal

Keeps updating pq until it reaches convergence



The Hadoop and Jaql Component

• Dataset

• Goal



The Hadoop and Jaql Component

• Calculate the squared errors

• Calculate the gradients



Computing the Model

i j rij

i pi j qj

Movie Ratings

Movie Parameters

Customer Parameters

3 way join to match

rij, pi, and qj,then aggregate

e = Σi,j (piqj - rij)2

Similarly compute the gradients



Aggregation In Jaql/Hadoop

res = jaqlTable(channel, " ratings hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*,

m.q } ) hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*,

c.p } ) transform { $.*, diff: $.rating - $.p*$.q } expand [ { value: pow($.diff, 2.0) },

{ $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ]

group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) }

")

i j gradient---- ---- --------null null 3252351 null 212 null 357…null 1 9null 2 64…

Result in R



Integrating the Components

Remember…..

We would be running optim( c(p,q), fe, fde, method="L-BFGS-B" ) in R process.



Experimental Evaluation

50 nodes at EC2

Each node: 8 cores, 7GB Memory, 320GB Disk

Total: 400 cores, 320GB Memory, 70TB Disk Space

Number of Rating Tuples Data Size in GB

500 Million 104.33

1 Billion 208.68

3 Billion 625.99

5 Billion 1043.23



Leveraging Hadoop’s Scalability

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

500

1000

1500

2000

2500

3000

3500

Hadoop (handtuned)

Jaql

Number of Ratings (in billions)

Tim

e (

in s

eco

nds)



Leveraging R’s Rich Functionality

0 5 10 15 20 250.9

0.95

1

1.05

1.1

1.15

1.2 Conjugate Gra-dientL-BFGS

Number of Iterations

Root

Mean S

quare

Err

or

– optim( c(p,q), fe, fde, method=“CG" )– optim( c(p,q), fe, fde, method="L-

BFGS-B" )



Conclusion

Scaled Latent Factor Models to Terabytes of data

Provided a bridge for other algorithms with Summation Form can be mapped and scaled– Many Algorithms have Summation Form– Decompose into “large part” and “small part”– [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-

means, logistic regression, neural network, PCA, ICA, EM, SVM

Future & Current Work– Tighter language integration– More algorithms– Performance tuning

© 2010 ibm corporation ibm almaden research center ricardo: integrating r and hadoop sudipto das 1,...

Documents

data size

ibm almaden research

ibm corporation sudipto

raw data data modeling

data limited deep analytics

processed data use

big data enterprises

small data amounts