© 2010 ibm corporation ibm almaden research center ricardo: integrating r and hadoop sudipto das 1,...
TRANSCRIPT
© 2010 IBM Corporation
IBM Almaden Research Center
Ricardo: Integrating R and Hadoop
Sudipto Das1, Yannis Sismanis2, Kevin S Beyer2, Rainer Gemulla2, Peter J. Haas2, John McPherson2
1 UC Santa Barbara2 IBM Almaden Research Center
Presented by: Luyuang Zhang Yuguan Li
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}2
Outline
Motivation & Background
Architecture & Components
Trading with Ricardo–Simple Trading
–Complex Trading
Evaluation
Conclusion
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}3
Deep Analytics on Big Data
Enterprises collect huge amounts of data–Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, …
–User interaction data and history
–Click and Transaction logs Deep analysis critical for competitive edge
–Understanding/Modeling data
–Recommendations to users
–Ad placement Challenge: Enable Deep Analysis and Understanding
over massive data volumes–Exploiting data to its full potential
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}4
Motivating Examples Data Exploration/Model
Evaluation/Outlier Detection
Personalized Recommendations– For each individual customer/product
– Many applications to Netflix, Amazon, eBay, iTunes, …
Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage
Application Scenario: Movie Recommendations– Millions of Customers
– Hundreds of thousands of Movies
– Billions of Movie Ratings
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}5
Analyst’s Workflow
Data Exploration–Deal with raw data
Data Modeling–Deal with processed data
–Use assigned method to build model fits the data
Model Evaluation–Deal with built model
–Use data to test the accuracy of model
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}6
Big Data and Deep Analytics – The Gap
R, SPSS, SAS – A Statistician’s toolbox–Rich statistical, modeling, visualization functionality
–Thousands of sophisticated add-on packages developed by hundreds of statistical experts and available through CRAN
–Operate on small data amounts entirely in memory on a single server
–Extensions for data handling cumbersome
Hadoop – Scalable Data Management Systems–Scalable, Fault-Tolerant, Elastic, …–“Magnetic”: easy to store data–Limited deep analytics: mostly descriptive analytics
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}7
Filling the Gap: Existing Approaches Reducing Data size by Sampling
–Approximations might result in losing competitive advantage–Loses important features of the long tail of data distributions
[Cohen et al., VLDB 2009] Scaling out R
–Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi]
–Main memory based in most cases
–Re-implementing DBMS and distributed processing functionality Deep Analysis within a DBMS
–Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout]
–Not Sustainable – missing out from R’s community development and rich libraries
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}8
Ricardo: Bridging the Gap David Ricardo, famous economist from 19th century
– “Comparative Advantage”
Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06]– Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA– Recommender Systems/Latent Factorization [our paper]– A key requirement for Ricardo is that the amount of data that must be
communicated between both systems be sufficiently small
Large-part includes joins, group bys, distributive aggregations– Hadoop + Jaql: excellent scalability to large-scale data management
Small-part includes matrix/vector operations– R: excellent support for numerically stable matrix inversions, factorizations,
optimizations, eigenvector decompositions,etc.
Ricardo: Establishes “trade” between R and Hadoop/Jaql
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}9
Ricardo: Bridging the Gap
–Trade– R send aggregation-processing queries (written in Jaql) to Hadoop– Hadoop send aggregated data to R for advanced satistical processing
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}10
R in a Nutshell
1950 1960 1970 1980 1990 2000 20103.5
3.6
3.7
3.8
3.9
Year of Release
Ra
tin
g
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}11
R in a Nutshell
1950 1960 1970 1980 1990 2000 20103.5
3.6
3.7
3.8
3.9
Year of Release
Ra
tin
g
R supports Rich statistical functionality
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}12
Jaql in a Nutshell
JSON View of the data:
Jaql Example:
Scalable Descriptive Analysis using Hadoop
Jaql a representative declarative interface
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}13
Ricardo: The Trading Architecture
Complexity of Trade between R and Hadoop― Simple Trading: Data Exploration― Complex Trading: Data Modeling
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}14
Simple Trading: Exploratory Analytics
Gain insights about data Example - top-k outliers
for a model – Identify data items on which
the model performed most poorly
Helpful for improving accuracy of model
The trade: –Use complex statistical
models using rich R functionality
–Parallelize processing over entire data using Hadoop/Jaql
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}15
Complex Trading: Latent FactorsSVD-like matrix factorization
Minimize Square Error: Σi,j (piqj - rij)2
p
q
The trade: ― Use complex statistical models in R― Parallelize aggregate computations using Hadoop/Jaql
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}16
Complex Trading: Latent Factors
However, in real world………
A vector of factors for each customer and item!
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}17
Latent Factor Models with Ricardo
Goal– Minimize Square Error: e = Σi,j (piqj - rij)2
– Numerical methods needed (large, sparse matrix)
Pseudocode1. Start with initial guess of parameters pi and qj.2. Compute error & gradient
– E.g., de/dpi = Σj 2qj (piqj – rij)3. Update parameters.
– R implements many different optimization algorithms4. Repeat steps 2 and 3 until convergence.
p
q
Data intensive,but parallelizable!
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}18
The R Component• Parameters
e: squared errorde: gradientspq: concatenation of the latent factors for users
and items
• R code
optim( c(p,q), fe, fde, method="L-BFGS-B" )
• Goal
Keeps updating pq until it reaches convergence
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}19
The Hadoop and Jaql Component
• Dataset
• Goal
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}20
The Hadoop and Jaql Component
• Calculate the squared errors
• Calculate the gradients
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}21
Computing the Model
i j rij
i pi j qj
Movie Ratings
Movie Parameters
Customer Parameters
3 way join to match
rij, pi, and qj,then aggregate
e = Σi,j (piqj - rij)2
Similarly compute the gradients
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}22
Aggregation In Jaql/Hadoop
res = jaqlTable(channel, " ratings hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*,
m.q } ) hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*,
c.p } ) transform { $.*, diff: $.rating - $.p*$.q } expand [ { value: pow($.diff, 2.0) },
{ $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ]
group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) }
")
i j gradient---- ---- --------null null 3252351 null 212 null 357…null 1 9null 2 64…
Result in R
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}23
Integrating the Components
Remember…..
We would be running optim( c(p,q), fe, fde, method="L-BFGS-B" ) in R process.
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}24
Experimental Evaluation
50 nodes at EC2
Each node: 8 cores, 7GB Memory, 320GB Disk
Total: 400 cores, 320GB Memory, 70TB Disk Space
Number of Rating Tuples Data Size in GB
500 Million 104.33
1 Billion 208.68
3 Billion 625.99
5 Billion 1043.23
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}25
Leveraging Hadoop’s Scalability
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
500
1000
1500
2000
2500
3000
3500
Hadoop (handtuned)
Jaql
Number of Ratings (in billions)
Tim
e (
in s
eco
nds)
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}26
Leveraging R’s Rich Functionality
0 5 10 15 20 250.9
0.95
1
1.05
1.1
1.15
1.2 Conjugate Gra-dientL-BFGS
Number of Iterations
Root
Mean S
quare
Err
or
– optim( c(p,q), fe, fde, method=“CG" )– optim( c(p,q), fe, fde, method="L-
BFGS-B" )
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}27
Conclusion
Scaled Latent Factor Models to Terabytes of data
Provided a bridge for other algorithms with Summation Form can be mapped and scaled– Many Algorithms have Summation Form– Decompose into “large part” and “small part”– [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-
means, logistic regression, neural network, PCA, ICA, EM, SVM
Future & Current Work– Tighter language integration– More algorithms– Performance tuning
IBM Almaden Research
Ricardo: Integrating R and Hadoop © 2010 IBM CorporationSudipto Das {[email protected]}
Questions? Comments?