sdm center parallel r (pr) for high performance statistical computing nagiza f. samatova (ornl)...

21
SDM Cente r Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer (GT) Chongle Pan (UTK/ORNL) SDM AHM @ Salt Lake City March 3-4, 2005 Contact : Nagiza Samatova, [email protected]

Post on 15-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Parallel R (pR)For High Performance Statistical Computing

• Nagiza F. Samatova (ORNL)

• Srikanth Yoginath (ORNL)

• Guruprasad Kora (ORNL)

• David Bauer (GT)

• Chongle Pan (UTK/ORNL)

SDM AHM @ Salt Lake CityMarch 3-4, 2005

Contact: Nagiza Samatova, [email protected]

Page 2: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Outline

About Parallel R Motivation About R and its parallelization efforts Task and data parallelism with Parallel R (pR) Extensibility of Parallel R Performance Benchmarks

Parallel R across Different Applications GIS data analysis with GRASS and Parallel R Clustered Climate Regimes using Parallel R Fusion scenario challenges Parallel R Quantitative Proteomics in Biology using Parallel R

Summary and Future Work

Page 3: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Tera-(Flop & Byte) Analyses Could Be Routine for Scientific Applications But…

HitsAlgorithmic Complexity:

Calculate means O(n)

Calculate FFT O(n log(n))

Calculate PCA O(r • c)

Hierarchical clust. O(n2)

Algorithmic Complexity:

Calculate means O(n)

Calculate FFT O(n log(n))

Calculate PCA O(r • c)

Hierarchical clust. O(n2)

Climate Now: 20-40TB per simulated year 5 yrs: 100TB/yr 5-10PB/yr

Astrophysics Now and 5 yrs: Can soak up anything!

Fusion Now: 100Mbytes/15min 5 yrs: 1000Mbytes/2 min

1Tflop/sec

Page 4: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Statistical Computing with R

About R (http://www.r-project.org/):

• R is an Open Source (GPL), most widely used programming environment for statistical analysis and graphics; similar to S.

• Provides good support for both users and developers.

• Highly extensible via dynamically loadable add-on packages.

• Originally developed by Robert Gentleman and Ross Ihaka.

Towards Enabling Parallel Computing in R:

> …> dyn.load( “foo.so”) > .C( “foobar” )> dyn.unload( “foo.so” )

> …> dyn.load( “foo.so”) > .C( “foobar” )> dyn.unload( “foo.so” )

> library(mva)> pca <- prcomp(data)> summary(pca)

> library(mva)> pca <- prcomp(data)> summary(pca)

• Rmpi (Hao Yu): R interface to LAM-MPI.

• rpvm (Na Li and Tony Rossini): R interface to PVM; requires knowledge of parallel programming.

• snow (Luke Tierney): general API on top of message passing routines to provide high-level (parallel apply) commands; mostly demonstrated for embarrassingly parallel applications .

> library (rpvm)> .PVM.start.pvmd ()> .PVM.addhosts (...)> .PVM.config ()

> library (rpvm)> .PVM.start.pvmd ()> .PVM.addhosts (...)> .PVM.config ()

snow API

Page 5: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Motivation behind Parallel R (pR)

Ideal Programming Requirements: Be able to use existing high level (i.e. R) code Require minimal extra efforts for parallelizing Have Identical/similar (presumably easy-to-use) interface to R’s Be able to test codes in sequential settings Provide efficient and scalable (in terms of problem size and

number of processors) performance

Page 6: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

:::::::fileList<-list.files(pattern="*.nc");PE ( for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] sla.prcomp (matrix [i])} ):::::::::::::

:::::::fileList<-list.files(pattern="*.nc");PE ( for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] sla.prcomp (matrix [i])} ):::::::::::::

:::::::fileList<-list.files(pattern="*.nc");for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] prcomp (matrix [i])} :::::::::::::

:::::::fileList<-list.files(pattern="*.nc");for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] prcomp (matrix [i])} ::::::::::::: R pR

Providing Task and Data Parallelism in pR

• Likelihood Maximization.

• Re-sampling schemes: Bootstrap, Jackknife, etc.

• Animations

• Markov Chain Monte Carlo (MCMC).– Multiple chains.– Simulated Tempering: running parallel chains at different “temperature“ to improve mixing.

Task-parallel analyses:

• k-means clustering

• Principal Component Analysis (PCA)

• Hierarchical (model-based) clustering

• Distance matrix, histogram, etc. computations

Data-parallel analyses:

Data ParallelismD a t

a

RScaLAPACK

Task & Data Parallelism with pR

Task Parallelism

D a t a

Task-pR

Page 7: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Extensibility of Parallel R (pR)

REnvironment

REnvironment

ParallelAgent

ParallelAgent

Third Party Parallel Codes

ScaLAPACK

Matrix

Parallel k-means

Alok’s Data Mining

RScaLAPACK

pMatrix

pAlok

•Define R function parameters & returns

•Map R functions to defined function interfaces

•Define the function interfaces

•Set parallel environment limits for your functions

•Define data distribution function (Optional)

•Convert your MPI/PVM routine(s) into a set of functions.

•Create a shared library of your functions.

•Place it in a predefined location.

C/Fortran MPI

Robject

Page 8: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Scalability of Parallel R (pR)

Speedup for Parallel R’s sla.solve() over serial R’s solve().

Architecture: SGI Altix at CCS of ORNL with 256 Intel Itanium2 processors at 1.5 GHz; 8 GB of memory per processor (2 TB system memory); 64-bit Linux OS; 1.5 TeraFLOPs/s theoretical total peak performance.

Matrix size:

R> solve (A,B) pR> sla.solve (A, B, NPROWS, NPCOLS, MB)A and B are the input matrices; NPROWS and NPCOLS are process grid specs; MB is block size

Page 9: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Overhead due to R & Parallel Agent in pR

Matrix size:

Page 10: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Parallel R (pR) Distribution

http://www.ASPECT-SDM.org/Parallel-RReleases History:

• pR enables both data and task parallelism (includes task-pR and RScaLAPACK) (2004/Q4)

• RScaLAPACK provides R interface to ScaLAPACK with its scalability in terms of problem size and number of processors using data parallelism (2004/Q2)

• task-pR achieves parallelism by performing out-of-order execution of tasks. With its intelligent scheduling mechanism it attains significant gain in execution times (2004/Q3)

•pMatrix provides a parallel platform to perform major matrix operations in parallel using ScaLAPACK and PBLAS Level II & III routines (2005/Q2)

Also: Available for download from R’s CRAN web site (www.R-Project.org) with 37 mirror sites in 20 countries

Page 11: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Geo-statistical and Spatial Data Analysis with GRASS and Parallel RWith: George Fann, John Drake, and Bhaduri Budhendra

About GRASS (http://grass.itc.it/):

• GRASS (Geographic Resources Analysis Support System) is a raster/vector GIS, image processing system, and graphics production.

• GRASS contains over 350 programs and tools to render maps and images on monitor and paper; manipulate raster, vector, and sites data; process multi spectral image data; create, manage, and store spatial data.

• It is Free (Libre) Software/Open Source released under GNU GPL.

Parallel R (pR) extension for GRASS:

• Leverages the work by Markus Neteler (http://grass.itc.it/statsgrass/grass_geostats.html).

• Offers a richer set of statistical analysis capabilities including (Basic Statistics, Exploratory Data Analysis, Linear Models, Multivariate Analysis, Time Series Analysis, etc.)

• Provides high performance and parallel computational platform for large datasets

GRASS

GRASS

pR

$> grass5 <dataset>$> pR

> library (GRASS)> G gmeta()> …

Page 12: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Grass/Parallel-R Examples

$> grass5$> pR….> library (MASS)> data (volcano)> plot (density (volcano, bw=2))> lines (density (volcano, bw=4), col="green") > lines (density (volcano, bw=8), col="red") > lines (density (volcano, bw=12), col="cyan")

Kernel Density Estimation $> grass5$> pR….> topo.meter.ls6 surf.ls (6, topo.meter)> topo.meter.surface6 trmat (topo.meter.ls6, 0, 100, 0, 100, 50)> image (topo.meter.surface6)> contour (topo.meter.surface6, labcex = 0.8, add=T) > points (topo.meter$x, topo.meter$y)

Trend Surface Fitting

Kernel Density Estimation Trend Surface Fitting Principal Component Analysis

Page 13: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Clustered Climate Regimes AnalysisWith: W. Hargrove, F. Hoffman, and D. Erickson

Readnc files

Spa

tio-T

empo

ral P

ts

Variables (V)

16.6M x 3

5-yr BAU PCM 2000-2098 runs2.8°×2.8°; 18 levels

B05.12.ncB06.12.nc

B09.12.nc

Geographic Space

Normalizeµ=0 & σ=1

• 2,796 out of 8,192 total land grid cells• V: Temperature, Precipitation, Soil Moisture• Pts: (latitude, longitude, level, time)

Variable Space

Clusterk-means

Geographic Space

TemperaturePrecipitations

Soil Moisture

Re-assemble;Stat. Analyses

k=32, time

Statistics

No.

of

Pts

Cluster Number

Page 14: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Scalability of pk-means() in pR

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

Number of Processors

Speedup for pkmeans() in pR

Series1 1.98 3.24 5.09 6.14

2 4 8 16

16.6 million points; ~20 iterations

Page 15: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Fusion Scenario Challenges Parallel R With: George Ostrouchov and Don Batchelor

A toroidal slice of the electrostatic field of a tokamak fusion simulation (polar coord. as Cartesian)

Mahalanobis Distance easy

Expectation Maximization (EM) easy

Hierarchical Model-based Clustering (mclust) hard

250,000 points10% sampling for ~1hr analysis

Page 16: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Quantitative Proteomics in BiologyWith: Bob Hettich, Hays McDonald, and Greg Hurst

Sample of ~2,000labeled proteins (N15)in different ratios

Liquid Chromatography- Mass Spectrometry (LC-MS)24 hours measurements

~3GB raw data +~50,000 MS, MS/MS files~1KByte each

Experimental Step

Ratio Calculations

RelEx

~50,000 ChromatogramFiles; ~1KB each

Quantification Step

DBDigger+SEQUEST~15-18 hours

Sequence Id Step

RelEx

Page 17: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Ratio Calculations for ~50,000 files

[CHROMATOGRAMS] SCAN TIME SAMPLE REFERENCE1537 32.8275 4727570 4509290 1541 32.8978 1120668 4377465 1545 32.9718 4298401 4713328 1549 33.0477 2975233 9286918….. …….. ………..

1. Read chromatogram file

2. Select Peak Window

3. Calculate Ratio=Slope(Eigenvector)

• Subtract background noise from data• Generate Covariance Chromatogram (red)• Apply Savitzky-Golay Smoother (blue)• Calculate cut-off for search (cyan)• Find Window with Max. SN ratio (green)

Page 18: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Ratio Estimation over ~50,000 fileslo

g (

Sig

nal

/No

ise)

= l

og

(λ 1

/λ2)

2

log(Ratio) = log (Slope (Eigenvector1))

Rel

ativ

e F

requ

ency

log (Ratio)

Page 19: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Ratio Calculations with Parallel R

:::::::chroList<-list.files(pattern="*.chro");cat ("Chro", "samSN", "refSN", "PPCSN", "HR", "PCA", "PCASN", file="Pratio-Peptide.txt");

for (i in 1:length(chroList)) { currResult [i] Rratio(filename=chroList[i]); }

for (i in 1:length(chroList)) { cat (chroList[i], currResult$samSN, currResult$refSN, currResult$PPCSN, currResult$HR, currResult$PCA, currResult$PCASN, file="Pratio-Peptide.txt"); }:::::::::::::

Serial Version

:::::::chroList<-list.files(pattern="*.chro");cat ("Chro", "samSN", "refSN", "PPCSN", "HR", "PCA", "PCASN", file="Pratio-Peptide.txt");

PE ( for (i in 1:length(chroList)) { currResult [i] Pratio(filename=chroList[i]); } )

for (i in 1:length(chroList)) { cat (chroList[i], currResult$samSN, currResult$refSN, currResult$PPCSN, currResult$HR, currResult$PCA, currResult$PCASN, file="Pratio-Peptide.txt"); }:::::::::::::

Parallel Version

Page 20: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Performance Results for Ratio Calculation

Page 21: SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer

SDMCenter

Summary and Future Work

Parallel R (pR) is an Open Source high performance library for statistical computing in R

It has been deployed in a number of applications including: climate, GIS, fusion, and biology

Future improvements in few major directions: Demonstrate more application scenarios Add more libraries like RScaLAPACK, PMatrix (e.g. pAlok,

pclust, pnetCDF) Improve the performance (reduce overhead, memory

management) of Parallel Agent Enhance features of Parallel Agent:

Support outside of Master-Slave model Better memory management strategies (one-sided put(),

get(), release(), etc.) Support of parallel I/O over netCDF and HDF files