sdm center parallel r (pr) for high performance statistical computing nagiza f. samatova (ornl)...
Post on 15-Jan-2016
220 views
TRANSCRIPT
SDMCenter
Parallel R (pR)For High Performance Statistical Computing
• Nagiza F. Samatova (ORNL)
• Srikanth Yoginath (ORNL)
• Guruprasad Kora (ORNL)
• David Bauer (GT)
• Chongle Pan (UTK/ORNL)
SDM AHM @ Salt Lake CityMarch 3-4, 2005
Contact: Nagiza Samatova, [email protected]
SDMCenter
Outline
About Parallel R Motivation About R and its parallelization efforts Task and data parallelism with Parallel R (pR) Extensibility of Parallel R Performance Benchmarks
Parallel R across Different Applications GIS data analysis with GRASS and Parallel R Clustered Climate Regimes using Parallel R Fusion scenario challenges Parallel R Quantitative Proteomics in Biology using Parallel R
Summary and Future Work
SDMCenter
Tera-(Flop & Byte) Analyses Could Be Routine for Scientific Applications But…
HitsAlgorithmic Complexity:
Calculate means O(n)
Calculate FFT O(n log(n))
Calculate PCA O(r • c)
Hierarchical clust. O(n2)
Algorithmic Complexity:
Calculate means O(n)
Calculate FFT O(n log(n))
Calculate PCA O(r • c)
Hierarchical clust. O(n2)
Climate Now: 20-40TB per simulated year 5 yrs: 100TB/yr 5-10PB/yr
Astrophysics Now and 5 yrs: Can soak up anything!
Fusion Now: 100Mbytes/15min 5 yrs: 1000Mbytes/2 min
1Tflop/sec
SDMCenter
Statistical Computing with R
About R (http://www.r-project.org/):
• R is an Open Source (GPL), most widely used programming environment for statistical analysis and graphics; similar to S.
• Provides good support for both users and developers.
• Highly extensible via dynamically loadable add-on packages.
• Originally developed by Robert Gentleman and Ross Ihaka.
Towards Enabling Parallel Computing in R:
> …> dyn.load( “foo.so”) > .C( “foobar” )> dyn.unload( “foo.so” )
> …> dyn.load( “foo.so”) > .C( “foobar” )> dyn.unload( “foo.so” )
> library(mva)> pca <- prcomp(data)> summary(pca)
> library(mva)> pca <- prcomp(data)> summary(pca)
• Rmpi (Hao Yu): R interface to LAM-MPI.
• rpvm (Na Li and Tony Rossini): R interface to PVM; requires knowledge of parallel programming.
• snow (Luke Tierney): general API on top of message passing routines to provide high-level (parallel apply) commands; mostly demonstrated for embarrassingly parallel applications .
> library (rpvm)> .PVM.start.pvmd ()> .PVM.addhosts (...)> .PVM.config ()
> library (rpvm)> .PVM.start.pvmd ()> .PVM.addhosts (...)> .PVM.config ()
snow API
SDMCenter
Motivation behind Parallel R (pR)
Ideal Programming Requirements: Be able to use existing high level (i.e. R) code Require minimal extra efforts for parallelizing Have Identical/similar (presumably easy-to-use) interface to R’s Be able to test codes in sequential settings Provide efficient and scalable (in terms of problem size and
number of processors) performance
SDMCenter
:::::::fileList<-list.files(pattern="*.nc");PE ( for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] sla.prcomp (matrix [i])} ):::::::::::::
:::::::fileList<-list.files(pattern="*.nc");PE ( for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] sla.prcomp (matrix [i])} ):::::::::::::
:::::::fileList<-list.files(pattern="*.nc");for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] prcomp (matrix [i])} :::::::::::::
:::::::fileList<-list.files(pattern="*.nc");for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] prcomp (matrix [i])} ::::::::::::: R pR
Providing Task and Data Parallelism in pR
• Likelihood Maximization.
• Re-sampling schemes: Bootstrap, Jackknife, etc.
• Animations
• Markov Chain Monte Carlo (MCMC).– Multiple chains.– Simulated Tempering: running parallel chains at different “temperature“ to improve mixing.
Task-parallel analyses:
• k-means clustering
• Principal Component Analysis (PCA)
• Hierarchical (model-based) clustering
• Distance matrix, histogram, etc. computations
Data-parallel analyses:
Data ParallelismD a t
a
RScaLAPACK
Task & Data Parallelism with pR
Task Parallelism
D a t a
Task-pR
SDMCenter
Extensibility of Parallel R (pR)
REnvironment
REnvironment
ParallelAgent
ParallelAgent
Third Party Parallel Codes
ScaLAPACK
Matrix
Parallel k-means
Alok’s Data Mining
RScaLAPACK
pMatrix
pAlok
•Define R function parameters & returns
•Map R functions to defined function interfaces
•Define the function interfaces
•Set parallel environment limits for your functions
•Define data distribution function (Optional)
•Convert your MPI/PVM routine(s) into a set of functions.
•Create a shared library of your functions.
•Place it in a predefined location.
C/Fortran MPI
Robject
SDMCenter
Scalability of Parallel R (pR)
Speedup for Parallel R’s sla.solve() over serial R’s solve().
Architecture: SGI Altix at CCS of ORNL with 256 Intel Itanium2 processors at 1.5 GHz; 8 GB of memory per processor (2 TB system memory); 64-bit Linux OS; 1.5 TeraFLOPs/s theoretical total peak performance.
Matrix size:
R> solve (A,B) pR> sla.solve (A, B, NPROWS, NPCOLS, MB)A and B are the input matrices; NPROWS and NPCOLS are process grid specs; MB is block size
SDMCenter
Overhead due to R & Parallel Agent in pR
Matrix size:
SDMCenter
Parallel R (pR) Distribution
http://www.ASPECT-SDM.org/Parallel-RReleases History:
• pR enables both data and task parallelism (includes task-pR and RScaLAPACK) (2004/Q4)
• RScaLAPACK provides R interface to ScaLAPACK with its scalability in terms of problem size and number of processors using data parallelism (2004/Q2)
• task-pR achieves parallelism by performing out-of-order execution of tasks. With its intelligent scheduling mechanism it attains significant gain in execution times (2004/Q3)
•pMatrix provides a parallel platform to perform major matrix operations in parallel using ScaLAPACK and PBLAS Level II & III routines (2005/Q2)
Also: Available for download from R’s CRAN web site (www.R-Project.org) with 37 mirror sites in 20 countries
SDMCenter
Geo-statistical and Spatial Data Analysis with GRASS and Parallel RWith: George Fann, John Drake, and Bhaduri Budhendra
About GRASS (http://grass.itc.it/):
• GRASS (Geographic Resources Analysis Support System) is a raster/vector GIS, image processing system, and graphics production.
• GRASS contains over 350 programs and tools to render maps and images on monitor and paper; manipulate raster, vector, and sites data; process multi spectral image data; create, manage, and store spatial data.
• It is Free (Libre) Software/Open Source released under GNU GPL.
Parallel R (pR) extension for GRASS:
• Leverages the work by Markus Neteler (http://grass.itc.it/statsgrass/grass_geostats.html).
• Offers a richer set of statistical analysis capabilities including (Basic Statistics, Exploratory Data Analysis, Linear Models, Multivariate Analysis, Time Series Analysis, etc.)
• Provides high performance and parallel computational platform for large datasets
GRASS
GRASS
pR
$> grass5 <dataset>$> pR
> library (GRASS)> G gmeta()> …
SDMCenter
Grass/Parallel-R Examples
$> grass5$> pR….> library (MASS)> data (volcano)> plot (density (volcano, bw=2))> lines (density (volcano, bw=4), col="green") > lines (density (volcano, bw=8), col="red") > lines (density (volcano, bw=12), col="cyan")
Kernel Density Estimation $> grass5$> pR….> topo.meter.ls6 surf.ls (6, topo.meter)> topo.meter.surface6 trmat (topo.meter.ls6, 0, 100, 0, 100, 50)> image (topo.meter.surface6)> contour (topo.meter.surface6, labcex = 0.8, add=T) > points (topo.meter$x, topo.meter$y)
Trend Surface Fitting
Kernel Density Estimation Trend Surface Fitting Principal Component Analysis
SDMCenter
Clustered Climate Regimes AnalysisWith: W. Hargrove, F. Hoffman, and D. Erickson
Readnc files
Spa
tio-T
empo
ral P
ts
Variables (V)
16.6M x 3
5-yr BAU PCM 2000-2098 runs2.8°×2.8°; 18 levels
B05.12.ncB06.12.nc
B09.12.nc
Geographic Space
Normalizeµ=0 & σ=1
• 2,796 out of 8,192 total land grid cells• V: Temperature, Precipitation, Soil Moisture• Pts: (latitude, longitude, level, time)
Variable Space
Clusterk-means
Geographic Space
TemperaturePrecipitations
Soil Moisture
Re-assemble;Stat. Analyses
k=32, time
Statistics
No.
of
Pts
Cluster Number
SDMCenter
Scalability of pk-means() in pR
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
Number of Processors
Speedup for pkmeans() in pR
Series1 1.98 3.24 5.09 6.14
2 4 8 16
16.6 million points; ~20 iterations
SDMCenter
Fusion Scenario Challenges Parallel R With: George Ostrouchov and Don Batchelor
A toroidal slice of the electrostatic field of a tokamak fusion simulation (polar coord. as Cartesian)
Mahalanobis Distance easy
Expectation Maximization (EM) easy
Hierarchical Model-based Clustering (mclust) hard
250,000 points10% sampling for ~1hr analysis
SDMCenter
Quantitative Proteomics in BiologyWith: Bob Hettich, Hays McDonald, and Greg Hurst
Sample of ~2,000labeled proteins (N15)in different ratios
Liquid Chromatography- Mass Spectrometry (LC-MS)24 hours measurements
~3GB raw data +~50,000 MS, MS/MS files~1KByte each
Experimental Step
Ratio Calculations
RelEx
~50,000 ChromatogramFiles; ~1KB each
Quantification Step
DBDigger+SEQUEST~15-18 hours
Sequence Id Step
RelEx
SDMCenter
Ratio Calculations for ~50,000 files
[CHROMATOGRAMS] SCAN TIME SAMPLE REFERENCE1537 32.8275 4727570 4509290 1541 32.8978 1120668 4377465 1545 32.9718 4298401 4713328 1549 33.0477 2975233 9286918….. …….. ………..
1. Read chromatogram file
2. Select Peak Window
3. Calculate Ratio=Slope(Eigenvector)
• Subtract background noise from data• Generate Covariance Chromatogram (red)• Apply Savitzky-Golay Smoother (blue)• Calculate cut-off for search (cyan)• Find Window with Max. SN ratio (green)
SDMCenter
Ratio Estimation over ~50,000 fileslo
g (
Sig
nal
/No
ise)
= l
og
(λ 1
/λ2)
2
log(Ratio) = log (Slope (Eigenvector1))
Rel
ativ
e F
requ
ency
log (Ratio)
SDMCenter
Ratio Calculations with Parallel R
:::::::chroList<-list.files(pattern="*.chro");cat ("Chro", "samSN", "refSN", "PPCSN", "HR", "PCA", "PCASN", file="Pratio-Peptide.txt");
for (i in 1:length(chroList)) { currResult [i] Rratio(filename=chroList[i]); }
for (i in 1:length(chroList)) { cat (chroList[i], currResult$samSN, currResult$refSN, currResult$PPCSN, currResult$HR, currResult$PCA, currResult$PCASN, file="Pratio-Peptide.txt"); }:::::::::::::
Serial Version
:::::::chroList<-list.files(pattern="*.chro");cat ("Chro", "samSN", "refSN", "PPCSN", "HR", "PCA", "PCASN", file="Pratio-Peptide.txt");
PE ( for (i in 1:length(chroList)) { currResult [i] Pratio(filename=chroList[i]); } )
for (i in 1:length(chroList)) { cat (chroList[i], currResult$samSN, currResult$refSN, currResult$PPCSN, currResult$HR, currResult$PCA, currResult$PCASN, file="Pratio-Peptide.txt"); }:::::::::::::
Parallel Version
SDMCenter
Performance Results for Ratio Calculation
SDMCenter
Summary and Future Work
Parallel R (pR) is an Open Source high performance library for statistical computing in R
It has been deployed in a number of applications including: climate, GIS, fusion, and biology
Future improvements in few major directions: Demonstrate more application scenarios Add more libraries like RScaLAPACK, PMatrix (e.g. pAlok,
pclust, pnetCDF) Improve the performance (reduce overhead, memory
management) of Parallel Agent Enhance features of Parallel Agent:
Support outside of Master-Slave model Better memory management strategies (one-sided put(),
get(), release(), etc.) Support of parallel I/O over netCDF and HDF files