blackpblackbblackdblackr: a sustainable path for …...ppppbbbbddddrrr: a sustainable path for...
TRANSCRIPT
ppppppbbbbbbddddddRRRRRR: A Sustainable Path for Scalable StatisticalComputing
George Ostrouchov
Oak Ridge National Laboratory and University of Tennessee
Workshop on Distributed Computing in R, January 26-27, 2015HP Labs, Palo Alto, CA
ppppppbbbbbbddddddRRRRRR Core Team ppppppbbbbbbddddddRRRRRR: A Sustainable Path for Scalable Statistical Computing
The ppppppbbbbbbddddddRRRRRR Core Team
Wei-Chen Chen1
George Ostrouchov2,3
Pragneshkumar Patel3
Drew Schmidt3
SupportThis work used resources of National Institute for Computational Sciences at the University of Tennessee, Knoxville, which issupported by the Office of Cyberinfrastructure of the U.S. National Science Foundation under Award No.ARRA-NSF-OCI-0906324 for NICS-RDAV center. This work also used resources of the Oak Ridge Leadership Computing Facilityat the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy underContract No. DE-AC05-00OR22725.
1FDAWashington, DC, USA
2Computer Science and Mathematics DivisionOak Ridge National Laboratory, Oak Ridge TN, USA
3Joint Institute for Computational SciencesUniversity of Tennessee, Knoxville TN, USA
ppppppbbbbbbddddddRRRRRR Core Team ppppppbbbbbbddddddRRRRRR: A Sustainable Path for Scalable Statistical Computing
Introduction to HPC and Its View from R
1 Introduction to HPC and Its View from R
2 pbdR
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
Introduction to HPC and Its View from R
Contents
1 Introduction to HPC and Its View from RBatch and InteractiveAn R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and SoftwareBeyond “embarrassignly parallel”
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
Introduction to HPC and Its View from R Batch and Interactive
1 Introduction to HPC and Its View from RBatch and InteractiveAn R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and SoftwareBeyond “embarrassignly parallel”
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
Introduction to HPC and Its View from R Batch and Interactive
Data analysis is interactive!
Data reduction to knowledge
S (and R) interactive “answer” to batch data analysis
Iterative process with same data
Exploration, model constructionDiagnostics of fit and quantification of uncertaintyInterpretation
Efficient use of expensive people
Supercomputing is batch!
Traditionally data generation by simulation
Data processed mostly by visualization, often single pass
Efficient use of expensive platforms
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 1/26
Introduction to HPC and Its View from R Batch and Interactive
When in Rome, do as the Romans do! Batch!
Parallel visualization systems (VisIt and ParaView) are client-server(batch on server)
Current ppppppbbbbbbddddddRRRRRR packages address server side (batch)
Client under development
Bridge laptop to login node to resource manager to clusterSite configuration fileManage relationship of big data (server side) to little data (client side)
Need intuitive parallel data input
Batch and Interactive. Is there a difference?
High-level functional language: a function is a “batch” script
R “An interactive environment to use batch scripts”
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 2/26
Introduction to HPC and Its View from R An R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and Software
1 Introduction to HPC and Its View from RBatch and InteractiveAn R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and SoftwareBeyond “embarrassignly parallel”
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
Introduction to HPC and Its View from R An R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and Software
Big Data and Little Data
HPC Cluster computer ParallelFile System
DiskStorageServers
Compute Nodes I/O Nodes
Login Nodes Your Laptop
Big
Dat
a
“Little Data”
SoftwareComponents
Reader or GeneratorAnalysis scriptDisplay/write results
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 3/26
Introduction to HPC and Its View from R An R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and Software
Hadoop: File System or HPC Cluster?
HADOOP Cluster Computer
Compute/IO/Storage Nodes and HD File System
Name Nodes, Job Trackers, Clients
Your LaptopB
ig D
ata
“Little Data”
SoftwareComponents
HDFS file systemYarn resource managerMap-reduce limitation
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 4/26
Introduction to HPC and Its View from R An R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and Software
Adding NVRAM to New HPC Systems
HPC Cluster computerwith NVRAM
ParallelFile System
DiskStorageServers
Compute Nodes I/O Nodes
Login Nodes Your Laptop
Big
Dat
a
“Little Data”
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 5/26
Introduction to HPC and Its View from R An R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and Software
R Interfaces to Low-Level Native Tools
Interconnection Network
PROC+ cache
PROC+ cache
PROC+ cache
PROC+ cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE+ cache
CORE+ cache
CORE+ cache
CORE+ cache
Network
Shared Memory Local Memory
GPUor
MIC
Co-Processor
GPU: Graphical Processing Unit
MIC: Many Integrated Core
Default is parallel (SPMD): where is my data and who has what I need?
Default is serial: which tasks can the compiler make parallel?
Offload data and tasks. We are slow but many!
SocketsMPI
Hadoop
snowRmpi
pbdMPI
snow + multicore = parallel
CUDAOpenCL
OpenACC
OpenMPOpenACC
OpenMPPthreads
fork
multicore
RHadoop
Foreign Langauge
Interfaces:.C
.CallRcpp
OpenCLinline
.
.
.
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 6/26
Introduction to HPC and Its View from R An R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and Software
HPC Libraries: 30+ Years of Research and Development
Interconnection Network
PROC+ cache
PROC+ cache
PROC+ cache
PROC+ cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE+ cache
CORE+ cache
CORE+ cache
CORE+ cache
Network
Shared Memory Local Memory
GPUor
MIC
Co-Processor
GPU: Graphical Processing Unit
MIC: Many Integrated Core
Trilinos
PETSc
PLASMA
DPLASMAMKL (Intel)LibSci (Cray)
ScaLAPACKPBLASBLACS
cuBLAS (NVIDIA)
MAGMA
PAPI
Tau
MPI
mpiP
fpmpi
NetCDF4
ADIOS
ACML (AMD)
CombBLAS
cuSPARSE (NVIDIA)
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 7/26
Introduction to HPC and Its View from R An R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and Software
R and ppppppbbbbbbddddddRRRRRR R Interfaces to HPC Libraries
Interconnection Network
PROC+ cache
PROC+ cache
PROC+ cache
PROC+ cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE+ cache
CORE+ cache
CORE+ cache
CORE+ cache
Network
Shared Memory Local Memory
GPUor
MIC
Co-Processor
GPU: Graphical Processing Unit
MIC: Many Integrated Core
Trilinos
PETSc
PLASMA
DPLASMAMKL (Intel)LibSci (Cray)
ScaLAPACKPBLASBLACS
cuBLAS (NVIDIA)
MAGMA
PAPI
Tau
MPI
mpiP
fpmpi
NetCDF4
ADIOS
magma
HiPLARHiPLARM
pbdMPI
pbdPAPI
pbdNCDF4
pbdADIOS
pbdPROFpbdPROFpbdPROF
ACML (AMD)
CombBLAS
cuSPARSE (NVIDIA)
pbdDMATpbdDMATpbdDMATpbdDMAT
pbdBASE pbdSLAP
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 8/26
Introduction to HPC and Its View from R Beyond “embarrassignly parallel”
1 Introduction to HPC and Its View from RBatch and InteractiveAn R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and SoftwareBeyond “embarrassignly parallel”
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
Introduction to HPC and Its View from R Beyond “embarrassignly parallel”
A brief personal history of parallel computing
1980
1990
2000
2010
Early distributed systems produced (iPSC, NCUBE)
Numerical linear algebra embraces parallel computing, some interest in statistics
MPI standard developed
First distributed libraries released (PBLAS, ScaLAPACK)
Simulation science is the driver of supercomputing
CPU clock speeds stagnate, multicore emerges
Everyone, including statistics and R, cares about parallel computing mostly from a multicore perspective
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 9/26
Introduction to HPC and Its View from R Beyond “embarrassignly parallel”
Early Work in Statistics: Crossproduct (Row-Block layout)
Ostrouchov (1987). Parallel Computing on a Hypercube: An overview of the architecture andsome applications. Proceedings of the 19th Symposium on the Interface of Computer Scienceand Statistics, p.27-32.
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 10/26
Introduction to HPC and Its View from R Beyond “embarrassignly parallel”
Distributed Numerical Linar Algebra
BLACS Library Communication Patterns (ScaLAPACK)
J. Dongarra and R. C. Whaley (1995). A Users Guide to the BLACS v1.1. UT-CS-95-281,March 1995.
Complex communications for block-distributed numerical linearalgebra computations.
Enables use of separate rowwise and columnwise communicators.
Optimizes collective operations of multiple communicators.
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 11/26
Introduction to HPC and Its View from R Beyond “embarrassignly parallel”
PaRSEC: Parallel Runtime Scheduling and ExecutionController (DPLASMA)
Graphic from icl.cs.utk.edu
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J. ”PaRSEC: Exploiting Heterogeneity to EnhanceScalability,” IEEE Computing in Science and Engineering, Vol. 15, No. 6, 36-45, November, 2013.
Master data flow controller runs distributed on all cores.
Dynamic generation of current level in flow graph
Effectively removes collective synchronizations
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 12/26
pbdR
Contents
2 pbdRThe pbdR ProjectpbdMPIpbdDMATBenchmarksFuture Work
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
pbdR The pbdR Project
2 pbdRThe pbdR ProjectpbdMPIpbdDMATBenchmarksFuture Work
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
pbdR The pbdR Project
ppppppbbbbbbddddddRRRRRR Interfaces to Libraries: Sustainable Path
Interconnection Network
PROC+ cache
PROC+ cache
PROC+ cache
PROC+ cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE+ cache
CORE+ cache
CORE+ cache
CORE+ cache
Network
Shared Memory Local Memory
GPUor
MIC
Co-Processor
GPU: Graphical Processing Unit
MIC: Many Integrated Core
Trilinos
PETSc
PLASMA
DPLASMAMKL (Intel)LibSci (Cray)
ScaLAPACKPBLASBLACS
cuBLAS (NVIDIA)
MAGMA
PAPI
Tau
MPI
mpiP
fpmpi
NetCDF4
ADIOS
pbdMPI
pbdPAPI
pbdNCDF4
pbdADIOS
pbdPROFpbdPROFpbdPROF
ACML (AMD)
pbdDEMO
CombBLAS
cuSPARSE (NVIDIA)
pbdDMATpbdDMATpbdDMATpbdDMAT
pbdBASE pbdSLAP
Why use HPC libraries?
The HPC community is 30 years beyond “embarrassingly parallel.”
They’re tested. They’re fast. They’re scalable.
Many science communities are invested in their API.
You’re not going to beat Jack Dongarra’s lab at dense linear algebra!
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 13/26
pbdR pbdMPI
2 pbdRThe pbdR ProjectpbdMPIpbdDMATBenchmarksFuture Work
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
pbdR pbdMPI
pbdMPI: Simplified, Extensible, and Fast Communication Operations
S4 methods for collective communication: extensible to other Robjects.
Default methods (like Robj in Rmpi) check for data type: safe forgeneral users.
API is simplified: defaults in control objects.
Array and matrix methods without serialization: faster than Rmpi.
pbdMPI (S4) Rmpiallgather mpi.allgather, mpi.allgatherv, mpi.allgather.Robjallreduce mpi.allreduce
bcast mpi.bcast, mpi.bcast.Robjgather mpi.gather, mpi.gatherv, mpi.gather.Robjrecv mpi.recv, mpi.recv.Robjreduce mpi.reduce
scatter mpi.scatter, mpi.scatterv, mpi.scatter.Robjsend mpi.send, mpi.send.Robj
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 14/26
pbdR pbdMPI
pbdMPI: Does the Right Thing for You
Rmpi
1 # int
2 mpi.allreduce(x, type =1)
3 # double
4 mpi.allreduce(x, type =2)
pbdMPI
1 allreduce(x)
Types in R
1 > is.integer (1)
2 [1] FALSE
3 > is.integer (2)
4 [1] FALSE
5 > is.integer (1:2)
6 [1] TRUE
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 15/26
pbdR pbdDMAT
2 pbdRThe pbdR ProjectpbdMPIpbdDMATBenchmarksFuture Work
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
pbdR pbdDMAT
Mapping a Matrix to Processors
Processor Grid Shapes
[0 1 2 3 4 5
]
(a) 1 × 6
[0 1 23 4 5
](b) 2 × 3
0 12 34 5
(c) 3 × 2
012345
(d) 6 × 1
Table: Processor Grid Shapes with 6 Processors
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 16/26
pbdR pbdDMAT
2×3 block-cyclic grid on 6 processors: Global view “ddmatrix” class
x =
x11 x12 x13 x14 x15 x16 x17 x18 x19
x21 x22 x23 x24 x25 x26 x27 x28 x29
x31 x32 x33 x34 x35 x36 x37 x38 x39
x41 x42 x43 x44 x45 x46 x47 x48 x49
x51 x52 x53 x54 x55 x56 x57 x58 x59
x61 x62 x63 x64 x65 x66 x67 x68 x69
x71 x72 x73 x74 x75 x76 x77 x78 x79
x81 x82 x83 x84 x85 x86 x87 x88 x89
x91 x92 x93 x94 x95 x96 x97 x98 x99
9×9
Processor grid =
∣∣∣∣ 0 1 23 4 5
∣∣∣∣ =
∣∣∣∣ (0,0) (0,1) (0,2)(1,0) (1,1) (1,2)
∣∣∣∣
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 17/26
pbdR pbdDMAT
2×3 block-cyclic grid on 6 processors: Local view “ddmatrix” class
x11 x12 x17 x18
x21 x22 x27 x28
x51 x52 x57 x58
x61 x62 x67 x68
x91 x92 x97 x98
5×4
x13 x14 x19
x23 x24 x29
x53 x54 x59
x63 x64 x69
x93 x94 x99
5×3
x15 x16
x25 x26
x55 x56
x65 x66
x95 x96
5×2
x31 x32 x37 x38
x41 x42 x47 x48
x71 x72 x77 x78
x81 x82 x87 x88
4×4
x33 x34 x39
x43 x44 x49
x73 x74 x79
x83 x84 x89
4×3
x35 x36
x45 x46
x75 x76
x85 x86
4×2
Processor grid =
∣∣∣∣ 0 1 23 4 5
∣∣∣∣ =
∣∣∣∣ (0,0) (0,1) (0,2)(1,0) (1,1) (1,2)
∣∣∣∣ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 18/26
pbdR pbdDMAT
ppppppbbbbbbddddddRRRRRR Example Syntax
1 x <- x[-1, 2:5]
2 x <- log(abs(x) + 1)
3 x.pca <- prcomp(x)
4 xtx <- t(x) %*% x
5 ans <- svd(solve(xtx))
The above (and over 100 other functions) runs on 1 core with Ror 10,000 cores with ppppppbbbbbbddddddRRRRRR ddmatrix class
1 > showClass("ddmatrix")
2 Class "ddmatrix" [package "pbdDMAT"]
3 Slots:
4 Name: Data dim ldim bldim ICTXT
5 Class: matrix numeric numeric numeric numeric
1 > x <- as.rowblock(x)
2 > x <- as.colblock(x)
3 > x <- redistribute(x, bldim=c(8, 8), ICTXT = 0)
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 19/26
pbdR Benchmarks
2 pbdRThe pbdR ProjectpbdMPIpbdDMATBenchmarksFuture Work
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
pbdR Benchmarks
Default Choices Throughout
Only free software used (no MKL, ACML, etc.)
Generate random matrix
Global Columns: 500, 1000, and 2000Global Rows: vary with core count (local matrix fixed at 43.4MiB)
1 core = 1 MPI process
Measure wall clock time
“weak scaling” = global problem grows with core count
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 20/26
pbdR Benchmarks
Covariance Benchmark (pbdDMAT)
1 x <- ddmatrix("rnorm", nrow=n, ncol=p, mean=mean , sd=sd)
2 cov.x <- cov(x)
21.34
42.6885.35
170.7 341.41
21.34 42.68 85.35 170.7 341.41
21.3442.68
85.35 170.7
341.41
7
8
9
10
11
504 1008 2016 4032 8064504 1008 2016 4032 8064504 1008 2016 4032 8064Cores
Run
Tim
e (S
econ
ds)
Predictors 500 1000 2000
Calculating cov(x) With Fixed Local Size of ~43.4 MiB
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 21/26
pbdR Benchmarks
Linear Regression Benchmark (pbdDMAT)
1 beta.true <- ddmatrix("runif", nrow=p, ncol =1)
2 y <- x %*% beta.true
3 beta.est <- lm.fit(x, y)$coefficients
21.34
42.6885.35170.7 341.41
1016.09
21.3442.6885.35 170.7
341.411016.09
21.3442.68
85.35
170.7341.41
1016.09
25
50
75
100
125
504
1008
2016
4032
8064
2400
050
410
0820
1640
3280
64
2400
050
410
0820
1640
3280
64
2400
0
Cores
Run
Tim
e (S
econ
ds)
Predictors 500 1000 2000
Fitting y~x With Fixed Local Size of ~43.4 MiB
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 22/26
pbdR Benchmarks
Data Generation (pbdDMAT)
0
200
400
600
800
1 504 1008 2016 4032 8064 24000 48000Cores
Dat
a G
ener
atio
n (G
iB p
er S
econ
d)
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 23/26
pbdR Benchmarks
Matrix Exponentiation (pbdDMAT)
Fitting biogeography models requires manymatrix exponentiations
Benchmark: Matrix exponential of5000×5000 matrix.
R 3.1.0, Matrix 1.1-2, rexpokit 0.25,pbdDMAT 0.3-0
Libs: Cray LibSci, NETLIB ScaLAPACK,Compilers: gnu 4.8.2
Configuration: 1 thread == 1 MPI rank ==1 physical core
Schmidt and Matzke (2014) Distributed matrix exponentiation, The R User Conference (UseR! 2014),
Los Angeles, CA, August 2014 .ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 24/26
pbdR Future Work
2 pbdRThe pbdR ProjectpbdMPIpbdDMATBenchmarksFuture Work
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing
pbdR Future Work
Future Work
Started a 3 year NSF grant to
Bring back interactivity via client/serverSimplify parallel data inputBegin DPLASMA integrationOutreach to the statistics community
DOE funding: In-situ or staging use with simulations
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 25/26
pbdR Future Work
Where to learn more?
http://r-pbd.org/
pbdDEMO vignette
Googlegroup:RBigDataProgramming
ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 26/26