blackpblackbblackdblackr: a sustainable path for …...ppppbbbbddddrrr: a sustainable path for...

ppppppbbbbbbddddddRRRRRR: A Sustainable Path for Scalable StatisticalComputing

George Ostrouchov

Oak Ridge National Laboratory and University of Tennessee

Workshop on Distributed Computing in R, January 26-27, 2015HP Labs, Palo Alto, CA

ppppppbbbbbbddddddRRRRRR Core Team ppppppbbbbbbddddddRRRRRR: A Sustainable Path for Scalable Statistical Computing

The ppppppbbbbbbddddddRRRRRR Core Team

Wei-Chen Chen1

George Ostrouchov2,3

Pragneshkumar Patel3

Drew Schmidt3

SupportThis work used resources of National Institute for Computational Sciences at the University of Tennessee, Knoxville, which issupported by the Office of Cyberinfrastructure of the U.S. National Science Foundation under Award No.ARRA-NSF-OCI-0906324 for NICS-RDAV center. This work also used resources of the Oak Ridge Leadership Computing Facilityat the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy underContract No. DE-AC05-00OR22725.

1FDAWashington, DC, USA

2Computer Science and Mathematics DivisionOak Ridge National Laboratory, Oak Ridge TN, USA

3Joint Institute for Computational SciencesUniversity of Tennessee, Knoxville TN, USA

ppppppbbbbbbddddddRRRRRR Core Team ppppppbbbbbbddddddRRRRRR: A Sustainable Path for Scalable Statistical Computing

Introduction to HPC and Its View from R

1 Introduction to HPC and Its View from R

2 pbdR

ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing

Introduction to HPC and Its View from R

Contents

1 Introduction to HPC and Its View from RBatch and InteractiveAn R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and SoftwareBeyond “embarrassignly parallel”


Introduction to HPC and Its View from R Batch and Interactive




Data analysis is interactive!

Data reduction to knowledge

S (and R) interactive “answer” to batch data analysis

Iterative process with same data

Exploration, model constructionDiagnostics of fit and quantification of uncertaintyInterpretation

Efficient use of expensive people

Supercomputing is batch!

Traditionally data generation by simulation

Data processed mostly by visualization, often single pass

Efficient use of expensive platforms

ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 1/26


When in Rome, do as the Romans do! Batch!

Parallel visualization systems (VisIt and ParaView) are client-server(batch on server)

Current ppppppbbbbbbddddddRRRRRR packages address server side (batch)

Client under development

Bridge laptop to login node to resource manager to clusterSite configuration fileManage relationship of big data (server side) to little data (client side)

Need intuitive parallel data input

Batch and Interactive. Is there a difference?

High-level functional language: a function is a “batch” script

R “An interactive environment to use batch scripts”


Introduction to HPC and Its View from R An R and ppppppbbbbbbddddddRRRRRR View of Parallel Hardware and Software




Big Data and Little Data

HPC Cluster computer ParallelFile System

DiskStorageServers

Compute Nodes I/O Nodes

Login Nodes Your Laptop

Big

Dat

a

“Little Data”

SoftwareComponents

Reader or GeneratorAnalysis scriptDisplay/write results



Hadoop: File System or HPC Cluster?

HADOOP Cluster Computer

Compute/IO/Storage Nodes and HD File System

Name Nodes, Job Trackers, Clients

Your LaptopB

ig D

ata

“Little Data”

SoftwareComponents

HDFS file systemYarn resource managerMap-reduce limitation



Adding NVRAM to New HPC Systems

HPC Cluster computerwith NVRAM

ParallelFile System

DiskStorageServers

Compute Nodes I/O Nodes

Login Nodes Your Laptop

Big

Dat

a

“Little Data”



R Interfaces to Low-Level Native Tools

Interconnection Network

PROC+ cache

PROC+ cache

PROC+ cache

PROC+ cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE+ cache

CORE+ cache

CORE+ cache

CORE+ cache

Network

Shared Memory Local Memory

GPUor

MIC

Co-Processor

GPU: Graphical Processing Unit

MIC: Many Integrated Core

Default is parallel (SPMD): where is my data and who has what I need?

Default is serial: which tasks can the compiler make parallel?

Offload data and tasks. We are slow but many!

SocketsMPI

Hadoop

snowRmpi

pbdMPI

snow + multicore = parallel

CUDAOpenCL

OpenACC

OpenMPOpenACC

OpenMPPthreads

fork

multicore

RHadoop

Foreign Langauge

Interfaces:.C

.CallRcpp

OpenCLinline

.

.

.



HPC Libraries: 30+ Years of Research and Development


PROC+ cache

PROC+ cache

PROC+ cache

PROC+ cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE+ cache

CORE+ cache

CORE+ cache

CORE+ cache

Network


GPUor

MIC

Co-Processor



Trilinos

PETSc

PLASMA

DPLASMAMKL (Intel)LibSci (Cray)

ScaLAPACKPBLASBLACS

cuBLAS (NVIDIA)

MAGMA

PAPI

Tau

MPI

mpiP

fpmpi

NetCDF4

ADIOS

ACML (AMD)

CombBLAS

cuSPARSE (NVIDIA)



R and ppppppbbbbbbddddddRRRRRR R Interfaces to HPC Libraries


PROC+ cache

PROC+ cache

PROC+ cache

PROC+ cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE+ cache

CORE+ cache

CORE+ cache

CORE+ cache

Network


GPUor

MIC

Co-Processor



Trilinos

PETSc

PLASMA


ScaLAPACKPBLASBLACS

cuBLAS (NVIDIA)

MAGMA

PAPI

Tau

MPI

mpiP

fpmpi

NetCDF4

ADIOS

magma

HiPLARHiPLARM

pbdMPI

pbdPAPI

pbdNCDF4

pbdADIOS

pbdPROFpbdPROFpbdPROF

ACML (AMD)

CombBLAS

cuSPARSE (NVIDIA)

pbdDMATpbdDMATpbdDMATpbdDMAT

pbdBASE pbdSLAP


Introduction to HPC and Its View from R Beyond “embarrassignly parallel”




A brief personal history of parallel computing

1980

1990

2000

2010

Early distributed systems produced (iPSC, NCUBE)

Numerical linear algebra embraces parallel computing, some interest in statistics

MPI standard developed

First distributed libraries released (PBLAS, ScaLAPACK)

Simulation science is the driver of supercomputing

CPU clock speeds stagnate, multicore emerges

Everyone, including statistics and R, cares about parallel computing mostly from a multicore perspective



Early Work in Statistics: Crossproduct (Row-Block layout)

Ostrouchov (1987). Parallel Computing on a Hypercube: An overview of the architecture andsome applications. Proceedings of the 19th Symposium on the Interface of Computer Scienceand Statistics, p.27-32.



Distributed Numerical Linar Algebra

BLACS Library Communication Patterns (ScaLAPACK)

J. Dongarra and R. C. Whaley (1995). A Users Guide to the BLACS v1.1. UT-CS-95-281,March 1995.

Complex communications for block-distributed numerical linearalgebra computations.

Enables use of separate rowwise and columnwise communicators.

Optimizes collective operations of multiple communicators.



PaRSEC: Parallel Runtime Scheduling and ExecutionController (DPLASMA)

Graphic from icl.cs.utk.edu

Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J. ”PaRSEC: Exploiting Heterogeneity to EnhanceScalability,” IEEE Computing in Science and Engineering, Vol. 15, No. 6, 36-45, November, 2013.

Master data flow controller runs distributed on all cores.

Dynamic generation of current level in flow graph

Effectively removes collective synchronizations


pbdR

Contents

2 pbdRThe pbdR ProjectpbdMPIpbdDMATBenchmarksFuture Work


pbdR The pbdR Project



pbdR The pbdR Project

ppppppbbbbbbddddddRRRRRR Interfaces to Libraries: Sustainable Path


PROC+ cache

PROC+ cache

PROC+ cache

PROC+ cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE+ cache

CORE+ cache

CORE+ cache

CORE+ cache

Network


GPUor

MIC

Co-Processor



Trilinos

PETSc

PLASMA


ScaLAPACKPBLASBLACS

cuBLAS (NVIDIA)

MAGMA

PAPI

Tau

MPI

mpiP

fpmpi

NetCDF4

ADIOS

pbdMPI

pbdPAPI

pbdNCDF4

pbdADIOS

pbdPROFpbdPROFpbdPROF

ACML (AMD)

pbdDEMO

CombBLAS

cuSPARSE (NVIDIA)

pbdDMATpbdDMATpbdDMATpbdDMAT

pbdBASE pbdSLAP

Why use HPC libraries?

The HPC community is 30 years beyond “embarrassingly parallel.”

They’re tested. They’re fast. They’re scalable.

Many science communities are invested in their API.

You’re not going to beat Jack Dongarra’s lab at dense linear algebra!


pbdR pbdMPI



pbdR pbdMPI

pbdMPI: Simplified, Extensible, and Fast Communication Operations

S4 methods for collective communication: extensible to other Robjects.

Default methods (like Robj in Rmpi) check for data type: safe forgeneral users.

API is simplified: defaults in control objects.

Array and matrix methods without serialization: faster than Rmpi.

pbdMPI (S4) Rmpiallgather mpi.allgather, mpi.allgatherv, mpi.allgather.Robjallreduce mpi.allreduce

bcast mpi.bcast, mpi.bcast.Robjgather mpi.gather, mpi.gatherv, mpi.gather.Robjrecv mpi.recv, mpi.recv.Robjreduce mpi.reduce

scatter mpi.scatter, mpi.scatterv, mpi.scatter.Robjsend mpi.send, mpi.send.Robj


pbdR pbdMPI

pbdMPI: Does the Right Thing for You

Rmpi

1 # int

2 mpi.allreduce(x, type =1)

3 # double

4 mpi.allreduce(x, type =2)

pbdMPI

1 allreduce(x)

Types in R

1 > is.integer (1)

2 [1] FALSE

3 > is.integer (2)

4 [1] FALSE

5 > is.integer (1:2)

6 [1] TRUE


pbdR pbdDMAT



pbdR pbdDMAT

Mapping a Matrix to Processors

Processor Grid Shapes

[0 1 2 3 4 5

]

(a) 1 × 6

[0 1 23 4 5

](b) 2 × 3

0 12 34 5

(c) 3 × 2

012345

(d) 6 × 1

Table: Processor Grid Shapes with 6 Processors


pbdR pbdDMAT

2×3 block-cyclic grid on 6 processors: Global view “ddmatrix” class

x =

x11 x12 x13 x14 x15 x16 x17 x18 x19

x21 x22 x23 x24 x25 x26 x27 x28 x29

x31 x32 x33 x34 x35 x36 x37 x38 x39

x41 x42 x43 x44 x45 x46 x47 x48 x49

x51 x52 x53 x54 x55 x56 x57 x58 x59

x61 x62 x63 x64 x65 x66 x67 x68 x69

x71 x72 x73 x74 x75 x76 x77 x78 x79

x81 x82 x83 x84 x85 x86 x87 x88 x89

x91 x92 x93 x94 x95 x96 x97 x98 x99

9×9

Processor grid =

∣∣∣∣ 0 1 23 4 5

∣∣∣∣ =

∣∣∣∣ (0,0) (0,1) (0,2)(1,0) (1,1) (1,2)

∣∣∣∣


pbdR pbdDMAT

2×3 block-cyclic grid on 6 processors: Local view “ddmatrix” class

x11 x12 x17 x18

x21 x22 x27 x28

x51 x52 x57 x58

x61 x62 x67 x68

x91 x92 x97 x98

5×4

x13 x14 x19

x23 x24 x29

x53 x54 x59

x63 x64 x69

x93 x94 x99

5×3

x15 x16

x25 x26

x55 x56

x65 x66

x95 x96

5×2

x31 x32 x37 x38

x41 x42 x47 x48

x71 x72 x77 x78

x81 x82 x87 x88

4×4

x33 x34 x39

x43 x44 x49

x73 x74 x79

x83 x84 x89

4×3

x35 x36

x45 x46

x75 x76

x85 x86

4×2

Processor grid =

∣∣∣∣ 0 1 23 4 5

∣∣∣∣ =

∣∣∣∣ (0,0) (0,1) (0,2)(1,0) (1,1) (1,2)

∣∣∣∣ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 18/26

pbdR pbdDMAT

ppppppbbbbbbddddddRRRRRR Example Syntax

1 x <- x[-1, 2:5]

2 x <- log(abs(x) + 1)

3 x.pca <- prcomp(x)

4 xtx <- t(x) %*% x

5 ans <- svd(solve(xtx))

The above (and over 100 other functions) runs on 1 core with Ror 10,000 cores with ppppppbbbbbbddddddRRRRRR ddmatrix class

1 > showClass("ddmatrix")

2 Class "ddmatrix" [package "pbdDMAT"]

3 Slots:

4 Name: Data dim ldim bldim ICTXT

5 Class: matrix numeric numeric numeric numeric

1 > x <- as.rowblock(x)

2 > x <- as.colblock(x)

3 > x <- redistribute(x, bldim=c(8, 8), ICTXT = 0)


pbdR Benchmarks



pbdR Benchmarks

Default Choices Throughout

Only free software used (no MKL, ACML, etc.)

Generate random matrix

Global Columns: 500, 1000, and 2000Global Rows: vary with core count (local matrix fixed at 43.4MiB)

1 core = 1 MPI process

Measure wall clock time

“weak scaling” = global problem grows with core count


pbdR Benchmarks

Covariance Benchmark (pbdDMAT)

1 x <- ddmatrix("rnorm", nrow=n, ncol=p, mean=mean , sd=sd)

2 cov.x <- cov(x)

21.34

42.6885.35

170.7 341.41

21.34 42.68 85.35 170.7 341.41

21.3442.68

85.35 170.7

341.41

7

8

9

10

11

504 1008 2016 4032 8064504 1008 2016 4032 8064504 1008 2016 4032 8064Cores

Run

Tim

e (S

econ

ds)

Predictors 500 1000 2000

Calculating cov(x) With Fixed Local Size of ~43.4 MiB


pbdR Benchmarks

Linear Regression Benchmark (pbdDMAT)

1 beta.true <- ddmatrix("runif", nrow=p, ncol =1)

2 y <- x %*% beta.true

3 beta.est <- lm.fit(x, y)$coefficients

21.34

42.6885.35170.7 341.41

1016.09

21.3442.6885.35 170.7

341.411016.09

21.3442.68

85.35

170.7341.41

1016.09

25

50

75

100

125

504

1008

2016

4032

8064

2400

050

410

0820

1640

3280

64

2400

050

410

0820

1640

3280

64

2400

0

Cores

Run

Tim

e (S

econ

ds)

Predictors 500 1000 2000

Fitting y~x With Fixed Local Size of ~43.4 MiB


pbdR Benchmarks

Data Generation (pbdDMAT)

0

200

400

600

800

1 504 1008 2016 4032 8064 24000 48000Cores

Dat

a G

ener

atio

n (G

iB p

er S

econ

d)


pbdR Benchmarks

Matrix Exponentiation (pbdDMAT)

Fitting biogeography models requires manymatrix exponentiations

Benchmark: Matrix exponential of5000×5000 matrix.

R 3.1.0, Matrix 1.1-2, rexpokit 0.25,pbdDMAT 0.3-0

Libs: Cray LibSci, NETLIB ScaLAPACK,Compilers: gnu 4.8.2

Configuration: 1 thread == 1 MPI rank ==1 physical core

Schmidt and Matzke (2014) Distributed matrix exponentiation, The R User Conference (UseR! 2014),

Los Angeles, CA, August 2014 .ppppppbbbbbbddddddRRRRRR Core Team Sustainable Scalable Statistical Computing 24/26

pbdR Future Work



pbdR Future Work

Future Work

Started a 3 year NSF grant to

Bring back interactivity via client/serverSimplify parallel data inputBegin DPLASMA integrationOutreach to the statistics community

DOE funding: In-situ or staging use with simulations


pbdR Future Work

Where to learn more?

http://r-pbd.org/

pbdDEMO vignette

Googlegroup:RBigDataProgramming


http://r-pbd.org/

Google group: RBigDataProgramming

blackpblackbblackdblackr: a sustainable path for …...ppppbbbbddddrrr: a sustainable path for...

Documents