communication frameworks for hpc and big data€¦ · systems 2015 tianhe-2 2020-2024 difference...

74
Communication Frameworks for HPC and Big Data Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at HPC Advisory Council Spain Conference (2015) by

Upload: others

Post on 08-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Communication Frameworks for HPC and Big Data

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Spain Conference (2015)

by

Page 2: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

High-End Computing (HEC): PetaFlop to ExaFlop

HPCAC Spain Conference (Sept '15) 2

100-200

PFlops in

2016-2018

1 EFlops in

2020-2024?

Page 3: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

3HPCAC Spain Conference (Sept '15)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

87%

Page 4: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Drivers of Modern HPC Cluster Architectures

• Multi-core processors are ubiquitous

• InfiniBand very popular in HPC clusters•

• Accelerators/Coprocessors becoming common in high-end systems

• Pushing the envelope for Exascale computing

Accelerators / Coprocessors high compute density, high performance/watt

>1 TFlop DP on a chip

High Performance Interconnects - InfiniBand<1usec latency, >100Gbps Bandwidth

Tianhe – 2 Titan Stampede Tianhe – 1A

4

Multi-core Processors

HPCAC Spain Conference (Sept '15)

Page 5: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• 259 IB Clusters (51%) in the June 2015 Top500 list

(http://www.top500.org)

• Installations in the Top 50 (24 systems):

Large-scale InfiniBand Installations

519,640 cores (Stampede) at TACC (8th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (22nd)

185,344 cores (Pleiades) at NASA/Ames (11th) 194,616 cores (Cascade) at PNNL (25th)

72,800 cores Cray CS-Storm in US (13th) 76,032 cores (Makman-2) at Saudi Aramco (28th)

72,800 cores Cray CS-Storm in US (14th) 110,400 cores (Pangea) in France (29th)

265,440 cores SGI ICE at Tulip Trading Australia (15th) 37,120 cores (Lomonosov-2) at Russia/MSU (31st)

124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (16th) 57,600 cores (SwiftLucy) in US (33rd)

72,000 cores (HPC2) in Italy (17th) 50,544 cores (Occigen) at France/GENCI-CINES (36th)

115,668 cores (Thunder) at AFRL/USA (19th) 76,896 cores (Salomon) SGI ICE in Czech Republic (40th)

147,456 cores (SuperMUC) in Germany (20th) 73,584 cores (Spirit) at AFRL/USA (42nd)

86,016 cores (SuperMUC Phase 2) in Germany (21st) and many more!

5HPCAC Spain Conference (Sept '15)

Page 6: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scientific Computing

– Message Passing Interface (MPI), including MPI + OpenMP, is the

Dominant Programming Model

– Many discussions towards Partitioned Global Address Space

(PGAS)

• UPC, OpenSHMEM, CAF, etc.

– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)

• Big Data/Enterprise/Commercial Computing

– Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce)

– Spark is emerging for in-memory computing

– Memcached is also used for Web 2.0

6

Two Major Categories of Applications

HPCAC Spain Conference (Sept '15)

Page 7: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Towards Exascale System (Today and Target)

Systems 2015Tianhe-2

2020-2024 DifferenceToday & Exascale

System peak 55 PFlop/s 1 EFlop/s ~20x

Power 18 MW(3 Gflops/W)

~20 MW(50 Gflops/W)

O(1)~15x

System memory 1.4 PB(1.024PB CPU + 0.384PB CoP)

32 – 64 PB ~50X

Node performance 3.43TF/s(0.4 CPU + 3 CoP)

1.2 or 15 TF O(1)

Node concurrency 24 core CPU + 171 cores CoP

O(1k) or O(10k) ~5x - ~50x

Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x

System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x

Total concurrency 3.12M12.48M threads (4 /core)

O(billion) for latency hiding

~100x

MTTI Few/day Many/day O(?)

Courtesy: Prof. Jack Dongarra

7HPCAC Spain Conference (Sept '15)

Page 8: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15) 8

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

Page 9: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

9

Partitioned Global Address Space (PGAS) Models

HPCAC Spain Conference (Sept '15)

• Key features

- Simple shared memory abstractions

- Light weight one-sided communication

- Easier to express irregular communication

• Different approaches to PGAS

- Languages

• Unified Parallel C (UPC)

• Co-Array Fortran (CAF)

• X10

• Chapel

- Libraries

• OpenSHMEM

• Global Arrays

Page 10: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based

on communication characteristics

• Benefits:

– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Exascale Roadmap*:

– “Hybrid Programming is a practical way to

program exascale systems”

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

Kernel 1MPI

Kernel 2MPI

Kernel 3MPI

Kernel NMPI

HPC Application

Kernel 2PGAS

Kernel NPGAS

HPCAC Spain Conference (Sept '15) 10

Page 11: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,

OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications

Networking Technologies(InfiniBand, 40/100GigE,

Aries, and OmniPath)

Multi/Many-coreArchitectures

11

Accelerators(NVIDIA and MIC)

Middleware

HPCAC Spain Conference (Sept '15)

Co-Design

Opportunities

and

Challenges

across Various

Layers

Performance

Scalability

Fault-

Resilience

Communication Library or Runtime for Programming Models

Point-to-point

Communication

(two-sided and

one-sided

Collective

Communication

Energy-

Awareness

Synchronization

and Locks

I/O and

File Systems

Fault

Tolerance

Page 12: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

• Scalable Collective communication– Offload

– Non-blocking

– Topology-aware

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node

• Support for efficient multi-threading

• Integrated Support for GPGPUs and Accelerators

• Fault-tolerance/resiliency

• QoS support for communication and I/O

• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …)

• Virtualization

• Energy-Awareness

Broad Challenges in Designing Communication Libraries for (MPI+X) at Exascale

12HPCAC Spain Conference (Sept '15)

Page 13: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Extreme Low Memory Footprint– Memory per core continues to decrease

• D-L-A Framework

– Discover

• Overall network topology (fat-tree, 3D, …)

• Network topology for processes for a given job

• Node architecture

• Health of network and node

– Learn

• Impact on performance and scalability

• Potential for failure

– Adapt

• Internal protocols and algorithms

• Process mapping

• Fault-tolerance solutions

– Low overhead techniques while delivering performance, scalability and fault-tolerance

Additional Challenges for Designing Exascale Software Libraries

13HPCAC Spain Conference (Sept '15)

Page 14: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over

Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Used by more than 2,450 organizations in 76 countries

– More than 286,000 downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘15 ranking)

• 8th ranked 519,640-core cluster (Stampede) at TACC

• 11th ranked 185,344-core cluster (Pleiades) at NASA

• 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Stampede at TACC (8th in Jun’15, 519,640 cores, 5.168 Plops)

MVAPICH2 Software

14HPCAC Spain Conference (Sept '15)

Page 15: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15) 15

MVAPICH2 Software Family

Requirements MVAPICH2 Library to use

MPI with IB, iWARP and RoCE MVAPICH2

Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE

MVAPICH2-X

MPI with IB & GPU MVAPICH2-GDR

MPI with IB & MIC MVAPICH2-MIC

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

Page 16: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimal memory footprint

• Collective communication– Hardware-multicast-based

– Offload and Non-blocking

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale

16HPCAC Spain Conference (Sept '15)

Page 17: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15) 17

Latency & Bandwidth: MPI over IB with MVAPICH2

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.261.19

1.051.15

TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switchConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

12465

3387

6356

11497

Page 18: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

18

Latest MVAPICH2 2.2a

Intel Ivy-bridge

0.18 us

0.45 us

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (Inter-socket)

inter-Socket-CMA

inter-Socket-Shmem

inter-Socket-LiMIC

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (Intra-socket)

intra-Socket-CMA

intra-Socket-Shmem

intra-Socket-LiMIC

14,250 MB/s 13,749 MB/s

HPCAC Spain Conference (Sept '15)

Page 19: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15) 19

Minimizing Memory Footprint further by DC Transport

No

de

0

P1

P0

Node 1

P3

P2

Node 3

P7

P6

No

de

2

P5

P4

IBNetwork

• Constant connection cost (One QP for any peer)

• Full Feature Set (RDMA, Atomics etc)

• Separate objects for send (DC Initiator) and receive (DC Target)

– DC Target identified by “DCT Number”

– Messages routed with (DCT Number, LID)

– Requires same “DC Key” to enable communication

• Available with MVAPICH2-X 2.2a

• DCT support available in Mellanox OFED

0

0.5

1

160 320 620

No

rmal

ize

d E

xecu

tio

n

Tim

e

Number of Processes

NAMD - Apoa1: Large data setRC DC-Pool UD XRC

1022

4797

1 1 12

10 10 10 10

1 1

35

1

10

100

80 160 320 640

Co

nn

ect

ion

Me

mo

ry (

KB

)

Number of Processes

Memory Footprint for Alltoall

RC DC-Pool

UD XRC

H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected

Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).

Page 20: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication– Hardware-multicast-based

– Offload and Non-blocking

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

20HPCAC Spain Conference (Sept '15)

Page 21: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15) 21

Hardware Multicast-aware MPI_Bcast on Stampede

05

10152025303540

2 8 32 128 512

Late

ncy

(u

s)

Message Size (Bytes)

Small Messages (102,400 Cores)

Default

Multicast

ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

050

100150200250300350400450

2K 8K 32K 128K

Late

ncy

(u

s)

Message Size (Bytes)

Large Messages (102,400 Cores)

Default

Multicast

0

5

10

15

20

25

30

Late

ncy

(u

s)

Number of Nodes

16 Byte Message

Default

Multicast

0

50

100

150

200

Late

ncy

(u

s)

Number of Nodes

32 KByte Message

Default

Multicast

Page 22: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

0

1

2

3

4

5

512 600 720 800Ap

plic

atio

n R

un

-Tim

e

(s)

Data Size

0

5

10

15

64 128 256 512Ru

n-T

ime

(s)

Number of Processes

PCG-Default Modified-PCG-Offload

Co-Design with MPI-3 Non-Blocking Collectives and Collective Offload Co-Direct Hardware (Available with MVAPICH2-X 2.2a)

22

Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)

K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking

All-to-All with Collective Offload on InfiniBand Clusters: A Study

with Parallel 3D FFT, ISC 2011

HPCAC Spain Conference (Sept '15)

17%

00.20.40.60.8

11.2

10 20 30 40 50 60 70

No

rmal

ize

d

Pe

rfo

rman

ce

HPL-Offload HPL-1ring HPL-Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective

Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011

K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

21.8%

Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12

Page 23: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication– Hardware-multicast-based

– Offload and Non-blocking

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• MPI-T Interface

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

23HPCAC Spain Conference (Sept '15)

Page 24: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(s_hostbuf, s_devbuf, . . .);

MPI_Send(s_hostbuf, size, . . .);

At Receiver:

MPI_Recv(r_hostbuf, size, . . .);

cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• Data movement in applications with standard MPI and CUDA interfaces

High Productivity and Low Performance

24HPCAC Spain Conference (Sept '15)

MPI + CUDA - Naive

Page 25: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

PCIe

GPU

CPU

NIC

Switch

At Sender:for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,

…);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

<<Similar at receiver>>

• Pipelining at user level with non-blocking MPI and CUDA interfaces

Low Productivity and High Performance

25HPCAC Spain Conference (Sept '15)

MPI + CUDA - Advanced

Page 26: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

At Sender:

At Receiver:

MPI_Recv(r_devbuf, size, …);

inside

MVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

26HPCAC Spain Conference (Sept '15)

GPU-Aware MPI Library: MVAPICH2-GPU

Page 27: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

CUDA-Aware MPI: MVAPICH2 1.8-2.2 Releases

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers

27HPCAC Spain Conference (Sept '15)

Page 28: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• OFED with support for GPUDirect RDMA is

developed by NVIDIA and Mellanox

• OSU has a design of MVAPICH2 using GPUDirect

RDMA

– Hybrid design using GPU-Direct RDMA

• GPUDirect RDMA and Host-based pipelining

• Alleviates P2P bandwidth bottlenecks on

SandyBridge and IvyBridge

– Support for communication using multi-rail

– Support for Mellanox Connect-IB and ConnectX

VPI adapters

– Support for RoCE with Mellanox ConnectX VPI

adapters

HPCAC Spain Conference (Sept '15) 28

GPU-Direct RDMA (GDR) with CUDA

IB Adapter

SystemMemory

GPUMemory

GPU

CPU

Chipset

P2P write: 5.2 GB/s

P2P read: < 1.0 GB/s

SNB E5-2670

P2P write: 6.4 GB/s

P2P read: 3.5 GB/s

IVB E5-2680V2

SNB E5-2670 /

IVB E5-2680V2

Page 29: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

29

Performance of MVAPICH2-GDR with GPU-Direct-RDMA

HPCAC Spain Conference (Sept '15)

MVAPICH2-GDR-2.1RC2Intel Ivy Bridge (E5-2680 v2) node - 20 cores

NVIDIA Tesla K40c GPUMellanox Connect-IB Dual-FDR HCA

CUDA 7Mellanox OFED 2.4 with GPU-Direct-RDMA

0

5

10

15

20

25

30

0 2 8 32 128 512 2K

MV2-GDR2.1RC2MV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

3.5X9.3X

2.15 usec

0

500

1000

1500

2000

2500

3000

3500

1 4 16 64 256 1K 4K

MV2-GDR2.1RC2MV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode MPI Uni-Directional Bandwidth

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

11x

2X

0500

10001500200025003000350040004500

1 4 16 64 256 1K 4K

MV2-GDR2.1RC2MV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode Bi-directional Bandwidth

Message Size (bytes)

Bi-

Ban

dw

idth

(M

B/s

)

11x

2x

Page 30: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15) 30

Application-Level Evaluation (HOOMD-blue)

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)

• HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0

MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768

MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1

MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Page 31: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• MVAPICH2-GDR 2.1RC2 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download

• System software requirements

• Mellanox OFED 2.1 or later

• NVIDIA Driver 331.20 or later

• NVIDIA CUDA Toolkit 6.0 or later

• Plugin for GPUDirect RDMA

– http://www.mellanox.com/page/products_dyn?product_family=116

– Strongly Recommended : use the new GDRCOPY module from NVIDIA

• https://github.com/NVIDIA/gdrcopy

• Has optimized designs for point-to-point communication using GDR

• Contact MVAPICH help list with any questions related to the package

[email protected]

Using MVAPICH2-GDR Version

31HPCAC Brazil Conference (Aug ‘15)

Page 32: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication– Hardware-multicast-based

– Offload and Non-blocking

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

32HPCAC Spain Conference (Sept '15)

Page 33: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

MPI Applications on MIC Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

OffloadedComputation

MPI Program

MPI Program

MPI Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

• Flexibility in launching MPI jobs on clusters with Xeon Phi

33HPCAC Spain Conference (Sept '15)

Page 34: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

34

MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC

• Offload Mode

• Intranode Communication

• Coprocessor-only and Symmetric Mode

• Internode Communication

• Coprocessors-only and Symmetric Mode

• Multi-MIC Node Configurations

• Running on three major systems

• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)

HPCAC Spain Conference (Sept '15)

Page 35: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

MIC-Remote-MIC P2P Communication with Proxy-based Communication

35HPCAC Spain Conference (Sept '15)

Bandwidth

Bette

r Bet

ter

Bet

ter

Latency (Large Messages)

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)

Message Size (Bytes)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)

Message Size (Bytes)

5236

Intra-socket P2P

Inter-socket P2P

0200040006000800010000120001400016000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)

Message Size (Bytes)

Latency (Large Messages)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)

Message Size (Bytes)

Bette

r5594

Bandwidth

Page 36: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

36

Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)

HPCAC Spain Conference (Sept '15)

A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014

0

5000

10000

15000

20000

25000

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

secs

)

Message Size (Bytes)

32-Node-Allgather (16H + 16 M)Small Message Latency

MV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(u

secs

)x 1

00

0

Message Size (Bytes)

32-Node-Allgather (8H + 8 M)Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

200

400

600

800

4K 8K 16K 32K 64K 128K 256K 512K

Late

ncy

(u

secs

)x 1

00

0

Message Size (Bytes)

32-Node-Alltoall (8H + 8 M)Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

10

20

30

40

50

MV2-MIC-Opt MV2-MIC

Exe

cuti

on

Tim

e (

secs

)

32 Nodes (8H + 8M), Size = 2K*2K*1K

P3DFFT Performance

Communication

Computation

76%

58%

55%

Page 37: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication– Hardware-multicast-based

– Offload and Non-blocking

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

37HPCAC Spain Conference (Sept '15)

Page 38: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS Applications

HPCAC Spain Conference (Sept '15)

MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)

Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI CallsUPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with

MVAPICH2-X 1.9 (2012) onwards!

– http://mvapich.cse.ohio-state.edu

• Feature Highlights

– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM,

MPI(+OpenMP) + UPC

– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard

compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH)

– Scalable Inter-node and intra-node communication – point-to-point and collectives

CAF Calls

38

Page 39: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Application Level Performance with Graph500 and SortGraph500 Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,

International Supercomputing Conference (ISC’13), June 2013

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l

Conference on Parallel Processing (ICPP '12), September 2012

0

5

10

15

20

25

30

35

4K 8K 16K

Tim

e (

s)

No. of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid (MPI+OpenSHMEM)13X

7.6X

• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design

• 8,192 processes- 2.4X improvement over MPI-CSR

- 7.6X improvement over MPI-Simple

• 16,384 processes- 1.5X improvement over MPI-CSR

- 13X improvement over MPI-SimpleJ. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned

Global Address Space Programming Models (PGAS '13), October 2013.

Sort Execution Time

0

500

1000

1500

2000

2500

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Tim

e (

seco

nd

s)

Input Data - No. of Processes

MPI Hybrid

51%

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application

• 4,096 processes, 4 TB Input Size- MPI – 2408 sec; 0.16 TB/min

- Hybrid – 1172 sec; 0.36 TB/min

- 51% improvement over MPI-design

39HPCAC Spain Conference (Sept '15)

Page 40: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication– Hardware-multicast-based

– Offload and Non-blocking

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

40HPCAC Spain Conference (Sept '15)

Page 41: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Virtualization has many benefits– Job migration

– Compaction

• Not very popular in HPC due to overhead associated with Virtualization

• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters

• Enhanced MVAPICH2 support for SR-IOV

• Publicly available as MVAPICH2-Virt library

Can HPC and Virtualization be Combined?

41HPCAC Spain Conference (Sept '15)

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV

based Virtualized InfiniBand Clusters? EuroPar'14

J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled

InfiniBand Clysters, HiPC’14

J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient

Approach to build HPC Clouds, CCGrid’15

Page 42: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

0

50

100

150

200

250

300

350

400

450

20,10 20,16 20,20 22,10 22,16 22,20

Exe

cuti

on

Tim

e (

us)

Problem Size (Scale, Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

4%

9%

0

5

10

15

20

25

30

35

FT-64-C EP-64-C LU-64-C BT-64-C

Exe

cuti

on

Tim

e (

s)

NAS Class C

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

1%

9%

• Compared to Native, 1-9% overhead for NAS

• Compared to Native, 4-9% overhead for Graph500

HPCAC Spain Conference (Sept '15) 65

Application-Level Performance (8 VM * 8 Core/VM)

NAS Graph500

Page 43: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15)

NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument

• Large-scale instrument

– Targeting Big Data, Big Compute, Big Instrument research

– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network

• Reconfigurable instrument

– Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use

• Connected instrument

– Workload and Trace Archive

– Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others

– Partnerships with users

• Complementary instrument

– Complementing GENI, Grid’5000, and other testbeds

• Sustainable instrument

– Industry connections http://www.chameleoncloud.org/

43

Page 44: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication– Hardware-multicast-based

– Offload and Non-blocking

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

44HPCAC Spain Conference (Sept '15)

Page 45: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• MVAPICH2-EA (Energy-Aware)

• A white-box approach

• New Energy-Efficient communication protocols for pt-pt and

collective operations

• Intelligently apply the appropriate Energy saving techniques

• Application oblivious energy saving

• OEMT

• A library utility to measure energy consumption for MPI applications

• Works with all MPI runtimes

• PRELOAD option for precompiled applications

• Does not require ROOT permission:

• A safe kernel module to read only a subset of MSRs

45

Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT)

HPCAC Spain Conference (Sept '15)

Page 46: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• An energy efficient

runtime that provides

energy savings without

application knowledge

• Uses automatically and

transparently the best

energy lever

• Provides guarantees on

maximum degradation

with 5-41% savings at <=

5% degradation

• Pessimistic MPI applies

energy reduction lever to

each MPI call

46

MV2-EA : Application Oblivious Energy-Aware-MPI (EAM)

A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh , A. Vishnu , K. Hamidouche , N. Tallent ,

D. K. Panda , D. Kerbyson , and A. Hoise - Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]

HPCAC Spain Conference (Sept '15)

1

Page 47: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided RMA)

– Extremely minimum memory footprint

• Collective communication– Hardware-multicast-based

– Offload and Non-blocking

• Integrated Support for GPGPUs

• Integrated Support for Intel MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale

47HPCAC Spain Conference (Sept '15)

Page 48: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15) 48

OSU InfiniBand Network Analysis Monitoring Tool (INAM) – Network Level View

• Show network topology of large clusters

• Visualize traffic pattern on different links

• Quickly identify congested links/links in error state

• See the history unfold – play back historical state of the network

Full Network (152 nodes) Zoomed-in View of the Network

Page 49: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15) 49

Upcoming OSU INAM Tool – Job and Node Level Views

Visualizing a Job (5 Nodes) Finding Routes Between Nodes

• Job level view• Show different network metrics (load, error, etc.) for any live job

• Play back historical data for completed jobs to identify bottlenecks

• Node level view provides details per process or per node• CPU utilization for each rank/node

• Bytes sent/received for MPI operations (pt-to-pt, collective, RMA)

• Network metrics (e.g. XmitDiscard, RcvError) per rank/node

Page 50: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Performance and Memory scalability toward 500K-1M cores– Dynamically Connected Transport (DCT) service with Connect-IB

• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)• Enhanced Optimization for GPU Support and Accelerators

• Taking advantage of advanced features– User Mode Memory Registration (UMR)

– On-demand Paging

• Enhanced Inter-node and Intra-node communication schemes for upcoming OmniPath enabled Knights Landing architectures

• Extended RMA support (as in MPI 3.0)

• Extended topology-aware collectives

• Energy-aware point-to-point (one-sided and two-sided) and collectives

• Extended Support for MPI Tools Interface (as in MPI 3.0)

• Extended Checkpoint-Restart and migration support with SCR

MVAPICH2 – Plans for Exascale

50HPCAC Spain Conference (Sept '15)

Page 51: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Scientific Computing

– Message Passing Interface (MPI), including MPI + OpenMP, is the

Dominant Programming Model

– Many discussions towards Partitioned Global Address Space

(PGAS)

• UPC, OpenSHMEM, CAF, etc.

– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)

• Big Data/Enterprise/Commercial Computing

– Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce)

– Spark is emerging for in-memory computing

– Memcached is also used for Web 2.0

51

Two Major Categories of Applications

HPCAC Spain Conference (Sept '15)

Page 52: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Can High-Performance Interconnects Benefit Big Data Computing? • Most of the current Big Data systems use Ethernet

Infrastructure with Sockets

• Concerns for performance and scalability

• Usage of High-Performance Networks is beginning to draw interest

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?

• Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop and Memcached?

HPCAC Spain Conference (Sept '15) 52

Page 53: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigEand Intelligent NICs)

Storage Technologies(HDD and SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-PointCommunication

QoS

Threaded Modelsand Synchronization

Fault-ToleranceI/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

HPCAC Spain Conference (Sept '15)

Upper level

Changes?

53

Page 54: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Design Overview of HDFS with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based HDFS with communication library written in native code

HDFS

Verbs

RDMA Capable Networks(IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

WriteOthers

OSU Design

• Design Features

– RDMA-based HDFS write

– RDMA-based HDFS replication

– Parallel replication support

– On-demand connection setup

– InfiniBand/RoCE support

HPCAC Spain Conference (Sept '15) 54

Page 55: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Communication Times in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

HPCAC Spain Conference (Sept '15)

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Co

mm

un

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

55

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in

RDMA-Enhanced HDFS, HPDC '14, June 2014

Page 56: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Triple-H

Heterogeneous Storage

• Design Features

– Three modes

• Default (HHH)

• In-Memory (HHH-M)

• Lustre-Integrated (HHH-L)

– Policies to efficiently utilize the

heterogeneous storage devices

• RAM, SSD, HDD, Lustre

– Eviction/Promotion based on

data usage pattern

– Hybrid Replication

– Lustre-Integrated mode:

• Lustre-based fault-tolerance

Enhanced HDFS with In-memory and Heterogeneous Storage

Hybrid

Replication

Data Placement Policies

Eviction/

Promotion

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS

on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applications

HPCAC Spain Conference (Sept '15) 56

Page 57: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• For 160GB TestDFSIO in 32 nodes

– Write Throughput: 7x improvement

over IPoIB (FDR)

– Read Throughput: 2x improvement

over IPoIB (FDR)

Performance Improvement on TACC Stampede (HHH)

• For 120GB RandomWriter in 32

nodes

– 3x improvement over IPoIB (QDR)

0

1000

2000

3000

4000

5000

6000

write read

Tota

l Th

rou

ghp

ut

(MB

ps)

IPoIB (FDR) OSU-IB (FDR)

TestDFSIO

0

50

100

150

200

250

300

8:30 16:60 32:120

Exe

cuti

on

Tim

e (

s)

Cluster Size:Data Size

IPoIB (FDR) OSU-IB (FDR)

RandomWriter

HPCAC Spain Conference (Sept '15)

Increased by 7x

Increased by 2xReduced by 3x

57

Page 58: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• For 200GB TeraGen on 32 nodes

– Spark-TeraGen: HHH has 2.1x improvement over HDFS-IPoIB (QDR)

– Spark-TeraSort: HHH has 16% improvement over HDFS-IPoIB (QDR)

58

Evaluation with Spark on SDSC Gordon (HHH)

0

50

100

150

200

250

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size : Data Size (GB)

IPoIB (QDR) OSU-IB (QDR)

0

100

200

300

400

500

600

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size : Data Size (GB)

IPoIB (QDR) OSU-IB (QDR)

TeraGen TeraSort

Reduced by 16%

Reduced by 2.1x

HPCAC Spain Conference (Sept '15)

N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-

Memory File Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015

Page 59: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Job

Tracker

Task

Tracker

Map

Reduce

HPCAC Spain Conference (Sept '15)

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features

– RDMA-based shuffle

– Prefetching and caching map output

– Efficient Shuffle Algorithms

– In-memory merge

– On-demand Shuffle Adjustment

– Advanced overlapping

• map, shuffle, and merge

• shuffle, merge, and reduce

– On-demand connection setup

– InfiniBand/RoCE support

59

Page 60: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• For 240GB Sort in 64 nodes (512

cores)

– 40% improvement over IPoIB (QDR)

with HDD used for HDFS

Performance Evaluation of Sort and TeraSort

HPCAC Spain Conference (Sept '15)

0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes

(1K cores)

– 38% improvement over IPoIB

(FDR) with HDD used for HDFS

60

Page 61: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Intermediate Data Directory

Design Overview of Shuffle Strategies for MapReduce over Lustre

HPCAC Spain Conference (Sept '15)

• Design Features

– Two shuffle approaches

• Lustre read based shuffle

• RDMA based shuffle

– Hybrid shuffle algorithm to take benefit from both shuffle approaches

– Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation

– In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design

Map 1 Map 2 Map 3

Lustre

Reduce 1 Reduce 2

Lustre Read / RDMA

In-memory

merge/sortreduce

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN

MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015.

In-memory

merge/sortreduce

61

Page 62: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

Performance Improvement of MapReduce over Lustre on TACC-Stampede

HPCAC Spain Conference (Sept '15)

• For 640GB Sort in 128 nodes

– 48% improvement over IPoIB (FDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

0

50

100

150

200

250

300

350

400

450

500

20 GB 40 GB 80 GB 160 GB 320 GB 640 GB

Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (FDR) OSU-IB (FDR)

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-

based Approach Benefit?, Euro-Par, August 2014.

• Local disk is used as the intermediate data directory

Reduced by 48%Reduced by 44%

62

Page 63: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Design Overview of Spark with RDMA

• Design Features

– RDMA based shuffle

– SEDA-based plugins

– Dynamic connection management and sharing

– Non-blocking and out-of-order data transfer

– Off-JVM-heap buffer management

– InfiniBand/RoCE support

HPCAC Spain Conference (Sept '15)

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Scala based Spark with communication library written in native code

X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data

Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

63

Page 64: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Performance Evaluation on TACC Stampede - SortByTest

• Intel SandyBridge + FDR, 16 Worker Nodes, 256 Cores, (256M 256R)

• RDMA-based design for Spark 1.4.0

• RDMA vs. IPoIB with 256 concurrent tasks, single disk per node and

RamDisk. For SortByKey Test:

– Shuffle time reduced by up to 77% over IPoIB (56Gbps)

– Total time reduced by up to 58% over IPoIB (56Gbps)

16 Worker Nodes, SortByTest Shuffle Time 16 Worker Nodes, SortByTest Total Time

HPCAC Spain Conference (Sept '15) 64

0

5

10

15

20

25

30

35

40

16 32 64

Tim

e (s

ec)

Data Size (GB)

IPoIB RDMA

0

5

10

15

20

25

30

35

40

16 32 64

Tim

e (s

ec)

Data Size (GB)

IPoIB RDMA 77%

58%

Page 65: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Memcached-RDMA Design

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

– Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• Memcached Server can serve both socket and verbs clients simultaneously

– All other Memcached data structures are shared among RDMA and Sockets worker threads

• Memcached applications need not be modified; uses verbs interface if available

• High performance design of SSD-Assisted Hybrid Memory

SocketsClient

RDMAClient

MasterThread

SocketsWorker Thread

VerbsWorker Thread

SocketsWorker Thread

VerbsWorker Thread

SharedData

Memory&

SSDSlabsItems

1

1

2

2

65HPCAC Spain Conference (Sept '15)

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda,

Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for

InfiniBand Clusters using Hybrid Transport, CCGrid’12

D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads

with Memcached and MySQL, ISPASS’15

Page 66: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• ohb_memlat & ohb_memthr latency & throughput micro-benchmarks

• Memcached-RDMA can

- improve query latency by up to 70% over IPoIB (32Gbps)

- improve throughput by up to 2X over IPoIB (32Gbps)

- No overhead in using hybrid mode with SSD when all data can fit in memory

Performance Benefits on SDSC-Gordon – OHB Latency & Throughput Micro-Benchmarks

0123456789

10

64 128 256 512 1024

Thro

ugh

pu

t (m

illio

n t

ran

s/se

c)

No. of Clients

IPoIB (32Gbps)RDMA-Mem (32Gbps)RDMA-Hybrid (32Gbps)

0

100

200

300

400

500

Ave

rage

late

ncy

(u

s)

Message Size (Bytes)

IPoIB (32Gbps)RDMA-Mem (32Gbps)RDMA-Hyb (32Gbps) 2X

66

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Benchmarking Key-Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads, IEEE BigData ‘15.

HPCAC Spain Conference (Sept '15)

Page 67: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• RDMA for Memcached (RDMA-Memcached)

• OSU HiBD-Benchmarks (OHB)

– HDFS and Memcached Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 125 organizations from 20 countries

• More than 13,000 downloads from the project site

• RDMA for Apache HBase and Spark

The High-Performance Big Data (HiBD) Project

HPCAC Spain Conference (Sept '15) 67

Page 68: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Upcoming Releases of RDMA-enhanced Packages will support

– Spark

– HBase

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will

support

– MapReduce

– RPC

• Advanced designs with upper-level changes and optimizations

• E.g. MEM-HDFS

HPCAC Spain Conference (Sept '15)

Future Plans of OSU High Performance Big Data (HiBD) Project

68

Page 69: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Exascale systems will be constrained by– Power

– Memory per core

– Data movement cost

– Faults

• Programming Models and Runtimes for HPC and BigData need to be designed for– Scalability

– Performance

– Fault-resilience

– Energy-awareness

– Programmability

– Productivity

• Highlighted some of the issues and challenges

• Need continuous innovation on all these fronts

Looking into the Future ….

69HPCAC Spain Conference (Sept '15)

Page 70: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

HPCAC Spain Conference (Sept '15)

Funding Acknowledgments

Funding Support by

Equipment Support by

70

Page 71: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

Personnel AcknowledgmentsCurrent Students

– A. Augustine (M.S.)

– A. Awan (Ph.D.)

– A. Bhat (M.S.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– N. Islam (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist– S. Sur

Current Post-Doc

– J. Lin

– D. Shankar

Current Programmer

– J. Perkins

Past Post-Docs– H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekar (Ph.D.)

– K. Kulkarni (M.S.)

– M. Li (Ph.D.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associates

– K. Hamidouche

– X. Lu

Past Programmers

– D. Bureddy

– H. Subramoni

Current Research Specialist

– M. Arnold

HPCAC Spain Conference (Sept '15) 71

Page 72: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

[email protected]

HPCAC Spain Conference (Sept '15)

Thank You!

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

72

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/

Page 73: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

73

Call For Participation

International Workshop on Extreme Scale

Programming Models and Middleware

(ESPM2 2015)

In conjunction with Supercomputing Conference

(SC 2015)

Austin, USA, November 15th, 2015

http://web.cse.ohio-state.edu/~hamidouc/ESPM2/

HPCAC Spain Conference (Sept '15)

Page 74: Communication Frameworks for HPC and Big Data€¦ · Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20

• Looking for Bright and Enthusiastic Personnel to join as

– Post-Doctoral Researchers

– PhD Students

– Hadoop/Big Data Programmer/Software Engineer

– MPI Programmer/Software Engineer

• If interested, please contact me at this conference and/or send

an e-mail to [email protected]

74

Multiple Positions Available in My Group

HPCAC Spain Conference (Sept '15)