[slides] parallel and distributed computing on low latency clusters

Parallel and Distributed Computing on Low

Latecy Clusters

Vittorio GiovaraM. S. Electrical Engineering and Computer Science

University of Illinois at ChicagoMay 2009

Contents• Motivation

• Strategy

• Technologies

• OpenMP

• MPI

• Infinband

• Application

• Compiler Optimizations

• OpenMP and MPI over Infinband

• Results

• Conclusions

Motivation

• Scaling trend has to stop for CMOS technology:✓ Direct-tunneling limit in SiO2 ~3 nm

✓ Distance between Si atoms ~0.3 nm

✓ Variabilty

• Foundamental reason: rising fab cost

Motivation

• Easy to build multiple core processor

• Requires human action to modify and adapt concurrent software

• New classification for computer architectures

data pool

SISDdata pool

CPU CPU

data pool

CPU CPU

CPUCPU

Classification

easier to parallelize

abstraction level

algorithm

loop level

process management

data dependencybranching overhead

control flowalgorithm

loop level

process management

recursionmemory

managementprofiling

SMP MultiprogrammingMultithreading and Scheduling

Levels

Backfire

• Difficutly to fully exploit the parallelism offered

• Automatic tools required to adapt software to parallelism

• Compiler support for manual or semi-automatic enhancement

Applications

• OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software

• Mathematics and Physics

• Computer Science

• Biomedics

Specific Problem and Background

• Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering)

• Computationally intensive (even days of CPU); speedup required

• Previous works still not fully encompassing the problem (no Infiniband or OpenMP+MPI solutions)

Strategy

• Install a Linux Kernel with ad-hoc configuration for scientific computation

• Compile a OpenMP enable GCC (supported from 4.3.1 onwards)

• Add the Infiniband link among clusters with proper drivers in kernel and user space

• Select a MPI implementation library

Strategy

• Verify Infiniband network through some MPI test examples

• Install the target software

• Proceed to include OpenMP and MPI directives in the code

• Run test cases

OpenMP

• standard

• supported by most of modern compilers

• requires little knowledge of the software

• very simple construction methods

OpenMP - example

Parallel Task 1

Parallel Task 2 Parallel Task 4

Parallel Task 3

OpenMP - example

Master Thread

Parallel Task 1 Parallel Task 2

Parallel Task 3

Parallel Task 4

Thread B

Thread A

OpenMP Sceduler

• Which scheduler available for hardware?

- Static

- Dynamic

- Guided

OpenMP Scheduler

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

OpenMP Static Scheduler Chart

number of threads

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

OpenMP Scheduler

102375

117000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

OpenMP Dynamic Scheduler Chart

number of threads

OpenMP Scheduler

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

OpenMP Guided Scheduler Chart

number of threads

OpenMP Scheduler

static scheduler dynamic scheduler guided scheduler

• standard

• widely used in cluster environment

• many transport link supported

• different implementations available

- OpenMPI

- MVAPICH

Infiniband

• standard

• widely used in cluster environment

• very low latency for small packets

• up to 16 Gb/s transfer speed

1,0 µs

10,0 µs

100,0 µs

1000,0 µs

10000,0 µs

100000,0 µs

1000000,0 µs

10000000,0 µs

OpenMPI Mvapich2

MPI over Infiniband

1,00 µs

10,00 µs

100,00 µs

1000,00 µs

10000,00 µs

100000,00 µs

1000000,00 µs

10000000,00 µs

OpenMPI Mvapich2

Optimizations

• Active at compile time

• Available only after porting the software to standard FORTRAN

• Consistent documentation available

• Unexpected positive results

Optimizations

•-march = native

•-O3

•-ffast-math

•-Wl,-O1

Target Software

• Sally3D

• micromagnetic equation solver

• written in FORTRAN with some C libraries

• program uses linear formulation of mathematical models

parallel loop

OpenMP Threadsdistributed loop

Host 1 Host 2OpenMP Threads OpenMP Threads

sequential loop

standard programming

Implementation Scheme

• Data Structure: not embarrassingly parallel

• Three dimensional matrix

• Several temporary arrays – synchronization obiects required

➡ send() and recv() mechanism

➡ critical regions using OpenMP directives

➡ functions merging

➡ matrix conversion

Results

ResultsOMP MPI OPT seconds

* * * 133* * - 400* - * 186* - - 487- * * 200- * - 792- - * 246- - - 1062

Total Speed Increase: 87.52%

Actual ResultsOMP MPI seconds

* * 59* - 129- * 174- - 249

Function Namecalc_intmuduacalc_hdmg_tetcalc_muduacampo_effettivo

Normal OpenMP MPI OpenMP+MPI24.5 s 4.7 s 14.4 s 2.8 s16.9 s 3.0 s 10.8 s 1.7 s12.1 s 1.9 s 7.0 s 1.1 s17.7 s 4.5 s 9.9 s 2.3 s

Actual Results

Total Raw Speed Increment: 76%

• OpenMP – 6-8x

• MPI – 2x

• OpenMP + MPI – 14 - 16x

Conclusions

Conclusions and Future Works

• Computational time has been significantly decreased

• Speedup is consistent with expected results

• Submitted to COMPUMAG ‘09

• Continue inserting OpenMP and MPI directives• Perform algorithm optimizations• Increase cluster size

[slides] parallel and distributed computing on low latency clusters

openmp sceduler

thread b parallel task

number of threads

target software

scheduler available

parallelizing process

master thread

inniband network

Documents

parallel visualization on large clusters using...

jockey - eurosys...

introduction to clusters and parallel...

parallel rendering on hybrid multi-gpu clusters

optimizing parallel job performance in data-intensive...

parallel olap query processing in database clusters with...

ddr basics, register configurations & pitfalls ·...

parallel programming on the iucaa clusters sunu engineer

high throughput and low latency on hadoop clusters using

parallel rendering technologies for hpc clusters€¦ ·...

java thread and process performance for parallel machine...

lightweight cryptography: from smallest to fastest ·...

grids, parallel processing, multicore, gpus and low latency...

(mpi) for parallel computing on clusters of workstations -...

cooperation model of a multiagent parallel file system for...

parallel visualization on large clusters using...

efficient scheduling ff real-time parallel applications on...

parallel rendering technologies for hpc clusters

parallel visualization on large clusters using mapreducevo,...

parallel computing techniques. 1. introduction 2. parallel...