damaris: how to efficiently leverage multicore parallelism ... · damaris 0 2 4 6 8 10 0 10 20 30 )...

26
Matthieu Dorier [email protected] KerData Team Inria Rennes, IRISA ENS Cachan Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O IEEE CLUSTER 2012 – Beijing September 24 - 28 September 25 th 2012

Upload: others

Post on 13-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Matthieu Dorier [email protected] KerData Team Inria Rennes, IRISA ENS Cachan

Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf

Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O

IEEE CLUSTER 2012 – Beijing September 24 - 28

September 25th 2012

Page 2: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Context: HPC simulation on Blue Waters

•  INRIA – UIUC - ANL Joint Lab for Petascale Computing

•  Targeting large-scale simulation of unprecedented accuracy

•  One of our challenges: At very large scale, scalability of HPC simulations is mainly driven by I/O

September 25th 2012 - 2 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 3: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Writing from HPC simulations: File per process vs. Collective I/O

- 3

•  Two main approaches for I/O in HPC simulations •  Implemented in MPI-I/O, available in HDF5, NetCDF,…

September 25th 2012

•  Too many files •  High metadata overhead •  Hard to read back

•  Requires coordination •  Data communications

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 4: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Users attempt to overcome performance issues

- 4

Image credit: Phil Carns (ANL), “Understanding and Improving Computational Science Storage Access through Continuous Characterization”

September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

File-per-process

Collective

“Hybrid”

Page 5: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Periodic synchronous snapshots lead to I/O bursts and high variability (jitter)

- 5 September 25th 2012

The “cardiogram” of a PVFS data server during a run of the CM1 simulation

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Input Output

Page 6: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Not only are performances bad, they’re also unpredictable!

- 6

I/O variability, or “jitter” •  Write time unpredictability

- Between two processes - Between two I/O phases - Between simulations

•  Due to - Cross-applications interferences - Many processes writing at the same time - Different amounts of data - File system’s configuration - …

September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 7: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Can we hide the I/O jitter?

Yes: Damaris

- 7 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 8: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

- 8

DAMARIS “Dedicated Adaptable Middleware for Application Resources Inline Steering”

September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 9: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

- 9

•  Next-generation supercomputers have multicore SMP nodes

•  Network access contention at the level of a node •  Possibility of efficient interactions through shared memory •  One core dedicated for gathering data •  This core only writes

The Damaris approach: dedicated I/O cores

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

September 25th 2012

Page 10: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Moving jitter to a dedicated core

- 10 September 25th 2012

Leave a core, go faster!

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 11: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Damaris at a glance

- 11 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 12: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Damaris at a glance

- 12 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 13: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Damaris at a glance

- 13 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 14: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Damaris at a glance

- 14 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 15: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Damaris at a glance

- 15 September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 16: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Application Programming Interface

- 16 September 25th 2012

•  Initializing, finalizing - DC_initialize(“config.xml”,rank), DC_finalize()- DC_mpi_init_and_start(“config.xml”, communicator)

•  Writing data - DC_write(“group/variable”,iteration,data)- DC_chunk_write(chunk,“group/variable”,iteration,data)

•  Sending events - DC_signal(“event name”,iteration)

•  Direct access to shared memory - DC_alloc(“group/variable”,iteration) à void*

- DC_commit(“group/variable”,iteration)

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 17: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Damaris: current state of the software

- 17

•  Version 0.6 available at http://damaris.gforge.inria.fr/ - Along with documentation, tutorials and examples

•  Written in C++, uses - Boost for IPC, Xerces-C and XSD for XML parsing

•  API for Fortran, C, C++

•  Tested on - Debian Linux, Kraken (Cray XT5 - NICS), JaguarPF (Cray XK6 - ORNL), JYC (Blue Waters testing system: Cray XE6 - NCSA), Intrepid (BlueGene/P – ANL)

•  Tested with - CM1 (climate), OLAM (climate), GTC (fusion), Nek5000 (CFD)

September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 18: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

- 18

Experimental evaluation

September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 19: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Running the CM1 simulation on Kraken, G5K and BluePrint with Damaris

- 19

•  The CM1 simulation - Atmospheric simulation - One of the Blue Waters target applications - Uses HDF5 (file-per-process) and pHDF5 (for collective I/O)

•  Kraken - Cray XT5 at NICS - 12 cores/node - 16 GB/node - Luster file system

September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

•  Grid 5000 - 24 cores/node - 48 GB/node - PVFS file system

•  BluePrint - Power5 - 16 cores/node - 64 GB/node - GPFS file system

Page 20: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Damaris achieves almost perfect scalability

- 20

0

2000

4000

6000

8000

10000

0 5000 10000 Sc

alab

ility

fact

or

Number of cores

Perfect scaling

Damaris

File-per-process

Collective-I/O

Weak scalability factor S = N TbaseT

N: number of cores Tbase: time of an iteration on one core w/o write T: time of an iteration + a write

0

200

400

600

800

1000

576 2304 9216

Run

tim

e (s

ec)

Number of cores

September 25th 2012

Kraken Cray XT5 Application run time

(50 iterations + 1 write phase)

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 21: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

- 21

0 100 200 300 400 500 600 700 800 900

576 2304 9216

Tim

e to

writ

e (s

ec)

Number of cores

Collective-I/O File-per-process Damaris

0

2

4

6

8

10

0 10 20 30

Tim

e to

writ

e (s

ec)

Total amount of data (GB)

Kraken Cray XT5 Average and maximum write time

28MB per process

BluePrint Power5, 1024 cores Average, max and min write time

Damaris hides the I/O jitter

September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 22: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

- 22

0.0625 0.125

0.25 0.5

1 2 4 8

16

0 5000 10000

Agg

rega

te th

roug

hput

(G

B/s

)

Number of cores

File-per-process Damaris Collective-I/O

Average aggregate throughput from the writer processes

Damaris increases effective throughput

September 25th 2012

Kraken Cray XT5 Average aggregate throughput

from writers

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 23: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

- 23

0 50

100 150 200 250 300

576 2304 9216

Tim

e (s

ec)

Number of cores

0

50

100

150

200

250

0.05 5.8 15.1 24.7 Ti

me

(sec

) Total amount of data (GB)

Spare time Used time

BluePrint Power5 (1024 cores) Kraken Cray XT5

Time spent by Damaris writing data and time spent waiting

Damaris spares time? Let’s use it!

September 25th 2012

Damaris spares time for data management

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 24: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

- 24 September 25th 2012

Use of the spare time: Adding compression and scheduling

0

5

10

15

20

25

30

35

Grid5000 (912 cores) Kraken (2304 cores)

Writ

e tim

e (s

ec)

No Compression With compression With Scheduling

•  With compression enabled: up to 600 percent compression ratio! - (reduction to 16bits float values and lossless gzip level 4)

•  Scheduling - each dedicated core waits for its time slot before writing

Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 25: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Conclusion and future work

- 25

•  The Damaris approach - Dedicated cores - Shared memory - Highly adaptable system thanks to plugins

•  Results - Fully hides the I/O jitter and I/O-related costs - 15x sustained write throughput (compared to collective I/O) - Almost perfect application scalability - Execution time divided by 3.5 compared to collective I/O - Overhead-free 600% compression ratio

•  Future work - Efficient coupling of simulation and analysis tools - Distributed I/O scheduling on dedicated cores

September 25th 2012 Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O

Page 26: Damaris: How to Efficiently Leverage Multicore Parallelism ... · Damaris 0 2 4 6 8 10 0 10 20 30 ) Total amount of data (GB) Kraken Cray XT5 Average and maximum write time 28MB per

Matthieu Dorier [email protected] KerData Team Inria Rennes, IRISA ENS Cachan

Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf

Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O

IEEE CLUSTER 2012 – Beijing September 24 - 28

September 25th 2012

Thank you 谢谢

Merci