recent developments on netcdf i/o: towards …...scalable i/o in climate models workshop – hamburg...

29
Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 1 C2SM Recent developments on NetCDF I/O: towards hiding I/O computations on COSMO Carlos Osuna (C2SM), Oliver Fuhrer (MeteoSwiss), Neil Stringfellow (CSCS)

Upload: others

Post on 25-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 1C2SM

Recent developments on NetCDF I/O:towards hiding I/O computations on COSMO

Carlos Osuna (C2SM), Oliver Fuhrer (MeteoSwiss), Neil Stringfellow (CSCS)

Page 2: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

2Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

COSMO Model

● Regional weather and climate prediction model.

● Used for operational and research applications by 2 communities:weather prediction http://www.cosmo-model.org/climate http://clm-community.eu/

● ~70 universities and research institutes.

● Operational at 7 national weather services.

Page 3: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

3Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

COSMO CPU Usage at CSCS

Research in ETH climate community

> 15'000'000 CPUh per year (2010-2012)

Cray XE6, AMD Interlagos 2x16 core CPUS, 47874 cores

~ 15'000'000 CPUh per year

Cray XT4, AMD opteron1056 (buin) | 640 (dole) cores

Operational at MeteoSwiss with 2 dedicates Cray XT4

Page 4: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

4Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

COSMO Workflow

● Structured grid

● Every compute PE processes thesame time step flow.

● At some time steps, output data is requested.

● All steps in the flow are parallelizedamong the compute PE except the I/Ooperations.

Page 5: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

5Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Domain Decomposition:

Full domain is divided in subdomainsassigned to every compute PE.

Every subdomain defines a halo to be exchanged with neighbours.

Page 6: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

6Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Time Step computation on compute PE:physics, dynamics, relaxation,...

Page 7: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

7Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Boundary exchange between neighbourcompute PEs

Page 8: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

8Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

1. Some compute PE will gather a full domain (one level of a given variable).

At certain time steps, output data

T (level=0) T (level=1)

P (level=0)

T (level=2) T (level=k)...

... ... ... ...

P (level=1) P (level=2) P (level=k)

Page 9: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

9Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

lustre

At certain time steps, output data

1. Some compute PE will gather a full domain (one level of a given variable).

2. Gathering all data (for a netcdf file) to PE 0

3. PE 0 write netcdf files to storage

Page 10: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

10Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

An example

The 0.5km resolution COSMO is one of the IAC most I/O demanding runs in production at CSCS.Taking 1 hour simulation, on ~1850 processors, as an example:

Section Time (seconds)

Dynamics 897

Physics 189

Output 543

Others 273

Total 1902

We can see that writing output data is ~29% of the total run time.

Current I/O is sequential, adding more processors will increase this ratio.

Page 11: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

11Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

In contrast with other sections of the model, the output does not scale with the numberof processors.

IPCC benchmark(24 hours) on XE6

Scaling

Page 12: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

12Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Goal: Improve the I/O performance for the netcdf format.Ideally, hide the I/O from the compute PE.Keep it simple.

Strategies studied:

Level to process mapping

Some compute PE will contain data from a full domain level.

Those can directly write to the netcdf file.

It requires using the parallel API of netcdf using NetCDF-4 format (on top of HDF5).

Not using collective but independent operations.

levelscompute PEstorage

Designing a new I/O strategy

Page 13: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

13Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Goal: Improve the I/O performance for the netcdf format.Ideally, hide the I/O from the compute PE for COSMO netcdf.Keep it simple.

Strategies studied:

Asynchronous I/Osome PE are reserved for I/O specific tasks. Compute pe can send data to asyn IO PEand continue processing next steps.

Use case for measurements & development:

IPCC benchmark, 15 km resolution (441x429x40) , 24 hours simulation, total output size ~2.8 GB

computation

communication

IO PE writing to disk

idle

Page 14: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

14Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

nf90_create(path, cmode, ncid, initialsize, bufrsize, cache_size, & cache_nelems, cache_preemption, comm, info)

By passing an MPI communicator to the netcdf API, we select the parallel NetCDF API.

nf90_create(...)nf90_def_dim(...)nf90_def_var(...)nf90_enddef(...)

nf90_put_var(ncid, varid, data, start=(/ i_start,j_start /), count = (/i_size,j_size/) )

nf90_close()

A typical code using netcdf among several processes:

calls from every compute PE

independent call from PE that contains data

The Level to Process Mapping Strategy

Page 15: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

15Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Measurements from a toy model: writing 152 MB, 5 3D variables, ~20 variable/dimensions definitions in the netcdf header

Page 16: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

16Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Level to process mapping Summary:

● Global definitions of netcdf file are expensive, same order as actual writing of data.

● nf90_create timing, i.e. opening files increases quasi linearly with the number of processor for Lustre & GPFS (clients request always a new file descriptor to server)

● The writing time stays constant with the number of processors.

● Bandwitdh obtained is ~400 MB/s while expecting about 1.8 GB/s from some ROMIO benchmarks in Lustre (80 OSTs).

● Results vary with the layout of data in the netcdf file (X, Y or Z partition) but do not getsignificant improvements.

● User may request several netcdf files per time step, adding large overheads opening files, writing definitions, etc.

Page 17: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

17Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Level to process mapping Summary:

Tried to fine tune netcdf and MPI-IO to improve performance:

● Alignment, header space allocation.Setting v_align to 1 MB (same as lustre stripe unit) at the enddef netcdf subroutine.Reserve 100KB space at the netcdf header to avoid expensive relocation of data.

● Striping As there can be hundreds of processors writing a full level into the disk (same file), setting the striping to 80 (maximum available).

● MPI_INFOPassing to netcdf a MPI_INFO optimized for lustre at CSCS

CALL MPI_INFO_SET(info,"striping_factor","80",ierr)CALL MPI_INFO_SET(info,"striping_unit","1048576",ierr)CALL MPI_INFO_SET(info,"romio_lustre_coll_threshold","6",ierr)

● Lustre version1.8 provides 1.3 faster netcdf header writing than 1.6, but same scaling behaviour

● GPFSSame pattern for writing timings though 1.5-2 slower.

● Avoid Strides

Page 18: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

18Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

They did not increase significantly the bandwitdth obtained. In the best case, output will be faster, but still not scaling.

All these fine tunings not suitable for COSMO collaboration, running on a wide set of platforms and HPC systems:

Cray, NEC, IBM, cluster, etc

But...

Page 19: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

19Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

From parallel-netcdf paper at SC2003 (not using HDF5 but netcdf native solution), not expecting a good scaling for netcdf writing operations?

http://www.mcs.anl.gov/~robl/pnetcdf/docs/pnetcdf-sc2003.pdf

Page 20: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

20Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

The Asynchronous I/O Strategy

Compute PE I/O PE

Page 21: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

21Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Results for IPCCSection Asyn I/O (seconds) Orig COSMO (seconds)

Dyn Comp 198.4 197.7

Fast waves Comm 26.4 27.09

Dyn Communication 31.89 33.68

Phys. Comp. 74.9 75.4

Phys. Comm. 10.36 10.38

Add. Comp. 10.48 10.5

Input 4.93 5.25

Output 6.68 132.1

Computations_O 93.99 [17.7 – 128.7 ]

Write data 34.27 [ 0 - 111.1]

Gather data 3.85

TOTAL 369.6 498.59

~26% reduction in total time using new asynchronous strategy on a 400 processors run.

Page 22: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

22Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Looks good! but we are not there yet...

there might be bottlenecks if:

I/O PE takes more time to write a file than compute PEfinishing computation, i.e. (computation time) < (data size) / bandwidth

multiple files are being written for one timestep

computation

communication

IO PE writing to disk

idle

Page 23: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

23Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

While some I/O PE assigned are busy writing a NetCDF file, all PEs in the communicator are blocked until all call nf90_close.

Add more resources to I/O??

But we can parallelize writing over several netcdf files by assigning different I/O PEs to different fies.By doing this we break the bandwidth limit provided by one file of 300 MB/s

Ideally, adding I/O PEs in this way will scale until we reach the filesystem limit.

nf90_create(...)nf90_def_dim(...)nf90_def_var(...)nf90_enddef(...)

nf90_put_var( )

nf90_close()

I/O PE communicator1

I/O PE communicator2

Page 24: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

24Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Section Asyn I/O with 3 IO PE (s) per comm & 6 comm

Asyn I/O with 3 IO PE (s) per comm &

1 comm

Orig COSMO(s)

Output 110 882 1103.71

Computations O 6.3 112.83 [49 - 141]

Write data 5.4 [4.3-8.1] 404 [ 1 – 907 ]

Gather data 99 [96-100] 584 [182 - 952]

Final Wait 19 [16.7-22]

Total 440.98 1207 1502

Testing the new solution on a more demanding IPCC based example:24 simulated hours, 400 processors130 GB total data size, 148 files, 168771 vertical slices.

Results (using several I/O communicators)

Dedicating 18 processors in 6 I/O groups, total time was reduced by 70%.

Page 25: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

25Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

But still time of output section is not 0

Section Asyn I/O with 3 IO PE (s) per comm & 6 comm

Asyn I/O with 3 IO PE (s) per comm &

1 comm

Orig COSMO(s)

Output 110 882 1103.71

Computations O 6.3 112.83 [49 - 141]

Write data 5.4 [4.3-8.1] 404 [ 1 – 907 ]

Gather data 99 [96-100] 584 [182 - 952]

Final Wait 19 [16.7-22]

Total 440.98 1207 1502

Due to the process of gathering of a full record from the set of subdomains computed in all compute PE.

Substitute MPI_GATHER by MPI_ALLTOALL communication for gathering, the total output time is reduced to negligible values.

Page 26: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

26Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Keys of the netcdf asynchronous strategy:

● COSMO write results from different time steps into different netcdf files.We can parallelize over multiple netcdf files by using several I/O communicators.

● Large buffering on compute PE is essential, to cover at least the size of data that one compute PE will send to I/O PEs in one output time step.With this, do not require large number of PE in I/O sector.

● Adding more PE to a I/O communicator will not help if buffering is enough(i.e. bandwidth does not increase with #PE).

● Adding more I/O communicators will help if writing several netcdf files overlapsIn time and there are no free I/O communicators available.

We don't care anymore about performance of a single I/O netcdf communicator, provided user split data into several files we can add resources to I/O as new communicators.

Page 27: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

27Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Speedup

In general, the speedup will depend on the rate of output data being written.We can use bandwidth required as a measurement of this rate.

0.5 Km example: 1 hour simulation, 52 GB. of data, bandwidth required ~40 MB/s. Total run time is reduced by 25%.

Study of massive data dumping examples (based on IPCC benchmark) shows that thestrategy scales well up to the filesystem limit (~1.8 GB/s).

bandwidth required = (data size) / (total run time - I/O time)

hardware limit

0.5 km data intensive exampleIPCC

for 400 processors on rosa

Speedup = (total time sequential I/O cosmo)/ (total time asyncronous strategy)

Page 28: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

28Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Speedup

The bandwidth required will depend on the number of processors configured.(Increasing the number of processors will reduce computation time of each time step and increase bandwidth required).

Therefore, the speedup will also depend on the number of processors.

IPCC benchmark on rosa

Page 29: Recent developments on NetCDF I/O: towards …...Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 C2SM 1 Recent developments on NetCDF I/O: towards hiding I/O computations

29Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

Conclusions:

● COSMO sequential netcdf output code does not scale: currently ~20-30% total time..

● We tried two main strategies to parallelize the output:Level to process mapping and Asynchronous strategy.

● Level to process mapping did not yield satisfactory results due to limitations on the parallel netcdf.

● The asynchronous strategy completely hides the I/O tasks from the total run time.Due to:

COSMO splitting output into multiple netcdf files, using several I/O communicatorsand large buffering on the compute PE.

● Tested with many different I/O patterns, it scales well until we reach the filesystem limit (~1.8 GB/s).

● The speedup depends on how much time your run spends in I/O (asynchronous strategy removes I/O time):

speedup = (total run time) / (total run time - I/O time)