recent developments on netcdf i/o: towards …...scalable i/o in climate models workshop – hamburg...

Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 1C2SM

Recent developments on NetCDF I/O:towards hiding I/O computations on COSMO

Carlos Osuna (C2SM), Oliver Fuhrer (MeteoSwiss), Neil Stringfellow (CSCS)

2Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012

COSMO Model

● Regional weather and climate prediction model.

● Used for operational and research applications by 2 communities:weather prediction http://www.cosmo-model.org/climate http://clm-community.eu/

● ~70 universities and research institutes.

● Operational at 7 national weather services.


COSMO CPU Usage at CSCS

Research in ETH climate community

> 15'000'000 CPUh per year (2010-2012)

Cray XE6, AMD Interlagos 2x16 core CPUS, 47874 cores

~ 15'000'000 CPUh per year

Cray XT4, AMD opteron1056 (buin) | 640 (dole) cores

Operational at MeteoSwiss with 2 dedicates Cray XT4


COSMO Workflow

● Structured grid

● Every compute PE processes thesame time step flow.

● At some time steps, output data is requested.

● All steps in the flow are parallelizedamong the compute PE except the I/Ooperations.


Domain Decomposition:

Full domain is divided in subdomainsassigned to every compute PE.

Every subdomain defines a halo to be exchanged with neighbours.


Time Step computation on compute PE:physics, dynamics, relaxation,...


Boundary exchange between neighbourcompute PEs


1. Some compute PE will gather a full domain (one level of a given variable).

At certain time steps, output data

T (level=0) T (level=1)

P (level=0)

T (level=2) T (level=k)...

... ... ... ...

P (level=1) P (level=2) P (level=k)


lustre

At certain time steps, output data

1. Some compute PE will gather a full domain (one level of a given variable).

2. Gathering all data (for a netcdf file) to PE 0

3. PE 0 write netcdf files to storage


An example

The 0.5km resolution COSMO is one of the IAC most I/O demanding runs in production at CSCS.Taking 1 hour simulation, on ~1850 processors, as an example:

Section Time (seconds)

Dynamics 897

Physics 189

Output 543

Others 273

Total 1902

We can see that writing output data is ~29% of the total run time.

Current I/O is sequential, adding more processors will increase this ratio.


In contrast with other sections of the model, the output does not scale with the numberof processors.

IPCC benchmark(24 hours) on XE6

Scaling


Goal: Improve the I/O performance for the netcdf format.Ideally, hide the I/O from the compute PE.Keep it simple.

Strategies studied:

Level to process mapping

Some compute PE will contain data from a full domain level.

Those can directly write to the netcdf file.

It requires using the parallel API of netcdf using NetCDF-4 format (on top of HDF5).

Not using collective but independent operations.

levelscompute PEstorage

Designing a new I/O strategy


Goal: Improve the I/O performance for the netcdf format.Ideally, hide the I/O from the compute PE for COSMO netcdf.Keep it simple.

Strategies studied:

Asynchronous I/Osome PE are reserved for I/O specific tasks. Compute pe can send data to asyn IO PEand continue processing next steps.

Use case for measurements & development:

IPCC benchmark, 15 km resolution (441x429x40) , 24 hours simulation, total output size ~2.8 GB

computation

communication

IO PE writing to disk

idle


nf90_create(path, cmode, ncid, initialsize, bufrsize, cache_size, & cache_nelems, cache_preemption, comm, info)

By passing an MPI communicator to the netcdf API, we select the parallel NetCDF API.

nf90_create(...)nf90_def_dim(...)nf90_def_var(...)nf90_enddef(...)

nf90_put_var(ncid, varid, data, start=(/ i_start,j_start /), count = (/i_size,j_size/) )

nf90_close()

A typical code using netcdf among several processes:

calls from every compute PE

independent call from PE that contains data

The Level to Process Mapping Strategy


Measurements from a toy model: writing 152 MB, 5 3D variables, ~20 variable/dimensions definitions in the netcdf header


Level to process mapping Summary:

● Global definitions of netcdf file are expensive, same order as actual writing of data.

● nf90_create timing, i.e. opening files increases quasi linearly with the number of processor for Lustre & GPFS (clients request always a new file descriptor to server)

● The writing time stays constant with the number of processors.

● Bandwitdh obtained is ~400 MB/s while expecting about 1.8 GB/s from some ROMIO benchmarks in Lustre (80 OSTs).

● Results vary with the layout of data in the netcdf file (X, Y or Z partition) but do not getsignificant improvements.

● User may request several netcdf files per time step, adding large overheads opening files, writing definitions, etc.


Level to process mapping Summary:

Tried to fine tune netcdf and MPI-IO to improve performance:

● Alignment, header space allocation.Setting v_align to 1 MB (same as lustre stripe unit) at the enddef netcdf subroutine.Reserve 100KB space at the netcdf header to avoid expensive relocation of data.

● Striping As there can be hundreds of processors writing a full level into the disk (same file), setting the striping to 80 (maximum available).

● MPI_INFOPassing to netcdf a MPI_INFO optimized for lustre at CSCS

CALL MPI_INFO_SET(info,"striping_factor","80",ierr)CALL MPI_INFO_SET(info,"striping_unit","1048576",ierr)CALL MPI_INFO_SET(info,"romio_lustre_coll_threshold","6",ierr)

● Lustre version1.8 provides 1.3 faster netcdf header writing than 1.6, but same scaling behaviour

● GPFSSame pattern for writing timings though 1.5-2 slower.

● Avoid Strides


They did not increase significantly the bandwitdth obtained. In the best case, output will be faster, but still not scaling.

All these fine tunings not suitable for COSMO collaboration, running on a wide set of platforms and HPC systems:

Cray, NEC, IBM, cluster, etc

But...


From parallel-netcdf paper at SC2003 (not using HDF5 but netcdf native solution), not expecting a good scaling for netcdf writing operations?

http://www.mcs.anl.gov/~robl/pnetcdf/docs/pnetcdf-sc2003.pdf


The Asynchronous I/O Strategy

Compute PE I/O PE


Results for IPCCSection Asyn I/O (seconds) Orig COSMO (seconds)

Dyn Comp 198.4 197.7

Fast waves Comm 26.4 27.09

Dyn Communication 31.89 33.68

Phys. Comp. 74.9 75.4

Phys. Comm. 10.36 10.38

Add. Comp. 10.48 10.5

Input 4.93 5.25

Output 6.68 132.1

Computations_O 93.99 [17.7 – 128.7 ]

Write data 34.27 [ 0 - 111.1]

Gather data 3.85

TOTAL 369.6 498.59

~26% reduction in total time using new asynchronous strategy on a 400 processors run.


Looks good! but we are not there yet...

there might be bottlenecks if:

I/O PE takes more time to write a file than compute PEfinishing computation, i.e. (computation time) < (data size) / bandwidth

multiple files are being written for one timestep

computation

communication

IO PE writing to disk

idle


While some I/O PE assigned are busy writing a NetCDF file, all PEs in the communicator are blocked until all call nf90_close.

Add more resources to I/O??

But we can parallelize writing over several netcdf files by assigning different I/O PEs to different fies.By doing this we break the bandwidth limit provided by one file of 300 MB/s

Ideally, adding I/O PEs in this way will scale until we reach the filesystem limit.

nf90_create(...)nf90_def_dim(...)nf90_def_var(...)nf90_enddef(...)

nf90_put_var( )

nf90_close()

I/O PE communicator1

I/O PE communicator2


Section Asyn I/O with 3 IO PE (s) per comm & 6 comm

Asyn I/O with 3 IO PE (s) per comm &

1 comm

Orig COSMO(s)

Output 110 882 1103.71

Computations O 6.3 112.83 [49 - 141]

Write data 5.4 [4.3-8.1] 404 [ 1 – 907 ]

Gather data 99 [96-100] 584 [182 - 952]

Final Wait 19 [16.7-22]

Total 440.98 1207 1502

Testing the new solution on a more demanding IPCC based example:24 simulated hours, 400 processors130 GB total data size, 148 files, 168771 vertical slices.

Results (using several I/O communicators)

Dedicating 18 processors in 6 I/O groups, total time was reduced by 70%.


But still time of output section is not 0

Section Asyn I/O with 3 IO PE (s) per comm & 6 comm

Asyn I/O with 3 IO PE (s) per comm &

1 comm

Orig COSMO(s)

Output 110 882 1103.71

Computations O 6.3 112.83 [49 - 141]

Write data 5.4 [4.3-8.1] 404 [ 1 – 907 ]

Gather data 99 [96-100] 584 [182 - 952]

Final Wait 19 [16.7-22]

Total 440.98 1207 1502

Due to the process of gathering of a full record from the set of subdomains computed in all compute PE.

Substitute MPI_GATHER by MPI_ALLTOALL communication for gathering, the total output time is reduced to negligible values.


Keys of the netcdf asynchronous strategy:

● COSMO write results from different time steps into different netcdf files.We can parallelize over multiple netcdf files by using several I/O communicators.

● Large buffering on compute PE is essential, to cover at least the size of data that one compute PE will send to I/O PEs in one output time step.With this, do not require large number of PE in I/O sector.

● Adding more PE to a I/O communicator will not help if buffering is enough(i.e. bandwidth does not increase with #PE).

● Adding more I/O communicators will help if writing several netcdf files overlapsIn time and there are no free I/O communicators available.

We don't care anymore about performance of a single I/O netcdf communicator, provided user split data into several files we can add resources to I/O as new communicators.


Speedup

In general, the speedup will depend on the rate of output data being written.We can use bandwidth required as a measurement of this rate.

0.5 Km example: 1 hour simulation, 52 GB. of data, bandwidth required ~40 MB/s. Total run time is reduced by 25%.

Study of massive data dumping examples (based on IPCC benchmark) shows that thestrategy scales well up to the filesystem limit (~1.8 GB/s).

bandwidth required = (data size) / (total run time - I/O time)

hardware limit

0.5 km data intensive exampleIPCC

for 400 processors on rosa

Speedup = (total time sequential I/O cosmo)/ (total time asyncronous strategy)


Speedup

The bandwidth required will depend on the number of processors configured.(Increasing the number of processors will reduce computation time of each time step and increase bandwidth required).

Therefore, the speedup will also depend on the number of processors.

IPCC benchmark on rosa


Conclusions:

● COSMO sequential netcdf output code does not scale: currently ~20-30% total time..

● We tried two main strategies to parallelize the output:Level to process mapping and Asynchronous strategy.

● Level to process mapping did not yield satisfactory results due to limitations on the parallel netcdf.

● The asynchronous strategy completely hides the I/O tasks from the total run time.Due to:

COSMO splitting output into multiple netcdf files, using several I/O communicatorsand large buffering on the compute PE.

● Tested with many different I/O patterns, it scales well until we reach the filesystem limit (~1.8 GB/s).

● The speedup depends on how much time your run spends in I/O (asynchronous strategy removes I/O time):

speedup = (total run time) / (total run time - I/O time)

recent developments on netcdf i/o: towards …...scalable i/o in climate models workshop – hamburg...

Documents