recent developments on netcdf i/o: towards …...scalable i/o in climate models workshop – hamburg...
TRANSCRIPT
Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012 1C2SM
Recent developments on NetCDF I/O:towards hiding I/O computations on COSMO
Carlos Osuna (C2SM), Oliver Fuhrer (MeteoSwiss), Neil Stringfellow (CSCS)
2Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
COSMO Model
● Regional weather and climate prediction model.
● Used for operational and research applications by 2 communities:weather prediction http://www.cosmo-model.org/climate http://clm-community.eu/
● ~70 universities and research institutes.
● Operational at 7 national weather services.
3Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
COSMO CPU Usage at CSCS
Research in ETH climate community
> 15'000'000 CPUh per year (2010-2012)
Cray XE6, AMD Interlagos 2x16 core CPUS, 47874 cores
~ 15'000'000 CPUh per year
Cray XT4, AMD opteron1056 (buin) | 640 (dole) cores
Operational at MeteoSwiss with 2 dedicates Cray XT4
4Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
COSMO Workflow
● Structured grid
● Every compute PE processes thesame time step flow.
● At some time steps, output data is requested.
● All steps in the flow are parallelizedamong the compute PE except the I/Ooperations.
5Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Domain Decomposition:
Full domain is divided in subdomainsassigned to every compute PE.
Every subdomain defines a halo to be exchanged with neighbours.
6Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Time Step computation on compute PE:physics, dynamics, relaxation,...
7Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Boundary exchange between neighbourcompute PEs
8Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
1. Some compute PE will gather a full domain (one level of a given variable).
At certain time steps, output data
T (level=0) T (level=1)
P (level=0)
T (level=2) T (level=k)...
... ... ... ...
P (level=1) P (level=2) P (level=k)
9Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
lustre
At certain time steps, output data
1. Some compute PE will gather a full domain (one level of a given variable).
2. Gathering all data (for a netcdf file) to PE 0
3. PE 0 write netcdf files to storage
10Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
An example
The 0.5km resolution COSMO is one of the IAC most I/O demanding runs in production at CSCS.Taking 1 hour simulation, on ~1850 processors, as an example:
Section Time (seconds)
Dynamics 897
Physics 189
Output 543
Others 273
Total 1902
We can see that writing output data is ~29% of the total run time.
Current I/O is sequential, adding more processors will increase this ratio.
11Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
In contrast with other sections of the model, the output does not scale with the numberof processors.
IPCC benchmark(24 hours) on XE6
Scaling
12Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Goal: Improve the I/O performance for the netcdf format.Ideally, hide the I/O from the compute PE.Keep it simple.
Strategies studied:
Level to process mapping
Some compute PE will contain data from a full domain level.
Those can directly write to the netcdf file.
It requires using the parallel API of netcdf using NetCDF-4 format (on top of HDF5).
Not using collective but independent operations.
levelscompute PEstorage
Designing a new I/O strategy
13Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Goal: Improve the I/O performance for the netcdf format.Ideally, hide the I/O from the compute PE for COSMO netcdf.Keep it simple.
Strategies studied:
Asynchronous I/Osome PE are reserved for I/O specific tasks. Compute pe can send data to asyn IO PEand continue processing next steps.
Use case for measurements & development:
IPCC benchmark, 15 km resolution (441x429x40) , 24 hours simulation, total output size ~2.8 GB
computation
communication
IO PE writing to disk
idle
14Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
nf90_create(path, cmode, ncid, initialsize, bufrsize, cache_size, & cache_nelems, cache_preemption, comm, info)
By passing an MPI communicator to the netcdf API, we select the parallel NetCDF API.
nf90_create(...)nf90_def_dim(...)nf90_def_var(...)nf90_enddef(...)
nf90_put_var(ncid, varid, data, start=(/ i_start,j_start /), count = (/i_size,j_size/) )
nf90_close()
A typical code using netcdf among several processes:
calls from every compute PE
independent call from PE that contains data
The Level to Process Mapping Strategy
15Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Measurements from a toy model: writing 152 MB, 5 3D variables, ~20 variable/dimensions definitions in the netcdf header
16Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Level to process mapping Summary:
● Global definitions of netcdf file are expensive, same order as actual writing of data.
● nf90_create timing, i.e. opening files increases quasi linearly with the number of processor for Lustre & GPFS (clients request always a new file descriptor to server)
● The writing time stays constant with the number of processors.
● Bandwitdh obtained is ~400 MB/s while expecting about 1.8 GB/s from some ROMIO benchmarks in Lustre (80 OSTs).
● Results vary with the layout of data in the netcdf file (X, Y or Z partition) but do not getsignificant improvements.
● User may request several netcdf files per time step, adding large overheads opening files, writing definitions, etc.
17Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Level to process mapping Summary:
Tried to fine tune netcdf and MPI-IO to improve performance:
● Alignment, header space allocation.Setting v_align to 1 MB (same as lustre stripe unit) at the enddef netcdf subroutine.Reserve 100KB space at the netcdf header to avoid expensive relocation of data.
● Striping As there can be hundreds of processors writing a full level into the disk (same file), setting the striping to 80 (maximum available).
● MPI_INFOPassing to netcdf a MPI_INFO optimized for lustre at CSCS
CALL MPI_INFO_SET(info,"striping_factor","80",ierr)CALL MPI_INFO_SET(info,"striping_unit","1048576",ierr)CALL MPI_INFO_SET(info,"romio_lustre_coll_threshold","6",ierr)
● Lustre version1.8 provides 1.3 faster netcdf header writing than 1.6, but same scaling behaviour
● GPFSSame pattern for writing timings though 1.5-2 slower.
● Avoid Strides
18Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
They did not increase significantly the bandwitdth obtained. In the best case, output will be faster, but still not scaling.
All these fine tunings not suitable for COSMO collaboration, running on a wide set of platforms and HPC systems:
Cray, NEC, IBM, cluster, etc
But...
19Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
From parallel-netcdf paper at SC2003 (not using HDF5 but netcdf native solution), not expecting a good scaling for netcdf writing operations?
http://www.mcs.anl.gov/~robl/pnetcdf/docs/pnetcdf-sc2003.pdf
20Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
The Asynchronous I/O Strategy
Compute PE I/O PE
21Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Results for IPCCSection Asyn I/O (seconds) Orig COSMO (seconds)
Dyn Comp 198.4 197.7
Fast waves Comm 26.4 27.09
Dyn Communication 31.89 33.68
Phys. Comp. 74.9 75.4
Phys. Comm. 10.36 10.38
Add. Comp. 10.48 10.5
Input 4.93 5.25
Output 6.68 132.1
Computations_O 93.99 [17.7 – 128.7 ]
Write data 34.27 [ 0 - 111.1]
Gather data 3.85
TOTAL 369.6 498.59
~26% reduction in total time using new asynchronous strategy on a 400 processors run.
22Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Looks good! but we are not there yet...
there might be bottlenecks if:
I/O PE takes more time to write a file than compute PEfinishing computation, i.e. (computation time) < (data size) / bandwidth
multiple files are being written for one timestep
computation
communication
IO PE writing to disk
idle
23Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
While some I/O PE assigned are busy writing a NetCDF file, all PEs in the communicator are blocked until all call nf90_close.
Add more resources to I/O??
But we can parallelize writing over several netcdf files by assigning different I/O PEs to different fies.By doing this we break the bandwidth limit provided by one file of 300 MB/s
Ideally, adding I/O PEs in this way will scale until we reach the filesystem limit.
nf90_create(...)nf90_def_dim(...)nf90_def_var(...)nf90_enddef(...)
nf90_put_var( )
nf90_close()
I/O PE communicator1
I/O PE communicator2
24Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Section Asyn I/O with 3 IO PE (s) per comm & 6 comm
Asyn I/O with 3 IO PE (s) per comm &
1 comm
Orig COSMO(s)
Output 110 882 1103.71
Computations O 6.3 112.83 [49 - 141]
Write data 5.4 [4.3-8.1] 404 [ 1 – 907 ]
Gather data 99 [96-100] 584 [182 - 952]
Final Wait 19 [16.7-22]
Total 440.98 1207 1502
Testing the new solution on a more demanding IPCC based example:24 simulated hours, 400 processors130 GB total data size, 148 files, 168771 vertical slices.
Results (using several I/O communicators)
Dedicating 18 processors in 6 I/O groups, total time was reduced by 70%.
25Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
But still time of output section is not 0
Section Asyn I/O with 3 IO PE (s) per comm & 6 comm
Asyn I/O with 3 IO PE (s) per comm &
1 comm
Orig COSMO(s)
Output 110 882 1103.71
Computations O 6.3 112.83 [49 - 141]
Write data 5.4 [4.3-8.1] 404 [ 1 – 907 ]
Gather data 99 [96-100] 584 [182 - 952]
Final Wait 19 [16.7-22]
Total 440.98 1207 1502
Due to the process of gathering of a full record from the set of subdomains computed in all compute PE.
Substitute MPI_GATHER by MPI_ALLTOALL communication for gathering, the total output time is reduced to negligible values.
26Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Keys of the netcdf asynchronous strategy:
● COSMO write results from different time steps into different netcdf files.We can parallelize over multiple netcdf files by using several I/O communicators.
● Large buffering on compute PE is essential, to cover at least the size of data that one compute PE will send to I/O PEs in one output time step.With this, do not require large number of PE in I/O sector.
● Adding more PE to a I/O communicator will not help if buffering is enough(i.e. bandwidth does not increase with #PE).
● Adding more I/O communicators will help if writing several netcdf files overlapsIn time and there are no free I/O communicators available.
We don't care anymore about performance of a single I/O netcdf communicator, provided user split data into several files we can add resources to I/O as new communicators.
27Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Speedup
In general, the speedup will depend on the rate of output data being written.We can use bandwidth required as a measurement of this rate.
0.5 Km example: 1 hour simulation, 52 GB. of data, bandwidth required ~40 MB/s. Total run time is reduced by 25%.
Study of massive data dumping examples (based on IPCC benchmark) shows that thestrategy scales well up to the filesystem limit (~1.8 GB/s).
bandwidth required = (data size) / (total run time - I/O time)
hardware limit
0.5 km data intensive exampleIPCC
for 400 processors on rosa
Speedup = (total time sequential I/O cosmo)/ (total time asyncronous strategy)
28Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Speedup
The bandwidth required will depend on the number of processors configured.(Increasing the number of processors will reduce computation time of each time step and increase bandwidth required).
Therefore, the speedup will also depend on the number of processors.
IPCC benchmark on rosa
29Scalable I/O in climate models workshop – Hamburg - 27th Feb 2012
Conclusions:
● COSMO sequential netcdf output code does not scale: currently ~20-30% total time..
● We tried two main strategies to parallelize the output:Level to process mapping and Asynchronous strategy.
● Level to process mapping did not yield satisfactory results due to limitations on the parallel netcdf.
● The asynchronous strategy completely hides the I/O tasks from the total run time.Due to:
COSMO splitting output into multiple netcdf files, using several I/O communicatorsand large buffering on the compute PE.
● Tested with many different I/O patterns, it scales well until we reach the filesystem limit (~1.8 GB/s).
● The speedup depends on how much time your run spends in I/O (asynchronous strategy removes I/O time):
speedup = (total run time) / (total run time - I/O time)