lossless compression of internal files in parallel...

$: Lossless Compression of Internal Files in Parallel …ieee-hpec.org/2019/2019Program/program_htm_files/02-114...I/O Challenges in Reservoir Simulation\爀圀栀礀椀猀䤀伀椀洀瀀漀爀琀愀渀琀$
Lossless Compression of Internal Files in Parallel Reservoir Simulation

Suha Kayum

Marcin Rogowski

Florian Mannuss

9/26/2019

2

Saudi Aramco: Select Classification Category

• I/O Challenges in Reservoir Simulation

• Evaluation of Compression Algorithms on Reservoir Simulation Data

• Real-world application

- Constraints

- Algorithm

- Results

• Conclusions

Outline

Presenter

Presentation Notes

I/O Challenges in Reservoir Simulation Why is I/O important in reservoir simulation (data size and speed) - Evaluation of Compression Algorithms on Reservoir Simulation Data - Real-world application Constraints data compresses well, but what are the constraints of fitting it within an existing, simulator with a large code-base and rigid interfacces Algorithm explanation of the compression algorithm developed Results results shared on the real world data - Conclusions

3

ChallengeReservoir simulation

1

4


Reservoir Simulation

• Largest field in the world are represented as 50 million – 1 billion grid block models

• Each runs takes hours on 500-5000 cores• Calibrating the model requires 100s of

runs and sophisticated methods• “History matched” model is only a

beginning

Presenter

Presentation Notes

Models are big (50 million – 1 billion grid block models; each runs takes hours on 500-5000 cores) History matching – or calibration of the model – takes hundreds of runs with varying parameters using methods such as Ensemble Kalman Filters History matching is only the beginning – once the model is calibrates and generates valid results, it is used to evaluate different future (prediction scenarios) – hundreds of runs again

5


• Internal Files

• Input / Output Files

- Interact with pre- & post-processing tools

Files in Reservoir Simulation

Date Restart/Checkpoint Files

Presenter

Presentation Notes

Pre processing – we interface with external tools such as GoCad and Petrel, so we are constrained in the choice of format Simulator – we are free to choose the format for files used internally by the simulator (written and read back by it) - restart files (stars on the plot on the right) are a form of checkpoint that dumps the data of every grid block (multiple int/double arrays) into a file and allows resuming the job from this point in time - this is used for history/prediction (once history model is matched, prediction can be started from the end of history period) Post-processing – we interface with visualization tools such as SimReservoir and TempestView – cannot change the format

6


• 100’000+ simulations annually

• The largest simulation of 10 billion cells

• Currently multiple machines in TOP500

• Petabytes of storage required

• Resources are Finite

• File Compression is one solution

Reservoir Simulation in Saudi Aramco

50x

600x

Presenter

Presentation Notes

We run 100’000+ reservoir simulation runs annually; the number of runs grew 50x and the maximum model size grew 600x in the last 20 (19) years One 10 billion cell model requires 80 GB per array to write out the data (10 billion * 8 bytes for double precision floating point = 80 gigabytes) and there are multiple arrays All this generates a lot of data and while storage is cheap/abundant, it is not infinite and disk is slow Compression is one solution - reduces the footprint on the disk - increases the likelihood the data will fit on faster local disk / burst buffer / etc. - reduces the time it takes to write the data to disk

7

Compression algorithm evaluation

2

8


Tested a number of algorithms on a GRID restart file for two models

- Model A – 77.3 million active grid blocks

- Model K – 8.7 million active grid blocks

- 15.6 GB and 7.2 GB respectively

Compression ratio is between

- From 2.27 for snappy (Model A)

- Up to 3.5 for bzip2 -9 (Model K)

Compression ratio

0

0.5

1

1.5

2

2.5

3

3.5

4

Model A Model Kco

mpr

essio

n ra

tio

lz4 snappy gzip -1 gzip -9 bzip2 -1 bzip2 -9

Presenter

Presentation Notes

Evaluation of compression different algorithms Using 1 process to simply test applicability and performance of different compression algorithms on the RESTART files specific to reservoir simulation Tested using 2 models The compression ratio for all the algorithms is between 2.27 for snappy (Model A) up to 3.5 for “bzip2 -9”. Gzip/bzip2 always seem to compress better than lz4/snappy However, this comes at a cost……… ****snappy and lz4 use a “sliding window” approach and miss the “global view” of the file which limits the compression ratio <= this will be important to explain why reordering floating point bytes might be better than compressing them as a character string

9


• LZ4 and Snappy significantly outperformed other algorithms when it comes to runtime overhead

• Compression ratio of LZ4 was 14-34% worse than bzip2 variant

- Overhead was 117-138 times smaller when it comes to compression

• 34s vs 4680s for 15.6GB

- Overhead 19-23 times smaller when decompressing

Compression speed

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Model A Model K

time

(s)


0

100

200

300

400

500

600

700

Model A Model K

time

(s)


Presenter

Presentation Notes

Lz4 and snappy are orders of magnitude faster than bzip2/gzip when it comes to compressing/decompressing the data LZ4 was up to 42% faster than snappy securing its position as the algorithm of choice for the results presented in this paper

10

Real-world

3

11


• Fitting within a framework of a large, pre-existing application

• Limiting the number of files

• Parallel implementation

• Compression ratio – runtime overhead balance

Constraints

Compression ratio

Runtime

Presenter

Presentation Notes

Lz4 shown to easily decrease the file size by 2x at the cost of small overhead -> how can this be used within a parallel application to make it more efficient rather than writing a data and then compressing/uncompressing it in wrappers Fitting within a framework of a large, pre-existing application - hundreds of thousands lines of code, set interface, challenge of replacing the I/O while keeping the external interface the same Limiting the number of files - with hundreds of thousands of jobs, thousands of processors per job, tens of restarts per job = we do not want to store each processes’ data in a separate file => merge to one “restart” file Parallel implementation - for a version writing out uncompressed arrays, it is easy to calculate locations of each process’s data in the output binary file - however, once the data is compressed, we do not know how well other processes’ data compressed hence the offset’s are impossible to calculate without parallel coordination Compression ratio – runtime overhead balance - we want to keep balance between compression ratio and the cost of compression - bzip2 -9 compresses better but we are not going to use it because it would make the entire simulation longer

12


Algorithm

Index structure pointing to the start of the individual compressed chunks in the restart file.

Compression Method 1 Double or integer arrays interpreted directly as character arrays

Compression Method 2 (Default)• Arrays are reordered where all first bytes of double-precision values are stored in one

array, the second bytes in another one and so on.• Create 8 arrays, each corresponding to the Nth

byte of every floating-point value• Reduces the entropy of sub-arrays that might be

introduced due to the IEEE floating-point standard

• 8 arrays are compressed individuallyIllustration of array reordering with the aim to reduce entropy.

Indexing• After compression, writer processes exchange

compressed array sizes and write all compressed chunks to file

• An index file is created by writers sending their chunk sizes to the writer with lowest rank

Presenter

Presentation Notes

Compression Method 1 – compress entre array at once as a “char”/”byte” array Compression Method 2 – split into 8 arrays first, 1st array = 1st byte of every double precision floating point, 2nd = 2nd byte etc. - Reduces entropy? All the exponents stored consecutively, all the mantissas stored consecutively Indexing – whether method 1 or method 2 is used, we do not know sizes of chunks of each process that had any data to compress - processes communicate and exchange compressed array sizes, so that offsets in the file can be calculated for each writer - index file is created by the writer with the lowest MPI rank so that the file can be read back by multiple processes The file created is a file comprising multiple lz4 archives inside it, the index file points at the beginning and the end of each archive contained within the file - in some algorithms, two archives concatenated are a valid archive; I think lz4 is not one of those algorithms

13


• The median model compressed with a ratio of 3 with the mean at 3.6.

• Cumulative size of restart GRID files was reduced from - 680 GB to 232 GB - compression ratio of 2.93

Real-world results

• Tested on 500+ synthetic and real-field models that encompass varying model features and sizes

• Lowest compression efficiency resulted in a ratio of 1.47

• Best-case scenario, the ratio exceeded 25

Presenter

Presentation Notes

Tested the method on 500+ synthetic and real-field models which are part of a test suite of the simulator The total RESTART file size was reduced from 680 GB per run of the suite down to 232 GB (2.93 ratio) The “median” model compressed with a ratio of 3 The mean is 3.6 – for the best case the ratio was over 25 The worst model compressed with a ratio of 1.47 * We actually identified one array that was always empty but still written to disk by mistake – as it always compressed so well!

14


Conclusions

• Storage and performance challenges of recurrently writing massive restart files in a parallel reservoir simulator are described and addressed using compression

• Evaluated a number of lossless compression algorithms

• LZ4 and Snappy performed the best in terms of compression speed resulting in less runtime overhead (LZ4 chosen)

• Compression was implemented in parallel

• Reduced storage requirements and increased efficiency

- In some cases, the overall runtime was reduced, i.e. compressing and writing to disk is faster than writing raw data to disk

• Shifting the load from interconnect and storage to performing additional computation is in line with predicted trends in the high-performance computing where compute capabilities are expected to grow at a higher rate than any other supercomputer component

Presenter

Presentation Notes

Storage and performance challenges of recurrently writing massive restart files in a parallel reservoir simulator are described and addressed using compression - ~~ 100 K models annually * 1 K scenarios * 50 restarts Evaluated a number of lossless compression algorithms - bzip2, gzip, snappy, lz4 LZ4 and Snappy performed the best in terms of compression speed resulting in less runtime overhead (LZ4 chosen) - Balance over ultimate compression ratio Compression was implemented in parallel Reduced storage requirements and increased efficiency In some cases, the overall runtime was reduced, i.e. compressing and writing to disk is faster than writing raw data to disk LZ4 compression overhead measured on a single process is small (34 seconds in the example of 15.6 GB file for Model A) and it becomes negligible once the array is distributed over N processes With the compression ratio above 2, it can be estimated that the time taken to compress and write the file of reduced size to disk will, most likely, be less than the time it would have taken to write the “raw” file to disk Shifting the load from interconnect and storage to performing additional computation is in line with predicted trends in the high-performance computing where compute capabilities are expected to grow at a higher rate than any other supercomputer component All the exascale roadmaps expect the compute to be the fastest growing component, disks and networks are already behind

lossless compression of internal files in parallel...

Documents