a file system for burst buffers - hpckp
TRANSCRIPT
LLNL-PRES-811657
A file system for burst buffers
LLNL : KATHRYN MOHROR ( P I ) , ADAM M OODY, CAM E RON S TAN AVIG E , TON Y HU T T E R
OR N L : SAR P OR AL ( OR N L - P I ) , HYOG I S IM , F E IY I WAN G , M IKE B R IM , SWE N B OE HM
NCSA: CE LS O M E ND ES , CR AIG S T E F F E N
HPC Knowledge Meeting (HPCKP’20)June 18, 2020
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 2
What is UnifyFS?
• Simply put, it’s a file system for burst buffers
• Our goal is to make using burst buffers on exascale systems as easyas writing to the parallel file system and orders of magnitude faster
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 3
What is UnifyFS?
• Simply put, it’s a file system for burst buffers
• Our goal is to make using burst buffers on exascale systems as easyas writing to the parallel file system and orders of magnitude faster
• Results on Summit show scalable write performance for UnifyFS with shared files on burst buffers
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 4
Writing data to the parallel file system is expensive
Parallel File System
Cluster B
Clu
ster
A
Cluster C
I/O Nodes
Compute Nodes
Network contention
Contention from other clusters for file system
Contention for shared filesystem resources
General purpose file system semantics
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 5
HPC Storage is becoming more complexShared burst buffer model
…...Local burst buffer model
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 6
HPC Storage is becoming more complexShared burst buffer model
Good:- Easy for applications
that write shared files- Easy for
producer/consumer applications
…...Local burst buffer model
Good:- Fast, no contention
problems- Potential for excellent
scaling performance
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 7
HPC Storage is becoming more complexShared burst buffer model
Good:- Easy for applications
that write shared files- Easy for ensemble
applications
Bad:- Not quite as fast as
node-local- Contention can be an
issue
…...Local burst buffer model
Good:- Fast, no contention
problems- Potential for excellent
scaling performance
Bad:- What about shared
files?- What about
producer/consumer applications?
UnifyFS target
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 8
UnifyFS targets local burst buffers because they are fast and scalable
300x faster than the parallel file system
1000x faster than the parallel file system
Atlas Nodes** Lassen Nodes**
**Measurements of local storage performance taken with SCR (The Scalable Checkpoint/Restart Library) https://github.com/llnl/scr
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 9
UnifyFS makes sharing files on node-local burst buffers easy and fast
• Sharing files on node-local burst buffers is not natively supported
• UnifyFS makes sharing files easy• UnifyFS presents a shared namespace across distributed storage
• Used directly by applications or indirectly via higher level libraries like VeloC, MPI-IO, HDF5, PnetCDF, ADIOS, etc.
• UnifyFS is fast• Tailored for specific HPC workloads, e.g., checkpoint/restart, visualization output
• Each UnifyFS instance exists only within a single job, no contention with other jobs on the system
/unifyfs
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 10
UnifyFS is designed to work completely in user space for a single user’s job
Checkpoint/Restart Lib:
SCR, VeloC
Scientific I/O Lib: HDF5,
MPI-IO, …
HPC Application
(N-1, N-N Checkpointing, Collective I/O with Scientific I/O Libraries)
Standard POSIX I/O
User level I/O interception
Data visible globally after “lamination”
Fast data writes to node local storage
Transer API to move data to/from
parallel file system
UnifyFS
# Job Script Integrationunifyfs start –mount=/unifyfs# run applications and tasks... unifyfs terminate
Batch Job Manager
Dynamic File System for
Each Job
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 11
UnifyFS optimizes on typical HPC application I/O behavior
int main(int argc, char* argv[]) {MPI_Init(argc, argv);
read_input();
for(int t = 0; t < TIMESTEPS; t++){/* ... Do work ... *//* ... Synchronization ... */
write_output();
}
MPI_Finalize();return 0;
}
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 12
UnifyFS optimizes on typical HPC application I/O behavior
int main(int argc, char* argv[]) {MPI_Init(argc, argv);
read_input();
for(int t = 0; t < TIMESTEPS; t++){/* ... Do work ... *//* ... Synchronization ... */
write_output();
}
MPI_Finalize();return 0;
}
• Reads and writes occur in distinct phases
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 13
UnifyFS optimizes on typical HPC application I/O behavior
int main(int argc, char* argv[]) {MPI_Init(argc, argv);
read_input();
for(int t = 0; t < TIMESTEPS; t++){/* ... Do work ... *//* ... Synchronization ... */
write_output();
}
MPI_Finalize();return 0;
}
• Reads and writes occur in distinct phases• Reads and writes are performed to regular offsets in files (not
random writes or reads)
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 14
UnifyFS optimizes on typical HPC application I/O behavior
int main(int argc, char* argv[]) {MPI_Init(argc, argv);
read_input();
for(int t = 0; t < TIMESTEPS; t++){/* ... Do work ... *//* ... Synchronization ... */
write_output();
}
MPI_Finalize();return 0;
}
• Reads and writes occur in distinct phases• Reads and writes are performed to regular offsets in files (not
random writes or reads)• When reading, a process will most likely access data it wrote
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 15
UnifyFS optimizes on typical HPC application I/O behavior
int main(int argc, char* argv[]) {MPI_Init(argc, argv);
read_input();
for(int t = 0; t < TIMESTEPS; t++){/* ... Do work ... *//* ... Synchronization ... */
write_output();
}
MPI_Finalize();return 0;
}
• Reads and writes occur in distinct phases• Reads and writes are performed to regular offsets in files (not
random writes or reads)• When reading, a process will most likely access data it wrote• These behaviors are typical of common HPC workloads
• Checkpoint/restart, periodic output, producer/consumer
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 16
UnifyFS files are globally visible after “lamination”
• Based on typical I/O behavior in HPC applications we utilize “lamination semantics”
• Before lamination processes may not see updates written by processes on another node
• Once processes are done modifying a file, they initiate lamination• The lamination process renders your file read-only and synchronizes file data across nodes in your job
• Now any process on any node can read the final state of the file
• And you can transfer data to external storage
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 17
UnifyFS files are globally visible after “lamination”
• Based on typical I/O behavior in HPC applications we utilize “lamination semantics”
• Before lamination processes may not see updates written by processes on another node
• Once processes are done modifying a file, they initiate lamination• The lamination process renders your file read-only and synchronizes file data across nodes in your job
• Now any process on any node can read the final state of the file
• And you can transfer data to external storage
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 18
Now for a tutorial…
• get and build UnifyFS
• build your application to use UnifyFS
• set up your application environment to use UnifyFS
• run your application with UnifyFS
• move your data between UnifyFS and the parallel file system
/unifyfs
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 19
How do I get and build UnifyFS?
Got Spack?
https://github.com/spack/spack
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 20
How do I get and build UnifyFS?
Got Spack?
$ spack install unifyfs
$ spack load unifyfs
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 21
How do I modify my MPI application for UnifyFS?
int main(int argc, char * argv[]) {FILE *fp;//program initialization//MPI setup
//perform I/Ofp = fopen(“/lustre/dset.txt”, “w”);fprintf(fp, “Hello World! I’m rank %d”, rank);fclose(fp);
//clean up
return 0;}
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 22
How do I modify my MPI application for UnifyFS?
int main(int argc, char * argv[]) {FILE *fp;//program initialization//MPI setup
//perform I/Ofp = fopen(“/unifyfs/dset.txt”, “w”);fprintf(fp, “Hello World! I’m rank %d”, rank);fclose(fp);
//clean up
return 0;}
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 23
How do I modify my serial application for UnifyFS?
int main(int argc, char * argv[]) {FILE *fp;//program initializationunifyfs_mount(“/unifyfs”);
//perform I/Ofp = fopen(“/unifyfs/dset.txt”, “w”);fprintf(fp, “Hello World!”);fclose(fp);
//clean upunifyfs_unmount();return 0;
}
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 24
How does UnifyFS intercept I/O calls?
• Static Linking
• Dynamic Linking (Recommended method)• Using Spack makes this very easy! (just “spack load unifyfs”)
• We use the Gotcha library for dynamic interception
$ mpicc -o hello `<unifyfs>/bin/unifyfs-config --pre-ld-flags` \
hello.c `<unifyfs>/bin/unifyfs-config --post-ld-flags`
$ mpicc -o hello –L<unifyfs>/lib hello.c -lunifyfs_gotcha
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 25
How do I set up my code to run with UnifyFS?
• UnifyFS provides the following ways to set configuration settings:• A system-wide configuration file in /etc/unifyfs/unifyfs.conf
• Environment variables
• Command line options to ‘unifyfs start’• Only available for some config options
• Link to detailed breakdown of all UnifyFS configuration options: https://unifyfs.readthedocs.io/en/dev/configuration.html
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 26
How do I run my code with UnifyFS?
• Easiest method: Start & stop UnifyFS in your batch script
### allocate nodes and options for resource manager#BSUB –nnodes 8 …
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 27
How do I run my code with UnifyFS?
• Easiest method: Start & stop UnifyFS in your batch script• Command ‘unifyfs start’ launches UnifyFS for your job and sets up the file system
### allocate nodes and options for resource manager#BSUB –nnodes 8 …
### shell command portion of batch scriptunifyfs start –-share-dir=/shared/file/system/path -–mount=/unifyfs
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 28
How do I run my code with UnifyFS?
• Easiest method: Start & stop UnifyFS in your batch script• Command ‘unifyfs start’ launches UnifyFS for your job and sets up the file system
### allocate nodes and options for resource manager#BSUB –nnodes 8 …
### shell command portion of batch scriptunifyfs start –-share-dir=/shared/file/system/path -–mount=/unifyfs
/unifyfs
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 29
How do I run my code with UnifyFS?
• Easiest method: Start & stop UnifyFS in your batch script• Command ‘unifyfs start’ launches UnifyFS for your job and sets up the file system
• Run your command as usual and use the path /unifyfs to direct data to UnifyFS
### allocate nodes and options for resource manager#BSUB –nnodes 8 …
### shell command portion of batch scriptunifyfs start –-share-dir=/shared/file/system/path -–mount=/unifyfsjsrun –p 4096 ./hello
/unifyfs
hello hello hello hello hello hello hello hello
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 30
How do I run my code with UnifyFS?
• Easiest method: Start & stop UnifyFS in your batch script• Command ‘unifyfs start’ launches UnifyFS for your job and sets up the file system
• Run your command as usual and use the path /unifyfs to direct data to UnifyFS
### allocate nodes and options for resource manager#BSUB –nnodes 8 …
### shell command portion of batch scriptunifyfs start –-share-dir=/shared/file/system/path -–mount=/unifyfsjsrun –p 4096 ./hello
/unifyfs
hello hello hello hello hello hello hello hello
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 31
How do I run my code with UnifyFS?
• Easiest method: Start & stop UnifyFS in your batch script• Command ‘unifyfs start’ launches UnifyFS for your job and sets up the file system
• Run your command as usual and use the path /unifyfs to direct data to UnifyFS
• Command ‘unifyfs terminate’ cleans up the UnifyFS file system and tears it down
### allocate nodes and options for resource manager#BSUB –nnodes 8 …
### shell command portion of batch scriptunifyfs start –-share-dir=/shared/file/system/path -–mount=/unifyfsjsrun –p 4096 ./hellounifyfs terminate
/unifyfs
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 32
How do I run my code with UnifyFS?
• Easiest method: Start & stop UnifyFS in your batch script• Command ‘unifyfs start’ launches UnifyFS for your job and sets up the file system
• Run your command as usual and use the path /unifyfs to direct data to UnifyFS
• Command ‘unifyfs terminate’ cleans up the UnifyFS file system and tears it down
### allocate nodes and options for resource manager#BSUB –nnodes 8 …
### shell command portion of batch scriptunifyfs start –-share-dir=/shared/file/system/path -–mount=/unifyfsjsrun –p 4096 ./hellounifyfs terminate
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 33
How do I run my code with UnifyFS?
• Easiest method: Start & stop UnifyFS in your batch script• Command ‘unifyfs start’ launches UnifyFS for your job and sets up the file system
• Run your command as usual and use the path /unifyfs to direct data to UnifyFS
• Command ‘unifyfs terminate’ cleans up the UnifyFS file system and tears it down
### allocate nodes and options for resource manager#BSUB –nnodes 8 …
### shell command portion of batch scriptunifyfs start –-share-dir=/shared/file/system/path -–mount=/unifyfsjsrun –p 4096 ./hellounifyfs terminate
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 34
How do I run my code with UnifyFS?
• Alternative method: Resource manager integration• UnifyFS can be started by LSF-CSM (e.g. on Summit) but it requires administrator support to install
• To start from within your batch script, add a BSUB command
• Command launches UnifyFS at the start of your job and cleans it up at the end
• From the command line on a login node, execute
• Support for integration with SLURM is in development
### allocate nodes and options for resource manager#BSUB –nnodes 8 …#BSUB –alloc_flags UNIFYFS
bsub –alloc_flags unifyfs <other bsub options> jobscript.lsf
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 35
How do I move my data into and out of UnifyFS?• Three ways to move data between UnifyFS and the parallel file system
• UnifyFS transfer program
• UnifyFS transfer API
• Stage in and out options with UnifyFS command “unifyfs start”
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 36
How do I move my data into and out of UnifyFS?• Easiest method: Use UnifyFS command line program ‘transfer’
• transfer.exe <source> <destination>• Source or dest needs to exist in UnifyFS
• The other location is an external storage location like the parallel file system
• Make sure the files that you transfer out are laminated before exiting with a call to chmod (shown on next slide)
### allocate nodes and options for resource manager#BSUB –nnodes 8 …
### shell command portion of batch scriptunifyfs start –-share-dir=/shared/file/system/path -–mount=/unifyfsjsrun –p 4096 ./hellojsrun –r1 transfer-static /unifyfs/file1 /scratch/mydir/file1unifyfs terminate
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 37
How do I move my data into and out of UnifyFS?• Method #2: UnifyFS transfer API
• int unifyfs_transfer_file_serial(source, dest)
• int unifyfs_transfer_file_parallel(source, dest)
int main(int argc, char * argv[]) {FILE *fp;//program initialization & MPI setup
fp = fopen(“/unifyfs/out.txt”, “w”);fprintf(fp, “Hello World! I’m rank %d”, rank);fclose(fp);
// “laminate” the file to indicate to UnifyFS you are done modifying this filechmod(“/unifyfs/out.txt”, 0444);// transfer data to PFS with UnifyFS API call
unifyfs_transfer_file_parallel(“/unifyfs/out.txt”, “/scratch/out.txt”);
//program clean up & MPI finalizationreturn 0;
}
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 38
How do I move my data into and out of UnifyFS?• Method #2: UnifyFS transfer API
• int unifyfs_transfer_file_serial(source, dest)
• int unifyfs_transfer_file_parallel(source, dest)
int main(int argc, char * argv[]) {FILE *fp;//program initialization & MPI setup
fp = fopen(“/unifyfs/out.txt”, “w”);fprintf(fp, “Hello World! I’m rank %d”, rank);fclose(fp);
// “laminate” the file to indicate to UnifyFS you are done modifying this filechmod(“/unifyfs/out.txt”, 0444);// transfer data to PFS with UnifyFS API call
unifyfs_transfer_file_parallel(“/unifyfs/out.txt”, “/scratch/out.txt”);
//program clean up & MPI finalizationreturn 0;
}
• “chmod” is currently supported method for lamination
• (Near-)future methods we plan to support• Laminate on close()• Laminate on unmount() (either
explicit API call or auto-mount)• Laminate with API call, e.g.,
unifyfs_laminate• Expect to be able to specify which
method you want when you start up UnifyFS
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 39
How do I move my data into and out of UnifyFS?• Method #3: UnifyFS start command
• Use the command line –-stage-in and –stage-out options
• Make sure your application laminates the files with chmod before exiting
• This command will take all files from UnifyFS and transfer it to the stage out path
$ unifyfs start --stage-in=/path/to/parallel/file/system/inputs --stage-out=/path/to/parallel/file/system/outputs
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 40
Can I use UnifyFS if I use an I/O library?
• Yes! UnifyFS works with HDF5 I/O as well as other I/O libraries
int main(int argc, char* argv[]) {MPI_Init(argc, argv);
for(int t = 0; t < TIMESTEPS; t++){/* ... Do work ... */
checkpoint(dset_data);}
MPI_Finalize();return 0;
}
void checkpoint(dset_data) {int rank; char file[256];MPI_Comm_rank(MPI_COMM_WORLD, &rank);
sprintf(file, “/lustre/shared.ckpt”);
file_id = H5open(file, …);dset_id = H5Dopen2(file_id, “/dset”, …);
H5DWrite(dset_id, …, dset_data); H5Dclose(dset_id);H5close(file_id);
return;}
int main(int argc, char* argv[]) {MPI_Init(argc, argv);
for(int t = 0; t < TIMESTEPS; t++){/* ... Do work ... */
checkpoint(dset_data);}
MPI_Finalize();return 0;
}
void checkpoint(dset_data) {int rank; char file[256];MPI_Comm_rank(MPI_COMM_WORLD, &rank);
sprintf(file, “/lustre/shared.ckpt”);
file_id = H5Fopen(file, …);dset_id = H5Dopen2(file_id, “/dset”, …);
H5DWrite(dset_id, …, dset_data); H5Dclose(dset_id);H5Fclose(file_id);
return;}
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 41
Can I use UnifyFS if I use an I/O library?
• Yes! UnifyFS works with HDF5 I/O as well as other I/O libraries
• We are partnered with HDF5 in ECP so we test it the most
• Build and run your application with UnifyFS, change the path
int main(int argc, char* argv[]) {MPI_Init(argc, argv);
for(int t = 0; t < TIMESTEPS; t++){/* ... Do work ... */
checkpoint(dset_data);}
MPI_Finalize();return 0;
}
void checkpoint(dset_data) {int rank; char file[256];MPI_Comm_rank(MPI_COMM_WORLD, &rank);
sprintf(file, “/lustre/shared.ckpt”);
file_id = H5open(file, …);dset_id = H5Dopen2(file_id, “/dset”, …);
H5DWrite(dset_id, …, dset_data); H5Dclose(dset_id);H5close(file_id);
return;}
int main(int argc, char* argv[]) {MPI_Init(argc, argv);
for(int t = 0; t < TIMESTEPS; t++){/* ... Do work ... */
checkpoint(dset_data);}
MPI_Finalize();return 0;
}
void checkpoint(dset_data) {int rank; char file[256];MPI_Comm_rank(MPI_COMM_WORLD, &rank);
sprintf(file, “/unifyfs/shared.ckpt”);
file_id = H5Fopen(file, …);dset_id = H5Dopen2(file_id, “/dset”, …);
H5DWrite(dset_id, …, dset_data); H5Dclose(dset_id);H5Fclose(file_id);chmod(file, 0444);return;
}
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 42
Wanna learn more?
• New features and improvements being added all the time
• Documentation and user support• User Guide: http://unifyfs.readthedocs.io
• Down at the bottom, where it says “Read the Docs” chose “v: latest” to get the latest information
• Example programs available in the “examples” directory in the source code
LLNL-PRES-811657 UnifyFS Tutorial https://github.com/LLNL/UnifyFS 43
We need you!
• Our goal is to provide easy, portable, and fastsupport for burst buffers for ECP applications
• We need early users
• What features are most important to you
• Available on github: https://github.com/LLNL/UnifyFS• MIT license
• Documentation and user support• User Guide: http://unifyfs.readthedocs.io
LLNL-PRES-811657
This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative.
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.