high performance computing : models, methods, & means parallel file i/o 2
DESCRIPTION
Prof. Thomas Sterling Prof. Hartmut Kaiser Department of Computer Science Louisiana State University March 31 st , 2011. HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PARALLEL FILE I/O 2. IO Problem of the day. #include < stdio.h > int main( ) { int a = 0, b = 0; - PowerPoint PPT PresentationTRANSCRIPT
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
PARALLEL FILE I/O 2
Prof. Thomas SterlingProf. Hartmut KaiserDepartment of Computer Science Louisiana State UniversityMarch 31st, 2011
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
IO Problem of the day
#include <stdio.h>
int main(){ int a = 0, b = 0; char buf[10]; scanf ("%d%d", a, b); sprintf (buf, "%d %d"); puts ("you entered: "); puts (buf);}
If the user entered 3 and 17, what‘s the generated output?
2
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
IO Problem of the day
#include <stdio.h>
int main(){ int a = 0, b = 0; char buf[42]; // max. 20 digits in
64bit int scanf ("%d%d", &a, &b); snprintf (buf, 42, "%d %d", a, b); puts ("you entered: "); puts (buf);}
3
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011 4
Topics
• Introduction • POSIX I/O API• Parallel I/O Libraries (MPI-IO)• Scientific I/O Interface: netCDF• Scientific Data Package: HDF5• Summary – Materials for Test
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011 5
Topics
• Introduction • POSIX I/O API• Parallel I/O Libraries (MPI-IO)• Scientific I/O Interface: netCDF• Scientific Data Package: HDF5• Summary – Materials for Test
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Parallel I/O: Library Layers (Review)
• Lower level interfaces may be provided by the file system for higher-performance access
• Above the parallel file systems are the parallel I/O layers provided in the form of libraries such as MPI-IO
• The parallel I/O layer provides a low level interface and operations such as collective I/O
• Scientific applications work with structured data for which a higher level API written on top of MPI-IO such as HDF5 or parallel netCDF are used
• HDF5 and parallel netCDF allow the scientists to represent the data sets in terms closer to those used in their applications, and in a portable manner
6
Storage HardwareStorage Hardware
Parallel I/O (MPI I/O)Parallel I/O (MPI I/O)
Parallel File SystemParallel File System
High-Level I/O LibraryHigh-Level I/O Library
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011 7
Topics
• Introduction • POSIX I/O API• Parallel I/O Libraries (MPI-IO)• Scientific I/O Interface: netCDF• Scientific Data Package: HDF5• Summary – Materials for Test
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
POSIX File Access API• Widespread standard• Available on any UNIX-compliant platform
– IBM AIX, HP HP-UX, SGI Irix, Sun Solaris, BSDi BSD/OS, Mac OS X, Linux, FreeBSD, OpenBSD, NetBSD, BeOS, and many others
– Also: Windows NT, XP, Server 2003, Vista, Windows 7 (through C runtime libraries)
• Simple interface: six functions from POSIX.1 (core services) provide practically all necessary I/O functionality– File open– File close– File data read– File data write– Flush buffer to disk– Adjust file pointer (seek)
• Two interface variants, provide roughly equivalent functionality– Low-level file interface (file handles are integer descriptors)– C stream interface (streams are represented by FILE structure; function
names prefixed with “f”)• But: no parallel I/O support
8
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
File Open
9
Function: open()
int open(const char *path, int flags);
int open(const char *path, int flags,
mode_t mode);
Description:Opens the file identified by path, returning a non-negative descriptor on success. The flags argument must contain one of the following access modes O_RDONLY, O_WRONLY, or O_RDWR; additional file creation flags may be bitwise or’d: O_CREAT, O_EXCL, and O_TRUNC. The optional mode specifies access permissions when the file is created.
#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>.../* create empty writable file with default accesspermissions, storing its descriptor in fd */int fd = open(“test”, O_WRONLY|O_CREAT|O_TRUNC);if (fd < 0) {/* handle error here */}
#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>.../* create empty writable file with default accesspermissions, storing its descriptor in fd */int fd = open(“test”, O_WRONLY|O_CREAT|O_TRUNC);if (fd < 0) {/* handle error here */}
Function: fopen()
FILE *fopen(const char *path,
const char *mode);
Description:Opens the file identified by path, associating a stream with it and returning non-zero pointer if successful. The mode string is one of: “r” (reading), “r+” (reading and writing), “w” (creating or truncating an existing file for writing), “w+” (reading and writing, with creation or truncating), “a” (appending: writing at the end of file), or “a+” (reading and appending, with creation if the file doesn’t exist).
#include <stdio.h>.../* replicate open() example on the left, storingfile handle in f */FILE *f = fopen(“test”, “w”);if (f == NULL) {/* handle error here */}...
#include <stdio.h>.../* replicate open() example on the left, storingfile handle in f */FILE *f = fopen(“test”, “w”);if (f == NULL) {/* handle error here */}...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
File Close
10
Function: close()
int close(int fd);
Description:Closes file descriptor fd making it available for reuse, returning zero on success. OS resources associated with the open file descriptor are freed. Note that a successful close does not guarantee that file data have been saved to the disk.
#include <unistd.h>.../* open a file */int rc;int fd = open(...);.../* file is accessed here */...rc = close(fd);if (rc != 0) {/* handle error here */}
#include <unistd.h>.../* open a file */int rc;int fd = open(...);.../* file is accessed here */...rc = close(fd);if (rc != 0) {/* handle error here */}
Function: fclose()
int fclose(FILE *fp);
Description:Flushes the stream pointed to by fp and closes the underlying file descriptor returning zero on success. Note that buffer flush affects only data implicitly managed by the C library, not the kernel buffers.
#include <stdio.h>.../* open a file */int rc;FILE *f = fopen(...);.../* file is accessed here */...rc = fclose(f);if (rc != 0) {/* handle error here */}
#include <stdio.h>.../* open a file */int rc;FILE *f = fopen(...);.../* file is accessed here */...rc = fclose(f);if (rc != 0) {/* handle error here */}
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
File Read
11
Function: read()
int read(int fd, void *buf, size_t count);
Description:Attempts to read at most count sequential bytes from file descriptor fd into the buffer starting at buf. Returns the number of bytes read (zero indicates end of file). On error, -1 is returned.
#include <unistd.h>...int bytes;char buf[100];/* open an existing file for reading */int fd = open(...);...bytes = read(fd, buf, 100);if (bytes < 100) {/* handle EOF or error here */}...
#include <unistd.h>...int bytes;char buf[100];/* open an existing file for reading */int fd = open(...);...bytes = read(fd, buf, 100);if (bytes < 100) {/* handle EOF or error here */}...
Function: fread()
size_t fread(void *ptr, size_t size, size_t n,
FILE *stream);
Description:Reads n sequential elements of data, each size bytes long from the stream identified by *stream, storing them in location pointed to by ptr. Returns the number of items (not bytes!) successfully read. On error, or if end of file is reached, the return value is less than n.
#include <stdio.h>...size_t items;char buf[100];/* open an existing file for reading */FILE *f = fopen(...);...items = fread(buf, 1, 100, f);if (items < 100) {/* handle EOF or error here */}...
#include <stdio.h>...size_t items;char buf[100];/* open an existing file for reading */FILE *f = fopen(...);...items = fread(buf, 1, 100, f);if (items < 100) {/* handle EOF or error here */}...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
File Write
12
Function: write()
int write(int fd, void *buf, size_t count);
Description:Writes sequentially at most count bytes from the buffer pointed to by buf to the file identified by descriptor fd. Returns the number of bytes written; if less than count, it means that either the underlying device is out of space, or an interrupt occurred. On error, -1 is returned.
#include <unistd.h>...int bytes;char buf[100];/* open a file for writing or appending */int fd = open(...);.../* initialize buffer data */...bytes = write(fd, buf, 100);if (bytes < 100) {/* handle short write */}...
#include <unistd.h>...int bytes;char buf[100];/* open a file for writing or appending */int fd = open(...);.../* initialize buffer data */...bytes = write(fd, buf, 100);if (bytes < 100) {/* handle short write */}...
Function: fwrite()
size_t fwrite(void *ptr, size_t size, size_t n,
FILE *stream);
Description:Writes sequentially n elements of data, each size bytes long to the stream identified by *stream from location pointed to by ptr. Returns the number of items successfully written. On error, or if end of file is reached, the return value is less than n.
#include <stdio.h>...size_t items;char buf[100];/* open a file for writing or appending */FILE *f = fopen(...);.../* initialize buffer data */...items = fwrite(buf, 1, 100, f);if (items < 100) {/* handle short write */}...
#include <stdio.h>...size_t items;char buf[100];/* open a file for writing or appending */FILE *f = fopen(...);.../* initialize buffer data */...items = fwrite(buf, 1, 100, f);if (items < 100) {/* handle short write */}...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
File Seek
13
Function: lseek()
off_t lseek(int fd, off_t offs, int whence);
Description:Adjusts the offset of the open file associated with the descriptor fd to the argument offs in accordance to whence, which may assume the following values: SEEK_SET (sets offset to offs bytes), SEEK_CUR (offset is set to the current location plus offs), or SEEK_END (sets offset to the sizeof file plus offs). Returns the resultant offset value measured from the beginning of file, or (off_t) -1 on error.
#include <sys/types.h>#include <unistd.h>.../* open file for read/write access */int fd = open(“/tmp/myfile”, O_RDWR);.../* write some file data */.../* “rewind” to the beginning of file to check the written data */lseek(fd, 0, SEEK_SET);/* start reading... */
#include <sys/types.h>#include <unistd.h>.../* open file for read/write access */int fd = open(“/tmp/myfile”, O_RDWR);.../* write some file data */.../* “rewind” to the beginning of file to check the written data */lseek(fd, 0, SEEK_SET);/* start reading... */
Function: fseek()
int fseek(FILE *stream, long offs, int whence);
Description:Sets the file position indicator for the stream identified by stream. The meaning of the offset and whence arguments is the same as for lseek(). Returns the current file offset in bytes, or -1 on error.
#include <stdio.h>.../* open file for reading and writing */FILE *f = fopen(“/tmp/myfile”, “r+”);.../* to start appending data at the end of file: */fseek(f, 0, SEEK_END);fwrite(...);...
#include <stdio.h>.../* open file for reading and writing */FILE *f = fopen(“/tmp/myfile”, “r+”);.../* to start appending data at the end of file: */fseek(f, 0, SEEK_END);fwrite(...);...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
File Data Flushing
14
Function: fsync()
int fsync(int fd);
Description:Transfers all modified in-core data and metadata (such as file size) of the file referred to by descriptor fd to permanent storage device. The call blocks until the transfer is complete. Returns zero on success, -1 on error.
#include <unistd.h>.../* open file for writing */int fd = open(“checkpt.dat”, O_WRONLY|O_CREAT);.../* write checkpoint data */.../* make sure data are flushed to disk before starting the next iteration */fsync(fd);...
#include <unistd.h>.../* open file for writing */int fd = open(“checkpt.dat”, O_WRONLY|O_CREAT);.../* write checkpoint data */.../* make sure data are flushed to disk before starting the next iteration */fsync(fd);...
Function: fflush()
int fflush(FILE *stream);
Description:Forces write of all user-space buffered data of the output stream identified by *stream, or all open output streams if stream is NULL. Returns zero on success, or EOF on error.
#include <stdio.h>.../* open file for appending */FILE *f = fopen(“/var/log/app.log”, “a”);.../* special event happened: output a message */fprintf(f, “driver initialization failed”);/* make sure message reaches at least kernel buffers before application crashes */fflush(f);...
#include <stdio.h>.../* open file for appending */FILE *f = fopen(“/var/log/app.log”, “a”);.../* special event happened: output a message */fprintf(f, “driver initialization failed”);/* make sure message reaches at least kernel buffers before application crashes */fflush(f);...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Problems with POSIX File I/O• Too simplistic interface
– Operates on anonymous sequences of bytes– No preservation of type or information structure– Cumbersome access to optimized/additional features (fcntl, ioctl)– Designed for sequential I/O (even regularly strided accesses require multiple calls and may
suffer from poor performance)• Portability issues
– Must use specialized reader/writer created for a particular application– Compatibility checks dependent on application developers (possibility of undetected failures)– No generic utilities to parse and interpret the contents of saved files– Cross platform endianness and type representation problem if saving in binary mode– Significant waste of storage space if text mode is used (for portability or readability of
transferred data)• Permit access only to locally mounted storage, or remote storage via NFS
(which has its share of problems)• Parallel and concurrent access issues
– Lack of synchronization when accessing shared files from multiple nodes– Atomic access to shared files may not be enforceable, has unclear semantics, or has to rely
on the programmer for synchronization– Uncoordinated access of I/O devices shared by multiple nodes may result in poor
performance (bottlenecks)– Additional performance loss due to suboptimal bulk data movement (e.g., no collective I/O)– On the other hand, without sharing, the management of individual files (i.e. with at least one
data file per I/O node) is complicated and tedious
15
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011 16
Topics
• Introduction • POSIX I/O API• Parallel I/O Libraries (MPI-IO)• Scientific I/O Interface: netCDF• Scientific Data Package: HDF5• Summary – Materials for Test
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
MPI-IO Overview
• Initially developed as a research project at the IBM T. J. Watson Research Center in 1994
• Voted by the MPI Forum to be included in MPI-2 standard (Chapter 9)• Most widespread open-source implementation is ANL’s ROMIO,
written by Rajeev Thakur (http://www-unix.mcs.anl.gov/romio/ )• Integrates file access with the message passing infrastructure, using
similarities between send/receive and file write/read operations• Allows MPI datatypes to describe meaningfully data layouts in files
instead of dealing with unorganized streams of bytes• Provides potential for performance optimizations through the
mechanism of “hints”, collective operations on file data, or relaxation of data access atomicity
• Enables better file portability by offering alternative data representations
17
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
MPI-IO Features (I)
• Basic file manipulation (open/close, delete, space preallocation, resize, storage synchronization, etc.)
• File views (define what part of a file each process can see and how it is interpreted)
– Processes can view file data independently, with possible overlaps– The users may define patterns to describe data distributions both in file and
in memory, including non-contiguous layouts– Permit skipping over fixed header blocks (“displacements”)– Views can be changed by tasks at any time
• Data access positioning– Explicitly specified offsets (suffix “_at”)– Independent data access by each task via individual file pointers (no suffix)– Coordinated access through shared file pointer (suffix “_shared”)
• Access synchronism– Blocking– Non-blocking (include split-collective operations)
18
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
MPI-IO Features (II)• Access coordination
– Non-collective (no additional suffix)– Collective (suffix: “_all” for most blocking calls, “_begin” and “_end” for split-
collective, or “_ordered” for equivalent of shared pointer access)
• File interoperability (ensures portability of data representation)– Native: for purely homogeneous environments– Internal: heterogeneous environments with implementation-defined data
representation (subset of “external32”)– External32: heterogeneous environments using data representation defined
by the MPI-IO standard
• Optimization hints (the “_info” interface)– Access style (e.g. read_once, write_once, sequential, random, etc.)– Collective buffering components (buffer and block sizes, number of target
nodes)– Striping unit and factor– Chunked I/O specification– Preferred I/O devices
• C, C++ and Fortran bindings
19
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
MPI-IO Types
• Etype (elementary datatype): the unit of data access and positioning; all data accesses are performed in etype units and offsets are measured in etypes
• Filetype: basis for partitioning the file among processes: a template for accessing the file; may be identical to or derived from the etype
20
Source: http://www.mhpcc.edu/training/workshop2/mpi_io/MAIN.html
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
MPI-IO File ViewsA view defines the current set of data visible and accessible from an open
file as an ordered set of etypes• Each process has its own view of the file, defined by: a displacement, an etype,
and a filetype
• Displacement: an absolute byte position relative to the beginning of file; defines where a view begins
21
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
22
MPI-IO: File Open
Function: MPI_File_open()
int MPI_File_open(MPI_Comm comm, char *filename, int amode,
MPI_Info info, MPI_File *fh);
Description:Opens the file identified by filename on all processes in comm group, using access mode specified in amode. The operation is collective; all participating processes must pass identical values for amode and use the filename referencing the same file. Successful call returns the open file handle in fh, which can be used to subsequently access the file.
It is possible to open file independently from other processes by passing MPI_COMM_SELF in comm argument.
#include <mpi.h>...MPI_File fh;int err;.../* create a writable file with default parameters */err = MPI_File_open(MPI_COMM_WORLD, “/mnt/piofs/testfile”, MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);if (err != MPI_SUCCESS) {/* handle error here */}...
#include <mpi.h>...MPI_File fh;int err;.../* create a writable file with default parameters */err = MPI_File_open(MPI_COMM_WORLD, “/mnt/piofs/testfile”, MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);if (err != MPI_SUCCESS) {/* handle error here */}...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
23
MPI-IO: File Close
Function: MPI_File_close()
int MPI_File_close(MPI_File *fh);
Description:Synchronizes file state (equivalent to implicit invocation of MPI_File_sync), and then closes the file associated with handle fh. The user must ensure that all oustanding non-blocking requests and split-collective operations associated with handle fh have completed. If the file was opened with access mode MPI_MODE_DELETE_ON_CLOSE, it is deleted from the file system.
#include <mpi.h>...MPI_File fh;int err;.../* open a file storing the handle in fh *//* perform file access */...err = MPI_File_close(&fh);if (err != MPI_SUCCESS) {/* handle error here */}...
#include <mpi.h>...MPI_File fh;int err;.../* open a file storing the handle in fh *//* perform file access */...err = MPI_File_close(&fh);if (err != MPI_SUCCESS) {/* handle error here */}...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
24
MPI-IO: Set File View
Function: MPI_File_set_view()
int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype,
MPI_Datatype filetype, char *datarep, MPI_Info info);
Description:Changes the process’ view of data file, setting the start of the view to disp, the type of file data to etype, the distribution of file data to processes to filetype, and data representation to datarep. Resets the individual and shared file pointers to zero. The call is collective, requiring the values for datarep and etype extents to be identical for all processes. The data representation must be one of: “native”, “internal” or “external32”.
#include <mpi.h>...MPI_File fh;int err;.../* open file storing the handle in fh */.../* view the file as stream of integers with no header, using native data representation */err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);if (err != MPI_SUCCESS) {/* handle error */}...
#include <mpi.h>...MPI_File fh;int err;.../* open file storing the handle in fh */.../* view the file as stream of integers with no header, using native data representation */err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);if (err != MPI_SUCCESS) {/* handle error */}...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
25
MPI-IO: Read File with Explicit OffsetFunction: MPI_File_read_at()
int MPI_File_read_at(MPI_File fh, MPI_Offset offs, void *buf, int count,
MPI_Datatype type, MPI_Status *status);
Description:Reads count elements of type type from file represented by fh at offset offs, storing them in buffer pointed to by buf. Offset offs is expressed in etype units relative to the current view associated with the file handle fh. Successful call returns the amount of data transferred in status.
#include <mpi.h>...MPI_File fh;MPI_Status stat;int buf[3], err;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);/* read the third triad of integers from file */err = MPI_File_read_at(fh, 6, buf, 3, MPI_INT, &stat);...
#include <mpi.h>...MPI_File fh;MPI_Status stat;int buf[3], err;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);/* read the third triad of integers from file */err = MPI_File_read_at(fh, 6, buf, 3, MPI_INT, &stat);...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
26
MPI-IO: Write to File with Explicit OffsetFunction: MPI_File_write_at()
int MPI_File_write_at(MPI_File fh, MPI_Offset offs, void *buf, int count,
MPI_Datatype type, MPI_Status *status);
Description:Writes count elements of type type from buffer buf to file represented by fh at offset offs. Offset offs is expressed in etype units relative to the current view associated with the file handle fh. Successful call returns the amount of data transferred in status.
#include <mpi.h>...MPI_File fh;MPI_Status stat;int err;double dt = 0.0005;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);/* store timestep as the first item in file */err = MPI_File_write_at(fh, 0, &dt, 1, MPI_DOUBLE, &stat);...
#include <mpi.h>...MPI_File fh;MPI_Status stat;int err;double dt = 0.0005;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);/* store timestep as the first item in file */err = MPI_File_write_at(fh, 0, &dt, 1, MPI_DOUBLE, &stat);...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
27
MPI-IO: Read File Collectively with Individual File Pointers
Function: MPI_File_read_all()
int MPI_File_read_all(MPI_File fh, void *buf, int count, MPI_Datatype type,
MPI_Status *status);
Description:All processes in communicator group associated with the file handle fh read their respective count elements of types type from file at the offsets determined by the current values of file pointers cached on their file handles, storing them in buffers pointed to by buf. Successful call returns the amount of data transferred in status.
#include <mpi.h>...MPI_File fh;MPI_Status stat;int buf[20], err;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);/* read 20 integers at current file offset in every process */err = MPI_File_read_all(fh, buf, 20, MPI_INT, &stat);...
#include <mpi.h>...MPI_File fh;MPI_Status stat;int buf[20], err;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);/* read 20 integers at current file offset in every process */err = MPI_File_read_all(fh, buf, 20, MPI_INT, &stat);...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
28
MPI-IO: Write to File Collectively with Individual File Pointers
Function: MPI_File_write_all()
int MPI_File_write_all(MPI_File fh, void *buf, int count, MPI_Datatype type,
MPI_Status *status);
Description:All processes in communicator group associated with the file handle fh write their respective count elements of types type from buffers buf to file at the offsets determined by the current values of file pointers cached on their file handles. Successful call returns the amount of data transferred in status.
#include <mpi.h>...MPI_File fh;MPI_Status stat;double t;int err, rank;.../* open file storing the handle in fh; compute t */...MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* interleave time values t from each process at the beginning of file */MPI_File_set_view(fh, rank*sizeof(t), MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);err = MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);...
#include <mpi.h>...MPI_File fh;MPI_Status stat;double t;int err, rank;.../* open file storing the handle in fh; compute t */...MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* interleave time values t from each process at the beginning of file */MPI_File_set_view(fh, rank*sizeof(t), MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);err = MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
29
MPI-IO: File SeekFunction: MPI_File_seek()
int MPI_File_seek(MPI_File fh, MPI_Offset offs, int whence);
Description:Updates the value of the individual file pointer according to whence, which has the following possible values:• MPI_SEEK_SET: the pointer is set to offs• MPI_SEEK_CUR: the pointer is set to the current value plus offs• MPI_SEEK_END: the pointer is set to the end of file plus offs.
#include <mpi.h>...MPI_File fh;MPI_Status stat;double t;int rank;.../* open file storing the handle in fh; compute t */...MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* interleave time values t from each process at the beginning of file */MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);MPI_File_seek(fh, MPI_SEEK_SET, rank);MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);...
#include <mpi.h>...MPI_File fh;MPI_Status stat;double t;int rank;.../* open file storing the handle in fh; compute t */...MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* interleave time values t from each process at the beginning of file */MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);MPI_File_seek(fh, MPI_SEEK_SET, rank);MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);...
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
MPI-IO Data Access Classification
30
:Source: http://www.mpi-forum.org/docs/mpi2-report.pdf
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Example: Scatter to File
31
Example created by Jean-Pierre Prost from IBM Corp.
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Scatter Example Source
32
#include "mpi.h"
static int buf_size = 1024;static int blocklen = 256;static char filename[] = "scatter.out";
main(int argc, char **argv){ char *buf, *p; int myrank, commsize; MPI_Datatype filetype, buftype; int length[3]; MPI_Aint disp[3]; MPI_Datatype type[3]; MPI_File fh; int mode, nbytes; MPI_Offset offset; MPI_Status status;
/* initialize MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &commsize);
#include "mpi.h"
static int buf_size = 1024;static int blocklen = 256;static char filename[] = "scatter.out";
main(int argc, char **argv){ char *buf, *p; int myrank, commsize; MPI_Datatype filetype, buftype; int length[3]; MPI_Aint disp[3]; MPI_Datatype type[3]; MPI_File fh; int mode, nbytes; MPI_Offset offset; MPI_Status status;
/* initialize MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &commsize);
/* initialize buffer */ buf = (char *) malloc(buf_size); memset(( void *)buf, '0' + myrank, buf_size);
/* create and commit buftype */ MPI_Type_contiguous(buf_size, MPI_CHAR, &buftype); MPI_Type_commit(&buftype);
/* create and commit filetype */ length[0] = 1; length[1] = blocklen; length[2] = 1; disp[0] = 0; disp[1] = blocklen * myrank; disp[2] = blocklen * commsize; type[0] = MPI_LB; type[1] = MPI_CHAR; type[2] = MPI_UB;
MPI_Type_struct(3, length, disp, type, &filetype); MPI_Type_commit(&filetype);
/* open file */ mode = MPI_MODE_CREATE | MPI_MODE_WRONLY;
/* initialize buffer */ buf = (char *) malloc(buf_size); memset(( void *)buf, '0' + myrank, buf_size);
/* create and commit buftype */ MPI_Type_contiguous(buf_size, MPI_CHAR, &buftype); MPI_Type_commit(&buftype);
/* create and commit filetype */ length[0] = 1; length[1] = blocklen; length[2] = 1; disp[0] = 0; disp[1] = blocklen * myrank; disp[2] = blocklen * commsize; type[0] = MPI_LB; type[1] = MPI_CHAR; type[2] = MPI_UB;
MPI_Type_struct(3, length, disp, type, &filetype); MPI_Type_commit(&filetype);
/* open file */ mode = MPI_MODE_CREATE | MPI_MODE_WRONLY;
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Scatter Example Source (cont.)
33
MPI_File_open(MPI_COMM_WORLD, filename, mode, MPI_INFO_NULL, &fh);
/* set file view */ offset = 0; MPI_File_set_view(fh, offset, MPI_CHAR, filetype, "native", MPI_INFO_NULL);
/* write buffer to file */ MPI_File_write_at_all(fh, offset, (void *)buf, 1, buftype, &status);
/* print out number of bytes written */ MPI_Get_elements(&status, MPI_CHAR, &nbytes); printf( "TASK %d ====== number of bytes written = %d ======\n", myrank, nbytes);
/* close file */ MPI_File_close(&fh);
/* free datatypes */ MPI_Type_free(&buftype); MPI_Type_free(&filetype);
/* free buffer */ free (buf);
/* finalize MPI */ MPI_Finalize();}
MPI_File_open(MPI_COMM_WORLD, filename, mode, MPI_INFO_NULL, &fh);
/* set file view */ offset = 0; MPI_File_set_view(fh, offset, MPI_CHAR, filetype, "native", MPI_INFO_NULL);
/* write buffer to file */ MPI_File_write_at_all(fh, offset, (void *)buf, 1, buftype, &status);
/* print out number of bytes written */ MPI_Get_elements(&status, MPI_CHAR, &nbytes); printf( "TASK %d ====== number of bytes written = %d ======\n", myrank, nbytes);
/* close file */ MPI_File_close(&fh);
/* free datatypes */ MPI_Type_free(&buftype); MPI_Type_free(&filetype);
/* free buffer */ free (buf);
/* finalize MPI */ MPI_Finalize();}
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Data Access Optimizations
34
Data Sieving 2-phase I/O
Collective Read Implementation in ROMIO
Source: http://www-unix.mcs.anl.gov/~thakur/papers/romio-coll.pdf
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
ROMIO Scaling Examples
• Bandwidths obtained for 5123 arrays (astrophysics benchmark) on Argonne IBM SP
35
Processors Independent I/O Collective I/O
16 1.26 MB/s 64.8 MB/s
32 1.25 MB/s 69.5 MB/s
48 1.36 MB/s 70.6 MB/s
Processors Independent I/O Collective I/O
16 12.8 MB/s 68.5 MB/s
32 6.46 MB/s 82.6 MB/s
48 5.83 MB/s 88.4 MB/s
Write Operations
Read Operations
Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/astro.html
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Independent vs. Collective Access
36
Collective I/O on IBM SPIndividual I/O on IBM SP
Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/upshot.html
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Demos
• MPI-IO Demo• POSIX I/O API Demo
37
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011 38
Topics
• Introduction • POSIX I/O API• Parallel I/O Libraries (MPI-IO)• Scientific I/O Interface: netCDF• Scientific Data Package: HDF5• Summary – Materials for Test
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
NetCDF: Introduction
• Stands for Network Common Data Form• Portable format to represent scientific data• Developed at the Unidata Program Center in Boulder, Colorado, with many
contributions from user community• Project page hosted by the Unidata program at University Corporation for
Atmospheric Research (UCAR): http://www.unidata.ucar.edu/software/netcdf/ • Provides a set of interfaces for array-oriented data access and a collection of
data access libraries for C, Fortan (77 and 90), C++, Java, Perl, Python, and other languages
• Available on UNIX and Windows platforms• Features simple programming interface• Supports large data files (and 64-bit offsets)• Open source, freely available• Commonly used file extension is “.nc” (changed from “.cdf” to avoid confusion
with other formats)• Current stable release is version 4.0 (released on June 12, 2008)• Used extensively by a number of climate modeling, land and atmosphere,
marine, naval data storage, satellite data processing, theoretical physics centers, geological institutes, commercial analysis, universities, as well as other research institutions in over 30 countries
39
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
NetCDF Rationale
• To facilitate the use of common datasets by distinct applications• Permit datasets to be transported between or shared by
dissimilar computers transparently, i.e., without translation (automatic handling of different data types, endian-ness, etc.)
• Reduce the programming effort usually spent interpreting formats• Reduce errors arising from misinterpreting data and ancillary
data• Facilitate using output from one application as input to another• Establish an interface standard which simplifies the inclusion of
new software into already existing application set (originally: Unidata system)
• However: not another DBMS!
40
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Key Properties of NetCDF Format
• Self-describing– A netCDF file includes information about the data it contains
• Portable– Files are accessible by computers that use different ways of representing
and storing of integers, floating-point numbers and characters
• Direct-access– Enabling an efficient access to small subsets of a large dataset without the
need to read through all preceding data
• Appendable– Additional data may be appended to a properly structured netCDF file
without copying the dataset or redefining its structure
• Sharable– One writer and multiple readers may simultaneously access the same
netCDF file
• Archivable– Access to all earlier forms of netCDF data will be supported by current and
future versions of the software
41
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
NetCDF Dataset Building Blocks
• Data in netCDF are represented as n-dimensional arrays, with n being 0, 1, 2, … (scalars are 0-dimensional arrays)
• Array elements are of the same data type• Three basic entities:
– Dimension: has name and length; one dimension per array may be UNLIMITED for unbounded arrays
– Variable: identifies array of values of the same type (byte, character, short, int, float, or double)
• In addition, coordinate variables may be named identically to dimensions, and by convention define physical coordinate set corresponding to that dimension
– Attribute: provides additional information about a variable, or global properties of a dataset
• There are established conventions for attribute names, e.g., unit, long_name, valid_range, etc.
• Multiple attributes per dataset are allowed
• The only kind of data structures supported by netCDF classic are collections of named arrays with attached vector attributes
42
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Common Data Form Language (CDL)• NetCDF uses CDL to provide a way to describe data model• CDL represents the information stored in binary netCDF files in a
human-readable form, e.g.:
43
netcdf example_1 { // example of CDL notation for a netCDF dataset
dimensions: // dimension names and lengths are declared first lat = 5, lon = 10, level = 4, time = unlimited;
variables: // variable types, names, shapes, attributes float temp(time,level,lat,lon); temp:long_name = "temperature"; temp:units = "celsius"; int lat(lat), lon(lon), level(level); lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; short time(time); time:units = "hours since 1996-1-1"; // global attributes :source = "Fictional Model Output";data: // optional data assignments level = 1000, 850, 700, 500; lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12;}
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
NetCDF Utilities
• ncgen– takes input in CDL format and creates a netCDF file, or a C or
Fortran program that creates a netCDF dataset
ncgen [-b] [-o netcdf-file] [-c] [-f] [-k kind] [-x]
[input-file]
• ncdump– generates the CDL text representation of a netCDF dataset on
standard output, optionally including some or all variable data– Output from ncdump is an acceptable input to ncgen
ncdump [-c|-h] [-v var1,…] [-b lang] [-f lang] [-l len]
[-p fdig[,ddig]] [-n name] [-k] [input-file]
44
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
45
NetCDF API: Create a DatasetFunction: nc_create()
int nc_create(const char *path, int cmode, int *id);
Description:Creates a new dataset returning its id that can be used in subsequent calls. The file name for the dataset is specified in path. The cmode argument determines creation mode, and may contain zero or more of the following flags or’d: NC_NOCLOBBER (to avoid overwriting existing files), NC_SHARE (limits buffering in scenarios where one or more other processes concurrently read the file being updated by a single writer process), NC_64BIT_OFFSET (create a file with 64-bit offsets). The default zero value is aliased to NC_CLOBBER, i.e. no overwrite protection for existing files. On success NC_NOERR is returned.
#include <netcdf.h> ... int status;int ncid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid);if (status != NC_NOERR) handle_error(status);
#include <netcdf.h> ... int status;int ncid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid);if (status != NC_NOERR) handle_error(status);
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
46
NetCDF API: Open a DatasetFunction: nc_open()
int nc_open(const char *path, int omode, int *id);
Description:Opens an existing dataset stored in a file identified by path, returning its id. The omode argument may contain zero or more of the following flags or’d: NC_WRITE (to open in read/write mode), NC_SHARE (same meaning as for nc_create). The default (zero) is aliased to NC_NOWRITE, which opens the file in read-only mode without sharing. On success NC_NOERR is returned.
#include <netcdf.h> ...int status;int ncid;...status = nc_open("foo.nc", 0, &ncid);if (status != NC_NOERR) handle_error(status);
#include <netcdf.h> ...int status;int ncid;...status = nc_open("foo.nc", 0, &ncid);if (status != NC_NOERR) handle_error(status);
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
47
NetCDF API: Create a DimensionFunction: nc_def_dim()
int nc_def_dim(int id, const char *name, size_t len, int *dimid);
Description:Adds a new dimension to an open dataset identified by id. The dimension name is pointed to by name, and its length, a positive integer or constant NC_UNLIMITED, is passed in len. On success NC_NOERR is returned and dimension id is stored in *dimid.
#include <netcdf.h> ... int status, ncid, latid, recid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid);if (status != NC_NOERR) handle_error(status); ... status = nc_def_dim(ncid, "lat", 18L, &latid);if (status != NC_NOERR) handle_error(status);status = nc_def_dim(ncid, "rec", NC_UNLIMITED, &recid);if (status != NC_NOERR) handle_error(status);
#include <netcdf.h> ... int status, ncid, latid, recid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid);if (status != NC_NOERR) handle_error(status); ... status = nc_def_dim(ncid, "lat", 18L, &latid);if (status != NC_NOERR) handle_error(status);status = nc_def_dim(ncid, "rec", NC_UNLIMITED, &recid);if (status != NC_NOERR) handle_error(status);
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
48
NetCDF API: Create a VariableFunction: nc_def_var()
int nc_def_var(int id, const char *name, nc_type xtype, int ndims,
const int dimids[], int *varid);
Description:Adds a new variable with name pointed to by name to an open dataset identified by id. The new variable id is stored in *varid. xtype defines the external data type, and must be one of: NC_BYTE, NC_CHAR, NC_SHORT, NC_INT, NC_FLOT, or NC_DOUBLE. The arguments ndims and dimids specify respectively the number of dimensions and their ids. On success NC_NOERR is returned.
#include <netcdf.h>int status, ncid; /* error status and dataset ID */int lat_dim, lon_dim, time_dim; /* dimension IDs */int rh_id, rh_dimids[3]; /* variable ID and shape */ ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid);if (status != NC_NOERR) handle_error(status);/* define dimensions */status = nc_def_dim(ncid, "lat", 5L, &lat_dim);if (status != NC_NOERR) handle_error(status);status = nc_def_dim(ncid, "lon", 10L, &lon_dim);if (status != NC_NOERR) handle_error(status);status = nc_def_dim(ncid, "time", NC_UNLIMITED, &time_dim);if (status != NC_NOERR) handle_error(status);/* define variable */rh_dimids[0] = time_dim; rh_dimids[1] = lat_dim; rh_dimids[2] = lon_dim;status = nc_def_var(ncid, "rh", NC_DOUBLE, 3, rh_dimids, &rh_id);if (status != NC_NOERR) handle_error(status);
#include <netcdf.h>int status, ncid; /* error status and dataset ID */int lat_dim, lon_dim, time_dim; /* dimension IDs */int rh_id, rh_dimids[3]; /* variable ID and shape */ ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid);if (status != NC_NOERR) handle_error(status);/* define dimensions */status = nc_def_dim(ncid, "lat", 5L, &lat_dim);if (status != NC_NOERR) handle_error(status);status = nc_def_dim(ncid, "lon", 10L, &lon_dim);if (status != NC_NOERR) handle_error(status);status = nc_def_dim(ncid, "time", NC_UNLIMITED, &time_dim);if (status != NC_NOERR) handle_error(status);/* define variable */rh_dimids[0] = time_dim; rh_dimids[1] = lat_dim; rh_dimids[2] = lon_dim;status = nc_def_var(ncid, "rh", NC_DOUBLE, 3, rh_dimids, &rh_id);if (status != NC_NOERR) handle_error(status);
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
49
NetCDF API: Leave Define ModeFunction: nc_enddef()
int nc_enddef(int id);
Description:Finalizes define mode and commits to disk changes made to the dataset. Returns NC_NOERR on success.
#include <netcdf.h> ... int status; int ncid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid);if (status != NC_NOERR) handle_error(status); .../* create dimensions, variables, attributes */ ...status = nc_enddef(ncid); /*leave define mode*/if (status != NC_NOERR) handle_error(status);
#include <netcdf.h> ... int status; int ncid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid);if (status != NC_NOERR) handle_error(status); .../* create dimensions, variables, attributes */ ...status = nc_enddef(ncid); /*leave define mode*/if (status != NC_NOERR) handle_error(status);
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
50
NetCDF API: Quering Variable Information
Function: nc_inq_varid(), nc_inq_var*()
int nc_inq_varid(int id, const char *name, int *varid);
int nc_inq_var(int id, int varid, char *name, nc_type *xtype, int *ndims, int dimids[], int *natts);
int nc_inq_varname(int id, int varid, char *name);
int nc_inq_vartype(int id, int varid, nc_type *xtype);
int nc_inq_varndims(int id, int varid, *ndims);
int nc_inq_vardimid(int id, int varid, int dimids[]);
int nc_inq_varnatts(int id, int varid, int *natts);
Description:The first function returns in *varid variable ID identified by name in dataset id.
The second function returns information about variable identified by varid, including its name (null terminated, in area pointed to by name), type (in *xtype), number of dimensions (in *ndims), dimension IDs (in dimids[]), and number of attributes (in *natts). The buffer to store variable name has to be allocated by user and should be at least NC_MAX_NAME+1 characters long if the name size is not known in advance. NC_NOERR is returned on success in both calls.
The remaining functions retrieve individual pieces of information, all of which nc_inq_var() returns in a single call.
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
51
NetCDF API: Variable Information (Example)
#include <netcdf.h> ... int status; /* error status */ int ncid; /* netCDF ID */ int rh_id; /* variable ID */ nc_type rh_type; /* variable type */ int rh_ndims; /* number of dims */ int rh_dimids[NC_MAX_VAR_DIMS]; /* dimension ids */int rh_natts; /* number of attributes */ ... status = nc_open("foo.nc", NC_NOWRITE, &ncid); if (status != NC_NOERR) handle_error(status); ... status = nc_inq_varid(ncid, "rh", &rh_id); if (status != NC_NOERR) handle_error(status); /* we don't need name, since we already know it */ status = nc_inq_var(ncid, rh_id, 0, &rh_type, &rh_ndims, rh_dimids, &rh_natts); if (status != NC_NOERR) handle_error(status);
#include <netcdf.h> ... int status; /* error status */ int ncid; /* netCDF ID */ int rh_id; /* variable ID */ nc_type rh_type; /* variable type */ int rh_ndims; /* number of dims */ int rh_dimids[NC_MAX_VAR_DIMS]; /* dimension ids */int rh_natts; /* number of attributes */ ... status = nc_open("foo.nc", NC_NOWRITE, &ncid); if (status != NC_NOERR) handle_error(status); ... status = nc_inq_varid(ncid, "rh", &rh_id); if (status != NC_NOERR) handle_error(status); /* we don't need name, since we already know it */ status = nc_inq_var(ncid, rh_id, 0, &rh_type, &rh_ndims, rh_dimids, &rh_natts); if (status != NC_NOERR) handle_error(status);
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
52
NetCDF API: Read a VariableFunction: nc_get_var_type()
int nc_get_var_text (int id, int varid, const char *ptr);
int nc_get_var_uchar (int id, int varid, const unsigned char *ptr);
int nc_get_var_schar (int id, int varid, const signed char *ptr);
int nc_get_var_short (int id, int varid, const short *ptr);
int nc_get_var_int (int id, int varid, const int *ptr);
int nc_get_var_long (int id, int varid, const long *ptr);
int nc_get_var_float (int id, int varid, const float *ptr);
int nc_get_var_double(int id, int varid, const double *ptr);
Description:Reads all the values from a netCDF variable referred to by varid of an open dataset with handle id. The dataset must be in data mode. The values of multidimensional arrays are read into consecutive memory locations with the last dimension varying fastest, starting at location pointed to by ptr. Type conversion will occur if the type of data differs from the netCDF variable type. Returns NC_NOERR on success.
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
53
NetCDF API: Read a Variable (Example)
#include <netcdf.h> ... #define TIMES 3 #define LATS 5 #define LONS 10 int status; /* error status */ int ncid; /* netCDF ID */ int rh_id; /* variable ID */ double rh_vals[TIMES*LATS*LONS]; /* array to hold values */ ... status = nc_open("foo.nc", NC_NOWRITE, &ncid); if (status != NC_NOERR) handle_error(status); ... status = nc_inq_varid(ncid, "rh", &rh_id); if (status != NC_NOERR) handle_error(status); ... /* read values from netCDF variable */ status = nc_get_var_double(ncid, rh_id, rh_vals); if (status != NC_NOERR) handle_error(status);
#include <netcdf.h> ... #define TIMES 3 #define LATS 5 #define LONS 10 int status; /* error status */ int ncid; /* netCDF ID */ int rh_id; /* variable ID */ double rh_vals[TIMES*LATS*LONS]; /* array to hold values */ ... status = nc_open("foo.nc", NC_NOWRITE, &ncid); if (status != NC_NOERR) handle_error(status); ... status = nc_inq_varid(ncid, "rh", &rh_id); if (status != NC_NOERR) handle_error(status); ... /* read values from netCDF variable */ status = nc_get_var_double(ncid, rh_id, rh_vals); if (status != NC_NOERR) handle_error(status);
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
54
NetCDF API: Write a VariableFunction: nc_put_var_type()
int nc_put_var_text (int id, int varid, const char *ptr);
int nc_put_var_uchar (int id, int varid, const unsigned char *ptr);
int nc_put_var_schar (int id, int varid, const signed char *ptr);
int nc_put_var_short (int id, int varid, const short *ptr);
int nc_put_var_int (int id, int varid, const int *ptr);
int nc_put_var_long (int id, int varid, const long *ptr);
int nc_put_var_float (int id, int varid, const float *ptr);
int nc_put_var_double(int id, int varid, const double *ptr);
Description:Writes all values of a possibly multidimensional variable referred to by varid to an open dataset with handle id. The location of the block of data values to be written is pointed to by ptr. The values may be implicitly converted to the external data type specified in variable definition. Returns NC_NOERR on success.
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
55
NetCDF API: Write a Variable (Example)
#include <netcdf.h> ... #define TIMES 3 #define LATS 5 #define LONS 10 int status; /* error status */ int ncid; /* netCDF ID */ int rh_id; /* variable ID */ double rh_vals[TIMES*LATS*LONS]; /* array to hold values */ int i; ... status = nc_open("foo.nc", NC_WRITE, &ncid); if (status != NC_NOERR) handle_error(status); ... status = nc_inq_varid(ncid, "rh", &rh_id); if (status != NC_NOERR) handle_error(status); ... for (i = 0; i < TIMES*LATS*LONS; i++) rh_vals[i] = 0.5; /* write values into netCDF variable */ status = nc_put_var_double(ncid, rh_id, rh_vals); if (status != NC_NOERR) handle_error(status);
#include <netcdf.h> ... #define TIMES 3 #define LATS 5 #define LONS 10 int status; /* error status */ int ncid; /* netCDF ID */ int rh_id; /* variable ID */ double rh_vals[TIMES*LATS*LONS]; /* array to hold values */ int i; ... status = nc_open("foo.nc", NC_WRITE, &ncid); if (status != NC_NOERR) handle_error(status); ... status = nc_inq_varid(ncid, "rh", &rh_id); if (status != NC_NOERR) handle_error(status); ... for (i = 0; i < TIMES*LATS*LONS; i++) rh_vals[i] = 0.5; /* write values into netCDF variable */ status = nc_put_var_double(ncid, rh_id, rh_vals); if (status != NC_NOERR) handle_error(status);
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
56
NetCDF API: Close a DatasetFunction: nc_close()
int nc_close(int id);
Description:Closes an open dataset referred to by id. If the dataset is in define mode, nc_enddef() will be called implicitly. After close, the id value may be reassigned to another newly opened or created dataset. NC_NOERR is returned on success.
#include <netcdf.h> ... int status; int ncid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid); if (status != NC_NOERR) handle_error(status); ... /* create dimensions, variables, attributes */ ...status = nc_close(ncid); /* close netCDF dataset */ if (status != NC_NOERR) handle_error(status);
#include <netcdf.h> ... int status; int ncid; ... status = nc_create("foo.nc", NC_NOCLOBBER, &ncid); if (status != NC_NOERR) handle_error(status); ... /* create dimensions, variables, attributes */ ...status = nc_close(ncid); /* close netCDF dataset */ if (status != NC_NOERR) handle_error(status);
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Parallel NetCDF
Possible usage scenarios on parallel computers– Serial netCDF to access single files from a single process– Multiple files accessed concurrently and independently through
serial netCDF API– Parallel netCDF API to access single files cooperatively or
collectively
57
Source: http://www-unix.mcs.anl.gov/parallel-netcdf/pnetcdf-sc2003.pdf
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
PnetCDF Implementation• Available from:
http://trac.mcs.anl.gov/projects/parallel-netcdf • Library layer between user space and file
system space• Processes parallel I/O requests from
compute nodes, optimizes them, and passes them down to the MPI-IO library
• Advantages:– Optimized for the netCDF file format– Regular and predictable data patterns in netCDF
compatible with MPI-IO interface– Low overhead of header I/O (local header copies
viable)– Well defined metadata creation phase– no need for collective I/O when accessing
individual objects• Disadvantages:
– No hierarchical data layout– Additions of data and header extensions are
costly after file creation due to linear layout order– No support for combining of multiple files in
memory (like HDF5 software mounting)– NetCDF source required for installation
58Source: http://www-unix.mcs.anl.gov/parallel-netcdf/pnetcdf-sc2003.pdf
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
PnetCDF Sample Calling Sequence
59
Source: http://www-unix.mcs.anl.gov/parallel-netcdf/sc03_present.pdf
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011 60
Topics
• Introduction • POSIX I/O API• Parallel I/O Libraries (MPI-IO)• Scientific I/O Interface: netCDF• Scientific Data Package: HDF5• Summary – Materials for Test
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Introduction to HDF5• Acronym for Hierarchical Data Format, a portable, freely distributable, and well
supported library, file format, and set of utilities to manipulate it• Explicitly designed for use with scientific data and applications• Initial HDF version was created at NCSA/University of Illinois at Urbana-
Champaign in 1988• First revision in widespread use was HDF4• Main HDF features include:
– Versatility: supports different data models and associated metadata
– Self-describing: allows an application to interpret the structure and contents of a file without any extraneous information
– Flexibility: permits mixing and grouping various objects together in one file in a user-defined hierarchy
– Extensibility: accommodates new data models, added both by the users and developers
– Portability: can be shared across different platforms without preprocessing or modifications
• HDF5 is the most recent incarnation of the format, adding support for new type and data models, parallel I/O and streaming, and removing a number of existing restrictions (maximal file size, number of objects per file, flexibility of type use, storage management configurability, etc.), as well as improving the performance
61
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
HDF5 File Layout• Major object classes: groups and datasets
• Namespace resembles file system directory hierarchy (groups ≡ directories, datasets ≡ files)
• Alias creation supported through links (both soft and hard)
• Mounting of sub-hierachies is possible
62
User’s viewLow-level
organization
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
HDF5 API & Tools
Library functionality grouped by function name prefix
• H5: general purpose functions
• H5A: attribute interface
• H5D: dataset manipulation
• H5E: error handling
• H5F: file interface
• H5G: group creation and access
• H5I: object identifiers
• H5P: property lists
• H5R: references
• H5S: dataspace definition
• H5T: datatype manipulation
• H5Z: inline data filters and compression
63
Command-line utilities• h5cc, h5c++, h5fc: C, C++ and
Fortran compiler wrappers• h5redeploy: updates compiler tools
after installation in new location• h5ls, h5dump: lists hierarchy and
contents of a HDF5 file• h5diff: compares two HDF5 files• h5repack, h5repart: rearranges or
repartitions a file• h5toh4, h4toh5: converts between
HDF5 and HDF4 formats• h5import: imports data into HDF5 file• gif2h5, h52gif: converts image data
between gif and HDF5 formats
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Basic HDF5 Concepts• Group
– Structure containing zero or more HDF5 objects (possibly other groups)
– Provides a mechanism for mapping a name (path) to an object
– “Root” group is a logical container of all other objects in a file
• Dataset– A named array of data elements (possibly multi-dimensional)
– Specifies the representation of the dataset the way it will be stored in HDF5 file through associated datatype and dataspace parameters
• Dataspace– Defines dimensionality of a dataset (rank and dimension sizes)
– Determines the effective subset of data to be stored or retrieved in subsequent file operations (aka selection)
• Datatype– Describes atomically accessed element of a dataset
– Permits construction of derived (compound) types, such as arrays, records, enumerations
– Influences conversion of numeric values between different platforms or implementations
• Attribute– A small, user-defined structure attached to a group, dataset or named datatype,
providing additional information
64
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
HDF5 Spatial Subset Examples
65
Source: http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
HDF5 Virtual File Layer
66
Source: http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf
• Developed to cope with large number of available storage subsystem variations
• Permits custom file driver implementations and related optimizations
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Overview of Data Storage Options
67
Source: http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Simultaneous Spatial and Type Transformation Example
68
Source: http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Simple HDF5 Code Example
69
/* Writing and reading an existing dataset. */#include "hdf5.h"#define FILE "dset.h5"
int main() { hid_t file_id, dataset_id; /* identifiers */ herr_t status; int i, j, dset_data[4][6];
/* Initialize the dataset. */ for (i = 0; i < 4; i++) for (j = 0; j < 6; j++) dset_data[i][j] = i * 6 + j + 1;
/* Open an existing file. */ file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT); /* Open an existing dataset. */ dataset_id = H5Dopen(file_id, "/dset"); /* Write the dataset. */ status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);
status = H5Dread(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);
/* Close the dataset. */ status = H5Dclose(dataset_id); /* Close the file. */ status = H5Fclose(file_id);}
/* Writing and reading an existing dataset. */#include "hdf5.h"#define FILE "dset.h5"
int main() { hid_t file_id, dataset_id; /* identifiers */ herr_t status; int i, j, dset_data[4][6];
/* Initialize the dataset. */ for (i = 0; i < 4; i++) for (j = 0; j < 6; j++) dset_data[i][j] = i * 6 + j + 1;
/* Open an existing file. */ file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT); /* Open an existing dataset. */ dataset_id = H5Dopen(file_id, "/dset"); /* Write the dataset. */ status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);
status = H5Dread(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);
/* Close the dataset. */ status = H5Dclose(dataset_id); /* Close the file. */ status = H5Fclose(file_id);}
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Parallel HDF5
• Relies on MPI-IO as the file layer driver• Uses MPI for internal communications• Most of the functionality controlled through property lists (requires
minimal HDF5 interface changes)• Supports both individual and collective file access• Three raw data storage layouts: contiguous, chunking and compact• Enables additional optimizations through derived MPI datatypes
(esp. for regular collective accesses)• Limitations
– Chunked storage with overlapping chunks (results non-deterministic)
– Read-only compression
– Writes with variable length datatypes not supported
70
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
0
500
1000
1500
2000
2500
10 110 210 310
MB
/s
Number of Processors
Flash I/O Benchmark (Checkpoint files)
PnetCDF HDF5 collective HDF5 independent
0
10
20
30
40
50
60
10 60 110 160
MB
/s
Number of Processors
Flash I/O Benchmark (Checkpoint files)
PnetCDF HDF5 collective HDF5 independent
Performance Comparison
71
Bluesky: Power 4 Power 5
0
2040
6080
100
120140
160
0 16 32 48 64 80 96 112 128 144
Number of processors
Ban
dw
idth
(MB
/S)
PNetCDF collective NetCDF4 collective
0
50
100
150
200
250
300
0 16 32 48 64 80 96 112 128 144Number of Processors
Ban
dw
idth
(M
B/S
)
Output size 995 MB Output size 15.5 GB
Source: http://www.hdfgroup.uiuc.edu/HDF5/projects/archive/WRF-ROMS/IBM-SCICOMP-HDF5-perform-yangnew.ppt
WRF-ROMS:
Flash I/OBenchmark:
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011
Demo
• NetCDF Demo• HDF5 Demo
72
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011 73
Topics
• Introduction • POSIX I/O API• Parallel I/O Libraries (MPI-IO)• Scientific I/O Interface: netCDF• Scientific Data Package: HDF5• Summary – Materials for Test
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011 74
Summary – Material for the Test
• POSIX – 8,15• MPI-I/O – 17-21• NetCDF – 39-44, 57,58• HDF5 – 61-64
CSC 7600 Lecture 20 : Parallel File I/O 2Spring 2011 75