parallel and grid i/o infrastructure w. gropp, r. ross, r. thakur argonne national lab a. choudhary,...

25
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad Lawrence Livermore National Lab

Upload: hollie-jodie-jennings

Post on 03-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

W. Gropp, R. Ross, R. Thakur

Argonne National Lab

A. Choudhary, W. LiaoNorthwestern University

G. Abdulla, T. Eliassi-RadLawrence Livermore National

Lab

Page 2: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Outline

• Introduction• PVFS and ROMIO• Parallel NetCDF• Query Pattern Analysis

Please interrupt at any point for questions!

Page 3: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

What is this project doing?

• Extending existing infrastructure work– PVFS parallel file system– ROMIO MPI-IO implementation

• Helping match application I/O needs to underlying capabilities– Parallel NetCDF– Query Pattern Analysis

• Linking with Grid I/O resources– PVFS backend for GridFTP striped server– ROMIO on top of Grid I/O API

Page 4: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

What Are All These Names?

• MPI - Message Passing Interface Standard– Also known as MPI-1

• MPI-2 - Extensions to MPI standard– I/O, RDMA, dynamic processes

• MPI-IO - I/O part of MPI-2 extensions • ROMIO - Implementation of MPI-IO

– Handles mapping MPI-IO calls into communication (MPI) and file I/O

• PVFS - Parallel Virtual File System– An implementation of a file system for Linux

clusters

Page 5: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Fitting the Pieces Together

Query Pattern Analysis

Parallel NetCDF

Any MPI-IO Implementation

• Query Pattern Analysis (QPA) and Parallel NetCDF both written in terms of MPI-IO calls– QPA tools pass information down through MPI-IO hints– Parallel NetCDF written using MPI-IO for data read/write

• ROMIO implementation uses PVFS as storage medium on Linux clusters or could hook to Grid I/O resources

ROMIO MPI-IO ImplementationPVFS Parallel File System

Grid I/O Resources

Page 6: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

PVFS and ROMIO

• Provide a little background on the two– What they are, example to set context, status

• Motivate the work• Discuss current research and development

– I/O interfaces– MPI-IO Hints– PVFS2

Our work with these two closely tied together.

Page 7: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Parallel Virtual File System

• Parallel file system for Linux clusters– Global name space– Distributed file data– Builds on TCP, local file systems

• Tuned for high performance concurrent access• Mountable like NFS file systems• User-level interface library (used by ROMIO)• 200+ users on mailing list, 100+ downloads/month

– Up from 160+ users in March

• Installations at OSC, Univ. of Utah, Phillips Petroleum, ANL, Clemson Univ., etc.

Page 8: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

PVFS Architecture

• Client - Server architecture• Two server types

– Metadata server (mgr) - keeps track of file metadata (permissions, owner) and directory structure

– I/O servers (iod) - orchestrate movement of data between clients and local I/O devices

• Clients access PVFS one of two ways– MPI-IO (using ROMIO implementation)– Mount through Linux kernel (loadable

module)

Page 9: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

PVFS and ROMIO

PVFS Performance

• Ohio Supercomputer Center cluster• 16 I/O servers (IA32), 70+ clients (IA64), IDE disks• Block partitioned data, accessed through ROMIO

Page 10: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

ROMIO

• Implementation of MPI-2 I/O specification– Operates on wide variety of platforms– Abstract Device Interface for I/O (ADIO) aids in

porting to new file systems– Fortran and C bindings

• Successes– Adopted by industry (e.g. Compaq, HP, SGI)– Used at ASCI sites (e.g. LANL Blue Mountain)

MPI-IO InterfaceADIO Interface

FS-Specific Code (e.g. AD_PVFS, AD_NFS)

Page 11: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Example of Software Layers

• FLASH Astrophysics application stores checkpoints and visualization data using HDF5

• HDF5 in turn uses MPI-IO (ROMIO) to write out its data files

• PVFS client library isused by ROMIO to writedata to PVFS file system

• PVFS client libraryinteracts with PVFSservers over network

FLASH Astrophysics Code

HDF5 I/O Library

ROMIO MPI-IO Library

PVFS Client Library

PVFS Servers

Page 12: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Example of Software Layers (2)

• FLASH Astrophysics application stores checkpoints and visualization data using HDF5

• HDF5 in turn uses MPI-IO (IBM) to write out its data files

• GPFS File System storesdata to disks

FLASH Astrophysics Code

HDF5 I/O Library

IBM MPI-IO Library

GPFS

Page 13: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Status of PVFS and ROMIO

• Both are freely available, widely distributed, documented, and supported products

• Current work focuses on:– Higher performance through more rich file

systems interfaces– Hint mechanisms for optimizing behavior of

both ROMIO and PVFS– Scalability– Fault tolerance

Page 14: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Why Does This Work Matter?

• Much of I/O on big machines goes through MPI-IO– Direct use of MPI-IO (visualization)– Indirect use through HDF5 or NetCDF (fusion, climate,

astrophysics)– Hopefully soon through Parallel NetCDF!

• On clusters, PVFS is currently the most deployed parallel file system

• Optimizations in these layers are of direct benefit to those users

• Providing guidance to vendors for possible future improvements

Page 15: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

I/O Interfaces

• Scientific applications keep structured data sets in memory and in files

• For highest performance, the description of the structure must be maintained through software layers– Allow the scientist to describe the data layout in

memory and file– Avoid packing into buffers in intermediate layers– Minimize the number of file system operations

needed to perform I/O

Page 16: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

File System Interfaces

• MPI-IO is a great starting point• Most underlying file systems only provide

POSIX-like contiguous access

• List I/O work was first step in the right direction– Proposed FS interface– Allows movement of lists of

data regions in memory andfile with one call

Memory

File

Page 17: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

List I/O

• Implemented in PVFS• Transparent to user through

ROMIO• Distributed in latest releases

0

5

10

15

20

25

30

MB/s

ec

POSIX I / O Data Sieving Two-PhaseCollective I / O

List I / O

Tiled Visualization Reader

Page 18: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

List I/O Example

• Simple datatyperepeated over file

• Desire to read first9 bytes

• This is converted intofour [offset,length] pairs

• One can see how this process could result in a very large list of offsets and lengths

# of Bytes

# of Datatypes 1 2 3

Datatype

0 1 2 3 4 5 6 7 8 9 10 11

0 2 6 10

1 3 3 2

File Offsets

File Lengths

size of a byte

Flattening A File Datatype

Page 19: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Describing Regular Patterns

• List I/O can’t describe regular patterns (e.g. a column of a 2D matrix) in an efficient manner

• MPI datatypes can do this easily• Datatype I/O is our solution to this

problem– Concise set of datatype constructors used to

describe types– API for passing these descriptions to a file

system

Page 20: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Datatype I/O

• Built using a generic datatype processing component (also used in MPICH2)– Optimizing for performance

• Prototype for PVFS in progress– API and server support

• Prototype of support in ROMIO in progress– Maps MPI datatypes to PVFS datatypes– Passes through new API

• This same generic datatype component could be used in other projects as well

Page 21: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Datatype I/O Example

• Same datatype as in previous example• Describe datatype with one construct:

– index {(0,1), (2,2)} describes pattern of one short block and one longer one

– automatically tiled (as with MPI types for files)

• Linear relationship between # of contiguous pieces and size of request is removed

# of Bytes

# of Datatypes 1 2 3

Datatype

0 1 2 3 4 5 6 7 8 9 10 11

size of a byte

Page 22: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

MPI Hints for Performance

• ROMIO has a number of performance optimizations built in

• The optimizations are somewhat general, but there are tuning parameters that are very specific– buffer sizes– number and location of processes to perform I/O– data sieving and two-phase techniques

• Hints may be used to tune ROMIO to match the system

Page 23: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

ROMIO Hints

• Currently all of ROMIO’s optimizations may be controlled with hints– data sieving– two-phase I/O– list I/O– datatype I/O

• Additional hints are being considered to allow ROMIO to adapt to access patterns– collective-only I/O– sequential vs. random access– inter-file dependencies

Page 24: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

PVFS2

• PVFS (version 1.x.x) plays an important role as a fast scratch file system for use today

• PVFS2 will supersede this version, adding– More comprehensive system management– Fault tolerance through lazy redundancy– Distributed metadata– Component-based approach for supporting new

storage and network resources

• Distributed metadata and fault tolerance will extend scalability into thousands and tens of thousands of clients and hundreds of servers

• PVFS2 implementation is underway

Page 25: Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad

Parallel and Grid I/O Infrastructure

Summary

• ROMIO and PVFS are a mature foundation on which to make additional improvements

• New, rich I/O descriptions allow for higher performance access

• Addition of new hints to ROMIO allows for fine-tuning its operation

• PVFS2 focuses on the next generation of clusters