scalable parallel file i/o with ibm gpfs...seismic weather exploration life sciences 32 bit /64 bit...

© 2013 IBM Corporation

1

Scalable Parallel File I/O with IBM GPFS

Klaus Gottschalk HPC Architect Germany


2

Best Practices for Scalable I/O are not always straight forward

• Scaling of parallel Applications sometimes reveals bad file access patterns

• Assumptions on local disk file systems don’t scale to parallel file system

• Straight forward approach works fine until application is scaled to large node

counts

• Collaboration with LRZ to develop best practice recommendation started

– Co-workers needed

Agenda

• GPFS Features and new GPFS 3.5 Update

– GPFS Storage Server (GSS) – Friday Karsten Kutzer, IBM

– Active File Management (AFM)

– File Placement Optimizer (FPO)

• File Access Problems and Best Practice Examples


3

File system

263 files per file system

Maximum file system

size: 299 bytes

Production 19PB file

system

Number of nodes

1 to 8192 (16284

Extreme Scalability

No special nodes

Add/remove nodes

and storage on the fly

Rolling upgrades

Administer from any

node

Data replication

Snapshots

File system journaling

Proven Reliability

Integrated tiered storage

Storage pools

Quotas

Policy-Driven automation

Clustered NFS

SNMP monitoring

TSM / HPSS (DMAPI)

Manageability

The IBM General Parallel File System (GPFS) Shipping since 1998


4

IBM General Parallel File System (GPFS™) – History and evolution

2006 2005 2002 1998

HPC

GPFS

General File

Serving

Standards

Portable

operating

system

interface

(POSIX)

semantics

-Large block

Directory and

Small file perf

Data

management

Virtual

Tape Server

(VTS)

Linux®

Clusters

(Multiple

architectures)

IBM AIX®

Loose Clusters

GPFS 2.1-2.3

HPC

Research

Visualization

Digital Media

Seismic

Weather

exploration

Life sciences

32 bit /64 bit

Inter-op (IBM AIX

& Linux)

GPFS Multicluster

GPFS over wide

area networks

(WAN)

Large scale

clusters

thousands of

nodes

GPFS 3.1-3.2

2009

First

called

GPFS

GPFS 3.4

Enhanced

Windows cluster

support

- Homogenous

Windows Server

Performance and

scaling

improvements

Enhanced

migration and

diagnostics

support

2010

GPFS 3.3

Restricted

Admin

Functions

Improved

installation

New license

model

Improved

snapshot and

backup

Improved ILM

policy engine

2012

Ease of administration Multiple-networks/ RDMA Distributed Token Management Windows 2008 Multiple NSD servers NFS v4 Support Small file performance

Information lifecycle management (ILM)

Storage Pools

File sets

Policy Engine

GPFS 3.5

Caching via

Active File

Management

(AFM)

GPFS Storage

Server

GPFS File

Placement

optimizer (FPO)


5

GPFS introduced

concurrent file

system access from

multiple nodes.

Evolution of the global namespace:

GPFS Active File Management (AFM)

Multi-cluster expands the global

namespace by connecting

multiple sites

AFM takes global namespace

truly global by automatically

managing asynchronous

replication of data

GPFS

GPFS

GPFS

GPFS

GPFS

GPFS

1993 2005 2011


6

NAS key building block of cloud storage architecture

Enables edge caching in the cloud

DR support within cloud data repositories

Peer-to-peer data access among cloud edge sites

Global wide-area filesystem spanning multiple sites in the cloud

HPC Distributed NAS Storage Cloud

AFM Use Cases

WAN Caching: Caching across WAN between SoNAS clusters or SoNAS and another NAS vendor

Data Migration: Online cross-vendor data migration

Disaster Recovery: multi-site fileset-level replication/failover

Shared Namespace:

across SoNAS clusters

Grid computing: allowing data to move transparently during grid workflows

Facilitates content distribution for global enterprises, “follow-the-sun” engineering teams


7

pNFS/NFS over the WAN

GW Nodes SoNAS layer

Cache Cluster Site1

(GPFS+Panache)

Cache Cluster Site 2

(GPFS+Panache)

Home Cluster Site

(Any NAS box or SOFS)

• Fileset on home cluster is associated with a fileset on one or more cache clusters

• If data is in cache … – Cache hit at local disk speeds

– Client sees local GPFS performance if file or directory is in cache

• If data not in cache … – Data and metadata (files and directories) pulled on-demand at network line

speed and written to GPFS

– Uses NFS/pNFS for WAN data transfer

SoNAS layer

AFM Architecture

Pull on cache miss

Push on write

If data is modified at home – Revalidation done at a configurable timeout

– Close to NFS style close-to-open consistency across sites

– POSIX strong consistency within cache site

If data is modified at cache – Writes see no WAN latency

– are done to the cache (i.e. local GPFS), then asynchronously pushed home

If network is disconnected … – cached data can still be read, and writes to cache are

written back after reconnection

There can be conflicts…


8

AFM Example: Global Namespace

4-8

Clients access:

/global/data1

/global/data2

/global/data3

/global/data4

/global/data5

/global/data6

Clients access:

/global/data1

/global/data2

/global/data3

/global/data4

/global/data5

/global/data6

Clients access:

/global/data1

/global/data2

/global/data3

/global/data4

/global/data5

/global/data6

Cache Filesets:

/data1

/data2

Local Filesets:

/data3

/data4

Cache Filesets:

/data5

/data6

File System: store1

Local Filesets:

/data1

/data2

Cache Filesets:

/data3

/data4

Cache Filesets:

/data5

/data6

File System: store2

Cache Filesets:

/data1

/data2

Cache Filesets:

/data3

/data4

Local Filesets:

/data5

/data6

File System: store3

See all data from any Cluster

Cache as much data as required or fetch

data on demand


9

Policy based Pre-fetching and Expiration of Data

• Policy-based pre population

• Periodically runs parallel inodescan at home

– Selects files/dirs based on policy criterion

• Includes any user defined metadata in xattrs or other file attributes

• SQL like construct to select

• RULE LIST „prefetchlist' WHERE FILESIZE > 1GB AND MODIFICATION_TIME

> CURRENT_TIME- 3600 AND USER_ATTR1 = “sat-photo” OR USER_ATTR2 =

“classified”

• Cache then pre-fetches selected objects

– Runs asynchronously in the background

– Parallel multi-node prefetch

– Can callout when completed

• Staleness Control

– Defined based on time since disconnection

– Once cache is expired, no access is allowed to cache

– Manual expire/unexpire option for admin

• Mmafmctl –expire/unexpire, ctlcache in sonas

– Allowed onlys for ro mode cache

– Disabled for SW & LU as they are sources of data themselves


10

Architecture

Use disk local to each server

All nodes are NSD servers and NSD clients

Designed for MapReduce workloads

File Placement Optimizer (FPO)

GPFS


11

MapReduce Environment Using GPFS-FPO (File Placement Optimizer)

MapReduce Cluster

NFS

Filers

M

a

p

R

e

d

u

c

e

Users Jobs

G

P

F

S

-

F

P

O

Uses disk local to each server

Aggregates the local disk space into a single redundant shared file system

Designed for MapReduce workloads

Unlike HDFS, GPFS-FPO is POSIX compliant – so data maintenance is easy

Intended as a drop in replacement for open source HDFS (IBM BigInsights product

may be required)


12

Another GPFS-FPO Use Case – Typical In HPC

HPC Cluster

NFS

Filers

Project 1

Project 2

L

S

F

Users Jobs

Local file systems

Local file systems used for high speed scratch space

Usually scratch space is RAID 0 (for speed) so reliability can be an issue

Disk capacity for scratch space can be a limiting factor

Inaccessible scratch disk renders the compute node useless

File systems used: EXT2, EXT3, XFS


13

Another GPFS-FPO Use Case HPC Cluster

NFS

Filers

Project 1

Project 2

L

S

F

Users Jobs

G

P

F

S

-

F

P

O

GPFS-FPO creates a single, shared scratch disk pool

Reliability is higher because of GPFS-FPO redundancy design

Scratch disk capacity is much larger

Likelihood of filling all the scratch disk is much lower

Performance is preserved because GPFS-FPO exploits locality by design


14

GPFS Features and Internal Structures

• Wide striping to all NSDs in a GPFS Pool

• Token based locking

• Byte Range Locking

• Fine granular directory locking


15

GPFS Concept – Wide Striping

disk

node

disk

node

node

node

disk

1 2 3 4 5 6 file

GPFS storage

client node

GPFS storage

server node

disk storage

30 MB/s per job

10 MB/s per disk


16

GPFS Concepts - Locking and Tokens

A

A 0 0

B C D

2 4 6 8

lock range 0 – 2 GB 2 – 4 GB 5 - 7 GB 4 – 6 GB

file

GB B C

D

Overlapping

lock range Node D has to wait for

Node C to release token

GPFS uses a token based locking mechanism • On a lock request the token system manager grants access for the token validity time

• No further connect to the manager is required – reduced locking overhead

• After token expiration no client is required for recovery

GPFS provides byte-range tokens for synchronizing parallel access to file

• With valid token, each task can independently write in its file region


17

Parallel File Access

• Parallel write requires byte-range locking of independent regions

• Overlapping regions cause serialization of write access

– A Task needs to wait until token expires or is freed by holding task

– Potential race conditions for overlapping region

• Parallel I/O Libraries ease use and hide complexity of locking mechanism

• Additional provide block aggregation and caching

– Examples: MPI-IO, MIO

• Data Region per task should be large enough


18

Fine Granular Directory Locking (FGDL)

GPFS Optimization to improve write sharing

• GPFS directories use extendible hashing to map file names to directory blocks

– Last n bits of hash value determine directory block number

– Example: hash(“NewFile") = 10111101 means the directory entry goes in block 5

• File create now only locks the hash value of the file name

– lock ensures against multiple nodes simultaneously creating the same file

– actual update shipped to metanode

– If create requires directory block split, lock upgraded to cover both old and new block

– Parallel file create performance no longer depends upon whether or not the files are in the same directory

Traditional file systems implemented directories as linear file. A File create locked the entire directory

• Fine if files are in different directories

• Not the best if they’re in the same directory

• Unfortunately, this is a natural way to organize files, and programmers are hard to re-educate

• Checkpoints, output for a time-step

hash(“NewFile") = 10111101


19

Many files created in single Directory

Examples

• Parallel Application writes checkpoint file from each task

• Serial application used to process high number of data files is parallelized using

trivial SPMD mode (i.e. rendering film, compressing multiple files)

Problem

• Creation of files is serialized in traditional file systems

• With GPFS FGDL feature parallel creation of files in single directory scales

nicely, but only up to certain number of tasks

– Max. parallel tasks is depending of file system configuration (>1000)

• How small can a byte-range of a directory object be

• How many tokens can be managed for a single object

Solution

• Parallel Task creates sub-directory per node (or task) and stores its files there


20

Many Directories created from a parallel Task

A parallel application creates a directory per task to place its file in them

• Follow on problem to many files in same directory

Problem

• Parent directory becomes a bottleneck

• Application startup time does not scale linearly

– Startup time with 1000 tasks ~10min

• Directory Write Tokens “bounces” around between nodes

Solution

• Directories are created by single task (task0 or within the job start script)

• One write token is required for task0

– Startup time with 1000 ~20sec


21

scalable parallel file i/o with ibm gpfs...seismic weather exploration life sciences 32 bit /64 bit...

Documents