efficient object storage journaling in a distributed parallel ...object storage targets (ost) lustre...

FAST’10, Feb 25, 2010

Efficient Object Storage Journaling in a Distributed Parallel File System

Sarp Oral, Feiyi Wang, David Dillow, Galen Shipman, Ross Miller, and Oleg Drokin

Presented by Sarp Oral

2 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

A Demanding Computational Environment

Jaguar XT5 18,688 Nodes

224,256 Cores

300+ TB memory

2.3 PFlops

Jaguar XT4 7,832 Nodes

31,328 Cores

63 TB memory

263 TFlops

Frost (SGI Ice) 128 Node institutional cluster

Smoky 80 Node software development cluster

Lens 30 Node visualization and analysis cluster

3 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Spider

Demonstrated bandwidth of 240 GB/s on the center wide file system

Fastest Lustre file system in the world

Demonstrated stability and concurrent mounts on major OLCF systems •  Jaguar XT5 •  Jaguar XT4 •  Opteron Dev Cluster (Smoky) •  Visualization Cluster (Lens)

Over 26,000 clients mounting the file system and performing I/O General availability on Jaguar XT5, Lens, Smoky, and GridFTP servers

Largest scale Lustre file system in the world

Cutting edge resiliency at scale

Demonstrated resiliency features on Jaguar XT5 •  DM Multipath •  Lustre Router failover

4 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Designed to Support Peak Performance

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1/1/10 0:00 1/6/10 0:00 1/11/10 0:00 1/16/10 0:00 1/21/10 0:00 1/26/10 0:00 1/31/10 0:00

Band

width GB/s

Timeline (January 2010)

ReadBW GB/s WriteBW GB/s

Max data rates (hourly) on ½ of available storage controllers

5 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Motivations for a Center Wide File System

• Building dedicated file systems for platforms does not scale –  Storage often 10% or more of new system cost –  Storage often not poised to grow independently of attached machine –  Different curves for storage and compute technology –  Data needs to be moved between different compute islands

•  Simulation platform to visualization platform –  Dedicated storage is only accessible

when its machine is available –  Managing multiple file systems

requires more manpower

data sharing path

JaguarXT5

Ewok

Lens

Smoky

Jaguar XT4

SION Network & Spider System

JaguarXT4

JaguarXT5 Ewok

LensSmoky

6 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Spider: A System At Scale

•  Over 10.7 PB of RAID 6 formatted capacity

•  13,440 1 TB drives

•  192 Lustre I/O servers

•  Over 3 TB of memory (on Lustre I/O servers)

•  Available to many compute systems through high-speed SION network –  Over 3,000 IB ports –  Over 3 miles (5 kilometers) cables

•  Over 26,000 client mounts for I/O

•  Peak I/O performance is 240 GB/s

•  Current Status –  in production use on all major OLCF computing platforms

7 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Lustre File System

•  Developed and maintained by CFS, then Sun, now Oracle

•  POSIX compliant, open source parallel file system, driven by DOE Labs

•  Metadata server (MDS) manages namespace

•  Object storage server (OSS) manages Object storage targets (OST)

•  OST manages block devices –  ldiskfs on OSTs

•  V. 1.6 superset of ext3 •  V. 1.8 + superset of ext3 or ext4

•  High-performance –  Parallelism by object striping

•  Highly scalable

•  Tuned for parallel block I/O

Metadata Server

(MDS)

Metadata Target

(MDT)

Object Storage Servers

(OSS)

Object Storage Targets

(OST)

Lustre Clients

High-

performance

interconnect

8 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Spider - Overview

SION IB Network

13,440 SATA-II Disks

1,344 (8+2) RAID

level 6 arrays (tiers)

192 Lustre I/O servers

96 DDN S2A9900

Controllers (Singlets)

192 4x DDR IB

SA

S c

onnections 1

92 4

x D

DR

IB

Jaguar XT5segment

Jaguar XT4segment

VIB VIB

96 DDR

VIB

96 DDR

Smoky

VIB64 DDR 64 DDR

VIB

Spider

Core 2Core 1

Aggregation 1

Aggregation 2

96 DDR

32 DDR

96 DDR

24 DDR 24 DDR

192 DDR

48 Leaf Switches

Lens/Everest 60 DDR 5 DDR

•  Currently providing high-performance scratch space to all major OLCF platforms

9 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Spider - Speeds and Feeds

Enterprise Storagecontrollers and large

racks of disks are connectedvia InfiniBand.

48 DataDirect S2A9900controller pairs with

1 Tbyte drives and 4 InifiniBand

connections per pair

Storage Nodesrun parallel file system software and manage incoming FS traffic.

192 dual quad coreXeon servers with

16 Gbytes of RAM each

SION Networkprovides connectivity

between OLCF resources and

primarily carries storage traffic.

3000+ port 16 Gbit/secInfiniBand switch

complex

Lustre Router Nodesrun parallel file system

client software andforward I/O operations

from HPC clients.

192 (XT5) and 48 (XT4)one dual core

Opteron nodes with8 GB of RAM each

Jaguar XT5

Jaguar XT4

XT5 SeaStar2+ 3D Torus

9.6 Gbytes/sec

InfiniBand16 Gbit/sec

384 Gbytes/s

96Gbytes/s

384 Gbytes/s

384 Gbytes/s

Serial ATA3 Gbit/sec

366 Gbytes/s

Other Systems

(Viz, Clusters)

10 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Spider - Couplet and Scalable Cluster

Disks 280 in 5 trays

DDN Couplet (2 controllers)

OSS (4 Dell nodes) 24 IB ports

Flextronics Switch

IB Ports

Uplink to Cisco Core Switch

Disks 280 in 5 trays

DDN Couplet (2 controllers)

OSS (4 Dell nodes) 24 IB ports

Flextronics Switch

IB Ports


280 1TB Disks in 5 disk trays

DDN S2A 9900 Couplet

(2 controllers)

Lustre I/O Servers (4 Dell nodes)

24 IB ports Flextronics Switch

IB Ports


A Spider Scalable Cluster (SC)

SC SC SC SC

SC SC SC SC

SC SC SC SC

SC SC SC SC

16 SC units on the floor 2 racks for each SC

Unit 1 Unit 2

Unit 3

11 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Spider - DDN S2A9900 Couplet

D1 D14

Disk Enclosure 1

DEM

D15 D28... DEM

D29 D42... DEM

D56... DEM

DEM

DEM

DEM

DEM

A1

B1

C1

D1

E1

F1

G1

H1

P1

S1

A1

B1

A2

B2

C2

D2

E2

F2

G2

H2

P2

S2

A2

B2

Controller2

...

Disk Enclosure 2

Disk Enclosure 5

Controller1

...

D43

IO

Module

IO

Module

IO

Module

IO

Module

IO

Module

IO

Module

IO

Module

IO

Module

IO

Module

IO

Module

Power Supply(House)

Power Supply(UPS)

Power Supply(House)

Power Supply(UPS)

Power Supply(House)

Power Supply(UPS)

12 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Spider - DDN S2A9900 (cont’d)

8 data drives

2 parity drives

D14A

D1A

D2A

...

D14B

D1B

D2B

...

D14P

D1P

D2P

...

D14SD1S D2S

...

D28A

D15A

D16A

...

D28B

D15B

D16B

...

D28P

D15P

D16P

D28S

D15S

D16S

...

...

Channel A

Channel B

Channel P

Channel S

Channel A

Channel B

Channel P

Channel S

Tier1 Tier 2 Tier 14

...

Tier 15 Tier 16 Tier 28

Disk Controller 1 Disk Controller 2

...

• RAID 6 (8+2)

13 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Spider - How Did We Get Here?

•  4 years project •  We didn’t just pick up phone and order

a center-wide file system –  No single vendor could deliver this system –  Trail blazing was required

•  Collaborative effort was key to success –  ORNL –  Cray –  DDN –  Cisco –  CFS, SUN, and now Oracle

14 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Spider – Solved Technical Challenges

•  Fault tolerance design –  Network –  I/O servers –  Storage arrays

•  Infiniband support on XT SIO 0

20

40

60

80

100

120

0 100 200 300 400 500 600 700 800 900

Perc

ent of observ

ed p

eak {

MB

/s,IO

PS

}

Elapsed time (seconds)

Hard bounce of 7844 nodes via 48 routers

Bounce XT4 @ 206s

I/O returns @ 435s

Full I/O @ 524s

RDMA Timeouts

Bulk Timeouts

OST Evicitions

Combined R/W MB/sCombined R/W IOPS

SeaStar Torus Congestion

•  Performance –  Asynchronous journaling –  Network congestion avoidance

•  Scalability –  26,000 file system clients

15 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

ldiskfs Journaling Overhead

• Even sequential writes exhibit random I/O behavior due to journaling •  Observed 4-8 KB writes along with 1 MB sequential writes on DDNs •  DDN S2A9900’s are not well tuned for small I/O access •  For enhanced reliability write-back cache on DDNs are turned off

• Special file (contiguous block space) reserved for journaling on ldiskfs –  Labeled as journal device –  Beginning on physical disk layout

• Ordered mode •  After file data portion committed on disk journal meta data portion needs to

be committed

• Extra head seek needed for every journal transaction commit!

16 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

ldiskfs Journaling Overhead (Cont’d)

between two events. The write-back mode thus provides meta data consistency but does not provide any filedata consistency. The ordered mode is the default journaling mode in Linux Ext3 file system. In this mode,though only updated meta data blocks are journaled the file data is guaranteed to be written to their fixedlocations on disk before committing the meta data updates in the journal, thus providing an order betweenthe meta data and file data commits on disk. The data journaling mode journals both the meta data and thefile data.

5.3.1 Journaling Overhead

The ldiskfs file system by default performs journaling in ordered mode by first writing the data blocks to diskfollowed by meta data blocks to the journal. The journal is then written to disk and marked as committed. Inthe worst case, such as appending to a file, this can result in one 16 KB write (on average – for bitmap, inodeblock map, inode, and super block) and another 4 KB write for the journal commit record for every 1 MBwrite. These extra small writes cause at least two extra disk head seeks. Due to the poor IOP performance ofSATA disks, these additional head seeks and small writes can substantially degrade the aggregate block I/Operformance.

A possible optimization (and perhaps the most obvious one) that would improve the journaling efficiency isto minimize the extra disk head seeks. This can be achieved by either a software or hardware optimization(or both).

5.3.2 Baseline Performance at DDN level

In order to obtain this baseline on the DDN S2A9900, the XDD benchmark utility was used. XDD allowsmultiple clients to exercise a parallel write or read operation synchronously. XDD can be run in sequential orrandom read or write mode. Our baseline tests focused on aggregate performance for sequential read or writeworkloads. Performance results using XDD from 4 hosts connected to the DDN via DDR IB are summarizedin Fig. 1. The results presented are a summary of our testing and show performance of sequential read,sequential write, random read, and random write using 1MB transfers. These tests were run using a singlehost for the single LUN tests, and 4 hosts each with 7 LUNs for the 28 LUN test. Performance resultspresented are the best of 5 runs in each configuration.

Table 1: XDD baseline performance!"#$ %&'("

!"#$"%&'() *+,-*. ./,-0,

1(%234 565-70 8*-77

!"#$"%&'() ,,76-5, ,*6+-5,

1(%234 .7,/-+7 .,/6-,

)'*+,"-./0

12-./0

5.3.3 Baseline Performance at Lustre Level

After establishing a baseline of performance using XDD, we examined Lustre level performance using theIOR benchmark. Testing was conducted using 4 OSSes each with 7 OSTs on the DDN S2A9900. Our initialresults showed very poor write performance of only 1398.99MB/sec using 28 clients where each client waswriting to different OST. Lustre level write performance was a mere 24.9% of our baseline performance

15

•  Block level benchmarking (writes) for 28 tiers 5608.15 MB/s (baseline) •  File system level benchmark (obdfilter) gives 1398.99 MB/s

–  24.9% of baseline bandwidth –  One couplet, 4 OSS each with 7 OSTs –  28 clients, one-to-one mapping with OSTs

•  Analysis –  Large number of 4KB writes in addition to 1MB writes –  Traced back to ldiskfs journal updates

17 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Minimizing extra disk head seeks

• Hardware solutions –  External journal on an internal SAS tier –  External journal on a network attached solid state device

• Software solution –  Asynchronous journal commits

Configura>on Bandwidth MB/s (single couplet)

Delta % from baseline

Block level (28 @ers) 5608.15 0%

Internal journals, SATA 1398.99 24.9%

External, internal SAS @er 1978.82 35.2%

External, sync to RAMSAN, solid state 3292.60 58.7%

Internal, async journals, SATA 5222.95 93.1%

18 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

External journals on a solid state device

•  Texas Memory Systems’ RamSan-400 –  Loaned by Vion Corp. –  Non-volatile SSD –  3 GB/s block I/O –  400,000 IOPS –  4 IB DDR ports w/ SRP

•  28 LUNs –  One-to-one mapping with

DDN LUNs –  Obtained 58.7% of baseline

performance –  Network round-trip latency or

inefficiency on external journal code path might culprit

Jaguar XT5segment

Jaguar XT4segment

VIB VIB

96 DDR

VIB

96 DDR

Smoky

VIB64 DDR 64 DDR

VIB

Spider

Core 2Core 1

Aggregation 1

Aggregation 2

96 DDR

32 DDR

96 DDR

24 DDR 24 DDR

192 DDR

48 Leaf Switches

Lens/Everest 60 DDR 5 DDR

TMSRamSan-400

4 DDR

19 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Synchronous Journal Commits

•  Running and closed transactions –  Running transaction accepts new threads to join in and has all its data in memory –  Closed transaction starts flushing updated metadata to journaling device. After flush is

complete, the transaction state is marked as committed –  Current running transaction can’t be

closed and committed until closed transaction fully commits to journaling device

•  Congestion points –  Slow disk –  Journal size (1/4 of journal device) –  Extra disk head seek for journal

transaction commit –  Write I/O operation for new threads is

blocked on currently closed transaction that is committing

RUNNING

CLOSED COMMITTED

The running transaction is marked as

CLOSED in memory by Journaling

Block Device (JBD) Layer

File data is flushed

from memory to

disk

The file data must be

flushed to disk prior

to committing the

transaction

Updated metadata

blocks flushed to

diskUpdated metadata

blocks are written from

memory to journaling

device

20 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Asynchronous Journal Commits

•  Change how Lustre uses the journal, not the operation of journal •  Every server RPC reply has a special field (default, sync)

–  id of the last transaction on stable storage –  Client uses this to keep a list of completed, but not committed operations –  In case of a server crash these could be resent (replayed) to the server

•  Clients pin dirty and flushed pages to memory (default, sync) –  Released only when server acks these are committed to stable storage

•  Relax the commit sequence (async) –  Add async flag to the RPC –  Reply clients immediately after file data portion of RPC is committed to disk

21 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Asynchronous Journal Commits (cont’d)

1.  Server gets destination object id and offset for write operation

2.  Server allocates necessary number of pages in memory and fetch data from remote client into pages

3.  Server opens a transaction on the back-end file system.

4.  Server updates file metadata, allocates blocks and extends file size 5.  Server closes transaction handle

6.  Server writes pages with file data to disk synchronously

7.  If async flag set server completes operation asynchronously –  Server sends a reply to client –  JBD flushes updated metadata blocks to journaling device, writes commit record

8.  If async flag is NOT set server completes operation synchronously –  JBD flushes updated metadata blocks to journaling device, writes commit record –  Server sends a reply to client

22 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Asynchronous Journal Commits (cont’d)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

0 100 200 300 400 500 600

Aggre

gate

Blo

ck I/O

(M

B/s

)

Number of Threads

Hardware- and Software-based Journaling Solutions

external journals on a tier of SAS disksexternal journals on RamSan-400 device

internal journals on SATA disksasync internal journals on SATA disks

•  Async journaling achieves 5,223 MB/sec (at file system level) or 93% of baseline

•  Cost effective –  Requires only Lustre code change –  Easy to implement and maintain

•  Temporarily increases client memory consumption

–  Clients have to keep more data in memory until the server acks the commit

•  Does not change the failure semantics or reliability characteristics

–  The guarantees about file system consistency at the local OST remain unchanged

23 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Asynchronous Journal Commits Application Performance

•  Up to 50% reduction in runtime –  Might not be typical –  Depends on the application

•  Reduced number of small I/O requests •  64% to 26.5%

!"!!#

$!"!!#

%!"!!#

&!"!!#

'!"!!#

(!"!!#

)!"!!#

*!"!!#

+!"!!#

,!"!!#

$!!"!!#

!#-#$%*# ($%#-#(%'# ,$%#-,%'# $!!+-$!&)# $(&)#-#$('+# %!'+#-#%!)!#

./0123#45#63782#9:;

#32</2=8=#>?

@#

9:;#32</2=8#=7A2=#>7B#CD@#

9:;#32</2=8#=7A2#E7=8371/F4B#543#GHI#3/B=#678J#$K&''#637823=#

LMBN#O4/3BPQ=# R=MBN#O4/3BPQ=#

!"!#"!$% !"!&"'&%

!"!'"()%

!"!(")(%

!"!!"!!%

!"!)"($%

!"!("*+%

!"!'")#%

!"!*"'$%

!"!,")(%

!"!&"+&%

!")!"!*%

$'%-./01%23%

4156-%.7%

$'%-./01%23%

4156-%.6%

)+''%-./01%23%

4156-%.7%

)+''%-./01%23%

4156-%.6%

896%:;<

0%=>/"<;6"10-?%

@:A%896%:;<01%

•  Gyrokinetic Toroidal Code (GTC)

24 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Conclusions

•  A system at this scale, we can’t just pick up the phone and order one

•  No problem is small when you scale it up

•  At Lustre file system level we obtained 24.9% of our baseline block level performance –  Tracked to ldiskfs journal updates

•  Solutions –  External journals on an internal SAS tier; achieved 35.2% –  External journals on network attached SSD; achieved 58.7% –  Asynchronous journal commits; achieved 93.1%

•  Removed a bottleneck from critical write path •  Decreased 4 KB I/O DDNs observed by 37.5% •  Cost-effective, easy to deploy and maintain •  Temporarily increases client memory consumption •  Doesn’t change failure characteristics or semantics

25 FAST’10, Feb 25, 2010 FAST’10, Feb 25, 2010

Questions?

Contact info

Sarp Oral oralhs at ornl dot gov Technology Integration Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory

efficient object storage journaling in a distributed parallel ...object storage targets (ost) lustre...

Documents