lustre use cases in the tsubame2.0 supercomputer tokyo institute of technology dr. hitoshi sato

Lustre use cases in the TSUBAME2.0 supercomputer

Tokyo Institute of Technology

Dr. Hitoshi Sato

Outline Introduction of the TSUBAME2.0

Supercomputer Overview of TSUBAME2.0 Storage Architecture Lustre FS use cases

TSUBAME2.0 (Nov. 2010 /w NEC-HP) A green-, cloud-based SC at TokyoTech (Tokyo Japan)

〜 2.4 Pflops (Peak), 1.192 Pflops (Linpack) 4th in TOP500 Rank (Nov. 2010)

Next gen multi-core x86 CPUs + GPUs 1432 nodes, Intel Westmere/Nehalem EX CPUs 4244 NVIDIA Tesla (Fermi) M2050 GPUs

〜 95TB of MEM /w 0.7 PB/s aggregate BW Optical Dual-Rail QDR IB /w full bisection BW (FAT Tree) 1.2 MW Power , PUE = 1.28

2nd in Green500 Rank Greenest Production SC

VM Operation (KVM), Linux + Windows HPC

HDD-based Storage systems ： Total 7.13PB (Parallel FS + Home)

Interconnets: 　 Full-bisection Optical QDR Infiniband Network

Parallel FS 5.93PB

SupreTitenet

Home 1.2PB

SupreSinet3

StorageTek SL8500Tape Library

~4PB

OSS 　 x20 MDS x10

MDS,OSS servers HP DL360 G6 30nodesStorage DDN SFA10000 x5 (10 enclosure x5)x5

Voltaire Grid Director 4700 ×12IB QDR: 324 ports

Core Switch

Edge Switch Edge Switch (/w 10GbE ports)

Voltaire Grid Director 4036 ×179IB QDR : 36 ports

Voltaire Grid Director 4036E ×6IB QDR:34ports 10GbE: 2port

12switches

6switches179switches

Storage Servers HP DL380 G6 4nodes BlueArc Mercury 100 x2Storage DDN SFA10000 x1（ 10 enclosure x1 ）

ManagementServers

Thin nodes

1408nodes (32nodes x44 Racks)

HP Proliant SL390s G7 1408nodes 　　　　　　　　　　　　CPU: Intel Westmere-EP 2.93GHz 6cores × 2 = 12cores/nodeGPU: NVIDIA M2050, 3GPUs/nodeMem: 54GB (96GB)SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB) 　

Medium nodes

HP Proliant DL580 G7 24nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem:128GBSSD: 120GB x 4 = 480GB

Fat nodes

HP Proliant DL580 G7 10nodesCPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem: 256GB (512GB)SSD: 120GB x 4 = 480GB

・・・・・・

　　 Computing Nodes ： 2.4PFlops(CPU+GPU), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD

GSIC:NVIDIA Tesla S1070GPU

PCI –E gen2 x16 x2slot/node

TSUBAME2.0 Overview

NFS,CIFS 用　 x4 NFS,CIFS,iSCSI 用 x2

High Speed DataTransfer Servers

Usage of TSUBAME2.0 storage Simulation, Traditional HPC

Outputs for intermediate and final results 200 times in 52GB × 128 nodes of MEMs → 1.3 PB Computation in 4 (several) patterns → 5.2 PB

Data-intensive computing Data analysis

ex.) Bioinformatics, Text, Video analyses Web text processing for acquiring knowledge from 1 bn.

pages (2TB) of HTML → 20TB of intermediate outputs→ 60TB of final results

Processing by using commodity tools MapReduce(Hadoop), Workflow systems, RDBs, FUSE, etc.

Storage problems in Cloud-based SCs Support for Various I/O workloads Storage Usage Usability

I/O workloads Various HPC apps. run on TSUBAME2.0

Concurrent Parallel R/W I/O MPI (MPI-IO), MPI with CUDA, OpenMP, etc.

Fine-grained R/W I/O Checkpoint, Temporal files Gaussian, etc.

Read mostly I/O Data-intensive apps, Parallel workflow, Parameter

survey Array job, Hadoop, etc.

Shared Storage I/O concentration

Storage Usage Data life cycle management

Few users occupy most of storage volumes on TSUBAME1.0 Only 0.02 % of users use more than 1TB of storage

volumes Storage resource characteristics

HDD : 〜 150MB/s, 0.16 $/GB, 10 W/disk SSD : 100 〜 1000 MB/s,

4.0 $/GB, 0.2 W/disk Tape : 〜 100MB/s, 1.0 $/GB,

low power consumption

Usability Seamless data access to SCs

Federated storage between private PCs, clusters in lab and SCs

Storage service for campus like cloud storage services

How to deal with large data sets Transfer big data between SCs e.g.) Web data mining on TSUBAME1

NICT (Osaka) → 　 TokyoTech (Tokyo) : Stage-in 2TB of initial data

TokyoTech → NICT : Stage-out 60TB of results Transfer data via the Internet : 8days Fedex?

TSUBAME2.0 Storage Overview

“Global Work Space” #1

SFA10k #5


“Global Work Space” #3 “Scratch”

SFA10k #4SFA10k #3SFA10k #2SFA10k #1

/work9 /work0 /work19 /gscr0

“cNFS/Clusterd Samba w/ GPFS”

HOME

Systemapplication

“NFS/CIFS/iSCSI by BlueARC”

HOME

iSCSI

Infiniband QDR Network for LNET and Other Services

SFA10k #6

GPFS#1 GPFS#2 GPFS#3 GPFS#4

Parallel File System VolumesHome Volumes

QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8

Lustre GPFS with HSM

“Thin node SSD” “Fat/Medium node

SSD”

ScratchGrid Storage

1.2PB

2.4 PB HDD + 〜 4PB Tape

3.6 PB

130 TB 〜

190 TB

TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)

TSUBAME2.0 Storage Overview


SFA10k #5


“Global Work Space” #3 “Scratch”

SFA10k #4SFA10k #3SFA10k #2SFA10k #1

/work9 /work0 /work19 /gscr0

“cNFS/Clusterd Samba w/ GPFS”

HOME

Systemapplication

“NFS/CIFS/iSCSI by BlueARC”

HOME

iSCSI

Infiniband QDR Network for LNET and Other Services

SFA10k #6

GPFS#1 GPFS#2 GPFS#3 GPFS#4

Parallel File System VolumesHome Volumes

QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8

Lustre GPFS with HSM

“Thin node SSD” “Fat/Medium node

SSD”

ScratchGrid Storage

1.2PB

2.4 PB HDD + 〜 4PB Tape

3.6 PB

130 TB 〜

190 TB

• Home storage for computing nodes• Cloud-based campus storage services

Concurrent Parallel I/O (e.g. MPI-IO)

Fine-grained R/W I/O(check point, temporal files)

Data transfer service between SCs/CCs

Read mostly I/O (data-intensive apps, parallel workflow, parameter survey)

Backup

TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)

Lustre Configuration

MDS x2 OSS x4 MDS x2 OSS x4 MDS x2 OSS x4

LustreFS #1785.05TB

10bn. inodes

LustreFS #2785.05TB

10bn. inodes

LustreFS #3785.05TB

10bn. inodes

General purpose (MPI)

Scratch (Gaussian)

ReservedBackup

Lustre FS Performance

IO Throughput 〜 11GB/sec / FS

Client #1Client #2

Client #3Client #4

Client #5Client #6

Client #7Client #8

IB QDR Network

LustreFS(56OSTs)

OSS #1 OSS #2 OSS #3 OSS #4

Sequential Write/Read by IOR-2.10.3 1 〜 8clients, 1 〜 7proc/client

10.7GB/sec 11GB/sec

SFA 10k 600 SlotsData 560x 2TB SATA 7200 rpm

LustreFS, 1FS, 4 OSS, IOR@Client(checksum=1)

〜 33GB/sec w/ 3FSs

Failure cases in productive operations OSS hung up due to heavy I/O load

A user conducts small read() I/O operations via many java procs

I/O slow down on OSS OSS dispatches iterative failovers and rejects

connections from clients MDS hung up

Client holds unnecessary RPC connections at eviction processes

MDS keeps RPC lock from clients

Related Activities Lustre FS Monitoring MapReduce

TSUBAME2.0 Lustre FS Monitoring /w DDN Japan

Purposes :• Real time visualization for TSUBAME2.0 users• Analysis for productive operation and FS research

Lustre Monitoring in detail

Management Node

4 x OSS2 x MDS

mdt ost00 ost01...

56 x OST (14 OST per OSS)

gmetad LMT Server

Ganglia lustre module

gmod

cerebro

Ganglia lustre module

gmod

cerebro

LMT server agentLMT server

agent

WebApplication LMT GUI

Web Browser

ost55 ost56

MySQL

Web Server

OST statistics :write/read b/w, OST usages, etc

MDT statistics :open,close,getattr, setattr, link, unlink mkdir, rmdir, statfs, rename, getxattr, inode usages, etc..

MDT statistics OST statistics

For Realtime Cluster Visualization

For research and analysis

Monitoring Target

Monitoring System

MapReduce on the TSUBAME supercomputer• MapReduce

– A programming model for large data processing

– Hadoop is a common OSS-based implementation

• Supercomputers– A candidate for the execution

environment– Needs by various application users

• Text Processing, Machine learning, Bioinformatics, etc.

Problems :• Cooperation with the existing batch-job scheduler system (PBS Pro)

– All jobs including MapReduce tasks should be run under the scheduler’s control• TSUBAME2.0 supports various storage for data-intensive computation

– Local SSD storage, Parallel FSs (Lustre, GPFS)• Cooperation with GPU accelerators

– Not supported in Hadoop

• Script-based invocation– acquire computing nodes via PBS Pro– deploy a Hadoop environment

on the fly (incl. HDFS)– execute a user MapReduce jobs

• Various FS support– HDFS by aggregating local SSDs– Lustre, GPFS (to appear)

• Customized Hadoop for executing CUDA programs (experimental)– Hybrid Map Task Scheduling

• Automatically detects map taskcharacteristics by monitoring

• Scheduling map tasks to minimizeoverall MapReduce job execution time

• Extension of Hadoop Pipes features

Hadoop on TSUBAME (Tsudoop)

Thank you

lustre use cases in the tsubame2.0 supercomputer tokyo institute of technology dr. hitoshi sato

Documents

gb ssd

gb computing nodes

nodes cpu

nodes of mems

storage simulation

storage problems

nodes storage ddn sfa10000

tb mem