lustre use cases in the tsubame2.0 supercomputer tokyo institute of technology dr. hitoshi sato
Post on 21-Dec-2015
214 views
TRANSCRIPT
![Page 1: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/1.jpg)
Lustre use cases in the TSUBAME2.0 supercomputer
Tokyo Institute of Technology
Dr. Hitoshi Sato
![Page 2: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/2.jpg)
Outline Introduction of the TSUBAME2.0
Supercomputer Overview of TSUBAME2.0 Storage Architecture Lustre FS use cases
![Page 3: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/3.jpg)
TSUBAME2.0 (Nov. 2010 /w NEC-HP) A green-, cloud-based SC at TokyoTech (Tokyo Japan)
〜 2.4 Pflops (Peak), 1.192 Pflops (Linpack) 4th in TOP500 Rank (Nov. 2010)
Next gen multi-core x86 CPUs + GPUs 1432 nodes, Intel Westmere/Nehalem EX CPUs 4244 NVIDIA Tesla (Fermi) M2050 GPUs
〜 95TB of MEM /w 0.7 PB/s aggregate BW Optical Dual-Rail QDR IB /w full bisection BW (FAT Tree) 1.2 MW Power , PUE = 1.28
2nd in Green500 Rank Greenest Production SC
VM Operation (KVM), Linux + Windows HPC
![Page 4: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/4.jpg)
HDD-based Storage systems : Total 7.13PB (Parallel FS + Home)
Interconnets: Full-bisection Optical QDR Infiniband Network
Parallel FS 5.93PB
SupreTitenet
Home 1.2PB
SupreSinet3
StorageTek SL8500Tape Library
~4PB
OSS x20 MDS x10
MDS,OSS servers HP DL360 G6 30nodesStorage DDN SFA10000 x5 (10 enclosure x5)x5
Voltaire Grid Director 4700 ×12IB QDR: 324 ports
Core Switch
Edge Switch Edge Switch (/w 10GbE ports)
Voltaire Grid Director 4036 ×179IB QDR : 36 ports
Voltaire Grid Director 4036E ×6IB QDR:34ports 10GbE: 2port
12switches
6switches179switches
Storage Servers HP DL380 G6 4nodes BlueArc Mercury 100 x2Storage DDN SFA10000 x1( 10 enclosure x1 )
ManagementServers
Thin nodes
1408nodes (32nodes x44 Racks)
HP Proliant SL390s G7 1408nodes CPU: Intel Westmere-EP 2.93GHz 6cores × 2 = 12cores/nodeGPU: NVIDIA M2050, 3GPUs/nodeMem: 54GB (96GB)SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB)
Medium nodes
HP Proliant DL580 G7 24nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem:128GBSSD: 120GB x 4 = 480GB
Fat nodes
HP Proliant DL580 G7 10nodesCPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem: 256GB (512GB)SSD: 120GB x 4 = 480GB
・・・・・・
Computing Nodes : 2.4PFlops(CPU+GPU), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD
GSIC:NVIDIA Tesla S1070GPU
PCI –E gen2 x16 x2slot/node
TSUBAME2.0 Overview
NFS,CIFS 用 x4 NFS,CIFS,iSCSI 用 x2
High Speed DataTransfer Servers
![Page 5: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/5.jpg)
Usage of TSUBAME2.0 storage Simulation, Traditional HPC
Outputs for intermediate and final results 200 times in 52GB × 128 nodes of MEMs → 1.3 PB Computation in 4 (several) patterns → 5.2 PB
Data-intensive computing Data analysis
ex.) Bioinformatics, Text, Video analyses Web text processing for acquiring knowledge from 1 bn.
pages (2TB) of HTML → 20TB of intermediate outputs→ 60TB of final results
Processing by using commodity tools MapReduce(Hadoop), Workflow systems, RDBs, FUSE, etc.
![Page 6: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/6.jpg)
Storage problems in Cloud-based SCs Support for Various I/O workloads Storage Usage Usability
![Page 7: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/7.jpg)
I/O workloads Various HPC apps. run on TSUBAME2.0
Concurrent Parallel R/W I/O MPI (MPI-IO), MPI with CUDA, OpenMP, etc.
Fine-grained R/W I/O Checkpoint, Temporal files Gaussian, etc.
Read mostly I/O Data-intensive apps, Parallel workflow, Parameter
survey Array job, Hadoop, etc.
Shared Storage I/O concentration
![Page 8: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/8.jpg)
Storage Usage Data life cycle management
Few users occupy most of storage volumes on TSUBAME1.0 Only 0.02 % of users use more than 1TB of storage
volumes Storage resource characteristics
HDD : 〜 150MB/s, 0.16 $/GB, 10 W/disk SSD : 100 〜 1000 MB/s,
4.0 $/GB, 0.2 W/disk Tape : 〜 100MB/s, 1.0 $/GB,
low power consumption
![Page 9: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/9.jpg)
Usability Seamless data access to SCs
Federated storage between private PCs, clusters in lab and SCs
Storage service for campus like cloud storage services
How to deal with large data sets Transfer big data between SCs e.g.) Web data mining on TSUBAME1
NICT (Osaka) → TokyoTech (Tokyo) : Stage-in 2TB of initial data
TokyoTech → NICT : Stage-out 60TB of results Transfer data via the Internet : 8days Fedex?
![Page 10: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/10.jpg)
TSUBAME2.0 Storage Overview
“Global Work Space” #1
SFA10k #5
“Global Work Space” #2
“Global Work Space” #3 “Scratch”
SFA10k #4SFA10k #3SFA10k #2SFA10k #1
/work9 /work0 /work19 /gscr0
“cNFS/Clusterd Samba w/ GPFS”
HOME
Systemapplication
“NFS/CIFS/iSCSI by BlueARC”
HOME
iSCSI
Infiniband QDR Network for LNET and Other Services
SFA10k #6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel File System VolumesHome Volumes
QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8
Lustre GPFS with HSM
“Thin node SSD” “Fat/Medium node
SSD”
ScratchGrid Storage
1.2PB
2.4 PB HDD + 〜 4PB Tape
3.6 PB
130 TB 〜
190 TB
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
![Page 11: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/11.jpg)
TSUBAME2.0 Storage Overview
“Global Work Space” #1
SFA10k #5
“Global Work Space” #2
“Global Work Space” #3 “Scratch”
SFA10k #4SFA10k #3SFA10k #2SFA10k #1
/work9 /work0 /work19 /gscr0
“cNFS/Clusterd Samba w/ GPFS”
HOME
Systemapplication
“NFS/CIFS/iSCSI by BlueARC”
HOME
iSCSI
Infiniband QDR Network for LNET and Other Services
SFA10k #6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel File System VolumesHome Volumes
QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8
Lustre GPFS with HSM
“Thin node SSD” “Fat/Medium node
SSD”
ScratchGrid Storage
1.2PB
2.4 PB HDD + 〜 4PB Tape
3.6 PB
130 TB 〜
190 TB
• Home storage for computing nodes• Cloud-based campus storage services
Concurrent Parallel I/O (e.g. MPI-IO)
Fine-grained R/W I/O(check point, temporal files)
Data transfer service between SCs/CCs
Read mostly I/O (data-intensive apps, parallel workflow, parameter survey)
Backup
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
![Page 12: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/12.jpg)
![Page 13: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/13.jpg)
Lustre Configuration
MDS x2 OSS x4 MDS x2 OSS x4 MDS x2 OSS x4
LustreFS #1785.05TB
10bn. inodes
LustreFS #2785.05TB
10bn. inodes
LustreFS #3785.05TB
10bn. inodes
General purpose (MPI)
Scratch (Gaussian)
ReservedBackup
![Page 14: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/14.jpg)
Lustre FS Performance
IO Throughput 〜 11GB/sec / FS
Client #1Client #2
Client #3Client #4
Client #5Client #6
Client #7Client #8
IB QDR Network
LustreFS(56OSTs)
OSS #1 OSS #2 OSS #3 OSS #4
Sequential Write/Read by IOR-2.10.3 1 〜 8clients, 1 〜 7proc/client
10.7GB/sec 11GB/sec
SFA 10k 600 SlotsData 560x 2TB SATA 7200 rpm
LustreFS, 1FS, 4 OSS, IOR@Client(checksum=1)
〜 33GB/sec w/ 3FSs
![Page 15: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/15.jpg)
Failure cases in productive operations OSS hung up due to heavy I/O load
A user conducts small read() I/O operations via many java procs
I/O slow down on OSS OSS dispatches iterative failovers and rejects
connections from clients MDS hung up
Client holds unnecessary RPC connections at eviction processes
MDS keeps RPC lock from clients
![Page 16: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/16.jpg)
Related Activities Lustre FS Monitoring MapReduce
![Page 17: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/17.jpg)
TSUBAME2.0 Lustre FS Monitoring /w DDN Japan
Purposes :• Real time visualization for TSUBAME2.0 users• Analysis for productive operation and FS research
![Page 18: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/18.jpg)
Lustre Monitoring in detail
Management Node
4 x OSS2 x MDS
mdt ost00 ost01...
56 x OST (14 OST per OSS)
gmetad LMT Server
Ganglia lustre module
gmod
cerebro
Ganglia lustre module
gmod
cerebro
LMT server agentLMT server
agent
WebApplication LMT GUI
Web Browser
ost55 ost56
MySQL
Web Server
OST statistics :write/read b/w, OST usages, etc
MDT statistics :open,close,getattr, setattr, link, unlink mkdir, rmdir, statfs, rename, getxattr, inode usages, etc..
MDT statistics OST statistics
For Realtime Cluster Visualization
For research and analysis
Monitoring Target
Monitoring System
![Page 19: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/19.jpg)
![Page 20: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/20.jpg)
MapReduce on the TSUBAME supercomputer• MapReduce
– A programming model for large data processing
– Hadoop is a common OSS-based implementation
• Supercomputers– A candidate for the execution
environment– Needs by various application users
• Text Processing, Machine learning, Bioinformatics, etc.
Problems :• Cooperation with the existing batch-job scheduler system (PBS Pro)
– All jobs including MapReduce tasks should be run under the scheduler’s control• TSUBAME2.0 supports various storage for data-intensive computation
– Local SSD storage, Parallel FSs (Lustre, GPFS)• Cooperation with GPU accelerators
– Not supported in Hadoop
![Page 21: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/21.jpg)
• Script-based invocation– acquire computing nodes via PBS Pro– deploy a Hadoop environment
on the fly (incl. HDFS)– execute a user MapReduce jobs
• Various FS support– HDFS by aggregating local SSDs– Lustre, GPFS (to appear)
• Customized Hadoop for executing CUDA programs (experimental)– Hybrid Map Task Scheduling
• Automatically detects map taskcharacteristics by monitoring
• Scheduling map tasks to minimizeoverall MapReduce job execution time
• Extension of Hadoop Pipes features
Hadoop on TSUBAME (Tsudoop)
![Page 22: Lustre use cases in the TSUBAME2.0 supercomputer Tokyo Institute of Technology Dr. Hitoshi Sato](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d5e5503460f94a3d106/html5/thumbnails/22.jpg)
Thank you