global filesystem option for the ska uv buffer
TRANSCRIPT
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Global Filesystem Option for the SKA UV buffer
Tests and Development of Lustre
Bojan Nikolic
Cambridge HPCS
Paul Calleja
WojciechTurek
Cambridge IT
Mark Calleja
SKA SDP Consortium
PROT.ISP Team
ARCH Team
Industry
Peter Braam
Calsoft Inc
Blame
Most of the work
FP7: Cluster of Research Infrastructures
for Synergies in Physics
SDP UV Buffer in CRISP Context
... ...
Real Time
Acce
ptor
Signa
l Agg
rega
tor /
Re-
Sequ
ence
r
Pre-
Proc
esso
r
Stor
age
Proc
esso
r
Stor
age
Proc
esso
r
Stor
age
Proc
esso
r
...
Distr
ibut
ed St
orag
e
Stor
age
Unifi
ed St
orag
e
Pseudo Real Time Pseudo Real Time
Proc
esso
rPr
oces
sor
Proc
esso
rPr
oces
sor
Proc
esso
r
Proc
esso
rPr
oces
sor
Proc
esso
rPr
oces
sor
Proc
esso
r
Onlin
e Com
putin
g
Offli
ne St
orag
e
Arch
ive
RT M
onito
ring
Adva
nced
pro
cess
ing
Data
redu
ctio
n
Data
Flow
Dire
cted
Pat
hway
Data
Flow
Dec
oupl
ed Pr
oces
ses
Addi
tiona
l RT
Mod
ules
Addi
tiona
l PRT
Mod
ules
Addi
tiona
l PRT
Mod
ules
Pre-Processing Layer Online Storage Layer Online Computing Layer Offline Layer
Proc
esso
rPr
oces
sor
Proc
esso
rPr
oces
sor
PRT O
nlin
e Com
putin
g
Proc
esso
rPr
oces
sor
Proc
esso
r Offli
ne C
ompu
tingPr
oces
sor
Data
expo
rt
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Global UV Buffer option
... ...
Real Time
Acce
ptor
Signa
l Agg
rega
tor /
Re-
Sequ
ence
r
Pre-
Proc
esso
r
Pseudo Real Time Pseudo Real Time
Proc
esso
rPr
oces
sor
Proc
esso
rPr
oces
sor
Proc
esso
r
Proc
esso
rPr
oces
sor
Proc
esso
rPr
oces
sor
Proc
esso
r
Onlin
e Com
putin
g
Offli
ne St
orag
e
Arch
ive
RT M
onito
ring
Adva
nced
pro
cess
ing
Data
redu
ctio
n
Data
Flow
Dire
cted
Pat
hway
Data
Flow
Dec
oupl
ed Pr
oces
ses
Addi
tiona
l RT
Mod
ules
Addi
tiona
l PRT
Mod
ules
Addi
tiona
l PRT
Mod
ules
Pre-Processing Layer Online Storage Layer Online Computing Layer Offline Layer
Data
expo
rt
Lustr
e
Unifi
ed S
tora
ge
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Specifications (maximal case)
Throughput Write: maximum case 7 Tbyte/s
Read: maximum ~5 times higher
Capacity Maximum requirement 300 Pbyte
Data lifecycle ~ 12 hours
Files Maximum 10^8 files (one file per ten frequency channels per snapshot)
IO Computational Intensity
>250 Floating Point Operations per Byte of IO to UV Buffer
Deployment Date Start of commissioning 2019. Science Operations 2021
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Global Filesystem MotivationOnly one of the options being considered for SKA Science Data Processor UV Buffer
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Options/Alternatives
Global Filesystem + Higher Level API
Global Filesystem (Posix/sub-Posix)
Hybrid (key+sequential)/value store
Traditional Key/Value Store
Local RAID
Local independent filesystems
Local raw block devices
HDFS
Hig
her
Lev
el
Intel FF Lustre + HDF5
Cassandra
Linux Kernel Software RAID6
/dev/sda
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Global Filesystem is not Required
• But this always true:• Computation is only ever done on words in the CPU• Large storage systems are always made of many smaller drives• Global Filesystem is just a software sub-system
• Why global filesystem?• Simple, familiar concept• Simple well known, very often used API and UI• Combine inter-node data communication, inter-node metadata distribution,
persistence, inter-node high-availability and failover mechanisms all in one subsystem
• Off the shelf vendor solutions & open source solutions
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Global Filesystem Features revisited
•Make data available to any node at any subsequent time
•Redundantly distribute data on multiple nodes and disks Write to File
•Retrieve data from any node from arbitrary previous time
•On-the-fly recover from lost nodes and disksRead from File
•Process signals that it is doing a piece of work File Open For Write
•Process signals that no more work on this can be done until closeFile Open For Read
•What are other nodes doing?
•What have other nodes done in the pastDirectory list
•Enforce interfaces between nodes
•Enforce interfaces between processesFile Permissions
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Trade-off Issues
Features/ Functions
Proven Alternatives
EfficiencyMaximum
Performance
Scaling IssuesCapital
ExpenditureOp-Ex/ Power
Budget
Software Development
Cost
Resilience Future-proof?Implementation
Risks
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Approaching a Trade-Off Analysis
• What are the inter-node communication features we need?
• What do we know we can do efficiently without a global filesystem?
• What is the performance of current global filesystems in providing the features we need?
• What is the capital/op-ex cost of global-filesystem?
• What are the impacts on software development costs?
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Current Lustre work toward the trade-off
• Performance & efficiency : benchmarks, analysis, hardware tests
• Scalability: Tests on very large integrated systems
• Cap-ex: Commodity Lustre
• High-Availability: Efficient Software RAID
• Implementation Risks: Accumulating and recording know-how
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Performance BenchmarkingAims: Measure Real World Performance, Identify Bottlenecks, Quantify Scaling
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Lustre Benchmarking Approach
Integrated Benchmarking Environment: JUBE
Filesystem Benchmarking suites ( IOR , iozone)
Low-level/raw devices tests (dd, xdd)
Oprofile/Linux Kernel Perf Event Based Profiling
• Application tests: ASKAPSoft (in preparation)
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Cambridge HPCS LUSTRE Test Brick (Gen 1)
0
2000
4000
6000
8000
10000
12000
1 Client 2 Clients 4 Clients 8 Clients 16 Clients 32 Clients
Tota
l Th
rou
ghp
ut
(MB
/s)
Number of client nodes with 1 client per node
18 Stripe Interleaved (Single File) Performance
Read Write
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Cambridge HPCS Lustre Test Brick Gen 1
Metadata Object storage
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Lustre tests on JuRoPA2 (file-per-process)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
256 clients 512 clients 1024 clients
Tota
l Th
rou
ghp
ut
(MB
/S)
POSIX API
POSIX Read Posix Write
0
2000
4000
6000
8000
10000
12000
14000
16000
256 clients 512 clients 1024 clients
Tota
l Th
rou
ghp
ut
(MB
/S)
MPIIO API
MPIIO Read MPIIO Write
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Lustre tests on JuRoPA2 (separate files)JobId Tasks Mode API MaxWrite MaxRead BS TS
n256p1t1 256 separated MPIIO 11678.41 15237.87 4GiB 32KiB
n256p1t1 256 separated POSIX 10535.22 16856.16 4GiB 32KiB
n256p2t1 512 separated MPIIO 9871.79 12365.02 2GiB 32KiB
n256p2t1 512 separated POSIX 10538.96 16314.26 2GiB 32KiB
n256p4t1 1024 separated MPIIO 6141.25 9124.83 1GiB 32KiB
n256p5t1 1024 separated POSIX 1982.93 17590.65 1GiB 32KiB
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Lustre Kernel ProfilingUnpatched Kernel Patched Kernel
IOTool Type # Time taken
seconds
Speed
MBps
Oprofile CPU
utilization details
Time taken
seconds
Speed
MBps
Oprofile CPU
utilization details
xdd
QueueDepth=32
Avg. 481 8.9 1.97% memcpy
0.07%
async_copy_data
5.75% raid456
module
419 10.2 1.5% memcpy
0%
async_copy_data
6.8% raid456
module
xdd
QueueDepth=1
Avg. 167 25.6 3.8% memcpy
0.31%
async_copy_data
12.15% raid456
module
169 25.2 2.65% memcpy
0%
async_copy_data
11.7% raid456
module
dd Avg. 173 24.7 4.3% memcpy
0.37%
async_copy_data
13.7% raid456
module
173 24.7 2.85% memcpy
0%
async_copy_data
13.4% raid456
module
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Lustre Development Hardware Platform at HPCS
Test Brick Gen 2
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Object Storage Servers: Dell R720
40 units theoretically give 1 Tbyte/s
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Object Storage Block• 16 x C8000XD sleds
• Each Sled: 12 4TB NL-SAS 7.2k HDDs
• Total capacity: 768 TB
• Total throughput: 20 GB/s
• Theoretically 50 units to reach 1 Tbyte/s
• Theoretically ~130 units to reach 100 Pbyte of capacity
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Lustre Development at HCPS
High Performance Lustre On Commodity Hardware
• Know-how in setting up and tuning Lustre systems for maximum performance
• Linux Kernel development (in collaboration with Calsoft Incand Peter Braam)
• High-Performance (zero copy) Software RAID6 for Lustre
• Know-how on maintenance and high-uptime for large Lustre implementations
FP7: Cluster of Research Infrastructures
for Synergies in Physics
Future SDP Plans
• Demonstration of very high performance Lustre (> 40 GB/s) on commodity hardware
• Creation IOR scenarios that precisely mimic SDP UV buffer access patterns. Tests on large computer clusters in UK, Germany, Australia and South Africa
• Global Filesystem -> Accelerator IO tests (Cambridge has a 256 K20x cluster)
• Report on scalability limitations of global filesystems
• Investigation of object/key and hybrid global storage systems, compare to filesystems
• Impact of data formatting (e.g., HDF5) on global filesystem performance