global filesystem option for the ska uv buffer

23
FP7: Cluster of Research Infrastructures for Synergies in Physics Global Filesystem Option for the SKA UV buffer Tests and Development of Lustre Bojan Nikolic Cambridge HPCS Paul Calleja Wojciech Turek Cambridge IT Mark Calleja SKA SDP Consortium PROT.ISP Team ARCH Team Industry Peter Braam Calsoft Inc Blame Most of the work

Upload: others

Post on 28-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Global Filesystem Option for the SKA UV buffer

Tests and Development of Lustre

Bojan Nikolic

Cambridge HPCS

Paul Calleja

WojciechTurek

Cambridge IT

Mark Calleja

SKA SDP Consortium

PROT.ISP Team

ARCH Team

Industry

Peter Braam

Calsoft Inc

Blame

Most of the work

FP7: Cluster of Research Infrastructures

for Synergies in Physics

SDP UV Buffer in CRISP Context

... ...

Real Time

Acce

ptor

Signa

l Agg

rega

tor /

Re-

Sequ

ence

r

Pre-

Proc

esso

r

Stor

age

Proc

esso

r

Stor

age

Proc

esso

r

Stor

age

Proc

esso

r

...

Distr

ibut

ed St

orag

e

Stor

age

Unifi

ed St

orag

e

Pseudo Real Time Pseudo Real Time

Proc

esso

rPr

oces

sor

Proc

esso

rPr

oces

sor

Proc

esso

r

Proc

esso

rPr

oces

sor

Proc

esso

rPr

oces

sor

Proc

esso

r

Onlin

e Com

putin

g

Offli

ne St

orag

e

Arch

ive

RT M

onito

ring

Adva

nced

pro

cess

ing

Data

redu

ctio

n

Data

Flow

Dire

cted

Pat

hway

Data

Flow

Dec

oupl

ed Pr

oces

ses

Addi

tiona

l RT

Mod

ules

Addi

tiona

l PRT

Mod

ules

Addi

tiona

l PRT

Mod

ules

Pre-Processing Layer Online Storage Layer Online Computing Layer Offline Layer

Proc

esso

rPr

oces

sor

Proc

esso

rPr

oces

sor

PRT O

nlin

e Com

putin

g

Proc

esso

rPr

oces

sor

Proc

esso

r Offli

ne C

ompu

tingPr

oces

sor

Data

expo

rt

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Global UV Buffer option

... ...

Real Time

Acce

ptor

Signa

l Agg

rega

tor /

Re-

Sequ

ence

r

Pre-

Proc

esso

r

Pseudo Real Time Pseudo Real Time

Proc

esso

rPr

oces

sor

Proc

esso

rPr

oces

sor

Proc

esso

r

Proc

esso

rPr

oces

sor

Proc

esso

rPr

oces

sor

Proc

esso

r

Onlin

e Com

putin

g

Offli

ne St

orag

e

Arch

ive

RT M

onito

ring

Adva

nced

pro

cess

ing

Data

redu

ctio

n

Data

Flow

Dire

cted

Pat

hway

Data

Flow

Dec

oupl

ed Pr

oces

ses

Addi

tiona

l RT

Mod

ules

Addi

tiona

l PRT

Mod

ules

Addi

tiona

l PRT

Mod

ules

Pre-Processing Layer Online Storage Layer Online Computing Layer Offline Layer

Data

expo

rt

Lustr

e

Unifi

ed S

tora

ge

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Specifications (maximal case)

Throughput Write: maximum case 7 Tbyte/s

Read: maximum ~5 times higher

Capacity Maximum requirement 300 Pbyte

Data lifecycle ~ 12 hours

Files Maximum 10^8 files (one file per ten frequency channels per snapshot)

IO Computational Intensity

>250 Floating Point Operations per Byte of IO to UV Buffer

Deployment Date Start of commissioning 2019. Science Operations 2021

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Global Filesystem MotivationOnly one of the options being considered for SKA Science Data Processor UV Buffer

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Options/Alternatives

Global Filesystem + Higher Level API

Global Filesystem (Posix/sub-Posix)

Hybrid (key+sequential)/value store

Traditional Key/Value Store

Local RAID

Local independent filesystems

Local raw block devices

HDFS

Hig

her

Lev

el

Intel FF Lustre + HDF5

Cassandra

Linux Kernel Software RAID6

/dev/sda

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Global Filesystem is not Required

• But this always true:• Computation is only ever done on words in the CPU• Large storage systems are always made of many smaller drives• Global Filesystem is just a software sub-system

• Why global filesystem?• Simple, familiar concept• Simple well known, very often used API and UI• Combine inter-node data communication, inter-node metadata distribution,

persistence, inter-node high-availability and failover mechanisms all in one subsystem

• Off the shelf vendor solutions & open source solutions

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Global Filesystem Features revisited

•Make data available to any node at any subsequent time

•Redundantly distribute data on multiple nodes and disks Write to File

•Retrieve data from any node from arbitrary previous time

•On-the-fly recover from lost nodes and disksRead from File

•Process signals that it is doing a piece of work File Open For Write

•Process signals that no more work on this can be done until closeFile Open For Read

•What are other nodes doing?

•What have other nodes done in the pastDirectory list

•Enforce interfaces between nodes

•Enforce interfaces between processesFile Permissions

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Trade-off Issues

Features/ Functions

Proven Alternatives

EfficiencyMaximum

Performance

Scaling IssuesCapital

ExpenditureOp-Ex/ Power

Budget

Software Development

Cost

Resilience Future-proof?Implementation

Risks

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Approaching a Trade-Off Analysis

• What are the inter-node communication features we need?

• What do we know we can do efficiently without a global filesystem?

• What is the performance of current global filesystems in providing the features we need?

• What is the capital/op-ex cost of global-filesystem?

• What are the impacts on software development costs?

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Current Lustre work toward the trade-off

• Performance & efficiency : benchmarks, analysis, hardware tests

• Scalability: Tests on very large integrated systems

• Cap-ex: Commodity Lustre

• High-Availability: Efficient Software RAID

• Implementation Risks: Accumulating and recording know-how

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Performance BenchmarkingAims: Measure Real World Performance, Identify Bottlenecks, Quantify Scaling

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Lustre Benchmarking Approach

Integrated Benchmarking Environment: JUBE

Filesystem Benchmarking suites ( IOR , iozone)

Low-level/raw devices tests (dd, xdd)

Oprofile/Linux Kernel Perf Event Based Profiling

• Application tests: ASKAPSoft (in preparation)

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Cambridge HPCS LUSTRE Test Brick (Gen 1)

0

2000

4000

6000

8000

10000

12000

1 Client 2 Clients 4 Clients 8 Clients 16 Clients 32 Clients

Tota

l Th

rou

ghp

ut

(MB

/s)

Number of client nodes with 1 client per node

18 Stripe Interleaved (Single File) Performance

Read Write

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Cambridge HPCS Lustre Test Brick Gen 1

Metadata Object storage

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Lustre tests on JuRoPA2 (file-per-process)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

256 clients 512 clients 1024 clients

Tota

l Th

rou

ghp

ut

(MB

/S)

POSIX API

POSIX Read Posix Write

0

2000

4000

6000

8000

10000

12000

14000

16000

256 clients 512 clients 1024 clients

Tota

l Th

rou

ghp

ut

(MB

/S)

MPIIO API

MPIIO Read MPIIO Write

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Lustre tests on JuRoPA2 (separate files)JobId Tasks Mode API MaxWrite MaxRead BS TS

n256p1t1 256 separated MPIIO 11678.41 15237.87 4GiB 32KiB

n256p1t1 256 separated POSIX 10535.22 16856.16 4GiB 32KiB

n256p2t1 512 separated MPIIO 9871.79 12365.02 2GiB 32KiB

n256p2t1 512 separated POSIX 10538.96 16314.26 2GiB 32KiB

n256p4t1 1024 separated MPIIO 6141.25 9124.83 1GiB 32KiB

n256p5t1 1024 separated POSIX 1982.93 17590.65 1GiB 32KiB

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Lustre Kernel ProfilingUnpatched Kernel Patched Kernel

IOTool Type # Time taken

seconds

Speed

MBps

Oprofile CPU

utilization details

Time taken

seconds

Speed

MBps

Oprofile CPU

utilization details

xdd

QueueDepth=32

Avg. 481 8.9 1.97% memcpy

0.07%

async_copy_data

5.75% raid456

module

419 10.2 1.5% memcpy

0%

async_copy_data

6.8% raid456

module

xdd

QueueDepth=1

Avg. 167 25.6 3.8% memcpy

0.31%

async_copy_data

12.15% raid456

module

169 25.2 2.65% memcpy

0%

async_copy_data

11.7% raid456

module

dd Avg. 173 24.7 4.3% memcpy

0.37%

async_copy_data

13.7% raid456

module

173 24.7 2.85% memcpy

0%

async_copy_data

13.4% raid456

module

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Lustre Development Hardware Platform at HPCS

Test Brick Gen 2

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Object Storage Servers: Dell R720

40 units theoretically give 1 Tbyte/s

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Object Storage Block• 16 x C8000XD sleds

• Each Sled: 12 4TB NL-SAS 7.2k HDDs

• Total capacity: 768 TB

• Total throughput: 20 GB/s

• Theoretically 50 units to reach 1 Tbyte/s

• Theoretically ~130 units to reach 100 Pbyte of capacity

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Lustre Development at HCPS

High Performance Lustre On Commodity Hardware

• Know-how in setting up and tuning Lustre systems for maximum performance

• Linux Kernel development (in collaboration with Calsoft Incand Peter Braam)

• High-Performance (zero copy) Software RAID6 for Lustre

• Know-how on maintenance and high-uptime for large Lustre implementations

FP7: Cluster of Research Infrastructures

for Synergies in Physics

Future SDP Plans

• Demonstration of very high performance Lustre (> 40 GB/s) on commodity hardware

• Creation IOR scenarios that precisely mimic SDP UV buffer access patterns. Tests on large computer clusters in UK, Germany, Australia and South Africa

• Global Filesystem -> Accelerator IO tests (Cambridge has a 256 K20x cluster)

• Report on scalability limitations of global filesystems

• Investigation of object/key and hybrid global storage systems, compare to filesystems

• Impact of data formatting (e.g., HDF5) on global filesystem performance