future farm technologies & architectures john baines 1

1

Future farm technologies & architectures

John Baines

2

Introduction• What will the HLT farm look like in 2020? • When & how do we narrow the options?

– Choice affects software design as well as farm infrastructure

• How do we evaluate cost/benefits? • When & how do we make the final choices for farm purchases?• How do we design software now to ensure we can fully exploit the

capability of future farm hardware?• What do we need in the way of demonstrators for specific

techologies?• We can’t evaluate all options – what should be the priorities?

3

Timescales: Framework, Steering & New Technologies

2014

Q3 Q4

LS 1

Design & Prototype

Implement core functionality

Extend to full functionality

Commissioning Run

Evaluate Implement Infrastructure

Exploit New. Tech. in Algorithms

Speed up code, thread-safety, investigate possibilities for internal parallelisation

Implement Algorithms in new framework.

HLT software Commissioning

Complete

Final Software Complete

Framework & Algos.

Fix PC architecture

FrameworkCore Functionality

Complete Incl. HLT components& new tech. support

Design of Framework

& HLT Components

Complete

Narrow h/w choices e.g. Use or not GPU

Run 3

Full menu complete

Simple menu

Framework Requirements

CaptureComplete

Framework

New Tech.

Algs & Menus

Draft Version for discussion

Prototype with 1 or 2 chains

Technologies• CPU: increased core counts –

– currently 12 core (24 thread) e.g. Xeon E5 2600 v2 series ~0.5 TFLOPS– 18 core (36 thread) coming soon (Xeon E5 2600 v3 series)– Possible trend to many cores with lower memory => cannot continue to run

one job per core

4

• GPU: Much bigger core count:e.g. Nvidia K40: 15 SMX, 2880 cores 12 GB memory. 4.3(1.4) TFLOPS SP(DP)

• Coprocessor: e.g. Intel Xeon Phi up to 61 cores, 244 threads 1.2 TFLOPS

5

GPU:Towards a cost benefit analysis

Will need to Assess:• Effort needed to port code to GPU maintain it (bug fixing, new hardware…) and to

Support MC simulation on GRID• Speed-up for individual components & full chain• What can be outsourced to GPU and what done on CPU• Integration with Athena (APE)• Balance of CPU cores to GPU i.e. sharing of GPU resource between several jobs• Farm integration issues: packaging, power consumption….• Financial cost: hardware, installation, commissioning, maintenance…

As an exercise, see what we can learn from studies to-date i.e. cost-benefit if we were to purchase today.

6

DemonstratorsDemonstrators:• ID (RAL, Edinburgh, Oxford):

– Complete L2 ID chain ported to CUDA for NVIDIA GPU– ID datapreparation (byestream conversion, clustering, space-point formation)

ported additionally to openCL• Muon (CERN, Rome)

– Muon calorimeter-isolation implemented in CUDA• Jet (Lisbon)

– Just starting

See: Twiki: TriggerSoftwareUpgrade

Porting L2 ID tracking to CUDA ~2 years @ 0.5 FTE => 1 staff year (for very experienced expert!)

Effort needed to port code?

https://twiki.cern.ch/twiki/bin/view/Atlas/TriggerSoftwareUpgrade



7

GPUSTime (ms) Tau RoI 0.6x0.6

tt events 2x1034

C++ on 2.4 GHz

CPU

CUDA on Tesla

C2050

SpeedupCPU/GPU

Data. Prep. 27 3 9

Seeding 8.3 1.6 5

Seed ext. 156 7.8 20

Triplet merging

7.4 3.4 2

Clone removal 70 6.2 11

CPU GPU xfer n/a 0.1 n/a

Total 268 22 12

Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov)

Data Prep.

L2 Tracking

X2.4

X5

Max. speed-up: x26Overall speed-up t(GPU/t(CPU): 12

Sharing of GPU resource

Blue: Tracking running on CPURed: Most Tracking steps on GPU,

final ambiguity solving on CPU

X2.4

• With balanced load on CPU/GPU, several CPU cores can share a GPUe.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU

Tesla C2050

9

Packaging

1U: 2xE5-2600 or E5-2600v23xGPU

2U: 2xE5-2600 or E5-2600v24xGPU

Examples:

Total for 1027 or 2027 with 2 K20 GPU: ~15k CHF=> 12 CPU cores/GPU

Total for 2027 with 4 K20 GPU: ~20k CHF=> 6 CPU cores/GPU

CPU: Intel E5-2697v2 CPU • 12 cores, ~0.5 TFLOPS, ~2.3kCHF

GPU: Nvidia K20 GPU • 2496 cores, 13 SMX, 192 cores per SMX • 3.5 (1.1) TFOPS for SP(DP), ~2.4k CHF

10

Power & CoolingSDX racks:• Max. Power: 12kW• Usable space: 47 U• Current power ~300W per motherboard => max. 40 motherboards per rack.• Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW

Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power)

Illustrative farm configuration:50 racks Total

FarmNodes

CPU Cores(max threads)

GPU(SMX)

Required throughput per Node(per CPU core)

40 nodes per rack~300W/node

2,000 4,000 48,000 (96,000)

0 50 Hz(2.1 Hz)

10 nodes per rack4 GPU per node~1200W/node

500 1,000 12000 (24000)

2,000(26,000)

200 Hz(8.3 Hz)


800 1,600 19,200 (38,400)

1,600(20,800)

125 Hz(5.2 Hz)

x4

X2.5

11

Summary

• Current limiting factor is cooling: 12kW/rack=> Adding GPU means removing CPU• Factor 2.5-4 less CPU requires corresponding increase in CPU throughput • Financial Cost per motherboard (2U box with 8 CPU versus 2CPU + 4 GPU) : CPU+GPU

Factor ~2 more expensive• => win with CPU+GPU solution when throughput per CPU increased by more than factor

8 => 90% work (by CPU time) transferred to GPU

12

Discussion• Benefits:

– If we can manage to port the bulk of time-consuming code to GPU, the benefit is potentially much better scaling with p.u. i.e. • No combinatorical code left on CPU => execution times will scale slowly with p.u.• Code on GPU is parallel and will scale slowly with p.u.

• Costs:– Significant effort needed to port code– Need to support different GPU generations with rolling replacements– Potential divergence from offline – Need to support CPU version of code for simulation– Possibly more expensive than CPU-only farm.

Þ CPU+GPU solution attractive IF CPU-based farm cannot provide enough processing power.

Þ However, currently looks like CPU-only farm is the least code solution

Þ Discuss!

13

CPU

• Coming: Xeon E5-2699 V3 18 cores and 36 thread 3,960 EUR $5,392

e.g.

14

GPU

US $ K40 K20X K20 M2090 C2050

1 4435 3200 2695 1825 1100

2 8870 9600 8085 5475 2200

4 17740 12800 10780 7300 4400

15

Increase in Throughput per CPU when GPU added

Speed-upt(CPU)/t(GPU)

CPU code serial: waits

for GPU completion

Fraction defined in terms of execution time on CPU

If CPU count reduced by factor 4, need factor 4 increase in throughput to break even; i.e. 75% of work moved to GPU

17

Speed-up factors

HLT:60% tracking20% Calo10% Muon10% other

18

• Cost of GPU-Cost of CPU• Cost of effort for online version• Cost for simulation

19

…

CPU#1 12 CPU cores; 12/24 cpu threads

GPU#1: 15 SMX; 2880 cores

GPU#2 15 SMX; 2880 cores

120ms240ms360ms

10ms240ms250ms 69%

CPU: x0.69 Throughput 1.44

20

6 jobs per GPU

22

Data Preparation Code

33

Data Preparation Code

54

Power & Cooling• SDX racks:• Upper level: 27? XPU racks

– each 47U usable; 9.5kW – 1U 31,32,40 per rack (=>max 300W per 1U) – Current power consumption 6-9kW per rack

• Lower level: partially equipped with 10 racks (+6 preseries racks) – each 47U (could be 52U with additional reinforcing); 15 kW– 2U 4-blade servers 1100W, 8 or 10 per rack (9-11kW)

GPU: C2050: <238W; K20:<225W; K40:<235W c.f. CPU: 130W (12 core, 2.7GHz) => GPU 80% higher max. power consumption than CPU.Þ Adding 1 GPU ~doubles power consumption of node50 racks Nodes

(motherboard)

CPU Cores(max threads)

GPU(SMX)

Throughput per Node(per CPU core)

40 nodes per rack 2000 4000 48000 (96000)

0 50Hz(2.08 Hz)


500 1000 12000 (24000)

2000(30000)

200Hz(8.33 Hz)


750 1500 18000 (36000)

1500(22500)

133Hz(5.55Hz)

55

Packaging

1U: 2xE5-2600 or E5-2600v23xGPU

2U: 2xE5-2600 or E5-2600v24xGPU

e.g. 2x12=24 CPU cores 3 GPU Þ 8 CPU cores/GPU

GPU: e.g. K40: 2880 cores, 15 SMX, 192 cores per SMX 4.3 (1.4) TFOPS for SP(DP): $4400

e.g. 2x12=24 CPU cores 4 GPU => 6 CPU cores/GPU

+ GPU 3X $4435

future farm technologies & architectures john baines 1

Documents