future farm technologies & architectures john baines 1
TRANSCRIPT
1
Future farm technologies & architectures
John Baines
2
Introduction• What will the HLT farm look like in 2020? • When & how do we narrow the options?
– Choice affects software design as well as farm infrastructure
• How do we evaluate cost/benefits? • When & how do we make the final choices for farm purchases?• How do we design software now to ensure we can fully exploit the
capability of future farm hardware?• What do we need in the way of demonstrators for specific
techologies?• We can’t evaluate all options – what should be the priorities?
3
Timescales: Framework, Steering & New Technologies
2014
Q3 Q4
LS 1
Design & Prototype
Implement core functionality
Extend to full functionality
Commissioning Run
Evaluate Implement Infrastructure
Exploit New. Tech. in Algorithms
Speed up code, thread-safety, investigate possibilities for internal parallelisation
Implement Algorithms in new framework.
HLT software Commissioning
Complete
Final Software Complete
Framework & Algos.
Fix PC architecture
FrameworkCore Functionality
Complete Incl. HLT components& new tech. support
Design of Framework
& HLT Components
Complete
Narrow h/w choices e.g. Use or not GPU
Run 3
Full menu complete
Simple menu
Framework Requirements
CaptureComplete
Framework
New Tech.
Algs & Menus
Draft Version for discussion
Prototype with 1 or 2 chains
Technologies• CPU: increased core counts –
– currently 12 core (24 thread) e.g. Xeon E5 2600 v2 series ~0.5 TFLOPS– 18 core (36 thread) coming soon (Xeon E5 2600 v3 series)– Possible trend to many cores with lower memory => cannot continue to run
one job per core
4
• GPU: Much bigger core count:e.g. Nvidia K40: 15 SMX, 2880 cores 12 GB memory. 4.3(1.4) TFLOPS SP(DP)
• Coprocessor: e.g. Intel Xeon Phi up to 61 cores, 244 threads 1.2 TFLOPS
5
GPU:Towards a cost benefit analysis
Will need to Assess:• Effort needed to port code to GPU maintain it (bug fixing, new hardware…) and to
Support MC simulation on GRID• Speed-up for individual components & full chain• What can be outsourced to GPU and what done on CPU• Integration with Athena (APE)• Balance of CPU cores to GPU i.e. sharing of GPU resource between several jobs• Farm integration issues: packaging, power consumption….• Financial cost: hardware, installation, commissioning, maintenance…
As an exercise, see what we can learn from studies to-date i.e. cost-benefit if we were to purchase today.
6
DemonstratorsDemonstrators:• ID (RAL, Edinburgh, Oxford):
– Complete L2 ID chain ported to CUDA for NVIDIA GPU– ID datapreparation (byestream conversion, clustering, space-point formation)
ported additionally to openCL• Muon (CERN, Rome)
– Muon calorimeter-isolation implemented in CUDA• Jet (Lisbon)
– Just starting
See: Twiki: TriggerSoftwareUpgrade
Porting L2 ID tracking to CUDA ~2 years @ 0.5 FTE => 1 staff year (for very experienced expert!)
Effort needed to port code?
7
GPUSTime (ms) Tau RoI 0.6x0.6
tt events 2x1034
C++ on 2.4 GHz
CPU
CUDA on Tesla
C2050
SpeedupCPU/GPU
Data. Prep. 27 3 9
Seeding 8.3 1.6 5
Seed ext. 156 7.8 20
Triplet merging
7.4 3.4 2
Clone removal 70 6.2 11
CPU GPU xfer n/a 0.1 n/a
Total 268 22 12
Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov)
Data Prep.
L2 Tracking
X2.4
X5
Max. speed-up: x26Overall speed-up t(GPU/t(CPU): 12
Sharing of GPU resource
Blue: Tracking running on CPURed: Most Tracking steps on GPU,
final ambiguity solving on CPU
X2.4
• With balanced load on CPU/GPU, several CPU cores can share a GPUe.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU
Tesla C2050
9
Packaging
1U: 2xE5-2600 or E5-2600v23xGPU
2U: 2xE5-2600 or E5-2600v24xGPU
Examples:
Total for 1027 or 2027 with 2 K20 GPU: ~15k CHF=> 12 CPU cores/GPU
Total for 2027 with 4 K20 GPU: ~20k CHF=> 6 CPU cores/GPU
CPU: Intel E5-2697v2 CPU • 12 cores, ~0.5 TFLOPS, ~2.3kCHF
GPU: Nvidia K20 GPU • 2496 cores, 13 SMX, 192 cores per SMX • 3.5 (1.1) TFOPS for SP(DP), ~2.4k CHF
10
Power & CoolingSDX racks:• Max. Power: 12kW• Usable space: 47 U• Current power ~300W per motherboard => max. 40 motherboards per rack.• Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW
Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power)
Illustrative farm configuration:50 racks Total
FarmNodes
CPU Cores(max threads)
GPU(SMX)
Required throughput per Node(per CPU core)
40 nodes per rack~300W/node
2,000 4,000 48,000 (96,000)
0 50 Hz(2.1 Hz)
10 nodes per rack4 GPU per node~1200W/node
500 1,000 12000 (24000)
2,000(26,000)
200 Hz(8.3 Hz)
16 nodes per rack2 GPU per node~750W/node
800 1,600 19,200 (38,400)
1,600(20,800)
125 Hz(5.2 Hz)
x4
X2.5
11
Summary
• Current limiting factor is cooling: 12kW/rack=> Adding GPU means removing CPU• Factor 2.5-4 less CPU requires corresponding increase in CPU throughput • Financial Cost per motherboard (2U box with 8 CPU versus 2CPU + 4 GPU) : CPU+GPU
Factor ~2 more expensive• => win with CPU+GPU solution when throughput per CPU increased by more than factor
8 => 90% work (by CPU time) transferred to GPU
12
Discussion• Benefits:
– If we can manage to port the bulk of time-consuming code to GPU, the benefit is potentially much better scaling with p.u. i.e. • No combinatorical code left on CPU => execution times will scale slowly with p.u.• Code on GPU is parallel and will scale slowly with p.u.
• Costs:– Significant effort needed to port code– Need to support different GPU generations with rolling replacements– Potential divergence from offline – Need to support CPU version of code for simulation– Possibly more expensive than CPU-only farm.
Þ CPU+GPU solution attractive IF CPU-based farm cannot provide enough processing power.
Þ However, currently looks like CPU-only farm is the least code solution
Þ Discuss!
13
CPU
• Coming: Xeon E5-2699 V3 18 cores and 36 thread 3,960 EUR $5,392
e.g.
14
GPU
US $ K40 K20X K20 M2090 C2050
1 4435 3200 2695 1825 1100
2 8870 9600 8085 5475 2200
4 17740 12800 10780 7300 4400
15
Increase in Throughput per CPU when GPU added
Speed-upt(CPU)/t(GPU)
CPU code serial: waits
for GPU completion
Fraction defined in terms of execution time on CPU
If CPU count reduced by factor 4, need factor 4 increase in throughput to break even; i.e. 75% of work moved to GPU
16
17
Speed-up factors
HLT:60% tracking20% Calo10% Muon10% other
18
• Cost of GPU-Cost of CPU• Cost of effort for online version• Cost for simulation
19
…
CPU#1 12 CPU cores; 12/24 cpu threads
GPU#1: 15 SMX; 2880 cores
GPU#2 15 SMX; 2880 cores
120ms240ms360ms
10ms240ms250ms 69%
CPU: x0.69 Throughput 1.44
20
6 jobs per GPU
21
22
Data Preparation Code
23
24
25
26
27
28
29
30
31
32
33
Data Preparation Code
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Power & Cooling• SDX racks:• Upper level: 27? XPU racks
– each 47U usable; 9.5kW – 1U 31,32,40 per rack (=>max 300W per 1U) – Current power consumption 6-9kW per rack
• Lower level: partially equipped with 10 racks (+6 preseries racks) – each 47U (could be 52U with additional reinforcing); 15 kW– 2U 4-blade servers 1100W, 8 or 10 per rack (9-11kW)
GPU: C2050: <238W; K20:<225W; K40:<235W c.f. CPU: 130W (12 core, 2.7GHz) => GPU 80% higher max. power consumption than CPU.Þ Adding 1 GPU ~doubles power consumption of node50 racks Nodes
(motherboard)
CPU Cores(max threads)
GPU(SMX)
Throughput per Node(per CPU core)
40 nodes per rack 2000 4000 48000 (96000)
0 50Hz(2.08 Hz)
10 nodes per rack4 GPU per node~1200W/node
500 1000 12000 (24000)
2000(30000)
200Hz(8.33 Hz)
15 nodes per rack2 GPU per node~800W/node
750 1500 18000 (36000)
1500(22500)
133Hz(5.55Hz)
55
Packaging
1U: 2xE5-2600 or E5-2600v23xGPU
2U: 2xE5-2600 or E5-2600v24xGPU
e.g. 2x12=24 CPU cores 3 GPU Þ 8 CPU cores/GPU
GPU: e.g. K40: 2880 cores, 15 SMX, 192 cores per SMX 4.3 (1.4) TFOPS for SP(DP): $4400
e.g. 2x12=24 CPU cores 4 GPU => 6 CPU cores/GPU
+ GPU 3X $4435