the mont-blanc prototype...per-node power monitoring • the emb features a power sensor for every...

25
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777. http://www.montblanc-project.eu The Mont-Blanc prototype Alex Ramirez Barcelona Supercomputing Center

Upload: others

Post on 13-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

http://www.montblanc-project.eu

The Mont-Blanc prototype

Alex RamirezBarcelona Supercomputing Center

Page 2: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Commodity components drive HPC

• Microprocessors replaced Vector/SIMD supercomputers• They were not faster• They were cheaper

Top500 1993, 1st edition:

Cray vector, 41%MasPar SIMD, 11%Convex/HP vector, 5%

May 19, 2014Prototypes workshop2

Page 3: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Mobile SoC vs Server processor

Performance

5.2 GFLOPS

153 GFLOPS

Cost

21$2

1500$3

x30

1. 6.8 GFLOPS from CPU + 25.5 GFLOPS from embedded GPU2. Leaked Tegra3 price from the Nexus 7 Bill of Materials3. Non-discounted List Price for the 8-core Intel E5 SandyBrdige

x70

32.3 GFLOPS1 21$ (?)

x5 x70

May 19, 2014Prototypes workshop3

Page 4: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Samsung Exynos 5 Dual Superphone SoC

• 32nm HKMG• Dual-core ARM Cortex-A15 @ 1.7 GHz• Quad-core ARM Mali T604

• OpenCL 1.1• Dual-channel DDR3• USB 3.0 to 1 GbE bridge

• All in a low-power mobile socket

May 19, 2014Prototypes workshop4

Page 5: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

4 GB of DDR3-1600 uSD slot, up to 64 GB

Exynos 5 Dual:2x ARM Cortex-A15

ARM Mali-T604

USB 3.0to 1 GbE

bridge

Samsung Daughter Board (SDB)

• CPU + GPU + DRAM + storage + network … all in a compute card just 8.5x5.6 cm

May 19, 2014Prototypes workshop5

Page 6: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

15 compute cards30 x ARM Cortex-A1515 x ARM Mali-T604 GPU120 GB DDR3

1 GbE crossbar switch

Cluster management

2 x 10 GbE links

Embedded Mother Board (EMB)

• 15 node-cluster in a standard Bull B505 enclosure

May 19, 2014Prototypes workshop6

Page 7: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

9 x Compute blades:135 x Compute cards + 36 x 10 GbE links

270 x ARM Cortex-A15135 x ARM Mali-T604 GPU540 GB DDR3-1600

Chassis management panel

Mont-Blanc server chassis

• 9 blades in a standard 7U BullX chassis• Shared cooling, PSU, chassis management

May 19, 2014Prototypes workshop7

Page 8: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

The Mont-Blanc prototype

• 6 BullX chassis• 54 Compute blades• 810 Compute cards

• 1620 CPU• 810 GPU• 3.2 TB of DRAM• 52 TB of Flash

• 26 TFLOPS• 18 KWatt

May 19, 2014Prototypes workshop8

Page 9: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Interconnection Network

SDB1 Gb/s

SDBSDBSDBSDBSDBSDBSDBSDBSDBSDB

SDBSDBSDBSDBSDB15 x 1 Gb/s

10 Gb/s

10 Gb/s

EMB

EM

BE

MB

18 x 10 Gb/s

18 x 10 Gb/s

160 Gb/s

LustreServer(s)

N x10 Gb/s

May 19, 2014Prototypes workshop9

18 x 10 Gb/s

18 x 10 Gb/s

Page 10: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Per-node power monitoring

• The EMB features a power sensor for every SDB• TI INA209 power meter• 1 sample every 16ms, 5% accuracy

• Data is aggregated on an FPGA on the EMB• 15 aggregated samples every 1.12s

• Samples offloaded to the BMC via I2C every 500ms• User can access readings in BMC through management network

using IPMI calls

May 19, 2014Prototypes workshop10

Page 11: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Partners + roles

• BSC• Concept + (shifting) architecture requirements• SoC benchmarking

• ARM• System software stack

• Linux kernel + OpenCL drivers• Network benchmarking• System software stack

• LRZ• Power monitoring requirements

• Bull• System architecture• PCB design

May 19, 2014Prototypes workshop11

Page 12: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Prototype development• Oct’12

• SoC selected for prototype integration• Nov’12

• Prototype architecture to use SoM / microserver concept• (instead of flat multi-node blade)

• Mar’13• Finalization of interconnect network

• 10 GbE switch integrated in Bull chassis• 10 GbE pass-through module integrated in Bull chassis• 10 GbE cables from the top of the blade• 10 GbE cables from the front of the blade

• Jul’13• First SDB samples received (EVT boards)

• Never-ending OS bring-up and work on drivers work starts …• Sep’13

• First EMB samples received (EVT boards)• Mar’14

• Second round of SDB and EMB boards (DVT boards)• Green light for prototype procurement

May 19, 2014Prototypes workshop12

Page 13: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Prototype procurement

• Complex procurement due to BSC + FP7 rules• Prototype provider can’t be Bull

• Because Bull is a partner …• … and this is going to be a BSC property

• BSC can’t pay Bull with Project funds• Bypass Bull, and order the hardware directly from their provider

• Bull will still integrate the hardware and deploy prototype

• Still fighting with bureocracy to publish prototype procurement• Budget is above 150.000€ => Must be public• Exclusive contract assigned to one provider

• Must argue technical + IP reasons• Timing is NOT a valid reason!

• But still, anyone could apply …

May 19, 2014Prototypes workshop13

Page 14: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Prototype deployment

• Large-scale prototype not deployed yet …• … trouble still needs to happen there

• SDB bring-up has been a never-ending process• Mismatch between kernel version + opencl driver + GbE driver• Then add Lustre client version

• Lustre client did not work for kernels 3.7 to 3.10 …• … we drop Lustre for GlusterFS …

• We perform all the benchmarking on a GlutserFS server• … and Lustre Works again in kernel 3.11

• Hardware platform for application developers has been in short supply• Arndale kits with Exynos 5 Dual out of stock in June’13

May 19, 2014Prototypes workshop14

Page 15: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Prototype evaluation

• What Works• Everything Works

• Multi-core CPU• Embedded GPU accelerator + OpenCL• GbE interconnect• Lustre client• HPC software stack

• Cluster management console• Performance monitoring + analysis• Debugger• …

• What doesn’t work• Wait until we deploy the large scale version …• … Ethernet use in HPC isn’t simple

May 19, 2014Prototypes workshop15

Page 16: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Exynos 5 Dual vs. Quad-core Intel i7

May 19, 201416

0

2

4

6

8

10

12

14

Quad Cortex‐A9 Dual Cortex‐A15 Quad Mali‐T604 Quad i7

Performance Energy

Rel

ativ

e to

Qua

d-A

9

Prototypes workshop

Page 17: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Exynos 5 Octa (5420)

• Quad-core ARM Cortex-A15, for performance• Quad-core ARM Cortex-A7, for energy efficiency• Six-core ARM Mali-T628, for OpenCL accelerator

• 50% more GPU cores than Exynos 5 Dual• 50% higher compute performance per core

• Higher CPU and DDR frequencies

May 19, 2014Prototypes workshop17

Page 18: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Exynos 5 Octa projected performance …

May 19, 201418

0

2

4

6

8

10

12

14

16

18

20

Quad Cortex‐A9 Dual Cortex‐A15 Quad Mali‐T604 Quad i7 Exynos 5 Octa

PerformanceR

elat

ive

to Q

uad-

A9

Speculative data, no commitment from ARM or Samsung implied

Prototypes workshop

Page 19: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Interconnect evaluation: latency

• TCP/IP adds significant CPU overhead• OpenMX driver interfaces “directly” to the Ethernet NIC• USB in Exynos5 adds extra latency on top of network

stack

May 19, 2014Prototypes workshop19

Page 20: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Interconnect evaluation: bandwidth

• TCP/IP overhead prevents Tegra2 from achieving full bandwidth• OpenMX does achieve peak bandwidth

• USB overheads prevent Exynos 5 from achieving full bandwidth, even with OpenMX

May 19, 2014Prototypes workshop20

Page 21: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Advances over State of the Art

• Developed an entire HPC software ecosystem on ARM• Including OpenCL driver for GPU acclerator• Performance monitoring + tracing + analysis• Advanced parallel programming models

• MPI + OmpSs @ OpenCL + FORTRAN

• Fine-grain per-node power monitoring• Open the door to future per-node power management • Combined with SoC power management features

• Microserver architecture for HPC• Built on commodity SoC + commodity network (Ethernet)

May 19, 2014Prototypes workshop21

Page 22: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Limitations of current mobile processors for HPC

• 32-bit memory controller• Even if ARM Cortex-A15 offers 40-bit address space

• No ECC protection in memory• Limited scalability, errors will appear beyond a certain number of

nodes• No standard server I/O interfaces

• Do NOT provide native Ethernet or PCI Express• Provide USB 3.0 and SATA (required for tablets)

• No network protocol off-load engine• TCP/IP, OpenMX, USB protocol stacks run on the CPU

• Thermal package not designed for sustained full-power operation

• All these are implementation decisions, not unsolvable problems• Only need a business case to justify the cost of including the new

features … such as the HPC and server markets

May 19, 2014Prototypes workshop22

Page 23: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Future directions

• Continue exploiting commodity SoC (from tables / smartphones)• No ECC memory protection

• Software error checking + checkpointing• No server I/O interfaces + no protocol offload

• Custom interconnect protocol + interface + switch

• Develop server versions of those commodity SoC• Small + low-power SoC using the same IP• Integrating ECC checksums• Integrating 10/40 Gb/s Ethernet + TCP/IP protocol offload

• Develop HPC-class versions of those SoC• Large + high-end many-core versions using the same IP

• Develop custom packaging + liquid cooling

• All of them rely on the same Mont-Blanc software stack

May 19, 2014Prototypes workshop23

Page 24: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

2016 SuperPhone SoC: 80 + 160 GFLOPS

• Quad-core ARM Cortex-A57, for performance• 2.5 GHz x 8 ops/cycle = 20 GFLOPS / core

• Quad-core ARM Cortex-A53, for energy efficiency• 16-core ARM Mali-T760, for OpenCL accelerator

• 2x more cores than Mali-T678 + higher performance/watt• 833 MHz x 12 ops/cycle = 10 GFLOPS / core

• 10 Gb/s IO interface (USB 3.1)• DDR4 higher frequency + bus width / extra channels

• 25.6 GB/sSpeculative data extrapolated from public sources, no commitment from ARM is implied

May 19, 2014Prototypes workshop24

Page 25: The Mont-Blanc prototype...Per-node power monitoring • The EMB features a power sensor for every SDB • TI INA209 power meter • 1 sample every 16ms, 5% accuracy • Data is aggregated

Conclusions

• The convergence of Embedded and HPC technologies has happened already• We have enabled the software already

• Leverage on all of the embedded systems technology to build a new class of HPC system• Automated SoC design• Automatic core customization• SoC power management• Decouple IP provider from semiconductor provider

May 19, 2014Prototypes workshop25