supercomputer 'fugaku' development · l1 inst. cache branch predictor decode & issue...

13
SUPERCOMPUTER “FUGAKU” DEVELOPMENT Toshiyuki Shimizu FUJITSU LIMITED June 17, 2019 Copyright 2019 FUJITSU LIMITED

Upload: others

Post on 10-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

SUPERCOMPUTER “FUGAKU”DEVELOPMENT

Toshiyuki Shimizu

FUJITSU LIMITED

June 17, 2019

Copyright 2019 FUJITSU LIMITED

Page 2: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, FrankfurtPost-K Activities, ISC19, Frankfurt

“Fugaku” is named after Mt. Fuji

Highest mountain in Japan

Very broad gradual slopes

Copyright 2019 FUJITSU LIMITED

Supercomputer “Fugaku”, Formerly Known as Post-K

1

Page 3: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, FrankfurtPost-K Activities, ISC19, Frankfurt Copyright 2019 FUJITSU LIMITED

Supercomputer “Fugaku”, Formerly Known as Post-K

ApproachFocus

Application performance

Power efficiency

Usability

Co-design w/ application developers and Fujitsu-designed CPU core w/ high memory bandwidth utilizing HBM2

Leading-edge Si-technology, Fujitsu's proven low power & high performance logic design, and power-controlling knobs

Arm®v8-A ISA with Scalable Vector Extension (“SVE”), and Arm standard Linux

“Fugaku” is named after Mt. Fuji

Highest mountain in Japan

Very broad gradual slopes

2

Page 4: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, Frankfurt

Fujitsu-designed CPU Core w/ High Memory Bandwidth

A64FX out-of-order controls in cores, caches, and memories achieve superior throughput

HBM2 8GiBHBM2 8GiBHBM2 8GiBHBM2 8GiB

CMG

L1D 64KiB, 4way

512-bit wide SIMD 2x FMAs

CoreCore

230+GB/s

115+GB/s

12x Compute Cores + 1x Assistant Core

L2 Cache 8MiB, 16way

256GB/s

115+GB/s

57+GB/s

Core Core

x4

BW and calc. perf. A64FX B/F

DP floating perf. (TFlops) 2.7+ -

L1 data cache (TB/s) 11+ 4

L2 cache (TB/s) 3.6+ 1.3

Memory BW (GB/s) 1024 0.37

Copyright 2019 FUJITSU LIMITED3

Page 5: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, Frankfurt

A64FX Optimized Load Efficiency for Apps Performance

128 bytes/cycle sustained bandwidth even for unaligned SIMD load

“Combined Gather” doubles gather (indirect) load’s data throughput, when target elements are within a“128-byte aligned block” for a pair of two regs, even & odd

L1D cacheRead port0

Read port1

Read data064B/cycle

Read data164B/cycle

Mem.

128B

0 1 2 3 4 5 6 7

0 1 3 2 6 7 45

8B

Regs

flow-1

flow-2

flow-4 flow-3

Maximizes BW to 32 bytes/cyc.Copyright 2019 FUJITSU LIMITED

Suggested through Co-design work w/ app teams

4

Page 6: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, Frankfurt

A64FX Power Knobs to Reduce Power Consumption

“Power knob” limits units’ activities via user APIs

Performance/Watt can be optimized by utilizing Power knobs

Copyright 2019 FUJITSU LIMITED

L1 inst.cache

BranchPredictor

Decode& Issue

RSE0

RSA

RSE1

RSBR

PGPREXA

EXB

EAGAEXC

EAGBEXD

PFPR

FetchPort

Store Port

L1 datacache

HBM2 Controller

Fetch Issue Dispatch Reg-read Execute Cache and Memory

CSE

Commit

PC

ControlRegisters

L2 cache

HBM2

Write Buffer

Tofu controller

FLA

PPR

FLB

PRX

FLA pipeline only

HBM2 B/W adjust(units of 10%)

Frequency reduction

Decode width: 4 to 2

EXA pipeline only

PCI Controller 52 cores

5

Page 7: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, Frankfurt

TSMC 7nm FinFET & CoWoS

Broadcom SerDes, HBM I/O, and SRAMs

87.86 billion transistors

594 signal pins

A64FX Leading-edge Si-technology

Copyright 2019 FUJITSU LIMITED

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core

Core Core Core Core

Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core

L2Cache

L2Cache

L2Cache

L2Cache

HB

M2

Interfa

ceH

BM

2 Interfa

ce

HB

M2

inte

rfa

ceH

BM

2 In

terf

ace

PCIe InterfaceTofuD Interface

RIN

G-B

us

6

Page 8: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, Frankfurt

“Fugaku” CPU Performance Evaluation (1/3)

Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx

Copyright 2019 FUJITSU LIMITED7

Page 9: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, Frankfurt

“Fugaku” CPU Performance Evaluation (2/3)

Himeno Benchmark (Fortran90)

Stencil calculation to solve Poisson’s equation by Jacobi method

† “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”,SC18, https://dl.acm.org/citation.cfm?id=3291728

Copyright 2019 FUJITSU LIMITED

GFl

ops

† †

8

Page 10: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, Frankfurt

“Fugaku” CPU Performance Evaluation (3/3)

Copyright 2019 FUJITSU LIMITED

WRF: Weather Research and Forecasting model Vectorizing loops including IF-constructs is key optimization

Source code tuning using directives promotes compiler optimizations

xx

9

Page 11: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, Frankfurt

OSS Application Porting @ Arm HPC Users Group

Copyright 2019 FUJITSU LIMITED

http://arm-hpc.gitlab.io/

(http://arm-hpc.gitlab.io/)Application Lang. GCC LLVM Arm Fujitsu

LAMMPS C++ Modified Modified Modified Modified

GROMACS C Modified Modified Modified Modified

GAMESS* Fortran Modified Modified Modified Modified

OpenFOAM C++ Modified Modified Modified Modified

Siesta* Fortran Ok in as is Issues found Issues found Modified

NAMD C++ Modified Modified Modified Modified

WRF Fortran Modified Modified Modified Modified

Quantum ESPRESSO Fortran Ok in as is Ok in as is Ok in as is Modified

NWChem Fortran Ok in as is Modified Modified Modified

ABINIT Fortran Modified Modified Modified Modified

CP2K Fortran Ok in as is Issues found Issues found Modified

NEST* C++ Ok in as is Modified Modified Modified

USQCD (MILC) C Ok in as is Modified Modified Modified

BLAST* C++ Ok in as is Modified Modified Modified

10

Page 12: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Post-K Activities, ISC19, Frankfurt

Summary: “Fugaku” and Fujitsu Commercial Units

“Fugaku” is designed and runs applications at the highest level performance to be worthy of the name

Arm HPC ecosystem and expanding apps portfolio are likened to the broad gradual slopes of Mt. Fuji

Copyright 2019 FUJITSU LIMITED

Broadcom SerDes, HBM I/O, and SRAMsTSMC 7nm FinFET & CoWoS

Fujitsu designed 48-core CPU

Image of commercial unit from Fujitsu

Fujitsu began production of “Fugaku”, also advances productization of commercial units based on the supercomputer technology

11

Page 13: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2

Copyright 2019 FUJITSU LIMITED