supercomputer 'fugaku' development · l1 inst. cache branch predictor decode & issue...
TRANSCRIPT
![Page 1: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/1.jpg)
SUPERCOMPUTER “FUGAKU”DEVELOPMENT
Toshiyuki Shimizu
FUJITSU LIMITED
June 17, 2019
Copyright 2019 FUJITSU LIMITED
![Page 2: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/2.jpg)
Post-K Activities, ISC19, FrankfurtPost-K Activities, ISC19, Frankfurt
“Fugaku” is named after Mt. Fuji
Highest mountain in Japan
Very broad gradual slopes
Copyright 2019 FUJITSU LIMITED
Supercomputer “Fugaku”, Formerly Known as Post-K
1
![Page 3: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/3.jpg)
Post-K Activities, ISC19, FrankfurtPost-K Activities, ISC19, Frankfurt Copyright 2019 FUJITSU LIMITED
Supercomputer “Fugaku”, Formerly Known as Post-K
ApproachFocus
Application performance
Power efficiency
Usability
Co-design w/ application developers and Fujitsu-designed CPU core w/ high memory bandwidth utilizing HBM2
Leading-edge Si-technology, Fujitsu's proven low power & high performance logic design, and power-controlling knobs
Arm®v8-A ISA with Scalable Vector Extension (“SVE”), and Arm standard Linux
“Fugaku” is named after Mt. Fuji
Highest mountain in Japan
Very broad gradual slopes
2
![Page 4: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/4.jpg)
Post-K Activities, ISC19, Frankfurt
Fujitsu-designed CPU Core w/ High Memory Bandwidth
A64FX out-of-order controls in cores, caches, and memories achieve superior throughput
HBM2 8GiBHBM2 8GiBHBM2 8GiBHBM2 8GiB
CMG
L1D 64KiB, 4way
512-bit wide SIMD 2x FMAs
CoreCore
230+GB/s
115+GB/s
12x Compute Cores + 1x Assistant Core
L2 Cache 8MiB, 16way
256GB/s
115+GB/s
57+GB/s
Core Core
x4
BW and calc. perf. A64FX B/F
DP floating perf. (TFlops) 2.7+ -
L1 data cache (TB/s) 11+ 4
L2 cache (TB/s) 3.6+ 1.3
Memory BW (GB/s) 1024 0.37
Copyright 2019 FUJITSU LIMITED3
![Page 5: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/5.jpg)
Post-K Activities, ISC19, Frankfurt
A64FX Optimized Load Efficiency for Apps Performance
128 bytes/cycle sustained bandwidth even for unaligned SIMD load
“Combined Gather” doubles gather (indirect) load’s data throughput, when target elements are within a“128-byte aligned block” for a pair of two regs, even & odd
L1D cacheRead port0
Read port1
Read data064B/cycle
Read data164B/cycle
Mem.
128B
0 1 2 3 4 5 6 7
0 1 3 2 6 7 45
8B
Regs
flow-1
flow-2
flow-4 flow-3
Maximizes BW to 32 bytes/cyc.Copyright 2019 FUJITSU LIMITED
Suggested through Co-design work w/ app teams
4
![Page 6: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/6.jpg)
Post-K Activities, ISC19, Frankfurt
A64FX Power Knobs to Reduce Power Consumption
“Power knob” limits units’ activities via user APIs
Performance/Watt can be optimized by utilizing Power knobs
Copyright 2019 FUJITSU LIMITED
L1 inst.cache
BranchPredictor
Decode& Issue
RSE0
RSA
RSE1
RSBR
PGPREXA
EXB
EAGAEXC
EAGBEXD
PFPR
FetchPort
Store Port
L1 datacache
HBM2 Controller
Fetch Issue Dispatch Reg-read Execute Cache and Memory
CSE
Commit
PC
ControlRegisters
L2 cache
HBM2
Write Buffer
Tofu controller
FLA
PPR
FLB
PRX
FLA pipeline only
HBM2 B/W adjust(units of 10%)
Frequency reduction
Decode width: 4 to 2
EXA pipeline only
PCI Controller 52 cores
5
![Page 7: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/7.jpg)
Post-K Activities, ISC19, Frankfurt
TSMC 7nm FinFET & CoWoS
Broadcom SerDes, HBM I/O, and SRAMs
87.86 billion transistors
594 signal pins
A64FX Leading-edge Si-technology
Copyright 2019 FUJITSU LIMITED
Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core
Core Core Core Core
Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core
Core Core Core Core
L2Cache
L2Cache
L2Cache
L2Cache
HB
M2
Interfa
ceH
BM
2 Interfa
ce
HB
M2
inte
rfa
ceH
BM
2 In
terf
ace
PCIe InterfaceTofuD Interface
RIN
G-B
us
6
![Page 8: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/8.jpg)
Post-K Activities, ISC19, Frankfurt
“Fugaku” CPU Performance Evaluation (1/3)
Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx
Copyright 2019 FUJITSU LIMITED7
![Page 9: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/9.jpg)
Post-K Activities, ISC19, Frankfurt
“Fugaku” CPU Performance Evaluation (2/3)
Himeno Benchmark (Fortran90)
Stencil calculation to solve Poisson’s equation by Jacobi method
† “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”,SC18, https://dl.acm.org/citation.cfm?id=3291728
Copyright 2019 FUJITSU LIMITED
GFl
ops
† †
8
![Page 10: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/10.jpg)
Post-K Activities, ISC19, Frankfurt
“Fugaku” CPU Performance Evaluation (3/3)
Copyright 2019 FUJITSU LIMITED
WRF: Weather Research and Forecasting model Vectorizing loops including IF-constructs is key optimization
Source code tuning using directives promotes compiler optimizations
xx
9
![Page 11: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/11.jpg)
Post-K Activities, ISC19, Frankfurt
OSS Application Porting @ Arm HPC Users Group
Copyright 2019 FUJITSU LIMITED
http://arm-hpc.gitlab.io/
(http://arm-hpc.gitlab.io/)Application Lang. GCC LLVM Arm Fujitsu
LAMMPS C++ Modified Modified Modified Modified
GROMACS C Modified Modified Modified Modified
GAMESS* Fortran Modified Modified Modified Modified
OpenFOAM C++ Modified Modified Modified Modified
Siesta* Fortran Ok in as is Issues found Issues found Modified
NAMD C++ Modified Modified Modified Modified
WRF Fortran Modified Modified Modified Modified
Quantum ESPRESSO Fortran Ok in as is Ok in as is Ok in as is Modified
NWChem Fortran Ok in as is Modified Modified Modified
ABINIT Fortran Modified Modified Modified Modified
CP2K Fortran Ok in as is Issues found Issues found Modified
NEST* C++ Ok in as is Modified Modified Modified
USQCD (MILC) C Ok in as is Modified Modified Modified
BLAST* C++ Ok in as is Modified Modified Modified
10
![Page 12: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/12.jpg)
Post-K Activities, ISC19, Frankfurt
Summary: “Fugaku” and Fujitsu Commercial Units
“Fugaku” is designed and runs applications at the highest level performance to be worthy of the name
Arm HPC ecosystem and expanding apps portfolio are likened to the broad gradual slopes of Mt. Fuji
Copyright 2019 FUJITSU LIMITED
Broadcom SerDes, HBM I/O, and SRAMsTSMC 7nm FinFET & CoWoS
Fujitsu designed 48-core CPU
Image of commercial unit from Fujitsu
Fujitsu began production of “Fugaku”, also advances productization of commercial units based on the supercomputer technology
11
![Page 13: SUPERCOMPUTER 'FUGAKU' DEVELOPMENT · L1 inst. cache Branch Predictor Decode & Issue RSE0 RSA RSE1 RSBR PGPR EXA EXB EAGA EXC EAGB EXD PFPR Fetch Port Store Port L1 data cache HBM2](https://reader030.vdocuments.us/reader030/viewer/2022011912/5fa4c06db6d7730a08189beb/html5/thumbnails/13.jpg)
Copyright 2019 FUJITSU LIMITED