post-k supercomputer - fujitsu with arm collaboration with linaro hpc-sig/openhpc collaboration with...

Post-K Supercomputer

Fujitsu’s High-end HPC Development

Fujitsu has provided HPC systems with original technologies, developed for over 40 years, to accelerate advanced research

The K computer continues to be competitive in various fields; from advanced research to manufacturing

K computer

RIKEN and Fujitsu are developing the Post-K to achieve superior application performance

Post-K computer

HPCGNo.1(2017)

Graph500No.1(2017)

Gordon Bell Prize Finalist

(2016)

PRIMEHPC FX10

PRIMEHPC FX100

Japan’s Post-K Computer Development Project

Overview• RIKEN and Fujitsu are developing the Post-K computer, which is aiming to be

the most advanced general-purpose supercomputer in the world

Goals and Approaches

Project Goals

Application performance

Power efficiency

User convenience

• Fujitsu-original CPU and interconnect• Superior compiler optimization

• Effective use of hardware resources through a co-design approach

• Building the Arm HPC ecosystem• Excellent application portability

Approaches

Post-K Fujitsu Original CPU and Interconnect

The CPU was designed to support the Arm SVE instruction set architecture (including FP16)

The CPU & Tofu maintain the programming models and provide high application performance

Functions & Architecture Post-K K computer

Processor

Base ISA + SIMD Extensions Armv8-A+SVE SPARC-V9+HPC-ACE

SIMD width [bits] 512 128

FMA: Floating-point multiply and add ✔ ✔

Inter-core barrier ✔ ✔

Sector cache ✔ Enhanced ✔

Hardware “prefetch” assist ✔ Enhanced ✔

Interconnect Tofu ✔ Enhanced ✔

Our Approach to Post-K High Performance

The compiler cooperates with hardware to improve performance.

• Designed to satisfy both performance/power and usability

Improve memory bandwidth Improve computational efficiency

• Software Prefetch• Loop-Blocking

• Stacked Memory • Software Pipelining with Loop Fission

• Auto-Vectorization with SVE

• Out-of-Order• SVE

Compiler Hardware Compiler Hardware

Memory Bandwidth-intensiveApplications

Calculation-intensiveApplications

“Smart” Auto-Vectorization forArm SVE : Next-Generation SIMD ISA

TSVC (total kernels) K computer Post-K Goal

Fortran (135) 89 111

C (151) 108 121

# of kernels vectorized on TSVC*

*[Fortran] D. Callahan, J. Dongarra, and D. Levine. “Vectorizing compilers: a test suite and results.” In Supercomputing '88, pp. 98- 105.[C] S. Maleki, Y. Gao, M. J. Garzar´n, T. Wong, and D. A. Padua, "An Evaluation of Vectorizing Compilers,” PACT2011, pp. 372-382.

// Loop s482 in TSVC kernels// is vectorized by SVEfor (int i = 0; i < LEN; i++) {

a[i] += b[i] * c[i];if (c[i] > b[i]) break;

• Efficient utilization of vectore.g. Gather/scatter and packed SIMD

• Highly optimized executablese.g. Utilizing deep knowledge of ISA and Post-K microarchitecture

• Per-lane predication

• Gather-load and scatter-store

• HPC-focused instructionse.g. Reciprocal inst.,

Math. acceleration inst., etc.

Arm SVE FeaturesFujitsu contributed to specifications

Advantages of Post-Kw/ Fujitsu Compiler

High vectorization rate

Smart Auto-VectorizationFollowing Loops :

- containing “if” statement- containing list-access- partial vectorization

Enables large loop SWP by reducing required registers, improving performance

Algorithm pre-trained by machine learning

• Trained by millions of patterns, machine learning determines the best weight of where to fission

• Machine learning evaluates # of registers, memory access, cache hit ratio, etc.

“Smart” Loop Fission : Increasing the Opportunity for “Software Pipelining” (SWP)

Fujitsu’s Software Pipelining with Smart Loop Fission

for (...) {

}for (...) {

123456789

for (...) {

}for (...) {

123456789

for (...) {

123456789

Instruction

Insufficientregisters for SWP

Sufficientfor SWP

：：：

Largebasic-block

SWPedbasic-block

Smart Loop Fission

by Machine-learnedalgorithm

SWPFissioned

basic-block

Effectiveness of Software Pipelining (SWP)with “Smart” Loop Fission

NICAM-DC-MINI

• A benchmark from one of the world's most famous climate simulations (search “NICAM-DC-MINI”)

SWP with Smart Loop Fission improves performance

(Source: http://www.riken.jp/pr/topics/2013/20130920_1/)

4 inst. committed

2-3 inst. committed

1 inst. committed

Wait (calculation)

Wait (data access)

SWP without Loop Fission SWP with Loop Fission

31% speedup

NICAM-DC-MINI single core breakdown of execution time on FX100 (w/ 32 SIMD registers)

Wait (calculation)reduced

Building the Arm HPC Ecosystem

Fujitsu collaborates closely with partners & communities to contribute to the prosperity of the Arm HPC Ecosystem; making Arm system easy-to-use

Open & Efficient HPC Spec.

Porting & OptimizingHPC Software Stacks

Porting & Tuning HPC Applications

Arm HPC Ecosystem

Hardware

Middleware

Application

Fujitsu’s Contributions for the Arm HPC Ecosystem

Application

Middleware

Hardware

2017 2020Porting & Tuning HPC applications

Developing an SVE simulator

Contributed to SVE specifications

Extending linux OS to support SVE and flexible RAS framework

Augmenting OSS compiler for HPC (e.g. Clang)

Optimizing scientific librariesPorting scientific libraries (PLASMA/SCOTCH/SLEPc ported)

Collaboration with Arm Collaboration with Linaro HPC-SIG/OpenHPC Collaboration with various users, developers, vendor.

Fujitsu is collaborating with Arm, Linaro HPC-SIG, OpenHPC on many projects

Appendix: What is Software Pipelining?

Original Loop

Instruction scheduling which overlaps instructions of a iteration in a loop with the following

Execute many parallel instructions AMAP, more efficiently

Effect of SWP in K computer1 kernel of NICAM-DC

(Dynamic Core of Non-hydrostatic Icosahedral Atmospheric Model)

withSWP

io 2.2x

Software Pipelined Loop

Kernel

Epilogue

Prologue

schedule

split and shift overlaid

Instructions Execute in Order*All instructions need 2 cycles to finish

Concept Image of Software Pipelining

post-k supercomputer - fujitsu with arm collaboration with linaro hpc-sig/openhpc collaboration with...

Documents

bjorn andersson - linaro

high performance computing: where is it...

functional safety for foss - linaro

high performance computing inspiring bigger...

rust for linux (linaro connect)

linaro accelerating open source innovation - arm … ·...

hpc-sig ecosystem validation renato golin baptiste ... ·...

hw fault injection mitigation - linaro

lce13: overview of linaro project management methodology

android internals at linaro connect asia 2013

q2.12: linaro connect q2.12 opening plenary

linaro connect : introduction to xen on arm

lca14: lca14-113:linaro comcast rdk project

glodroid - linaro

the future is (almost) here david a rusling, cto€¦ ·...

hkg15-407: linaro clear key

q1.12: igloo community and linaro

linaro connect 2016 (bkk16) - introduction to lisa

hpc midlands update for hpc-sig july 2013

lce13: linaro requirements lifecycle