post-k supercomputer - fujitsu with arm collaboration with linaro hpc-sig/openhpc collaboration with...
Post on 26-May-2018
219 Views
Preview:
TRANSCRIPT
Post-K Supercomputer
Copyright 2017 FUJITSU LIMITED1
Fujitsu’s High-end HPC Development
Copyright 2017 FUJITSU LIMITED
Fujitsu has provided HPC systems with original technologies, developed for over 40 years, to accelerate advanced research
© RIKEN
The K computer continues to be competitive in various fields; from advanced research to manufacturing
K computer
RIKEN and Fujitsu are developing the Post-K to achieve superior application performance
Post-K computer
HPCGNo.1(2017)
Graph500No.1(2017)
Gordon Bell Prize Finalist
(2016)
PRIMEHPC FX10
PRIMEHPC FX100
2
Japan’s Post-K Computer Development Project
Overview• RIKEN and Fujitsu are developing the Post-K computer, which is aiming to be
the most advanced general-purpose supercomputer in the world
Goals and Approaches
Copyright 2017 FUJITSU LIMITED
Project Goals
Application performance
Power efficiency
User convenience
• Fujitsu-original CPU and interconnect• Superior compiler optimization
• Effective use of hardware resources through a co-design approach
• Building the Arm HPC ecosystem• Excellent application portability
Approaches
3
Post-K Fujitsu Original CPU and Interconnect
The CPU was designed to support the Arm SVE instruction set architecture (including FP16)
The CPU & Tofu maintain the programming models and provide high application performance
Copyright 2017 FUJITSU LIMITED
Functions & Architecture Post-K K computer
Processor
Base ISA + SIMD Extensions Armv8-A+SVE SPARC-V9+HPC-ACE
SIMD width [bits] 512 128
FMA: Floating-point multiply and add ✔ ✔
Inter-core barrier ✔ ✔
Sector cache ✔ Enhanced ✔
Hardware “prefetch” assist ✔ Enhanced ✔
Interconnect Tofu ✔ Enhanced ✔
4
Our Approach to Post-K High Performance
The compiler cooperates with hardware to improve performance.
• Designed to satisfy both performance/power and usability
Improve memory bandwidth Improve computational efficiency
• Software Prefetch• Loop-Blocking
• Stacked Memory • Software Pipelining with Loop Fission
• Auto-Vectorization with SVE
• Out-of-Order• SVE
Compiler Hardware Compiler Hardware
Memory Bandwidth-intensiveApplications
Calculation-intensiveApplications
Copyright 2017 FUJITSU LIMITED5
“Smart” Auto-Vectorization forArm SVE : Next-Generation SIMD ISA
TSVC (total kernels) K computer Post-K Goal
Fortran (135) 89 111
C (151) 108 121
Copyright 2017 FUJITSU LIMITED
# of kernels vectorized on TSVC*
*[Fortran] D. Callahan, J. Dongarra, and D. Levine. “Vectorizing compilers: a test suite and results.” In Supercomputing '88, pp. 98- 105.[C] S. Maleki, Y. Gao, M. J. Garzar´n, T. Wong, and D. A. Padua, "An Evaluation of Vectorizing Compilers,” PACT2011, pp. 372-382.
// Loop s482 in TSVC kernels// is vectorized by SVEfor (int i = 0; i < LEN; i++) {
a[i] += b[i] * c[i];if (c[i] > b[i]) break;
}
• Efficient utilization of vectore.g. Gather/scatter and packed SIMD
• Highly optimized executablese.g. Utilizing deep knowledge of ISA and Post-K microarchitecture
• Per-lane predication
• Gather-load and scatter-store
• HPC-focused instructionse.g. Reciprocal inst.,
Math. acceleration inst., etc.
Arm SVE FeaturesFujitsu contributed to specifications
Advantages of Post-Kw/ Fujitsu Compiler
High vectorization rate
Smart Auto-VectorizationFollowing Loops :
- containing “if” statement- containing list-access- partial vectorization
6
Copyright 2017 FUJITSU LIMITED
Enables large loop SWP by reducing required registers, improving performance
Algorithm pre-trained by machine learning
• Trained by millions of patterns, machine learning determines the best weight of where to fission
• Machine learning evaluates # of registers, memory access, cache hit ratio, etc.
“Smart” Loop Fission : Increasing the Opportunity for “Software Pipelining” (SWP)
Fujitsu’s Software Pipelining with Smart Loop Fission
for (...) {
}for (...) {
}
123456789
10
for (...) {
}for (...) {
}
123456789
10
for (...) {
}
123456789
10
Instruction
Insufficientregisters for SWP
:
:
:
Sufficientfor SWP
Sufficientfor SWP
:::
:::
Largebasic-block
SWPedbasic-block
Smart Loop Fission
by Machine-learnedalgorithm
SWPFissioned
basic-block
7
Effectiveness of Software Pipelining (SWP)with “Smart” Loop Fission
NICAM-DC-MINI
• A benchmark from one of the world's most famous climate simulations (search “NICAM-DC-MINI”)
SWP with Smart Loop Fission improves performance
Copyright 2017 FUJITSU LIMITED
(Source: http://www.riken.jp/pr/topics/2013/20130920_1/)
4 inst. committed
2-3 inst. committed
1 inst. committed
Wait (calculation)
Wait (data access)
SWP without Loop Fission SWP with Loop Fission
31% speedup
Nor
mal
ized
exec
uti
on t
ime
NICAM-DC-MINI single core breakdown of execution time on FX100 (w/ 32 SIMD registers)
Wait (calculation)reduced
8
Building the Arm HPC Ecosystem
Fujitsu collaborates closely with partners & communities to contribute to the prosperity of the Arm HPC Ecosystem; making Arm system easy-to-use
Copyright 2017 FUJITSU LIMITED9
Open & Efficient HPC Spec.
Porting & OptimizingHPC Software Stacks
Porting & Tuning HPC Applications
Arm HPC Ecosystem
Hardware
Middleware
Application
9
Fujitsu’s Contributions for the Arm HPC Ecosystem
Application
Middleware
Hardware
2017 2020Porting & Tuning HPC applications
Developing an SVE simulator
Contributed to SVE specifications
Extending linux OS to support SVE and flexible RAS framework
Augmenting OSS compiler for HPC (e.g. Clang)
Optimizing scientific librariesPorting scientific libraries (PLASMA/SCOTCH/SLEPc ported)
Collaboration with Arm Collaboration with Linaro HPC-SIG/OpenHPC Collaboration with various users, developers, vendor.
Fujitsu is collaborating with Arm, Linaro HPC-SIG, OpenHPC on many projects
10
Appendix: What is Software Pipelining?
Original Loop
Instruction scheduling which overlaps instructions of a iteration in a loop with the following
Execute many parallel instructions AMAP, more efficiently
Copyright 2017 FUJITSU LIMITED
0
0.2
0.4
0.6
0.8
1
Effect of SWP in K computer1 kernel of NICAM-DC
(Dynamic Core of Non-hydrostatic Icosahedral Atmospheric Model)
noSWP
withSWP
Exec
uti
onCy
cle
Cou
nt
Rat
io 2.2x
①
wait
①
wait
①
wait
②
wait
②
wait
②
wait
Loop
Software Pipelined Loop
Kernel
Epilogue
Prologue
①
wait
①
wait
①
wait
②
wait
②
wait
②
wait
③
wait
③
wait
③
wait
④
wait
④
wait
④
wait
⑤
wait
⑤
wait
⑤
wait
⑥
wait
⑥
wait
⑥
wait
①
②
③
④
①
②
③
④
①
②
③
④
⑤
⑥
⑤
⑥
⑤
⑥
schedule
split and shift overlaid
Instructions Execute in Order*All instructions need 2 cycles to finish
Concept Image of Software Pipelining
11
Copyright 2016 FUJITSU LIMITED12
top related