hpci system development team software development team · under the measurement rule of the v3.0,...
TRANSCRIPT
HPCI System Development TeamSoftware Development Team
BBBBBBBBBBBBBBBBB
1
10
100
1,000
10,000
8 64 512 4096 32768
GGFL
OPS
(lo
g)
PProcesses (log)
0200
400
600
800
OriginalTuned
TTim
e [s
] Speed UpX19
Result for SC15
Tune: Coloring for SYMGS
The original HPCG code evaluation on the K computer gave these informations. Linear scalability was obtained, so parallelization tuning is not ncessary. Single CPU performance was low since SYMGS is not multi-thread.Therefore we have aimed the. single CPU.
Weak-scagaling performance Cost components in CG
Using V2.4, We obtained the HPCG score 0.461 PFLOPS using 82944 processors.This score was ranked 2nd on the list at SC15.
Significant Improving ObtainedWe tried these additional ways for improving. Memory serialization for matrix Data access ordering improvement for SYMGS Loop optimization for SPMV, SYMGS Parameter adjustment Refinement miscellaneous routinesWe obtained 19 times improvement.
R.5
In the original code, SYMGS is not able to be multi-thread since there are data dependencies between rows.To avoid data dependencies, we em-ployed the colored blocking that divide the mesh into some blocks and do the coloring to blocks.
Nodes in same colored blocks are reordered to be continuous in memory space.There are no data dependencies between same colored blocks, so the blocks are able to be processed by multi-threading.
Input meshfor a process
Node Divide intosome blocksand coloring
R.1 R.2
R.4
R.5
R.3
Each block hasalmost samenodes
Rank Computer HPLPFLOPS
HPCGPFLOPS
Ratio toHPL %
1 Tianhe-2 33.86 0.580 1.7%2 K computer 10.51 0.461 4.4%3 Titan 17.59 0.322 1.8%4 Trinity 8.10 0.183 2.3%5 Mira 8.59 0.167 1.9%
http://www.hpcg-benchmark.org/custom/index.html?lid=155&slid=282
Evaluate Original HPCG on the K
93.09%ComputeSYMGS_ref
(No multi-thread)
5.23%ComputeSPMV_ref
Under the measurement rule of the V3.0, the performance of computational kernel is same to V2.4, but the HPCG score will be decreased.
R.1 R.2 R.3 R.4
CPUcore
CPUcore
CPUcore
CPUcore
R.8・ ・ ・ ・
・ ・ ・ ・
Comparing CG run time for 5 CG calls
HPCI System Development Team
Software Development Team
HPCI : High Performance Computing Infrastructure
users.
Nagoya University
University of Tsukuba
University of Tokyo
HPCI WEST HUB Tokyo Institute of Technology
Kyoto University
Osaka University
Riken AICS
HPCI EAST HUB
Hokkaido University
Tohoku University
Kyushu University
Nagoya University
University of TsTT ukuba
University offffffffffffffff TTTTTTTTTToTT kyo
HPCI WEST HUB TTTTTTTTTTTTTTTTTTooooTTTTT kyo Institute ofTeTT chnology
Kyoto University
Osaka University
Riken AICS
HPCI EASSSSSSSSSSSSSSSSSSSSSSSSSSTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT HHUB
ToTT hoku University
Kyushhhhhhhhhhhhhhhhuuuuuuuuuuuuuuuuu University
The Institute ofStatistical Mathematics
Collaborate between the system and the application to improve system and usabilityHPCG Performance Tuning on the K computer
One-Stop service for na onwide super computer resources Support from project proposal to user report. The K computer and other major super computers in japan,
large scale MPP, GPU Cluster, PC Cluster, and Vector machines are available.
The HPCI helpdesk is available for HPCI users, and provides consulta on before applying project proposal, technical support for each computer resources, and guidance of user report a er finishing the project.
HPCI Shared Storage Provide 22PBytes single view file space for all the HPCI Available from all the HPCI super computer resources. HPCI single sign on authen ca on, if a user wants to transfer
the computa onal results to the HPCI shared storage a er the computa onal session, the user does not have to log in again to the HPCI shared storage.
Automa c file replica on, HPCI shared storage adopts 2 file replicas in default. Secure network communica on, HPCI user can specify data encryp on method such as “gsi” in ~/.gfarm2fs.
Single sign On Authen ca on Allow efficient use of na on wide super computer resources. Users of the HPCI can u lize all the granted HPC resources seamlessly a er signing on one of the resources.
Pow
er
Time
Pow
er
Time
×n1/n
n/n=1
Analysis of correlation with power consumptionand application performance Standby power and memory throughput has a
greater impact on the power consumption of the K computer.
A trial of improving of power consumption and system operation A evaluation model of power consumption
of each performance tuned applications.
The performance of each big userʼs applications has estimated how much differs compared to the field average.
If both the CPU performance and the memory performance improved to field average, NODE×TIME can be reduced as below.
When we improve performance about the 10(50) major user application, we can use more 25M(66M) NODExTIME.
The power consumption is When we improve performance about 10(50) major
user application, we can achieve energy savings of 2.7GWh(7.0GWh).
0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
12.0%
14.0%
16.0%
18.0%
20.0%
0 20 40 60 80 100
Ra o [%]
The number of applica on
Node ×× Time and Power Saving Ra o
Node × TimePower saving
Calc.[%]Integer ×0.2429+[%]Throughput L1D×2.3299+ [%]Throughput L2×0.0857+[%]ThroughputMemory 4.3906+
Calc.[%] Floating×1.36598.8128=sys]δPower[MW/
Simple kernel loops
Improvement of Operation Efficiency and Effort forPower Consumption Reduction on the K computer
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0[MW]
Day/Time
Received powerCGS1CGS2
HPCG start
Limit ofPower
Consump on
0.0
1.0
2.0
3.0
4.0
5.0
6.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
Power[MW/sys]
Floa ng Calc. Memory Throuput L2 Throuput L1D Throuput Integer Calc. Power Mesurement
B/F Change Throuput Change STREAM Floa ng Calc. Integer Calc.DGEMM
Recently jobs with large power consumption haveincreased and total power consumption exceeded thepower consumption limit regulated by contract withprovider.If the total power consumption exceed the power limit,we have to pay additional charge as a penalty.
JAMSTEC