psi-sim: system performance evaluation environment for next-generation supercomputers k. inoue, h....
TRANSCRIPT
![Page 1: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/1.jpg)
PSI-SIM: System Performance Evaluation Environment for
Next-Generation Supercomputers
K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M.
Aoyagi
Kyusyu University, ISIT, IST
![Page 2: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/2.jpg)
Background
• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines
How are you Mr. Tera?
I am fine!How about you, Mr. Peta?
![Page 3: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/3.jpg)
Background
• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines
• If you would like to develop a “Peta-Scale” supercomputer, it is required to…– Explore the design space both of computation
nodes and inter-connection network!– Verify the effective performance to be achieved!
• So, we need a performance evaluation environment for peta-scale supercomputers!
![Page 4: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/4.jpg)
Our Goal!
• Problem…– Simulations are 3-orders of magnitude slower
than real machines!– “Peta-scale” is 3-orders of magnitude larger
than “Tera-scale” (i.e. available machines) ! – How can we bridge the gap?
• Develop an efficient performance evaluation environment: PSI-SIM– Divide compute-node simulations and network
simulations!– Abstract the target application program to
accelerate simulation speed!
![Page 5: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/5.jpg)
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
![Page 6: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/6.jpg)
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
![Page 7: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/7.jpg)
What is the Skeleton Code?Original code
foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F
for (j=0; j<n; j++) for (k=0; k<n; k++) Func( );}
foo( ) {
BSIM_ADD_TIME(10ms)
MPI_Comm.
BSIM_ADD_TIME(1ms)
BSIM_ADD_TIME(15s)
}
Skeleton code
• Computation blocks are replaced by “Estimated” execution times!• Other modifications (e.g. reducing required memory size)
![Page 8: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/8.jpg)
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
![Page 9: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/9.jpg)
Generating Communication Profile
• BSIM-Logger– Executes the skeleton code on an existing
machine– Emulates the behavior of target machine– Generates a communication profile under the
assumption of a ZERO-latency ideal network
• Why Fast?– Abstracted computation blocks are NOT
executed (just update virtual timers)– Mask real communications, but generate
accurate logs
![Page 10: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/10.jpg)
How Fast? How Accurate?ERI (Electron Repulsion Integral)
Tim
e f
or
loggin
g (
s) Original
Skeleton
Exe.
Tim
e P
redic
ted (
s)
Original Skeleton
NAS PARALLEL FT
Tim
e f
or
loggin
g (
s) Original
Skeleton
Exe.
Tim
e P
redic
ted (
s)
OriginalSkeleton
![Page 11: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/11.jpg)
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
![Page 12: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/12.jpg)
Fast, Flexible Interconnection Network Simulator
• NSIM– Inputs the communication profile and a
network configuration file– Generates a communication profile with
estimated interconnect latency
• Why Fast? Why Flexible?– Parallelized implementation– Support a number of parameters
• Topology , Spec. of routers/switches, buffer size, and so on
![Page 13: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/13.jpg)
Performance of BSIM + NSIM
• Performance prediction for HPL execution @16nodes PC cluster
• <120s (problem size = 5,000) @8CPU• About 9,000 MPI-Comm./s@8CPU
Execu
tion T
ime (
s) Measured Predicted
Error=5.3%
Not skeleton execution
![Page 14: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/14.jpg)
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
![Page 15: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/15.jpg)
ANA GroupWork Viewer
Group Work•Indicate load balance
Performance Indicator•Execution time after load-balance optimization
Communication Indicator•Amount of communications per second
![Page 16: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/16.jpg)
Conclusions
• PSI-SIM– Performance evaluation environment for
supercomputers– BSIM+NSIM+ANA
• On Going Work: Performance Prediction for – “Tera-Scale” machine (1K CPU Cores) by using a
“Giga-scale” machine (e.g. 32 CPU Cores)– “Peta-Scale” machine (4K PSI-SIMD CPUs) by
using a “Giga-scale” machine
![Page 17: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/17.jpg)
Backup Slides
![Page 18: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/18.jpg)
Peta-scale Performance Prediction
• Assumption– HPL problem size: 3Million– #of nodes: 4K (PSI-SIMD)– BSIM: use 32 cpus (3GHz Xeon)– NSIM: 10,000 MPI-Comm./s@8CPU
• How long we need to spend?– BSIM: about 300h (<2 weeks)– NSIM: about ??
• under the estimation…
![Page 19: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/19.jpg)
予測実行時間 (FT)
誤差 -11.6%
誤差 -11.3%
Target machine?: rsccUsed machine?: rscc
![Page 20: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/20.jpg)
通信プロファイル時間 (FT)
86%削減
19%削減
Target machine?: rsccUsed machine?: rscc
![Page 21: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/21.jpg)
予測実行時間( ERI)
誤差 -0.2%
誤差 1.5%
誤差 -0.6%
Target machine?: rsccUsed machine?: rscc
![Page 22: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/22.jpg)
通信プロファイル生成時間( ERI)
91%削減
96%削減
97%削減
Target machine?: rsccUsed machine?: rscc
![Page 23: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/23.jpg)
実行時間の予測性能通信レイテンシ
評価アプリケーションの規模増加 ⇒ 予測精度が向上評価アプリケーションの規模増加 ⇒ 予測精度が向上
予測精度: 94.7%
![Page 24: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/24.jpg)
シミュレーション時間(問題サイズ固定: 2000 )
評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上
最近の成果(高速化)分
16プロセス
256プロセス
1,024プロセス
![Page 25: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649f425503460f94c6193e/html5/thumbnails/25.jpg)
Performance of NSIM
Accuracy : 94.7%
7.92,8.36,8.04
114sTarget machine? : PSI-hexaUsed machine?: PSI-hexa