project 38 success requires modsim · project 38 at a glance project 38 is a set of vendor-agnostic...
TRANSCRIPT
Project 38 success requires ModSim
David J. Mountain
Advanced Computing Systems Research Program (ACS)
Laboratory for Physical Sciences, Research Park
Outline
What the heck is Project 38?
How is Project 38 using ModSim?
Why is ModSim valuable?
Project 38 is a team effort
BackgroundSeptember 2016 meeting(s) with HPC experts, held to discuss/respond to the June
2016 announcement of China’s TaihuLight supercomputer, reached the following
consensus (a report [1] is available):
• The HPC technology ecosystem is changing in ways that are less favorable to HPC
• There are national security implications to this change
• Leadership in innovative architectures is critical
• Joint architectural explorations might be useful and interesting
September 2017 technical deep dive plus follow on meetings, VTC, telecons
Improved understanding of key applications
Specialized architecture development process and examples
Possible architectures and applications of interest
Value proposition for collaborating
Refinement of exploration space and exploration process
Joint explorations/collaborations on purpose built architectures have promise
It requires a new approach to exploring and developing HPC systems
Project 38 is an attempt to define this new approach
[1] https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA_DOE_HPC_TechMeetingReport.pdf
Project 38 at a glance
Project 38 is a set of vendor-agnostic architectural explorations involving NSA, the
DOE Office of Science, and NNSA (these latter 2 organizations are referred to below
as “DOE”). These explorations are expected to accomplish the following:
Near-term goal: Quantify the performance value and identify the potential costs of
specific architectural concepts against a limited set of applications of interest to both
the DOE and NSA.
Long-term goal: Develop an enduring capability for DOE and NSA to jointly explore
architectural innovations and quantify their value.
High level guidelines
Improvements in data access are the initial focus for architectural ideas and
applications – primarily on the node
Cost-benefit analysis
Baseline comparison is to expected roadmaps/ECP+ (business as usual)
Primarily a performance comparison
“Cost” is adverse changes to programming models, SW stacks, etc.
Milestones2018 Quantify the benefits of the explorations
ModSim is the primary exploration path
2019 Complete the existing explorations -- define the primary SW issues
Document the results
2020 Improve cost-benefit analyses
Push best ideas towards implementation
Runtime
Frameworks
Application
AlgorithmData
Structures
Libraries
OS
Compiler / Debugger
Tools
Application Kernels1D FFT
HPC Challenge benchmark
Kripke
Kunen et al, “Kripke – A massively parallel transport mini-app”, American
Nuclear Society M&C, April 2015.
HPGMG
Adams et al, “HPGMG 1.0: A benchmark for ranking high performance
computing systems,” Tech report, hpgmg.org, 2014.
Tensor Contraction Engine
Baumgartner et al, “Synthesis of High-Performance Parallel Programs for a
Class of Ab Initio Quantum Chemistry Models,” Proceedings of the IEEE, Feb 2005.
PIC codes, represented by PICSAR
picsar.net
HipMer
Georganas et al, “MerBench: PGAS benchmarks for High Performance Genome
Assembly,” PAW17, November 2017.
Sparse Matrix Trisolve and other common sparse matrix operations
Architectural ideasWord Granularity Scratchpad Memory
DRE and DRE (?)
Message Queues
Scatter/Gather
Remote Atomics
Fixed Function Accelerators
Word Granularity Scratchpad Memory:Gather-scatter within processor tile
more effective SIMD
Data Recoding EngineSub-word granularity
Handles branch-heavy code (avg. 20x improvement over using processor core)
Hardware Message Queues (with atomic queue/dequeue)Gather-scatter between processor tiles
Async between tiles to eliminate overhead of barriers
SPM
Bank
SPM
Bank
SPM
Bank
SPM
Bank
SPM
Bank
SPM
Bank
SPM
Bank
SPM
Bank
Register File
CrossbarXBarLightweight
In-Order Scalar Core
L1I$L1D$
Arbiter
MQI
StreamPrefetch
Unit
ActivationQueue
Addr12
64
Data32
Local Memory
StreamBuffer
Ve
ctor R
eg
isters
2048
DataRegistersStateReg
8 12
Dispatch Unit Action Unit
Adder
ALU
MUX
ARB
memoryslice
grid ’processors’
particle ’processors’
buffers
get {index,delta}
put {index,delta}
Particles(streamed from memory)
ARB
memoryslice
ARB
memoryslice
ARB
memoryslice
ARB
memoryslice
ARB
memoryslice
ARB
memoryslice
ARB
memoryslice
particles/s < STREAM/64B
8 updates/particle (classic)32 updates/particle (Gyro)Throughput > 32*particles/s
could be private caches or cache banks
(n.b., grid >> SPM)
sized for memory latency,load balance
PIC Charge(mass) Depositionfor(i=0..#particles)
for(0..7 points) // x4 for Gyrogrid[ foo(pos[i],point) ] += goo(pos[i],point);
Data Rearrangement Engine(DRE)
Data Mover Control Processor
AXI Memory Interface
Local
Memory
Bus
Command Messages
(address, length…)
CP memory
To Peripheral
Interconnect
AXI Interconnect
Memory
Read and Write
Stream Switch FIFOHostAdapter
DMA operations
ModSim tools used
Higher level architectural exploration
Empirical GPP Roofline on KNL
More ModSim tools usedlower level design explorations
Performance Counters…• Full apps in distributed memory
• DRAM, Cache, FPU, IPC Performance Counters w/LIKWID, VTune, NVProf
• x86, KNL, NVIDIA GPU support
Analytic Modeling
Code Analysis: ExaSAT (code opt design space)
SRAM Model: Cacti and p-Cacti for sub-22nm
Microsoft Excel
Roofline Model (+roofline advisor)• x86, KNL, and GPU support
• Multilevel hierarchy, Multi-bottleneck analysis, stride-k access, divides, …
Simulation & Instrumentation…• Kernels (trade speed for detail)
• Cache Simulator: SDE (Intel Advisor, x86)
• Core: Chisel “Spike” simulator and Verilator
• NOC Model: OpenSoC + BookSim for Parameter sweeps
Emulation (Chisel HW Generators)
FPGA and synthesis
1113
Chisel RISC-V OpenSOC
AXIOpenSoC
FabricCPU(s)
HMC
AXI
AXI
CPU(s)
AXI CPU(s)
AX
I
CPU(s)
AXI
CPU(s)
AX
I
AXI
10
Gb
E
PC
Ie
Verilog
FPGA ASIC
Hardware
Compilation
Software
Compilation
SystemC
Simulation
C++
Simulation
Scala
Chisel
Open Source Extensible ISA/Cores Open Source fabricTo integrate accelerators
And logic into SOC
DSL for rapid prototypingof circuits, systems, and
arch simulator components
Platform for experimentation Parameterized hardware generation
Back-end to synthesizeHW with different devices
Or new logic families
Re-implement processorWith different devices orExtend w/accelerators
0
2000000
4000000
6000000
8000000
10000000
12000000
64:64:64 2:2:65536 20:20:656
Example:SearchingforOptimalCacheLatency(ns)/DifferentStencilOrganizationat2048K
L1latency Mainmemorylatency Coherencelatency
0%
50%
100%
Cache Latency (best coherence case)
computation latency prefetch warm up latency
coherence latency
90%
95%
100%
DMA Latency (10 ns initial latency, DMA
list)
computation latency DMA latency
Optimal Blocking
Stencil Study (effect of coherence and word granularity DMA
on basic stencil performance)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
2048
20:20:656
512
12:12:456
128
8:8:256
32
4:6:171
8
4:8:8
2
4:8:8
0.5
4:4:4
0.125
2:2:4
scrachpad speedup per computation
4.5x
SpeedupEnables extreme
strong scaling
Scatter/Gather results
Simple IPC CalculationLots of caveats
Effect 1: Reduce $ MissesAssume covered L1 misses become hits
13-40% IPC improvement
Effect 2: Integer offloadAssume half of address-calc integer operations are offloaded, require 0.1 cycles (vs. 0.4)
17-44% IPC Improvement
Effect 3: Improved $ performanceIf covered cache accesses go to scratchpad, other accesses more efficient?
0%
10%
20%
30%
40%
50%
hpgmg Kripke XSBench miniMD
Estimated IPC Improvement
from S/GReduced $
Misses
Here’s a great idea!Scatter/Gather
Could have a substantial positive impact on performance
Possible to identify regions amenable to lots of in-memory
operations, high reuse, good compaction
Good progress on design (DRE work) – high feasibility
Word-Granularity Scratchpad
Can reduce post-$ accesses / make better use of cache
Not useful for all apps (Kripke, XSBench?)
Verdict: Combine them!
Focus on S/G
Target Arch: S/G to/from scratchpad
Option: w/ HW Synchronization
Option: w/ Recoding engine
Combine!
High level results
Three Architectures
Scatter-Gather: Good body of evidence for performance, programmability, good
foundation for design
Word-granularity Scratchpad: Some evidence for improved performance,
programmability
Atomics: Chicken-and-Egg, coherency issues, need better application drivers
Benefit Scatter/GatherWord-granularity
Scratchpad
Kripke Good >20% Not Word-Gran
hpgmg Good >15% Ongoing
XSBench Good >28% Mild?
Additional value of ModSim
Detailed knowledge transfer
OCCAM
Extension of knowledge
Classified/unclassified boundaries
Proprietary/open boundaries
Thanks!
Special thanks to Arun Rodrigues and John Shalf
Project 38
Happy Birthday Bob Mrosky!