project 38 success requires modsim · project 38 at a glance project 38 is a set of vendor-agnostic...

Project 38 success requires ModSim

David J. Mountain

Advanced Computing Systems Research Program (ACS)

Laboratory for Physical Sciences, Research Park

[email protected]

Outline

What the heck is Project 38?

How is Project 38 using ModSim?

Why is ModSim valuable?

Project 38 is a team effort

BackgroundSeptember 2016 meeting(s) with HPC experts, held to discuss/respond to the June

2016 announcement of China’s TaihuLight supercomputer, reached the following

consensus (a report [1] is available):

• The HPC technology ecosystem is changing in ways that are less favorable to HPC

• There are national security implications to this change

• Leadership in innovative architectures is critical

• Joint architectural explorations might be useful and interesting

September 2017 technical deep dive plus follow on meetings, VTC, telecons

Improved understanding of key applications

Specialized architecture development process and examples

Possible architectures and applications of interest

Value proposition for collaborating

Refinement of exploration space and exploration process

Joint explorations/collaborations on purpose built architectures have promise

It requires a new approach to exploring and developing HPC systems

Project 38 is an attempt to define this new approach

[1] https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA_DOE_HPC_TechMeetingReport.pdf

Project 38 at a glance

Project 38 is a set of vendor-agnostic architectural explorations involving NSA, the

DOE Office of Science, and NNSA (these latter 2 organizations are referred to below

as “DOE”). These explorations are expected to accomplish the following:

Near-term goal: Quantify the performance value and identify the potential costs of

specific architectural concepts against a limited set of applications of interest to both

the DOE and NSA.

Long-term goal: Develop an enduring capability for DOE and NSA to jointly explore

architectural innovations and quantify their value.

High level guidelines

Improvements in data access are the initial focus for architectural ideas and

applications – primarily on the node

Cost-benefit analysis

Baseline comparison is to expected roadmaps/ECP+ (business as usual)

Primarily a performance comparison

“Cost” is adverse changes to programming models, SW stacks, etc.

Milestones2018 Quantify the benefits of the explorations

ModSim is the primary exploration path

2019 Complete the existing explorations -- define the primary SW issues

Document the results

2020 Improve cost-benefit analyses

Push best ideas towards implementation

Runtime

Frameworks

Application

AlgorithmData

Structures

Libraries

OS

Compiler / Debugger

Tools

Application Kernels1D FFT

HPC Challenge benchmark

Kripke

Kunen et al, “Kripke – A massively parallel transport mini-app”, American

Nuclear Society M&C, April 2015.

HPGMG

Adams et al, “HPGMG 1.0: A benchmark for ranking high performance

computing systems,” Tech report, hpgmg.org, 2014.

Tensor Contraction Engine

Baumgartner et al, “Synthesis of High-Performance Parallel Programs for a

Class of Ab Initio Quantum Chemistry Models,” Proceedings of the IEEE, Feb 2005.

PIC codes, represented by PICSAR

picsar.net

HipMer

Georganas et al, “MerBench: PGAS benchmarks for High Performance Genome

Assembly,” PAW17, November 2017.

Sparse Matrix Trisolve and other common sparse matrix operations

Architectural ideasWord Granularity Scratchpad Memory

DRE and DRE (?)

Message Queues

Scatter/Gather

Remote Atomics

Fixed Function Accelerators

Word Granularity Scratchpad Memory:Gather-scatter within processor tile

more effective SIMD

Data Recoding EngineSub-word granularity

Handles branch-heavy code (avg. 20x improvement over using processor core)

Hardware Message Queues (with atomic queue/dequeue)Gather-scatter between processor tiles

Async between tiles to eliminate overhead of barriers

SPM

Bank

SPM

Bank

SPM

Bank

SPM

Bank

SPM

Bank

SPM

Bank

SPM

Bank

SPM

Bank

Register File

CrossbarXBarLightweight

In-Order Scalar Core

L1I$L1D$

Arbiter

MQI

StreamPrefetch

Unit

ActivationQueue

Addr12

64

Data32

Local Memory

StreamBuffer

Ve

ctor R

eg

isters

2048

DataRegistersStateReg

8 12

Dispatch Unit Action Unit

Adder

ALU

MUX

ARB

memoryslice

grid ’processors’

particle ’processors’

buffers

get {index,delta}

put {index,delta}

Particles(streamed from memory)

ARB

memoryslice

ARB

memoryslice

ARB

memoryslice

ARB

memoryslice

ARB

memoryslice

ARB

memoryslice

ARB

memoryslice

particles/s < STREAM/64B

8 updates/particle (classic)32 updates/particle (Gyro)Throughput > 32*particles/s

could be private caches or cache banks

(n.b., grid >> SPM)

sized for memory latency,load balance

PIC Charge(mass) Depositionfor(i=0..#particles)

for(0..7 points) // x4 for Gyrogrid[ foo(pos[i],point) ] += goo(pos[i],point);

Data Rearrangement Engine(DRE)

Data Mover Control Processor

AXI Memory Interface

Local

Memory

Bus

Command Messages

(address, length…)

CP memory

To Peripheral

Interconnect

AXI Interconnect

Memory

Read and Write

Stream Switch FIFOHostAdapter

DMA operations

ModSim tools used

Higher level architectural exploration

Empirical GPP Roofline on KNL

More ModSim tools usedlower level design explorations

Performance Counters…• Full apps in distributed memory

• DRAM, Cache, FPU, IPC Performance Counters w/LIKWID, VTune, NVProf

• x86, KNL, NVIDIA GPU support

Analytic Modeling

Code Analysis: ExaSAT (code opt design space)

SRAM Model: Cacti and p-Cacti for sub-22nm

Microsoft Excel

Roofline Model (+roofline advisor)• x86, KNL, and GPU support

• Multilevel hierarchy, Multi-bottleneck analysis, stride-k access, divides, …

Simulation & Instrumentation…• Kernels (trade speed for detail)

• Cache Simulator: SDE (Intel Advisor, x86)

• Core: Chisel “Spike” simulator and Verilator

• NOC Model: OpenSoC + BookSim for Parameter sweeps

Emulation (Chisel HW Generators)

FPGA and synthesis

1113

Chisel RISC-V OpenSOC

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AX

I

CPU(s)

AXI

CPU(s)

AX

I

AXI

10

Gb

E

PC

Ie

Verilog

FPGA ASIC

Hardware

Compilation

Software

Compilation

SystemC

Simulation

C++

Simulation

Scala

Chisel

Open Source Extensible ISA/Cores Open Source fabricTo integrate accelerators

And logic into SOC

DSL for rapid prototypingof circuits, systems, and

arch simulator components

Platform for experimentation Parameterized hardware generation

Back-end to synthesizeHW with different devices

Or new logic families

Re-implement processorWith different devices orExtend w/accelerators

0

2000000

4000000

6000000

8000000

10000000

12000000

64:64:64 2:2:65536 20:20:656

Example:SearchingforOptimalCacheLatency(ns)/DifferentStencilOrganizationat2048K

L1latency Mainmemorylatency Coherencelatency

0%

50%

100%

Cache Latency (best coherence case)

computation latency prefetch warm up latency

coherence latency

90%

95%

100%

DMA Latency (10 ns initial latency, DMA

list)

computation latency DMA latency

Optimal Blocking

Stencil Study (effect of coherence and word granularity DMA

on basic stencil performance)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

2048

20:20:656

512

12:12:456

128

8:8:256

32

4:6:171

8

4:8:8

2

4:8:8

0.5

4:4:4

0.125

2:2:4

scrachpad speedup per computation

4.5x

SpeedupEnables extreme

strong scaling

Scatter/Gather results

Simple IPC CalculationLots of caveats

Effect 1: Reduce $ MissesAssume covered L1 misses become hits

13-40% IPC improvement

Effect 2: Integer offloadAssume half of address-calc integer operations are offloaded, require 0.1 cycles (vs. 0.4)

17-44% IPC Improvement

Effect 3: Improved $ performanceIf covered cache accesses go to scratchpad, other accesses more efficient?

0%

10%

20%

30%

40%

50%

hpgmg Kripke XSBench miniMD

Estimated IPC Improvement

from S/GReduced $

Misses

Here’s a great idea!Scatter/Gather

Could have a substantial positive impact on performance

Possible to identify regions amenable to lots of in-memory

operations, high reuse, good compaction

Good progress on design (DRE work) – high feasibility

Word-Granularity Scratchpad

Can reduce post-$ accesses / make better use of cache

Not useful for all apps (Kripke, XSBench?)

Verdict: Combine them!

Focus on S/G

Target Arch: S/G to/from scratchpad

Option: w/ HW Synchronization

Option: w/ Recoding engine

Combine!

High level results

Three Architectures

Scatter-Gather: Good body of evidence for performance, programmability, good

foundation for design

Word-granularity Scratchpad: Some evidence for improved performance,

programmability

Atomics: Chicken-and-Egg, coherency issues, need better application drivers

Benefit Scatter/GatherWord-granularity

Scratchpad

Kripke Good >20% Not Word-Gran

hpgmg Good >15% Ongoing

XSBench Good >28% Mild?

Additional value of ModSim

Detailed knowledge transfer

OCCAM

Extension of knowledge

Classified/unclassified boundaries

Proprietary/open boundaries

Thanks!

Special thanks to Arun Rodrigues and John Shalf

Project 38

Happy Birthday Bob Mrosky!

project 38 success requires modsim · project 38 at a glance project 38 is a set of vendor-agnostic...

Documents