early experiences with simulation- based co-design with ...€¦ · power is the problem •2018...

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administration

under contract DE-AC04-94AL85000.

Early Experiences with Simulation-Based Co-Design with Proxy

Applications

Arun Rodrigues

Co-Design and Simulation

•Why simulate? Why Co-Design?

•Early Results•Future Directions

Arch

SW/Apps

Hardware

Bottlenecks ImpactsCapabilities

CapabilitiesImpacts

Why/How Simulate?Why/How Co-Design?

Power is the Problem•2018 Exascale Machine–1 Exaop/sec–100s petabyte/sec memory

bandwidth–100s petabyte/sec

interconnect bandwidth–No major architecture

changes

•Consider power –1 pJ * 1 Exa = 1 MW–1 MW/year = $1 M–$200-400M / year power bill

Page 14

management of data, combined with better overlap of communication and computation could reduce bandwidth

requirements by 50% or more.

Table 2: Effects of Power Reduction Techniques on an Exascale System

2018 Estimate Reduction Techniques Reduced Power Reduction

Processing 224 MW Simpler processor, Reduce

FP 11.2 MW 95%

Memory 125 MW Closer Proximity 37.5 MW 70%

Interconnect 24 MW Message Overlap 12 MW 50%

Total 373 MW 60.7 MW 84%

Applying these techniques to the hypothetical system from the introduction, we see a power reduction of 84%. This turns a

wholly impractical machine consuming hundreds of megawatts and costing significant fractions of a billion dollars to

power to a more attainable machine consuming tens of megawatts. More aggressive application of the techniques presented

here, and improvements in technology could reduce this further.

Figure 13 – Lessons from Embedded Systems

REFERENCES

[1] G. M. Amdahl, “Validity of single-processor approach to achieving large-scale computing capability,” Proceedings of AFIPS Conference, Reston,

VA. 1967. pp. 483-485

[2] G. Arnout, “C for system level design”, Proceedings of Design Automation and Test Europe (DATE) pp.384- 386, 2003, Munich, Germany

[3] M. Barr . "Embedded Systems Glossary". Netrino Technical Library. http://www.netrino.com/Embedded-Systems/Glossary. Retrieved 2007-04-21.

[4] M. Barr; A. J. Massa (2006). "Introduction". Programming embedded systems: with C and GNU development tools. O'Reilly. pp. 1-2. http://books.google.com/books?id=nPZaPJrw_L0C&pg=PA1.

[5] J. L. Bentley, “Programming pearls, second edition,” Addison-Wesley, Inc., 2000, ISBN 0-201-65788-0.

[6] B. W. Boehm, “Improving software productivity,” IEEE Computer 20, 9, 1987, pp. 43-57.

[7] S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, J. Rattner, “Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,” Technology & Research, Technology@Intel, Magazine Platform 2015

[8] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communication of the ACM, January 2008, pp. 107-113.

[9] D. Burger and J. R, Goodman, “Billion-transistor architectures: there and back again,” IEEE Computer 37, 3, Mar. 2004, pp. 22-28

From Jensen “Embedded systems and exascale computing.” CiSE 2010

Energy Conventional

Processor 62.5 pJ/op 62.5 MW

Memory 31.25 pJ/bit 125 MW

Interconnect 6 pJ/bit 24 MW

Total 211.5 MW

Worldwide Impact"Total power used by servers [in 2005] represented ... an amount comparable to that for color televisions. "-ESTIMATING TOTAL POWER CONSUMPTION BY SERVERS IN THE U.S. AND THE WORLD, Jonathan G. Koomey

3741e9 KW-Hrs Total US power consumption

* 3-4% used by computers (>2% servers, >1% household computer use)

= 112 - 150e9 KW-Hrs US Computer power consumption* $0.1 $/KW-Hr Retail cost, US Average 2009= $11 - $15 Billion US$ in compute power

* 3-5 in 2005 US was roughly 1/3 of servers, by power.

= $33 ( ) - $75 ( ) Billion US$ in worldwide computer power

* 15-35% DRAM memory power

= $5 ( ) - $25 ( ) BIllion in US$ in DRAM power

View of the Simulation Problem

Application writerspurchasersdesigners

system procurementalgorithm co-design

architecture researchlanguage research

Multiple Audiences.....Network

ProcessorSystem

present systemsfuture systemsX X X

Scale..... ManyCores

+Memory

ManyManyNodes

ManyManyMany

ThreadsX X

Multi-Physics AppsInformatics Apps

Complexity.....Communication Libraries

Run-TimesOS Effects

Existing LanguagesNew LanguagesX X

Constraints.....Performance

CostPower

ReliabilityCoolingUsability

RiskSize

Major Simulation Challenges•Multiple Objectives

–Performance used to be only criteria

–Now, Energy, cost, power, reliability, etc...

•Scale & Detail–Many system

characteristics require detail to measure

–Detailed simulation takes too long (10^4-10^5 slower than realtime)

•Accuracy–Systems more complex–Vendors don’t reveal

necessary details

Approach: Multiple Models

Back of the Envelope Provide ability to reason about the system and perform tradeoffsAnalysis Analytical Models

Provide ability to reason about the system and perform tradeoffs

High Level State Machines Explore high level application/Sys SW. structure, event counts

Behavioral Cycle-Approximate, Cycle-Accurate Models

Detailed exploration of tradeoffs for arch, application, sys sw.

Hardware Prototype Prototypes Provide hard numbers for

power/area/feasibility

SST Simulation Project Overview

Technical Approach

Goals•Become the standard architectural simulation framework for HPC•Be able to evaluate future systems on DOE workloads•Use supercomputers to design supercomputers

•Parallel•Parallel Discrete Event core with conservative optimization over MPI•Holistic

•Integrated Tech. Models for power•McPAT, Sim-Panalyzer•Multiscale

•Detailed and simple models for processor, network, and memory

•Open•Open Core, non viral, modular

Consortium•“Best of Breed” simulation suite•Combine Lab, academic, & industry

Status•Current Release (2.1) at code.google.com/p/sst-simulator/•Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model

Co-Design Goals•Co-Design = HW & SW•Iterative Simulation provides a testbed to explore HW & SW Ideas

•Software–Application, Algorithm, Runtime, OS

experiments–Provide feedback to software teams on

power, performance, etc... for FUTURE systems.

•Hardware–Goal: Design new hardware–Must target several years out to have

chance of influence–Provide mechanism to evaluate new HW

ideas, convince vendors

Early Results

Design Space ExplorationNVRAM

Design Space Exploration

Design Space Exploration Example•Design Space Exploration

–Inputs•Memory technology (DDR2, DDR3, GDDR5)•Core width (1,2,4,8 wide issue)•Cache size (32/32/1M or 64/64/2M)

–Outputs: Power, Performance, Cost•Methodology

–Performance models: GeM5/x86, DRAMSim2

–Energy Models: DRAMSim2, McPAT–Cost Models: IC Knowledge–Key Questions

•What is good cache size? Core Width?•Which DRAM technology?

•Example of sorts of questions simulation can answer

Cores MC

DIMM DIMM

DIMM DIMM

DDR2, DDR3, GDDR5

Small or Large coresSmall or Large caches

Execution Based Processor Model

Detailed DRAM Model

PartialCo-Design

Target Apps

HPCCG

Lulesh

Cache Size

•Larger caches increase processor size, power

•Avg. Power increase: 6.75%•Avg. Cost increase: 3.76%•Avg. Performance improvement–Lulesh: 1.40%–HPCCG: 6.73%

•Conclusion: Lulesh probably wouldn’t benefit, HPCCG marginal benefit

!"#

$"#

%"#

&"#

'"#

("#

)"#

*"#

+"#

!"#$%& '()*$+&,")*& -./$)0&

!$1"%+234$&

5!,,6&

!$%1"%+234$&

-2%7$&,240$&834%$2)$&

9:$4*&"1&-2%7$%&,240$)&

Which Memory System

•Options–DDR2: Cheap, low

power, antiquated–DDR3: Higher

performance, reasonable power

–GDDR5: Expensive, high power, very fast

•Pure performance:–GDDR 26-47% faster

than DDR3 (Lulesh)–GDDR 32-41% faster

thand DDR3 (HPCCG)–GDDR Wins?

!"

!#$"

%"

%#$"

&"

&#$"

%" &" '" (" )*+,-.+"

!"#$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

67&)25+,)#-"#$%./)+

889:+

889;+

<889=+

!"

!#$"

%"

%#$"

&"

&#$"

%" &" '" (" )*+,-.+"

!"#$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

6,778+,)#-"#$%./)+

99:;+

99:<+

899:=+

Better

Better

Energy & Cost•GDDR better performance•DDR3 generally does better on perf/Watt –Lulesh: -3% to 107%–HPCCG: 0 to 100%–GDDR does well at higher

processor widths–DDR2 sometimes slightly

better than DDR3 on HPCCG

•perf/$–DDR3 better for narrow

cores, GDDR better for wide

–DDR slightly better over all

!"#$

!"%$

!"&$

'$

'"($

'"#$

'"%$

'$ ($ #$ &$ )*+,-.+$

!"#$

%&'()*+,)#-.,"/)#+

,#"0)11"#+2'*34+

56&)14+,)#-"#$%70)+8)#+2%9+

::;<+

::;=+

>::;?+

!"#$

!"%$

!"&$

'$

'"($

'"#$

'"%$

'$ ($ #$ &$ )*+,-.+$

!"#$

%&'()*+,)#-.,"/)#+

,#"0)11"#+2'*34+

5,667+,)#-"#$%80)+9)#+2%:+

;;<=+

;;<>+

7;;<?+

!"#$

!"%$

!"&$

!"'$

!"($

!")$

*$

*"*$

*"+$

*",$

*"#$

*$ +$ #$ ($ -./012/$

!"#$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

67&)25+,)#-"#$%./)+8)#+9"&&%#+

99:;+

99:<+

=99:>+

!"#$

!"%$

!"&$

!"'$

!"($

!")$

*$

*"*$

*"+$

*$ +$ #$ ($ ,-./01.$

!"#$

%&'()*+,)#"-"#$

%./)01+

,#"/)22"#+3'*45+

6,778+,)#-"#$%./)+9)#+:"&&%#+

::;<+

::;=+

8::;>+

Better

Better

Processor Width

•Wider processor can issue more instructions/cycle

•Consumes more area, power–Cost is super-linear wrt area

increase•Power often increases faster than performance–E.g. 8-wide processor 78%

faster on Lulesh, uses 123% more power

•1-2 wide cores most power efficient

•2-4 most cost efficient

!"#$

!"%$

!"&$

!"'$

($

("($

(")$

("*$

!" #" $" %"

&'()

*+,-./"0.(1'()

*23."456"

0('3.55'("7,/89"

:;+.59<"=>.38"'1"0('3.55'("7,/89"

0'?.("

@'58"

!"#$

!"%$

!"&$

!"'$

($

("($

(")$

("*$

!" #" $" %"

&'()

*+,-./"0.(1'()

*23."456"

0('3.55'("7,/89"

:0;;<=">?.38"'1"0('3.55'("7,/89"

0'@.("

;'58"

Better

Better

X to Solution

•Wider processors provide shorter time to solution

•Require much more energy to solution

!"#$

!"%$

!"&$

'"'$

'"($

'"#$

'"%$

'"&$

'$ )$ *$ +$

!"#$

%&'()*+,)#-"#$

%./)+012+

,#"/)11"#+3'*45+

67&)158+9:)/4+"-+,#"/)11"#+3'*45+

9.)#;<+="+>"&7?".+

='$)+="+>"&7?".+

!"#$

!"%$

!"&$

'"'$

'"($

'"#$

'"%$

'"&$

'$ )$ *$ +$

!"#$

%&'()*+,)#-"#$

%./)+012+

,#"/)11"#+3'*45+

6,7789+:;)/4+"-+,#"/)11"#+3'*45+

:.)#<=+>"+?"&@A".+

>'$)+>"+?"&@A".+

Better

Better

Design Space Exploration Results•Fastest memory technology not always best (DDR beats GDDR) due to power, cost

•No “best” processor - depends on tradeoff between cost, performance, power

•Can provide better understanding of which configurations are best for a given application

•Can be used as basis for application optimization, vendor guidance

Lulesh Pareto Optimal Designs

Width Memory Cache Power Performance Cost1 DDR3 Small 1.00 1.00 1.02 DDR3 Small 1.43 1.65 1.32 GDDR5 Small 3.00 2.28 2.04 GDDR5 Small 3.57 2.92 2.38 GDDR5 Small 5.29 3.62 3.4

Non-Volatile Memory•Factor of 3-8 less expensive (per bit)

•Lower Power•Much higher latency•Wear-out issues

•Traditionally used in block storage (SSDs, thumb drives)

•Could it be used to replace/augment DRAM?

Occupy Xyce

Distribution of Loads to memory pages, Xyce

Distribution of Income to People, US

Occupy Wall Street Occupy NVRAM

Other Apps

Good

Good

???

Bad

Enter Co-design•Lots of Issues to determine if NVRAM infeasible•Algorithm

–Is infrequently used data easy to identify?–Can prefetching hints be given to the hardware?–How much can be used?

•Runtime/OS–How is NVRAM presented to the application?–How is NVRAM managed between libraries and apps?

•Architecture–How much NVRAM?–Speeds? Integration strategy? Caching mechanism?

(Some) Future Directions

Advanced 3D Architecture

•Advanced Architecture–3D Integration of Memory–Silicon Photonics

•HW design questions•Application Tuning

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

NIC & Router

Mem

. Ctrl.

Proc.

Proc.

Proc.

Mem

. Ctr

l.

MT Proc.

MT Proc.

Proc.

Optics

Processing

DRAMTSV

TSV

TSV

TSV

TSV

TSVDRAM

DRAM

DRAM

DRAM

DRAM

DRAM

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

NIC & Router

Mem

. Ctrl.

Logic

Logic

Proc.

Mem

. Ctr

l.

MT Proc.

MT Proc.

Proc.

TSV

TSV

TSV

TSV

TSV

TSVDRAM

DRAM

DRAM

DRAM

DRAM

DRAM

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

TSV

TSV

TSV

TSV

TSV

TSVDRAM

DRAM

DRAM

DRAM

DRAM

DRAM

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

NIC & Router

Mem

. Ctrl.

Proc.

Proc.

Proc.

Mem

. Ctr

l. Proc.

Proc.

Proc.

Option 1

Option 2

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

Modulator

TSV Filter

Recv

NIC & Router

Mem

. Ctrl.

Proc.

Proc.

Proc.

Mem

. Ctr

l. Proc.

Proc.

Proc. TSV

TSV

TSV

TSV

TSV

TSVDRAM

DRAM

DRAM

DRAM

DRAM

DRAM

TSV

TSV

TSV

TSV

TSV

TSVDRAM

DRAM

DRAM

DRAM

DRAM

DRAM

ElectricalLink

Option 3

Processing-In-Memory•Perform computation closer to the data–Less moving data–Fewer round trips

•Space–DRAM Die is O(100) mm^2–64-bit MIPS 5kf in 65nm is 1.5-2.2

mm^2•Programmability

–Needs exploration•OpenMP accelerator model?

–OS Integration?•Power/Energy

–What are the thermal constraints?0%

25%

50%

75%

100%

miniMD HPCCG

Operations Offloaded

FPMem

for (size_t i = startat; i < stopat; ++i) { double sum = 0.0; const unsigned int cur_nnz = nnz_in_row[i]; const double * const cur_vals = ptr_to_vals_in_row[i]; const int * const cur_inds = ptr_to_inds_in_row[i]; sum = 0.0; for (unsigned int j=0; j<cur_nnz; j++) sum += cur_vals[j] * x[cur_inds[j]]; ((struct HPC_sparsemv_o_args*)arg)->y[i] = sum;}

←PIM Section

Summary•Simulation is a multi-faceted problem•Can be a powerful tool in Co-design•Compact Applications are a useful tool for simulation, experimentation

•We need better tools to understand our software and hardware

•Simulators can be a clearing house for new ideas and experiments

Bonus Slides

Skeleton AppsRunnemede

Simulator

NMSU

Stochastic

Simulator

M5

MacSim

IRIS

Network

Models

DRAMSim

Skeleton

Apps

Traces

Com

mon

Interfa

ce

Memory

Refs

Traces

Stochastic

Blocks

Comm.

Events

Mem

oiza

tion

Exec.

Blocks

•A common interface for “meso-scale” simulation

•Multiple forms of simulation (execution, traces, stochastic)

•Multiple levels of detail (Cycle accurate to analytical)

Thread Handling & Remote Fork

Proc Queue

Off-

ChipGen.

Memories

WU

Filter

FIFODouble

EndedFiltered

Power is the Problem•Processors

–Aggressive design reduces energy to 11 pJ/operation

–FP and $ access scales•Memory DDRx•Network

–Short distance electrical (~2pJ/bit)

–Routing: 1pJ/bit–Long distance electrical/

optical (4.6pJ/bit)•~75MW=$75M/yr in power (w/o cooling)

• 76% in data movement

Amount EnergyEnergy(MW)

Processor 1 Eop/s 22 pJ/op 22Memory Capacity 200 PB 28 mW/

GB 5.6

Mem. BW 1 Eb/s 30 pJ/bit 30Network 4 Eb/s 4.3 pJ/bit 17.2

Total 74.8

Operation Freq. Energy

Basic 100% 11

$ Access 50% 20

FP op 10% 10

Average 22

Simulation Examples

Action “Translation” “Software” Model

Explore Design Space Compiler Executable Detailed 1K Node

Model

CoDesign Tool

Instruction Trace Generator Trace Fast Estimator

Advanced Architecture

Compiler + Manual Mod. Executable Vendor + Open

Models

Algorithm Prototyping ASPEN Communication

PatternAbstract 1M Node

Network Model

Diversity!

genericProc

Macro

M5 O3

RS Router

DRAMSim

Stochastic

simpleRouter

ResiliencySimscheduleSim

Com

plex

ity/D

etai

l

Execution Time

Zesto

Commonalities• Discrete Event or time stepped

• Amenable to event counting for power modeling

early experiences with simulation- based co-design with ...€¦ · power is the problem •2018...

Documents