early experiences with simulation- based co-design with ...€¦ · power is the problem •2018...
TRANSCRIPT
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administration
under contract DE-AC04-94AL85000.
Early Experiences with Simulation-Based Co-Design with Proxy
Applications
Arun Rodrigues
Co-Design and Simulation
•Why simulate? Why Co-Design?
•Early Results•Future Directions
Arch
SW/Apps
Hardware
Bottlenecks ImpactsCapabilities
CapabilitiesImpacts
Why/How Simulate?Why/How Co-Design?
Power is the Problem•2018 Exascale Machine–1 Exaop/sec–100s petabyte/sec memory
bandwidth–100s petabyte/sec
interconnect bandwidth–No major architecture
changes
•Consider power –1 pJ * 1 Exa = 1 MW–1 MW/year = $1 M–$200-400M / year power bill
Page 14
management of data, combined with better overlap of communication and computation could reduce bandwidth
requirements by 50% or more.
Table 2: Effects of Power Reduction Techniques on an Exascale System
2018 Estimate Reduction Techniques Reduced Power Reduction
Processing 224 MW Simpler processor, Reduce
FP 11.2 MW 95%
Memory 125 MW Closer Proximity 37.5 MW 70%
Interconnect 24 MW Message Overlap 12 MW 50%
Total 373 MW 60.7 MW 84%
Applying these techniques to the hypothetical system from the introduction, we see a power reduction of 84%. This turns a
wholly impractical machine consuming hundreds of megawatts and costing significant fractions of a billion dollars to
power to a more attainable machine consuming tens of megawatts. More aggressive application of the techniques presented
here, and improvements in technology could reduce this further.
Figure 13 – Lessons from Embedded Systems
REFERENCES
[1] G. M. Amdahl, “Validity of single-processor approach to achieving large-scale computing capability,” Proceedings of AFIPS Conference, Reston,
VA. 1967. pp. 483-485
[2] G. Arnout, “C for system level design”, Proceedings of Design Automation and Test Europe (DATE) pp.384- 386, 2003, Munich, Germany
[3] M. Barr . "Embedded Systems Glossary". Netrino Technical Library. http://www.netrino.com/Embedded-Systems/Glossary. Retrieved 2007-04-21.
[4] M. Barr; A. J. Massa (2006). "Introduction". Programming embedded systems: with C and GNU development tools. O'Reilly. pp. 1-2. http://books.google.com/books?id=nPZaPJrw_L0C&pg=PA1.
[5] J. L. Bentley, “Programming pearls, second edition,” Addison-Wesley, Inc., 2000, ISBN 0-201-65788-0.
[6] B. W. Boehm, “Improving software productivity,” IEEE Computer 20, 9, 1987, pp. 43-57.
[7] S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, J. Rattner, “Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,” Technology & Research, Technology@Intel, Magazine Platform 2015
[8] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communication of the ACM, January 2008, pp. 107-113.
[9] D. Burger and J. R, Goodman, “Billion-transistor architectures: there and back again,” IEEE Computer 37, 3, Mar. 2004, pp. 22-28
From Jensen “Embedded systems and exascale computing.” CiSE 2010
Energy Conventional
Processor 62.5 pJ/op 62.5 MW
Memory 31.25 pJ/bit 125 MW
Interconnect 6 pJ/bit 24 MW
Total 211.5 MW
Worldwide Impact"Total power used by servers [in 2005] represented ... an amount comparable to that for color televisions. "-ESTIMATING TOTAL POWER CONSUMPTION BY SERVERS IN THE U.S. AND THE WORLD, Jonathan G. Koomey
3741e9 KW-Hrs Total US power consumption
* 3-4% used by computers (>2% servers, >1% household computer use)
= 112 - 150e9 KW-Hrs US Computer power consumption* $0.1 $/KW-Hr Retail cost, US Average 2009= $11 - $15 Billion US$ in compute power
* 3-5 in 2005 US was roughly 1/3 of servers, by power.
= $33 ( ) - $75 ( ) Billion US$ in worldwide computer power
* 15-35% DRAM memory power
= $5 ( ) - $25 ( ) BIllion in US$ in DRAM power
View of the Simulation Problem
Application writerspurchasersdesigners
system procurementalgorithm co-design
architecture researchlanguage research
Multiple Audiences.....Network
ProcessorSystem
present systemsfuture systemsX X X
Scale..... ManyCores
+Memory
ManyManyNodes
ManyManyMany
ThreadsX X
Multi-Physics AppsInformatics Apps
Complexity.....Communication Libraries
Run-TimesOS Effects
Existing LanguagesNew LanguagesX X
Constraints.....Performance
CostPower
ReliabilityCoolingUsability
RiskSize
Major Simulation Challenges•Multiple Objectives
–Performance used to be only criteria
–Now, Energy, cost, power, reliability, etc...
•Scale & Detail–Many system
characteristics require detail to measure
–Detailed simulation takes too long (10^4-10^5 slower than realtime)
•Accuracy–Systems more complex–Vendors don’t reveal
necessary details
Approach: Multiple Models
Back of the Envelope Provide ability to reason about the system and perform tradeoffsAnalysis Analytical Models
Provide ability to reason about the system and perform tradeoffs
High Level State Machines Explore high level application/Sys SW. structure, event counts
Behavioral Cycle-Approximate, Cycle-Accurate Models
Detailed exploration of tradeoffs for arch, application, sys sw.
Hardware Prototype Prototypes Provide hard numbers for
power/area/feasibility
SST Simulation Project Overview
Technical Approach
Goals•Become the standard architectural simulation framework for HPC•Be able to evaluate future systems on DOE workloads•Use supercomputers to design supercomputers
•Parallel•Parallel Discrete Event core with conservative optimization over MPI•Holistic
•Integrated Tech. Models for power•McPAT, Sim-Panalyzer•Multiscale
•Detailed and simple models for processor, network, and memory
•Open•Open Core, non viral, modular
Consortium•“Best of Breed” simulation suite•Combine Lab, academic, & industry
Status•Current Release (2.1) at code.google.com/p/sst-simulator/•Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model
Co-Design Goals•Co-Design = HW & SW•Iterative Simulation provides a testbed to explore HW & SW Ideas
•Software–Application, Algorithm, Runtime, OS
experiments–Provide feedback to software teams on
power, performance, etc... for FUTURE systems.
•Hardware–Goal: Design new hardware–Must target several years out to have
chance of influence–Provide mechanism to evaluate new HW
ideas, convince vendors
Early Results
Design Space ExplorationNVRAM
Design Space Exploration
Design Space Exploration Example•Design Space Exploration
–Inputs•Memory technology (DDR2, DDR3, GDDR5)•Core width (1,2,4,8 wide issue)•Cache size (32/32/1M or 64/64/2M)
–Outputs: Power, Performance, Cost•Methodology
–Performance models: GeM5/x86, DRAMSim2
–Energy Models: DRAMSim2, McPAT–Cost Models: IC Knowledge–Key Questions
•What is good cache size? Core Width?•Which DRAM technology?
•Example of sorts of questions simulation can answer
Cores MC
DIMM DIMM
DIMM DIMM
DDR2, DDR3, GDDR5
Small or Large coresSmall or Large caches
Execution Based Processor Model
Detailed DRAM Model
PartialCo-Design
Target Apps
HPCCG
Lulesh
Cache Size
•Larger caches increase processor size, power
•Avg. Power increase: 6.75%•Avg. Cost increase: 3.76%•Avg. Performance improvement–Lulesh: 1.40%–HPCCG: 6.73%
•Conclusion: Lulesh probably wouldn’t benefit, HPCCG marginal benefit
!"#
$"#
%"#
&"#
'"#
("#
)"#
*"#
+"#
!"#$%& '()*$+&,")*& -./$)0&
!$1"%+234$&
5!,,6&
!$%1"%+234$&
-2%7$&,240$&834%$2)$&
9:$4*&"1&-2%7$%&,240$)&
Which Memory System
•Options–DDR2: Cheap, low
power, antiquated–DDR3: Higher
performance, reasonable power
–GDDR5: Expensive, high power, very fast
•Pure performance:–GDDR 26-47% faster
than DDR3 (Lulesh)–GDDR 32-41% faster
thand DDR3 (HPCCG)–GDDR Wins?
!"
!#$"
%"
%#$"
&"
&#$"
%" &" '" (" )*+,-.+"
!"#$
%&'()*+,)#"-"#$
%./)01+
,#"/)22"#+3'*45+
67&)25+,)#-"#$%./)+
889:+
889;+
<889=+
!"
!#$"
%"
%#$"
&"
&#$"
%" &" '" (" )*+,-.+"
!"#$
%&'()*+,)#"-"#$
%./)01+
,#"/)22"#+3'*45+
6,778+,)#-"#$%./)+
99:;+
99:<+
899:=+
Better
Better
Energy & Cost•GDDR better performance•DDR3 generally does better on perf/Watt –Lulesh: -3% to 107%–HPCCG: 0 to 100%–GDDR does well at higher
processor widths–DDR2 sometimes slightly
better than DDR3 on HPCCG
•perf/$–DDR3 better for narrow
cores, GDDR better for wide
–DDR slightly better over all
!"#$
!"%$
!"&$
'$
'"($
'"#$
'"%$
'$ ($ #$ &$ )*+,-.+$
!"#$
%&'()*+,)#-.,"/)#+
,#"0)11"#+2'*34+
56&)14+,)#-"#$%70)+8)#+2%9+
::;<+
::;=+
>::;?+
!"#$
!"%$
!"&$
'$
'"($
'"#$
'"%$
'$ ($ #$ &$ )*+,-.+$
!"#$
%&'()*+,)#-.,"/)#+
,#"0)11"#+2'*34+
5,667+,)#-"#$%80)+9)#+2%:+
;;<=+
;;<>+
7;;<?+
!"#$
!"%$
!"&$
!"'$
!"($
!")$
*$
*"*$
*"+$
*",$
*"#$
*$ +$ #$ ($ -./012/$
!"#$
%&'()*+,)#"-"#$
%./)01+
,#"/)22"#+3'*45+
67&)25+,)#-"#$%./)+8)#+9"&&%#+
99:;+
99:<+
=99:>+
!"#$
!"%$
!"&$
!"'$
!"($
!")$
*$
*"*$
*"+$
*$ +$ #$ ($ ,-./01.$
!"#$
%&'()*+,)#"-"#$
%./)01+
,#"/)22"#+3'*45+
6,778+,)#-"#$%./)+9)#+:"&&%#+
::;<+
::;=+
8::;>+
Better
Better
Processor Width
•Wider processor can issue more instructions/cycle
•Consumes more area, power–Cost is super-linear wrt area
increase•Power often increases faster than performance–E.g. 8-wide processor 78%
faster on Lulesh, uses 123% more power
•1-2 wide cores most power efficient
•2-4 most cost efficient
!"#$
!"%$
!"&$
!"'$
($
("($
(")$
("*$
!" #" $" %"
&'()
*+,-./"0.(1'()
*23."456"
0('3.55'("7,/89"
:;+.59<"=>.38"'1"0('3.55'("7,/89"
0'?.("
@'58"
!"#$
!"%$
!"&$
!"'$
($
("($
(")$
("*$
!" #" $" %"
&'()
*+,-./"0.(1'()
*23."456"
0('3.55'("7,/89"
:0;;<=">?.38"'1"0('3.55'("7,/89"
0'@.("
;'58"
Better
Better
X to Solution
•Wider processors provide shorter time to solution
•Require much more energy to solution
!"#$
!"%$
!"&$
'"'$
'"($
'"#$
'"%$
'"&$
'$ )$ *$ +$
!"#$
%&'()*+,)#-"#$
%./)+012+
,#"/)11"#+3'*45+
67&)158+9:)/4+"-+,#"/)11"#+3'*45+
9.)#;<+="+>"&7?".+
='$)+="+>"&7?".+
!"#$
!"%$
!"&$
'"'$
'"($
'"#$
'"%$
'"&$
'$ )$ *$ +$
!"#$
%&'()*+,)#-"#$
%./)+012+
,#"/)11"#+3'*45+
6,7789+:;)/4+"-+,#"/)11"#+3'*45+
:.)#<=+>"+?"&@A".+
>'$)+>"+?"&@A".+
Better
Better
Design Space Exploration Results•Fastest memory technology not always best (DDR beats GDDR) due to power, cost
•No “best” processor - depends on tradeoff between cost, performance, power
•Can provide better understanding of which configurations are best for a given application
•Can be used as basis for application optimization, vendor guidance
Lulesh Pareto Optimal Designs
Width Memory Cache Power Performance Cost1 DDR3 Small 1.00 1.00 1.02 DDR3 Small 1.43 1.65 1.32 GDDR5 Small 3.00 2.28 2.04 GDDR5 Small 3.57 2.92 2.38 GDDR5 Small 5.29 3.62 3.4
NVRAM
Non-Volatile Memory•Factor of 3-8 less expensive (per bit)
•Lower Power•Much higher latency•Wear-out issues
•Traditionally used in block storage (SSDs, thumb drives)
•Could it be used to replace/augment DRAM?
Occupy Xyce
Distribution of Loads to memory pages, Xyce
Distribution of Income to People, US
Occupy Wall Street Occupy NVRAM
Other Apps
Good
Good
???
Bad
Enter Co-design•Lots of Issues to determine if NVRAM infeasible•Algorithm
–Is infrequently used data easy to identify?–Can prefetching hints be given to the hardware?–How much can be used?
•Runtime/OS–How is NVRAM presented to the application?–How is NVRAM managed between libraries and apps?
•Architecture–How much NVRAM?–Speeds? Integration strategy? Caching mechanism?
(Some) Future Directions
Advanced 3D Architecture
•Advanced Architecture–3D Integration of Memory–Silicon Photonics
•HW design questions•Application Tuning
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
NIC & Router
Mem
. Ctrl.
Proc.
Proc.
Proc.
Mem
. Ctr
l.
MT Proc.
MT Proc.
Proc.
Optics
Processing
DRAMTSV
TSV
TSV
TSV
TSV
TSVDRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
NIC & Router
Mem
. Ctrl.
Logic
Logic
Proc.
Mem
. Ctr
l.
MT Proc.
MT Proc.
Proc.
TSV
TSV
TSV
TSV
TSV
TSVDRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
TSV
TSV
TSV
TSV
TSV
TSVDRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
NIC & Router
Mem
. Ctrl.
Proc.
Proc.
Proc.
Mem
. Ctr
l. Proc.
Proc.
Proc.
Option 1
Option 2
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
Modulator
TSV Filter
Recv
NIC & Router
Mem
. Ctrl.
Proc.
Proc.
Proc.
Mem
. Ctr
l. Proc.
Proc.
Proc. TSV
TSV
TSV
TSV
TSV
TSVDRAM
DRAM
DRAM
DRAM
DRAM
DRAM
TSV
TSV
TSV
TSV
TSV
TSVDRAM
DRAM
DRAM
DRAM
DRAM
DRAM
ElectricalLink
Option 3
Processing-In-Memory•Perform computation closer to the data–Less moving data–Fewer round trips
•Space–DRAM Die is O(100) mm^2–64-bit MIPS 5kf in 65nm is 1.5-2.2
mm^2•Programmability
–Needs exploration•OpenMP accelerator model?
–OS Integration?•Power/Energy
–What are the thermal constraints?0%
25%
50%
75%
100%
miniMD HPCCG
Operations Offloaded
FPMem
for (size_t i = startat; i < stopat; ++i) { double sum = 0.0; const unsigned int cur_nnz = nnz_in_row[i]; const double * const cur_vals = ptr_to_vals_in_row[i]; const int * const cur_inds = ptr_to_inds_in_row[i]; sum = 0.0; for (unsigned int j=0; j<cur_nnz; j++) sum += cur_vals[j] * x[cur_inds[j]]; ((struct HPC_sparsemv_o_args*)arg)->y[i] = sum;}
←PIM Section
Summary•Simulation is a multi-faceted problem•Can be a powerful tool in Co-design•Compact Applications are a useful tool for simulation, experimentation
•We need better tools to understand our software and hardware
•Simulators can be a clearing house for new ideas and experiments
Bonus Slides
Skeleton AppsRunnemede
Simulator
NMSU
Stochastic
Simulator
M5
MacSim
IRIS
Network
Models
DRAMSim
Skeleton
Apps
Traces
Com
mon
Interfa
ce
Memory
Refs
Traces
Stochastic
Blocks
Comm.
Events
Mem
oiza
tion
Exec.
Blocks
•A common interface for “meso-scale” simulation
•Multiple forms of simulation (execution, traces, stochastic)
•Multiple levels of detail (Cycle accurate to analytical)
Thread Handling & Remote Fork
Proc Queue
Off-
ChipGen.
Memories
WU
Filter
FIFODouble
EndedFiltered
Power is the Problem•Processors
–Aggressive design reduces energy to 11 pJ/operation
–FP and $ access scales•Memory DDRx•Network
–Short distance electrical (~2pJ/bit)
–Routing: 1pJ/bit–Long distance electrical/
optical (4.6pJ/bit)•~75MW=$75M/yr in power (w/o cooling)
• 76% in data movement
Amount EnergyEnergy(MW)
Processor 1 Eop/s 22 pJ/op 22Memory Capacity 200 PB 28 mW/
GB 5.6
Mem. BW 1 Eb/s 30 pJ/bit 30Network 4 Eb/s 4.3 pJ/bit 17.2
Total 74.8
Operation Freq. Energy
Basic 100% 11
$ Access 50% 20
FP op 10% 10
Average 22
Simulation Examples
Action “Translation” “Software” Model
Explore Design Space Compiler Executable Detailed 1K Node
Model
CoDesign Tool
Instruction Trace Generator Trace Fast Estimator
Advanced Architecture
Compiler + Manual Mod. Executable Vendor + Open
Models
Algorithm Prototyping ASPEN Communication
PatternAbstract 1M Node
Network Model
Diversity!
genericProc
Macro
M5 O3
RS Router
DRAMSim
Stochastic
simpleRouter
ResiliencySimscheduleSim
Com
plex
ity/D
etai
l
Execution Time
Zesto
Commonalities• Discrete Event or time stepped
• Amenable to event counting for power modeling