november 25, 20021workloads that scale in multiple dimensions workload benchmarks that scale in...
TRANSCRIPT
November 25, 2002 1WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Workload Benchmarks thatScale in Multiple Dimensions
John L. GustafsonSun Labs HPC Workload Characterization and System
Analysis Team
November 25, 2002 2WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Goal
Create a suite of purposebased benchmarks representative of HPC, which adjust in several dimensions to match real workloads.Description must be architectureindependent and languageindependent.Conjecture: This approach will yield improved predictive methods that are relatively invariant as technology evolves.
November 25, 2002 3WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
What Does it Mean to “Scale” a Workload?
Performance = Work/Time.Work is usually undefined in benchmarks,
so it’s “fixed” to avoid the issue.FLOPS are not work. Not in 2002, anyway.Multiple instances (small, medium, large)
doesn’t make a workload “scalable.”True scalability requires an objective
function.
November 25, 2002 4WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Latency Tradeoffs Drive Need for New HPC Approaches
ILLIAC IV, 1970 latency=1000 nsec
SGI O2000, 1996 latency=800 nsec
Sun SF15K, 2001 latency=400 nsec
…but traditional complexity analysis counts operations, ignores memory latency!
Time for onememory access
1950 1970 20101990
1 µs
1 ns
1 ms
Time for oneoperation
November 25, 2002 5WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Moore’s Law near limit for microprocessors, and this time we mean it! The clock can’t make
it across a 20 mm die at current GHz rates
Either we go to multiple cores or use NonUniform Cache Access (NUCA)
This is a physical limit, not technological [Source: Chuck Moore, UTAustin]
130 nm 100 nm
70 nm 35 nm
November 25, 2002 6WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Bandwidth Burns Energy; Maybe Our Measure of Work is… Joules?
1300 to 1900 pJMove 32 bits off chip
100 pJMove 32 bits across 10 mm chip
50 pJRead 32 bits from 8K RAM
10 pJ32bit register read
5 pJ32bit ALU operation
Energy (130 nm, 1.2 V)Operation
[source: Bill Dalley, Stanford]
November 25, 2002 7WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
PurposeBased Benchmarks
A purposebased benchmark states an objective function that has direct interest to humans.
An activitybased benchmark states computer operations to be performed, usually defined by source code.Either can be made to scale in multiple dimensions, but it’s harder to do with activitybased benchmarks.
November 25, 2002 8WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
MeansBased vs EndsBased Metrics
Flop/sBytes of RAM
Number of ProcessorsUse of Commodity Parts
Word SizeECC Memory
Speedup
Time to Compute AnswerDetail, Content of AnswerFeasible Problems to AttemptCost, Availability of SystemCloseness to Actual PhysicsReliability of AnswerProduct Line Range
ENDSBASEDMEANSBASED
November 25, 2002 9WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
TwoWay Taxonomy Examples
LINPACK
NAS Parallel
SPEC (any year)
LINPEAK
SLALOM
Streams, LMBench
NSA suites
Many RFP tests
TPCx, ECPerf
HINT (?)
Sun’s HPC Suite
ActivityBased
PurposeBased
FixedSize Scalable
November 25, 2002 10WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Example of Prediction Failure
0 2 4 6 80.0
0.5
1.0
1.5
2.0
2.5
LINPACK Speed relative to IBM 3090
NonlinearNonmonotoneLow correlation
November 25, 2002 11WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Peak FLOPS as Inverse Predictor
0 3 6 9 12 150.0
0.5
1.0
1.5
2.0
2.5
Peak Advertised Performance, GFLOPS
• Negative correlation• Hardware imbalance
Teraflops advocates,please note!
November 25, 2002 12WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Why F.P. Op Counts Make Poor Workload Definitions
Explicit TimesteppingConventional Matrix Multiply
Cholesky DecompositionAlltoAll NBody MethodsSuccessive OverRelaxation
TimeDomain OperatorsRecompute Gaussian Integrals
Material Property Function
Implicit TimesteppingStrassen, Winograd MethodsPC Conjugate GradientBarnesHut, GreengardMultigridFFT’sCompute Once and StoreTable LookUp
FASTER ANSWERSHIGHER FLOP/S RATES
November 25, 2002 13WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Peak Bandwidth Sometimes Works Surprisingly Well
• High correlation• Monotone (4 points)• Nonlinear, though
0 20 40 60 800.0
0.5
1.0
1.5
2.0
2.5
Peak Memory Bandwidth, GB/sec
November 25, 2002 14WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
An HPC Taxonomy
Electronic Design
Nuclear Applications
Mechanical Design
Mechanical Engineering
Radar CrossSection
Crash Simulation
Fluid Dynamics
Weather, Climate modeling
Signal/Image Processing
Encryption
Life Sciences
Financial Modeling
Petroleum
For each of these, we seek to match
•Problem size•Data locality•Predominate data type•Dynamic behavior in time•Spatial irregularity•Demands on I/O
To avoid the “toy benchmark” problem.
November 25, 2002 15WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Some Invariant Workload Dimensions
DataParallelism
RealTime
PhysicalSimulation
Math
Boolean
Small DiscreteLarge Discrete
Lowprecision continuous
Highly dataparallel
No locality of reference
WorkloadCategory
November 25, 2002 16WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Example of Workload Dimension: Data Parallelism (estimated)
100%
DataParallelcontent
0%
Genematching, SETI, factoring large integersVibrational analysis, raytracingDense matrix multiply, factoring; convolution and filteringEulerian fluid dynamics (decomposed spatially)NBody problemsStressstrain analysis with finite elements; crash testingLagrangian fluid dynamics (decomposed by fluid element)“Easy” databases (conflictfree, few updates)FFTs (frequencyspace filtering)Particle simulations with imposed fields, gametree explorationCircuit simulationEconomic simulationsTypical database applications; transaction processing
November 25, 2002 17WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Prior Art: HINTEnglish Description of Scalable Task
Divide the unit square into 2i by 2i rectangles. Bound the integral of(1–x)/(1+x) using that resolution and hierarchical interval subdivision.Use only the knowledge that the function is monotone decreasing.
0
1
10
Known to contribute to lower bound
Limited by arithmetic precision
Available for further refinement
Known not to contribute to upper bound
November 25, 2002 18WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Region of Computation asWorkload Scales
1e+05
1e+06
1e+07
1e+08
1e+09
1e06 1e05 0.0001 0.001 0.01 0.1 1 10 100Time in seconds
limited by precisionor memory
Region of Computation
limited bylatency
limited by"peak speed"
November 25, 2002 19WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Serial Systems Used to Test Model
MHz
CPU
PrimaryCache
SecondaryCache
RAM
OperatingSystem
Compiler
Indy 1
133
R4600
16 KB
0.5 MB
64 MB
IRIX6.2
Mips C
Indy 2
100
R4400
16 KB
1.0 MB
64 MB
IRIX6.2
Mips C
Indy 3
133
R4600
16 KB
None
64 MB
IRIX6.2
Mips C
Indy 4
100
R4600
16 KB
None
64 MB
IRIX6.2
Mips C
Indy 5
200
R4000
16 KB
1.0 MB
64 MB
IRIX6.2
Mips C
PC
200
Pent. Pro
16 KB
0.5 MB
64 MB
Linux
gcc
November 25, 2002 20WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Curve Crossings Predict Inconsistent Rankings
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
100 1K 10K 100K 1M 10M 100M
Memory in Bytes
Pentium Pro PC
SGI Indy 1
SGI Indy 5
SGI Indy 3
SGI Indy 4
SGI Indy 2
Noteobviouscacheregions.
November 25, 2002 21WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Shared Memory Systems Tested
MHz
CPU
PrimaryCache
SecondaryCache
RAM
OperatingSystem
Compiler
SGI Challenge
194
R10000
32 KB
0.5 MB
0.5 GB
IRIX6.2
Mips C
SGI Onyx
194
R10000
32 KB
1.0 MB
0.5 GB
IRIX6.2
Mips C
Cray C90
250
Vector
None
None
2.0 GB
UNICOS
Cray C
November 25, 2002 22WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Here, Differences are Not Subtle!
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
100 1K 10K 100K 1M 10M 100M 1G
Memory in Bytes
Cray C90
SGIChallenge
SGI Oynx
November 25, 2002 23WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Parallel Systems Tested
Number ofProcessors
MHz
CPU
Pri. Cache
Sec. cache
RAM
Op. System
Compiler
SGI Onyx
1, 2, 4, 8
194
R10000
32 KB
2.0 MB
2 GB (8way)
IRIX 6.2
Mips C
nCUBE 2S
16, 32, 64,128
25
custom


4 MB/node
Vertex
cc
Cray T3D
32, 64
150
Alpha EV4
32 KB

64 MB/node
UNICOS
C 4.0.3.5
IBM SP2
8, 16, 32,64, 128
67
RS6000
256 KB

128 MB/node
AIX 3
mpcc
Cray C
November 25, 2002 24WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
HINT Curves for Parallel Systems
1
10
100
0.0001 0.001 0.01 0.1 1
Time in seconds
SGI Onyx (8)
IBM SP2 (8)
IBM SP2 (64)
IBM SP2 (128)
Cray T3D (32)
nCUBE 2S (256)
November 25, 2002 25WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
SPECint Correlation with HINT
go
m88ksim
gcc
compress
li
iipeg
perl
vortex
SPECint
HINT Sample
27 KB
39 KB
180 KB
290 KB
17 KB
198 KB
36 KB
180 KB
180 KB
Correlation
0.9964
0.9985
0.9970
0.9990
0.9947
0.9986
0.9989
0.9985
0.9996
HINT Sample
468 KB
164 KB
514 KB
164 KB
164 KB
164 KB
164 KB
514 KB
514 KB
Rank Corr.
0.996
0.996
0.996
0.996
0.996
0.992
0.979
perfect
perfect
November 25, 2002 26WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
SPECfp Correlation with HINT
tomcatv
swim
su2cor
hydro2d
mgrid
applu
turb3d
absi
fppp
wave5
SPECfp
HINT Sample
net
net
893 KB
1080 KB
29 KB
net
16 KB
1080 KB
29 KB
893 KB
180 KB
Correlation
0.9990
0.9998
0.9970
0.9983
0.9977
0.9997
0.9994
0.9979
0.9933
0.9962
0.9984
HINT Sample
4963 KB
4963 KB
net
4963 KB
net
4963 KB
net
net
16 KB
4101 KB
4101KB
Rank Corr.
0.996
0.983
0.996
0.996
0.996
0.992
0.996
0.992
0.996
perfect
0.996
November 25, 2002 27WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
SPEC Linear Fits
spec95int = 7.5 × ( 180 ) 15%MQUIPS at KB within
95 = 11. spec fp × ( 893 ) 23%MQUIPS at KB within
Time to run SPEC is about8 .hours
Time to run HINT is about10 .minutes
November 25, 2002 28WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Heuristic App Profile: hydro2d
102
103
104
105
106
107
108
Sample HINT memory point
1.000
0.995
0.990
0.985
0.980
0.975
0.970
0.965
November 25, 2002 29WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Worst Prediction: FT, from NPB
0 50 100 150 200 250 300 3500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
NetQUIPS
November 25, 2002 30WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Scalability Allows >0.995 Correlation with Other Benchmarks
100 1000 10000 100000 106 107 108
Memory in Bytes
Fhourstone,Dhrystone,Tower of Hanoi,Queens,Fibonacci, etc.
Whetstone
LINPACK100×100
SPECint
SPECfp
LINPACK1000×1000
Stream
You know those“EnergyGuide”stickers you seeon refrigerators,water heaters,
air conditioners,Etc.…?
November 25, 2002 31WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Wouldn’t it be greatif they had something
like this at retailcomputer stores?
15.6net MQUIPS
32BIT INTEGER RATING
WINTEL PERSONAL COMPUTER64 MEGABYTES MAIN MEMORY
MODEL9600
ESTIMATES ARE BASED ON THE INT® PERFORMANCE MEASURE
Your performance will vary depending on how you use the product, and on how you modifyit by installing software. This test is based on patented, scalable methods developed by a federal laboratory.
THIS MODEL
June 1998model withlowestperformance3.2
June 1998model with
highestperformance
17.9
Estimated performance over a range of uses
How fast will this model run different types of problems?
Ask your salesperson for information about the needs of your application.
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1 microsecond 1 millisecond 1 secondTime for a Computing Task
64bitfloating point
32bitinteger
ONLY SINGLEUSER PERSONALCOMPUTERS ARE USED ON
THIS SCALE
November 25, 2002 32WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Benchmark #1: Truss Design
100 tons
TASK: Given a point that must support a 100 ton load and three attachmentpoints, find the geometry of struts and cables that creates the lighteststructure. Each node requires 1 kg of steel. Structure must support its ownweight.
(0,y1,z1)
L meters
(0,y2,z2)
(0,y3,z3)
(L,y0,z0)
November 25, 2002 33WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Strut Designcontinued
Stiffness(strength)
More complex structures have higher strength. The benchmarkinitially sets the topology, and then perturbs the xyz positions of nodepoints to optimize resulting total mass required.
November 25, 2002 34WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Strut Algorithm (Preprocess)
Read problem geometry and N=number of vertices from nonvolatile storage.Iterate until there are N vertices:
Sort edges (cables or struts) by length.Bisect longest edge, creating new vertex.Adjust vertex position to make it noncollinear.Add two noncoplanar edges between new vertex and neighbor vertices.Compute new lengths of edges.
Save entire mesh description to nonvolatile storage.
November 25, 2002 35WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Strut Algorithm (Inner Solver)
For each edge:For each edge that touches the same vertex:
Project edge vectors to this edge to obtain Aij
Compute external force from edge weights + load.Solve linear system such that vertex forces = 0.Compute required crosssections of cables and struts needed.Return the total weight of the truss.
November 25, 2002 36WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Strut OptimizationPoint Method
Iterate until time limit reached:Pick a strut (randomly or sequentially).Vary xyz coordinates of a vertex, adjusting edge lengths that
connectModify equations and resolve.
If truss weight is reduced, keep the modification;Else restore original mass distribution (or use annealing)
Report best solution found for structure.
This imitates the actions of a human engineer exploring a design space. Note the possibility of massive parallelism at the job level; each processor can try a different variation of the structure. The best solution found is then shared globally and used as a new starting point.
November 25, 2002 37WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Strut OptimizationInterval Method
Set initial bounds on vertex positions (can be very conservative)Iterate until time limit reached:
Subdivide Mdimensional space of vertex positions into subregionsFor each subregion:
Compute the truss weight, as an interval bound.Share bounds globally to exclude subregions from search.
Report best solution (range) found for structure.
This replaces trialanderror with rigorous exclusion of infeasible sets.Massive parallelism is easy; each processor can try a different subspace.
November 25, 2002 38WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Strut Optimization Dimensions
Number of vertices (if too many, fewer trials)
Number of trials (if too many, fewer vertices)
Type of solver (iterative, direct)Precision of solver (match to workload!)Search strategy (point, point parallel,
interval exclusion)
November 25, 2002 39WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Benchmark #2: Radiosity
TASK: Given the geometry above and three 1 by 1 diffuse light sources, find the placement of the light sources that results in the most even illumination of the bottom surface. All surfaces are Lambertian reflectors. Reflectivity of theTop surface, lights, and vertical surfaces is 0.95; reflectivity of the bottom surface and the occluder is 0.70. Figure of merit = brightest/darkest ratio.
3
45
1
1
1 2
4
November 25, 2002 40WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Radiositycontinued
Surfaces are subdivided into patches. Using more patches gives a better result. Point method uses Monte Carlo and only subdivides the bottom surface once; interval method uses an iterative solver and recursively subdivides all surfaces.
November 25, 2002 41WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Radiosity Solver (Point)
Read problem geometry and N=number of patches from nonvolatile storage.Subdivide bottom surface into N patches.Until all patches have threesigma confidence:
Fire a random photon from a light source.Track reflections using probabilities until photon is absorbed.If absorbed in a bottom surface patch, increment histogram.
Find maximum and minimum photon counts.Save bottom surface radiosities to nonvolatile storage.Compute ratio.
Note: Highly parallel if care is used to create independent random numbergeneration. Easy tests for occlusion compared to interval method.
November 25, 2002 42WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Radiosity Solver (Interval)
Read problem geometry from nonvolatile storage.Create initial subdivision into large rectangles.Set up form factor matrix, and initial bounds on radiosity.Until all patch intervals are 1% of the lightestdarkest range:
Subdivide the patch with the largest uncertainty.Use radiosity equation (contractive mapping) to find new bounds.Compute lightestdarkest range on bottom surface.
Save subdivision geometry and patch ranges to nonvolatile storage.Compute ratio.
Note: Parallel if asynchronous updating of radiosity is allowed (runswill not be exactly repeatable but will always converge).Closedform expressions for form factors exist for this problem.
November 25, 2002 43WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Radiosity OptimizationPoint
Iterate until time limit reached:Pick a light source coordinate.Vary it by some small amount, like 0.1 meter.Resolve the radiosity problem using the point method.Compare radiation evenness on bottom surface:
If ratio is closer to 1.0, keep the modification;Else restore original light position.
Report best solution found for position of lights.
As before, this allows job parallelism if information about the search isshared by all processors after each run.
November 25, 2002 44WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Radiosity OptimizationInterval
Set initial bounds on light positions (Must be on ceiling, disjoint)Iterate until time limit reached:
Subdivide the 6dimensional space of light positionsFor each subset of the search space:
Bound the ratio of lightest/darkest surface patchShare bounds globally to exclude subregions from search.
Report best solution (range) found for light positions.
Parallelism exists at the job level and within each solver step.
November 25, 2002 45WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Benchmark#3: Life Sciences (Proteomics)
Given a sequence of N peptides and a time limit T, find the minimum energy conformation of the peptide sequence.
Figure of merit: N, or N/T
This approaches protein folding as N grows.Answer validity can be tested against experiment.
Currently, N~55 is the frontier.
November 25, 2002 46WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Benchmark#4: Electronic Design
Design an Nbit adder with carry lookahead in a given process technology. (Like 0.10 micron, FO4, 6layer Cu interconnect). Simulate with a cyclebased simulator for a complete set of test
vectors. Optimize to minimize clock cycle and chip area.
Figure of merit: Clock speed or area.
This captures both integer (logical)and floatingpoint (analog) aspects of
electronic design.
November 25, 2002 47WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Benchmark#5: Financial Modeling
Generate realtime market behavior drawn from historical data to drive workload. Execute trades based on estimates of
future value for N financial instruments over a period T.
Objective function: Profit!
November 25, 2002 48WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Benchmark#6: Weather Modeling
Generate realtime weather behavior drawn from historical data to drive workload. Predict weather (temperature,
precipitation, cloud cover, pressure, wind speed) for N days in advance.
Objective function: Minimum total log(error)/time.
November 25, 2002 49WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
Benchmark#7: PetroleumReservoir Management
Given a geological structure containing oil, water, gas, and a set of M injector wells and N extraction wells, position the wells
to maximize the total oil and gas extracted over a period of time T.
Objective function: Maximum fuel extracted.
November 25, 2002 50WORKLOADS THAT SCALEIN MULTIPLE DIMENSIONS
SUMMARY
Scalability is much easier if the workloads are purposebased.
Multiple dimensions of scalability arise naturally as adjustable parameters.
Predictive value looks promising based on prior experience with HINT.
We will share our HPC workload benchmarks with the HPC community when completed.