1 enterprise platforms group pinpointing representative portions of large intel itanium programs...
TRANSCRIPT
1
EEnterprise nterprise PPlatforms latforms GGrouproup
Pinpointing Representative Portions of Large Intel Itanium
Programs with Dynamic Instrumentation
Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew
Sun, Anand Karunanidhi
Enterprise Platform GroupIntel Corporation
Presented at MICRO-37: Portland, OR, Dec. 6th, 2004
IA32/EM64T/IPF
2
EEnterprise nterprise PPlatforms latforms GGrouproup
Target: LARGE Applications
• With little/no manual intervention
• Within reasonable time
Goal: Accurate Performance Prediction
3
EEnterprise nterprise PPlatforms latforms GGrouproup
Instruction Counts : Some Itanium Applications
# Instructions (billions)
142 373 463
3,979 3,994
4,932
SPECINT (average)
SPECFP (average)
RenderManmagic
Fluent L2
Amber rt
Ls-Dyna 3cars
4
EEnterprise nterprise PPlatforms latforms GGrouproup
Whole-Program Simulation is Slow
Simulation Time in YEARS@ 10,000 Instructions/Second
0.4 1.2 1.5
12.6 12.715.6
SPECINT (average)
SPECFP (average)
RenderManmagic
Fluent L2
Amber rt
Ls-Dyna 3cars
5
EEnterprise nterprise PPlatforms latforms GGrouproup
Solution: Select Simulation Points
• Manually• Randomly
– Anywhere– From uniform regions
• Fine-grain sampling (SMARTS: CMU)• By program-phase analysis
(SimPoint:UCSD, iPart: Intel/MRL)
6
EEnterprise nterprise PPlatforms latforms GGrouproup
Running Commercial Applications on Simulators is Hard
• Resource Requirements: Disks etc.– Need to modify/re-configure the simulator
• OS dependencies– Need support for specific kernel and
device drivers
• License checking– Need special action
7
EEnterprise nterprise PPlatforms latforms GGrouproup
Use PIN to select simulation points (PinPoints) and generate traces
PIN: A dynamic-instrumentation system+ A tool for writing tools+ No special compiler/linker flags required
Solution: Native Execution with Instrumentation
8
EEnterprise nterprise PPlatforms latforms GGrouproup
PIN-Tools: Profiling, Trace Generation and more….
PIN-based profiler
Simulation Point
Selection
ProfilePinPoints
PIN-based Trace
Generator
PIN-based Branch
Predictor
Your Simulator Here
9
EEnterprise nterprise PPlatforms latforms GGrouproup
Simulation Point Selection withSimPoint [UCSD]
Why SimPoint?
• Instrumentation based
• Microarchitecture independent
• Works well (results later)
Applied to multi-threaded programs
PIN-based profiler
SimPoint Tools
Basic BlockVectors PinPoints
10
EEnterprise nterprise PPlatforms latforms GGrouproup
Multiple Sources of Error
Goal: Accurate Performance Prediction
Error Source: Phase detection
Error Source: Non-repeatability
Error Source: Warm-up, Modeling
PinPoints TracesSimulationStats (CPI)
Phase-detection is not enough!
Need Trace Generation and Simulation
11
EEnterprise nterprise PPlatforms latforms GGrouproup
Main Contributions• A Toolkit that automatically:
– Profiles, finds phases/ simulation regions (PinPoints)
–Validates that PinPoints are representative
–Generates traces for simulators
Available for Itanium/IA32/EM64T
• Evaluations in a production environment
12
EEnterprise nterprise PPlatforms latforms GGrouproup
The PinPoints Toolkit
PinPoints file
H/W counters-based Validation
(pfmon : ItaniumPAPI : IA32)
Compute CPI
Match?
Whole ProgramWeighted Sum
for PinPoints
Phase Detection+ PinPoint Selection
Trace Generation/Simulation
13
EEnterprise nterprise PPlatforms latforms GGrouproup
EvaluationsApplications: Built w/ Intel’s compilers (high opt)
HPC: Fluent, AMBER, LS-Dyna, RenderMan SPEC2000: Processed 8-9 times
Test Configurations: Linux (RedHat)
Merced Itanium (1) 800 MHz L3: 2MB
McKinley Itanium-2 900 MHz L3: 1.5MB
Madison Itanium-2 1.3 GHz L3: 3-6 MB
14
EEnterprise nterprise PPlatforms latforms GGrouproup
• PinPoints << 1% of program execution•Turnaround time (Traces) : Few days
PinPoints Generated
Program # Retired Instructions
(billions)
# PinPoints (250 million insts. EACH)
AMBER-rt 3,994 6
Fluent-m3 2,625 8
LS-DYNA 4,932 6
SPECINT2000(avg.) 142 4
SPECFP2000(avg.) 373 5
15
EEnterprise nterprise PPlatforms latforms GGrouproup
Results: Overview• PinPoints: Whole-Program CPI prediction
(SPEC2000 and HPC applications):– Average CPI prediction error ~5%– PinPoints better than random selection
• Predicting speedup between microarchitectures
– PinPoints can be used to evaluate microarchitecture variations
• PinPoints Traces: Prediction of native SPEC2000 ratios
– INT within 8% FP within 3%More results in the paper
16
EEnterprise nterprise PPlatforms latforms GGrouproup
0.1
0.6
1.1
1.6
2.1
CP
I
Whole_pgm_CPI
PinPoints_CPI
CPI: Actual vs. PredictedSPEC2000: Itanium-Madison
17
EEnterprise nterprise PPlatforms latforms GGrouproup
SPEC2000 CPI PredictionAverage Error: Madison : 2.8%
Merced : 3.2% McKinley : 2.7%
0.1
0.6
1.1
1.6
2.1
CP
I
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
% A
bs
(De
lta
in
CP
I)
%Delta
Whole_pgm_CPI
PinPoints_CPI
18
EEnterprise nterprise PPlatforms latforms GGrouproup
HPC Applications CPI PredictionAverage Error: Madison : 5.0%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CP
I
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
% A
bs(d
elta
CP
I)
%Delta
Whole_pgm_CPI PinPoints_CPI
19
EEnterprise nterprise PPlatforms latforms GGrouproup
Cumulative Distribution of CPI Errors for SPEC2000
5%
15%
25%
35%
45%
55%
65%
75%
85%
95%
0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 22% 24% 26% 28% 30%
CPI Error
% o
f R
un
s
PinPoints : N Points
Random: N Points
Uniform Random : N Points
Comparison With Random Selection[ 48 unique program runs ]
20
EEnterprise nterprise PPlatforms latforms GGrouproup
Cumulative Distribution of CPI Errors for HPC apps.
5%
15%
25%
35%
45%
55%
65%
75%
85%
95%
0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 22% 24% 26% 28% 30%
CPI Error
% o
f R
un
s
PinPoints : N Points
Random: N Points
Uniform Random : N Points
Comparison With Random Selection[ 18 unique program runs ]
21
EEnterprise nterprise PPlatforms latforms GGrouproup
Speedup: Merced McKinleySPEC2000
0123456
Spe
edup
McKinley:Actual
22
EEnterprise nterprise PPlatforms latforms GGrouproup
PinPoints Speedup Prediction: SPEC2000: Merced McKinley
0
1
2
3
4
5
6
Spe
edup
McKinley:Actual
McKinley:Predicted
23
EEnterprise nterprise PPlatforms latforms GGrouproup
PinPoints: Speedup Prediction Across Multiple Microarchitectures
Same Binaries/PinPoints
0
1
2
3
4
5
6
Sp
eed
up
McKinley:ActualMcKinley:PredictedMadison:ActualMadison:Predicted
24
EEnterprise nterprise PPlatforms latforms GGrouproup
Putting it All Together:From PinPoints to Projections
PinPoints TracesSimulationStats (CPI)
Does simulation of traces for PinPoints predict native performance?
Error Source: Phase detection
Error Source: Non-repeatability
Error Source: Warm-up, Modeling
Error: Cumulative
25
EEnterprise nterprise PPlatforms latforms GGrouproup
CPI Prediction with SimulationSPEC2000: Itanium Madison
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
CPI
0%10%20%30%40%50%60%70%80%90%100%
Abs
(% D
elta
)
% Delta
Actual: Native Hardware
Simulated: PinPoints(traces)
26
EEnterprise nterprise PPlatforms latforms GGrouproup
Native SPEC2000 Ratios[Spring 2004]
Itanium: Madison 1.5GHz/6MB L3
2075
1174
0
500
1000
1500
2000
2500
SPECfp SPECint
SP
EC
Rat
io
27
EEnterprise nterprise PPlatforms latforms GGrouproup
Performance Prediction from PinPoints Traces
Itanium: Madison 1.5GHz/6MB L3
2075
1174
2126
1270
0
500
1000
1500
2000
2500
SPECfp SPECint
SP
EC
Rat
io
Native Simulated
28
EEnterprise nterprise PPlatforms latforms GGrouproup
Summary
PinPoints toolkit : Automatic simulation region selection, tracing, and validation
Dynamic instrumentation (PIN ) LARGE programs
• PinPoints: << 1% of executionCapture whole-program CPI– Average error < 5% for SPEC2000, HPC apps.– Better than random selection
• PinPoints traces: Predict SPEC2000 Ratios– INT within 8% FP within 3%
29
EEnterprise nterprise PPlatforms latforms GGrouproup
Try it out!
(PIN + PinPoints) toolkit :
http://rogue.colorado.edu/Pin
New
30
EEnterprise nterprise PPlatforms latforms GGrouproup
Backup: Simulator Warm-up• Strategy 1: Large slice-size (250 million
instructions)– Too coarse-grain for phase detection– Too much simulation time
• Strategy 2: 7 warm-up traces per simulation trace (30 million instructions)
Art (SPECFP2000): First pinpoint touches most of the working set– Simulate all pinpoint traces in
succession