an improved analytical superscalar microprocessor memory model
Post on 30-Apr-2022
4 Views
Preview:
TRANSCRIPT
An Improved Analytical Superscalar Microprocessor Memory ModelMemory Model
Xi Chen and Tor Aamodt
Electrical and Computer EngineeringUniversity of British Columbia
June 22nd, 2008 (MoBS-2008)
Outlinen Introductionn Backgroundn Modeling Long Latency Memory Systems
¨ Pending Hits¨ Accurate Hidden Miss Latency Estimation
An Improved Analytical Superscalar Microprocessor Memory Model 2
Xi Chen, Tor AamodtUniversity of British Columbia
¨ Accurate Hidden Miss Latency Estimation¨ A Limited Number of Outstanding Cache Misses¨ SWAM & SWAM-MLP
n Methodology n Resultsn Future Work and Conclusions
Introduction
n Processor design is a complicated task
n Cycle-accurate performance simulators are
An Improved Analytical Superscalar Microprocessor Memory Model 3
Xi Chen, Tor AamodtUniversity of British Columbia
n Cycle-accurate performance simulators are extensively used to explore design space¨ Creating and debugging a simulator takes time
¨ Running detailed simulations is slow
Introductionn Simulating future large-scale CMPs ?
Over 8 months
An Improved Analytical Superscalar Microprocessor Memory Model 4
Xi Chen, Tor AamodtUniversity of British Columbia
Time to simulate 1 second of LCMP (32-core, 4-thread per core) workload execution (Li Zhao et al., Exploring Large-Scale CMP Architectures Using ManySim, IEEE Micro, Issue 4, 2007)
Introduction
n Analytical Modeling
¨an alternative to cycle-accurate simulations
An Improved Analytical Superscalar Microprocessor Memory Model 5
Xi Chen, Tor AamodtUniversity of British Columbia
Analytical Model
program characteristics
microarchitectural parameters
performance
Introduction
n Analytical Modeling vs. Performance Simulations
¨Pros
n Fast speed (orders of magnitude times faster)
Providing more insights for chip designers
An Improved Analytical Superscalar Microprocessor Memory Model 6
Xi Chen, Tor AamodtUniversity of British Columbia
n Providing more insights for chip designers
¨Cons
n Less accurate than performance simulations
n Covering only major microarchitectural parameters
Backgroundn First-order Model
¨ Tejas Karkhanis and James Smith. A First-order Superscalar Processor Model. ISCA’04.
¨ Stable performance is disrupted by different types of miss-events
An Improved Analytical Superscalar Microprocessor Memory Model 7
Xi Chen, Tor AamodtUniversity of British Columbia
miss-events
time
IPC
miss-event #1 miss-event #2 miss-event #3
Backgroundn First-order Model
¨ The overall performance (CPI) is modeled by the sum of isolated CPI components due to each type of miss-event.
CPImodeled = CPIbase + CPIbmsp + CPID$miss + CPII$miss
An Improved Analytical Superscalar Microprocessor Memory Model 8
Xi Chen, Tor AamodtUniversity of British Columbia
in absence of miss-events
branchmispredictions
data cache misses
instruction cache misses
¨ The baseline in our paper is our careful re-implementation of the first-order model based upon the available details described in the ISCA’04 paper and its follow-up work.
Pending Data Cache Hits
8
10
12
CP
I_D
$m
iss
(m
cf)
actual w/o PH
An Improved Analytical Superscalar Microprocessor Memory Model 9
Xi Chen, Tor AamodtUniversity of British Columbia
0
2
4
6
20 50 100 200 500
memory access latency (cycle)
CP
I_D
$m
iss
(m
cf)
Contributions
n Pending Hits
n Accurate Hidden Miss Latency Estimation
i1
i2
i3
i4 i5
i6
i7
i8error is reduced43.5% -> 27.5 %
15.5%-> 10.3 %
An Improved Analytical Superscalar Microprocessor Memory Model 10
Xi Chen, Tor AamodtUniversity of British Columbia
n Accurate Hidden Miss Latency Estimation
n Profile Window Selection
i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3
i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3
baseline contribution
29.2%-> 10.3 %
32.4%-> 9.2 %
-> 10.3 %
Modeling CPID$miss : Baseline
i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19i4 i5 i6 i7 i8i1 i2 i3
n Assuming the instruction window size is eight
An Improved Analytical Superscalar Microprocessor Memory Model 11
Xi Chen, Tor AamodtUniversity of British Columbia
num_serialized_D$miss: 0 -> 1
i1 i3 i5
i8i6
i7
i2 i4
Modeling CPID$miss : Baseline
i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19i4 i5 i6 i7 i8i1 i2 i3
n Assuming the instruction window size is eight
An Improved Analytical Superscalar Microprocessor Memory Model 12
Xi Chen, Tor AamodtUniversity of British Columbia
num_serialized_D$miss: 1 -> 3
i9 i11 i13
i16
i15
i14i10 i12
Modeling CPID$miss : Baseline
n mem_lat: main memory latency
An Improved Analytical Superscalar Microprocessor Memory Model 13
Xi Chen, Tor AamodtUniversity of British Columbia
n mem_lat: main memory latency
n total_num_instructions: total number of instructions committed
Pending Data Cache Hits
n A pending data cache hit results from a memory reference to a cache block for which a request has already been initiated by another instruction
An Improved Analytical Superscalar Microprocessor Memory Model 14
Xi Chen, Tor AamodtUniversity of British Columbia
by another instruction
n Pending data cache hits are common due to spatial locality in applications
Pending Data Cache Hits
n A (traditional) cache simulator cannot identify pending cache hits due to the lack of timing information
L2 cache
An Improved Analytical Superscalar Microprocessor Memory Model 15
Xi Chen, Tor AamodtUniversity of British Columbia
ld r2, (0)r1ld r3, (4)r1
add r4, r2, r5add r6, r3, r5ld r7, (0)r6
……
L2 cache
…miss
L1 cache
...miss hit
Pending Data Cache Hits
ld r2, (0)r1ld r3, (4)r1
add r4, r2, r5add r6, r3, r5ld r7, (0)r6
i1
i2
i3
i4
i5i1 i3 i6
i8
not considering pending hits (num_serialized_D$miss += 1)
wrongvalue
An Improved Analytical Superscalar Microprocessor Memory Model 16
Xi Chen, Tor AamodtUniversity of British Columbia
ld r7, (0)r6or r8, r4, r5
add r9, r5, r7or r10, r8, r9
……
i5
i6
i7
i8
i9
i2 i4 i5 i7
i8
Modeling Pending Hits
ld r2, (0)r1ld r3, (4)r1
add r4, r2, r5add r6, r3, r5ld r7, (0)r6
i1
i2
i3
i4
i5
miss
hit (i1)
miss
i1 i3 i6i8
considering pending hits (num_serialized_D$miss += 2)correctvalue
An Improved Analytical Superscalar Microprocessor Memory Model 17
Xi Chen, Tor AamodtUniversity of British Columbia
ld r7, (0)r6or r8, r4, r5
add r9, r5, r7or r10, r8, r9
……
i5
i6
i7
i8
i9
miss
i2 i4 i5 i7
Error is reduced from 43.5% to 27.5 % withbest fixed cycle compensation for miss latency
latency modeled by CPIbase
Compensating Overestimate [K&S ISCA’04, Karkhanis PhD Thesis]
n Part of the latency of a miss may be hidden
when a miss issues, it can be (or near) the
commit side of the ROB
An Improved Analytical Superscalar Microprocessor Memory Model 18
Xi Chen, Tor AamodtUniversity of British Columbia
fetch commitROB
Compensating Overestimate [K&S ISCA’04, Karkhanis PhD Thesis]
n Part of the latency of a miss may be hidden
when a miss issues, it can be (or near)
the middle of the ROB
when a miss issues, it can be (or near)
the middle of the ROB
An Improved Analytical Superscalar Microprocessor Memory Model 19
Xi Chen, Tor AamodtUniversity of British Columbia
fetch commitROBROB
Compensating Overestimate [K&S ISCA’04, Karkhanis PhD Thesis]
when a miss issues, it can be (or near) the fetch side of the ROB
n Part of the latency of a miss may be hidden
An Improved Analytical Superscalar Microprocessor Memory Model 20
Xi Chen, Tor AamodtUniversity of British Columbia
fetch commitROB
Compensating Overestimate
n Fixed-cycle compensation used by prior work is not accurate for all the benchmarks we study.
An Improved Analytical Superscalar Microprocessor Memory Model 21
Xi Chen, Tor AamodtUniversity of British Columbia
n We propose to compensate the overestimate using the average distance between consecutive misses.
Compensating Overestimate
fetch commitROBld miss ld miss
over 70% instructions can be used to hide miss latency in pointer chasing
An Improved Analytical Superscalar Microprocessor Memory Model 22
Xi Chen, Tor AamodtUniversity of British Columbia
dist: average distance between two consecutive misses (the distance between two misses is saturated by the size of ROB)
error is reduced from 15.5% (best fixed cycle compensation) to 10.3%
Modeling a limited number of MSHRs
n The model thus far has assumed that the number of outstanding cache misses supported is unlimited.
An Improved Analytical Superscalar Microprocessor Memory Model 23
Xi Chen, Tor AamodtUniversity of British Columbia
n We propose a technique to model a limited number of outstanding cache misses supported.
Modeling a limited number of MSHRs
n We stop a profile step and update num_serialized_D$miss
when the number of misses analyzed is equal to the number of Miss Status Holding Registers (MSHRs)
An Improved Analytical Superscalar Microprocessor Memory Model 24
Xi Chen, Tor AamodtUniversity of British Columbia
i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3
ROBsize = 8, NMSHR = 4
With plain profiling, error reduces from 32.4% to 23.9% for 8 MSHRs
Start-with-a-miss (SWAM) Profiling
n Each profile step starts with a cache miss
i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3
plain profiling (baseline)
sliding profiling
An Improved Analytical Superscalar Microprocessor Memory Model 25
Xi Chen, Tor AamodtUniversity of British Columbia
i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3
sliding profiling
i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3
SWAM profiling 29.2% (error of plain) -> 10.3% (error of SWAM)
(does not improve the accuracy much but slowdown the model significantly)
SWAM-MLP
i1
i2
i5
i4 i6
i3i4 i5 i6 i7 i8i1 i2 i3
SWAM (ROBsize = 8, NMSHR = 4)
SWAM-MLP
num_serialized_miss += 4 (wrong value)
An Improved Analytical Superscalar Microprocessor Memory Model 26
Xi Chen, Tor AamodtUniversity of British Columbia
i7
i8i4 i5 i6 i7 i8i1 i2 i3
SWAM-MLP
num_serialized_miss += 2 (correct value)
12.8% (error of SWAM) to 9.2% (error of SWAM-MLP) for 8 MSHRs
23.2% (error of SWAM) to 9.9% (error of SWAM-MLP) for 4 MSHRs
Methodology
n We modified SimpleScalar and used a set of memory intensive benchmarks from SPEC 2000 and OLDEN suite whose cache misses per thousand instructions (MPKI) is higher than 10 for our cache configurations
An Improved Analytical Superscalar Microprocessor Memory Model 27
Xi Chen, Tor AamodtUniversity of British Columbia
Resultsn Unlimited number of outstanding cache misses supported
40
60
80
100Plain w/o PH Plain w/ PH SWAM w/ PH
39.7%
An Improved Analytical Superscalar Microprocessor Memory Model 28
Xi Chen, Tor AamodtUniversity of British Columbia
-100
-80
-60
-40
-20
0
20
40
app ar
teq
klu
csw
mm
cf em hth
prm
lbm
mea
nerr
or
(%) 39.7%
-> 29.2%-> 10.3%
Resultsn Modeling a limited number of MSHRs
¨ SWAM-MLP further decreases error of SWAM from 12.8% to 9.2% (8 MSHRs, e.g., Prescott), from 23.2% to 9.9% (4 MSHRs)
60
80
100Plain w/o MSHR Plain w/ MSHR
SWAM SWAM-MLP
An Improved Analytical Superscalar Microprocessor Memory Model 29
Xi Chen, Tor AamodtUniversity of British Columbia
-100
-80
-60
-40
-20
0
20
40
app art eqk luc swm mcf em hth prm lbm meanerr
or
(%)
n Our improvements do not slowdown the first-order model
Current / Future Workn Analytically modeling the performance impact of hardware
data prefetching (by the pending hits caused by data prefetches)
n Analytically modeling the throughput of fine-grain multithreaded microprocessors (e.g., Sun’s Niagara)
An Improved Analytical Superscalar Microprocessor Memory Model 30
Xi Chen, Tor AamodtUniversity of British Columbia
multithreaded microprocessors (e.g., Sun’s Niagara)
n Extending the analytical model for CMPs with superscalar, out-of-order execution cores
n Analytically modeling the performance of SMT superscalar cores
Conclusionsn Modeling Long Latency Memory System
¨ Pending Data Cache Hit¨ Accurate Hidden Miss Latency Estimation¨ Modeling MSHRs¨ SWAM & SWAM-MLP
An Improved Analytical Superscalar Microprocessor Memory Model 31
Xi Chen, Tor AamodtUniversity of British Columbia
¨ SWAM & SWAM-MLP
n Overall our improvements reduce the error of our baseline from 39.7% to 10.3%. The error is less than 10% when modeling MSHRs. Our improvements do not slow down the first-order model.
top related