access region locality for high- bandwidth processor memory system design sangyeun cho samsung/u of...
TRANSCRIPT
Access Region Locality for Access Region Locality for High-Bandwidth High-Bandwidth Processor Memory System Processor Memory System DesignDesign
Sangyeun Cho Samsung/U of Minnesota
Pen-Chung Yew U of Minnesota
Gyungho Lee Iowa State U
32nd Annual International Symposium on32nd Annual International Symposium onMicroarchitectureMicroarchitecture
MICRO-32November 17, 1999
Cho, Yew, and Lee 4
Wide-Issue Superscalar Wide-Issue Superscalar ProcessorsProcessors
Fetc
h
R eservatio nStatio n s
D isp atchB uff er
I n structio n /D eco d e B uff er
R eo rder/C o m p letio nB uff er
Sto reB uff er
Dec
ode
Dis
patc
h
Com
plet
e
Ret
ire
L o ad / Sto reU n its
$$ Current Generation
– Alpha 21264– Intel’s Merced
Future Generation (IEEE Computer, Sept. ‘97)
– Superspeculative Processors
– Trace Processors
MICRO-32November 17, 1999
Cho, Yew, and Lee 5
Multi-Ported Data CacheMulti-Ported Data Cache
Fetch
$$ X $$ Y
Sto reL o ad L o ad
Fetch
$ $
1 L o ad /Sto re
2 L o ad /Sto re
Fetch
$$ E ven $$ O dd
" O dd" L o ad /Sto re
Fetch
" E ven " L o ad /Sto re
Replicated Cache– Alpha 21164
Time-Division Multiplexed Cache
– Alpha 21264
Interleaved Cache– MIPS R10K
MICRO-32November 17, 1999
Cho, Yew, and Lee 6
Window Logic ComplexityWindow Logic Complexity
Pointed out as the major hardware complexity (Parlacharla et al., ISCA ‘97)
More severe for Memory window– Difficult to partition– Thick network needed t
o connect RSs and LSUs
L SU
Net
wor
kD isp atch
R eserv atio nStatio n s
L SU
L SU
L SU
$$
MICRO-32November 17, 1999
Cho, Yew, and Lee 8
Data Decoupling: Data Decoupling: What is it?What is it?
A Divide-and-Conquer approach– Instruction stream
partitioned before entering RS
– Narrower networks– Less ports to each
cache– Needs mechanism for
proper partitioning
Net
wor
k "Y
"
D isp atch
R eservatio nStatio n s
L SU
L SU
$$ " Y "
L SU
L SU
$$ " X "
Net
wor
k "X
"
MICRO-32November 17, 1999
Cho, Yew, and Lee 9
Data Decoupling: Data Decoupling: Operating IssuesOperating Issues
Memory Stream Partitioning– Hardware classification– Compiler classification
Load Balancing– Enough instructions
in different groups?– Are they well
interleaved?
D isp atch
R eservatio nStatio n s
?D isp atch
T o R eservatio nStatio n s
MICRO-32November 17, 1999
Cho, Yew, and Lee 11
Access Region: Access Region: OverviewOverview
Access Region R– R = (L, U)
L: Lower Bound on Addr. U: Upper Bound on Addr.
If (D<A) or (B<C),– Region R and Q are said
to be exclusive or non-overlapping.
Locations in exclusive regions are independent.
MICRO-32November 17, 1999
Cho, Yew, and Lee 12
Access Region Access Region and Mem. Instructiand Mem. Instructionsons
MICRO-32November 17, 1999
Cho, Yew, and Lee 13
Partitioning Memory SpacePartitioning Memory Space
One way of partitioning memory space into regions:– Data Region / Heap Region / Stack Region
This work assumes this partitioning.
MICRO-32November 17, 1999
Cho, Yew, and Lee 14
Partitioning Memory Space, Partitioning Memory Space, Cont’dCont’d
Many accesses are toward Data and Stack regions. Some programs don’t access the Heap region at all.
0
5
10
15
20
25
30
35
40
go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg. FP.Avg.
Data Heap Stack
(%)
MICRO-32November 17, 1999
Cho, Yew, and Lee 15
Partitioning Memory Space, Partitioning Memory Space, Cont’dCont’d
Accesses to Data region are less bursty than others. Programs such as ijpeg have clustered region accesse
s.
Window Size = 32
0.44
0.84
1.22
0.37
0.72
1.57
0.840.65
0.31
0.00
1.72
1.40
1.08
0.61
1.34
2.19
2.70
1.281.16
0.80
1.391.20
0.43
1.331.52
0.68
0.98
0.74
0.840.72
0.00 0.000.00
0.65
0.86
0.98
0.00
0.50
1.00
1.50
2.00
2.50
3.00
go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid
Std
. D
ev.
/ A
vg.
DataHeapStack
MICRO-32November 17, 1999
Cho, Yew, and Lee 16
Partitioning Memory Space, Partitioning Memory Space, Cont’dCont’d
W/ a large window, Stack accesses become less bursty. Data and Stack regions have quite stable, constant demand.
Window Size = 64
0.37
1.15
0.32
0.59
1.54
0.55 0.59
0.23
1.68
1.01
0.45
1.18
1.96
2.41
1.080.88
0.66
0.33
1.39
0.58
0.73
0.360.72
0.67
1.21
0.60
0.000.00 0.000.00
0.71
1.07
0.52
0.950.84
0.98
0.00
0.50
1.00
1.50
2.00
2.50
3.00
go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid
Std
. D
ev.
/ A
vg.
DataHeapStack
MICRO-32November 17, 1999
Cho, Yew, and Lee 17
Partitioning Memory Space, Partitioning Memory Space, Cont’dCont’d
0
0.2
0.4
0.6
0.8
1
99 124 126 129 130 132 134 147 101 102 103 107 Int.Avg FP.Avg
D/H/S
H/S
D/S
D/H
S
H
D
gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid
1.9%1.8%
51.1%
50.4%
1.6%
16.2%
45.4%31.6
%
Many instructions access a single region (~98%). Multi-region-accessing instructions account for 0
~ 9.6% of dynamic memory references.
MICRO-32November 17, 1999
Cho, Yew, and Lee 18
Access Region LocalityAccess Region Locality
“A memory reference instruction typically accesses a single region at run time”– Only about 2% of all static memory
instructions access more than a single region.
“(Thus) the region it accesses is highly predictable”– Simple predictors with a small look-up table
achieve high prediction accuracy.
MICRO-32November 17, 1999
Cho, Yew, and Lee 19
Predicting Regions: Predicting Regions: Unlimited CaseUnlimited Case
One predictor per memory instruction Predictor types:
– 1-bit history saver (0: Data, 1: Stack)
– 2-bit saturating counter
MICRO-32November 17, 1999
Cho, Yew, and Lee 20
Predicting Regions: Predicting Regions: Adding Adding ContextContext
Run-time context– Caller’s ID (CID): in Link Register– Global Branch History (GBH)– Hybrid of above
MICRO-32November 17, 1999
Cho, Yew, and Lee 21
Predicting Regions: Predicting Regions: Utilizing Utilizing Static Info.Static Info.
Some instructions’ access regions are revealed through architecture and compiler conventions:– Use of Stack Pointer ($SP) or Frame Pointer ($FP)suggests that the region is Stack.
– Use of Global Pointer ($GP) suggests that the region is non-Stack.
– For others, assume non-Stack. Directly exporting some high-level region
information from compiler to processor may improve prediction accuracy.
MICRO-32November 17, 1999
Cho, Yew, and Lee 22
Region Pred. Result: Region Pred. Result: Unlimited CaUnlimited Casese
gom88ksim gcccompress li ijpeg perl vortex Int.AvgFP.Avgtomcatvswim su2cormgrid0%
20%
40%
60%
80%
100%
Cor
rect
ly C
lass
ified
Ins
tr.
Predicted
Known from Instr.
Simple 1-bit
w/ GBHw/ CID
Static
w/ Hybrid
1-bit predictors do better than 2-bit predictors (not shown). Hybrid context bits achieve the best prediction rate on average.
MICRO-32November 17, 1999
Cho, Yew, and Lee 23
Predicting Regions: Predicting Regions: Limited-Limited-Size ARPTSize ARPT
Low n bits of PC, XOR’ed with hybrid context bits are used to index into Access Region Prediction Table (ARPT):
– Table Entries Initialized to 0’s– 1 to denote stack access– Decoding information explo
ited to save ARPT space
MICRO-32November 17, 1999
Cho, Yew, and Lee 24
Region Prediction Result: Region Prediction Result: ARPTARPT
98%
99%
100%
Pred
ictio
n Rat
e
w/ Compiler Hints
w/o Compiler Hints
gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid
Unlimited8 KB4 KB
2 KB1 KB
Over 99.9% Accuracy w/ 4 KB or larger ARPT w/o compiler hints. Compiler hints relieve pressure due to smaller sizes.
MICRO-32November 17, 1999
Cho, Yew, and Lee 27
Dynamic Data Decoupling, Dynamic Data Decoupling, Cont’Cont’
dd
Dynamically predicting access regions to classify memory instructions:– Utilize Access Region Prediction Table (ARPT).– Utilize any region information revealed through instructio
n decoding. Dispatching partitioned memory instructions into se
parate memory pipelines, connetected to separate caches.
Dynamically Verifying Region Prediction– Let TLB (i.e., page table) contain verification information
such that memory access is reissued on mis-predictions.
MICRO-32November 17, 1999
Cho, Yew, and Lee 28
Base Machine ModelBase Machine Model
Issue Width 16Registers 32 GPRs/ 32 FPRs
ROB/ LSQ Size 256/ 128
Functional Units Integer: 16 ALUs, 4 MULT/ DIV UnitsFP: 16 ALUs, 4 MULT/ DIV Units
Value Pred. 16K-Entry Stride-Based PredictorL1 D-Cache 64 KB, 2-Way Set-Associative, 2-Cycle AccessL2 D-Cache 512 KB, 4-Way Set-Associative, 12-Cycle Access
Memory 50-Cycle Access, Fully InterleavedLV-Cache 4 KB, Direct-Mapped. 1-Cycle Access
ARPT 32K 1-Bit EntriesI-Cache Perfect (100% Hit) Cache, 1-Cycle Access
Branch Prediction Perfect (100% Correct) PredictionInstruction Lat. Same as MIPS R10000
MICRO-32November 17, 1999
Cho, Yew, and Lee 29
Overall PerformanceOverall Performance
go m88ksimgcccompress li ijpeg perl vortex Int.AvgFP.Avgtomcatvswim su2cormgrid
1.11
1.02
1.23
1.13
1.39
1.16
1.25
1.37
1.18
1.13 1.
18
1.09
1.21
1.14
1.00 1.
02
1.22
1.39
1.15 1.
19
1.36
1.19
1.12
1.18
1.09
1.18
1.14
1.02
1.03
1.29
1.11
1.57
1.22 1.
26
1.61
1.24
1.18
1.23
1.18
1.25
1.20
1.21
1.00
1.22
1.12
1.51
1.12
1.29
1.45
1.18
1.08
1.05
1.04
1.24
1.08
1.22
1.00
1.27
1.12
1.53
1.18
1.34
1.71
1.18
1.09
1.06
1.04
1.29
1.09
1.25
1.02
1.31
1.24
1.57
1.24
1.35
1.75
1.23
1.17
1.17
1.17
1.34
1.19
1.18
1.05
1.35
1.17
1.57
1.28
1.35
1.80
1.25
1.23 1.
28
1.25
1.33
1.25
1.09
1.00
1.20
1.40
1.60
1.80
2.00(3+0), 2 cycle(3+0), 3 cycle(4+0), 3 cycle(2+2)(2+3)(3+3)(16+0), 2 cycle
Over (2+0) conf.
MICRO-32November 17, 1999
Cho, Yew, and Lee 30
ConclusionsConclusions
Access Region Locality says– Memory instructions access few regions at run tim
e.– Accessed regions are accurately predictable.
Access Region Locality leads to Access Region Prediction techniques.
Access Region Prediction allows Dynamic Data Decoupling, shown to achieve comparable performance to very wide data caches.
MICRO-32November 17, 1999
Cho, Yew, and Lee 32
Impact of LVC SizeImpact of LVC Size
2KB and 4KB LVCs achieve high hit rates. (~99.9%).
Set associativity less important if LVC is 2KB or more.
Small, simple LVC works well.
0.5K 1K 2K 4K
8.42
3.98
1.12
2.30
0.73 0.440.19 0.090.02 0.00 0.00 0.000
1
2
3
4
5
6
7
8
9
Miss
Rat
e (%
)
126.gcc
Avg.
129.compress