nc state university center for efficient, scalable, and reliable computing department of electrical...
TRANSCRIPT
NC STATE UNIVERSITY
Center for Efficient, Scalable, and Reliable ComputingDepartment of Electrical & Computer Engineering
North Carolina State University
EXACT:Explicit Dynamic-Branch Prediction
with Active Updates
Muawya Al-Otoom, Elliott Forbes, Eric Rotenberg
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 2© Eric Rotenberg
Branch Prediction and Performance
1K-entry instruction window
128-entry issue queue (similar to Nehalem)
4-wide fetch/issue/retire
16 KB gshare
Randomly remove 0% to 100% of mispredictions
0
0.5
1
1.5
2
2.5
3
3.5
4
% of remaining branch mispredicts removed
IPC
bzip2gziptwolf
89%
91%
95%
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 3© Eric Rotenberg
while ( . . . ){ . . . for (i = 0; i < 16; i++) { if (a[i] > 10) { . . . } else { . . . } } . . .}
LOAD r5 = a[i]X: BRANCH r5 <= 10
Example
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 4© Eric Rotenberg
Example (cont.)
LOAD
BRANCH
4 11 15 2 7 1 3 52 9 3 8 5 55 833
a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]
1 0 0 1 1 1 1 0 1 1 1 1 0 10
18
a[15]
0
0 0 1
X
+
PC
GHR
0
PHT
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 5© Eric Rotenberg
Good Scenario #1
LOAD
BRANCH
4 11 15 2 7 1 3 52 9 3 8 5 55 833
a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]
1 0 0 1 1 1 1 0 1 1 1 1 0 10
18
a[15]
0
0 0 1
X
+
PC
GHR
0
PHT a[4] has unique GHR context. GHR implicitly identifies a[4]. Dedicated prediction for a[4].
dedicatedpredictionfor a[4]
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 6© Eric Rotenberg
Good Scenario #2
LOAD
BRANCH
4 11 15 2 7 1 3 52 9 3 8 5 55 833
a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]
1 0 0 1 1 1 1 0 1 1 1 1 0 10
18
a[15]
0
a[6], a[10] have same GHR context. GHR does not distinguish a[6], a[10]. Shared prediction for a[6], a[10]. Fortunately, they have same outcome.
sharedpredictionfor a[6], a[10]
1 0 1
X
+
PC
GHR
1
PHT
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 7© Eric Rotenberg
Bad Scenario #1
LOAD
BRANCH
4 11 15 2 7 1 3 52 9 3 8 5 55 833
a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]
1 0 0 1 1 1 1 0 1 1 1 1 0 10
18
a[15]
0
a[6], a[15] have same GHR context. GHR does not distinguish a[6], a[15]. Shared prediction for a[6], a[15]. Problem: they have different outcomes.
sharedpredictionfor a[6], a[15]
1 0 1
X
+
PC
GHR
?
PHT
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 8© Eric Rotenberg
Bad Scenario #1 (cont.)
LOAD
BRANCH
4 11 15 2 7 1 3 52 9 3 8 5 55 833
a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]
1 0 0 1 1 1 1 0 1 1 1 1 0 10
18
a[15]
0
Indirect solution Use longer history. GHR now distinguishes a[6], a[15].
0 1 0
X
+
PC
GHR
PHT
1 0
X
+
PC
GHR
1
1
11 0
dedicated prediction for a[6]
dedicated prediction for a[15]
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 9© Eric Rotenberg
Bad Scenario #1 (cont.)
LOAD
BRANCH
4 11 15 2 7 1 3 52 9 3 8 5 55 833
a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]
1 0 0 1 1 1 1 0 1 1 1 1 0 10
18
a[15]
0
Indirect solution Use longer history. GHR now distinguishes a[6], a[15]. Ambiguity not eradicated: a[10], a[15].
PHT
1 0
X
+
PC
GHR 11
? sharedpredictionfor a[10], a[15]
Direct solution Use address of element. Dedicated predictions for different elements.
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 10© Eric Rotenberg
Bad Scenario #2
LOAD
BRANCH
4 11 15 2 7 1 3 52 9 3 8 5 55 833
a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]
1 0 0 1 1 1 1 0 1 1 1 1 0 10
18
a[15]
0
0 0 1
X
+
PC
GHR
0
PHT STORE a[4] = 6
stalepredictionfor a[4]
6
1
Conventional passive updates Predictor’s entry is stale after store Retrained after suffering misprediction
Solution: Active updates
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 11© Eric Rotenberg
Big Picture
branchpredictor
memory
Proposedbranchpredictor
memory
Conventional
Branch predictor should mirror a program’s objects
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 12© Eric Rotenberg
Big Picture Branch predictor should mirror a program’s objects Branch predictor should mirror changes as they happen
branchpredictor
memory
Conventional
Store
branchpredictor
memory
Proposed
Store
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 13© Eric Rotenberg
Characterize Mispredictions
Bad Scenario #1 Measure how often global branch history does
not distinguish different dynamic branches that have different outcomes
Bad Scenario #2 Measure how often stores cause stale
predictions Evaluate for two history-based predictors
Very large gselect Very large L-TAGE
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 14© Eric Rotenberg
Characterize Mispredictions
Bad Scenario #1
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 15© Eric Rotenberg
Characterize Mispredictions
Bad Scenario #2
Note:Predominance of Bad Scenario #1 may obscure occurrences of Bad Scenario #2.
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 16© Eric Rotenberg
Problems and Solutions Two problems:
{PC, global branch history} does not always distinguish different dynamic branches that have different outcomes
Stores to memory change branch outcomes
Two solutions: Explicitly identify dynamic branches to provide dedicated
predictions for them (EX)o branch ID = hash(PC, load addresses)
Stores “actively update” the branch predictor (ACT)
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 17© Eric Rotenberg
Bad Scenario #2
Bad Scenario #1
Load-dependent branches use explicit predictor Index with branch ID (EX) Index with branch ID and perform active updates (EXACT)
Other branches use the default predictor (e.g., L-TAGE)
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 18© Eric Rotenberg
3 Implementation Challenges
1. Indexing the explicit predictor Branch ID unknown at fetch time Loads are unlikely to have computed their addresses
by the time the branch is fetched
2. Large explicit predictor Many different IDs contribute to mispredictions Predictor size normally limited by cycle time
3. Large active update unit Too large to implement with dedicated storage
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 19© Eric Rotenberg
Implementation Challenge #1:Indexing the Explicit Predictor
Insight: sequences of IDs repeat due to traversing data structures
Use ID of a prior retired branch at a fixed distance N
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 20© Eric Rotenberg
0%
2%
4%
6%
8%
10%
12%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30distance
mis
pre
dic
tio
n r
ate
%
bzip2 crafty gzip twolf
0%
2%
4%
6%
8%
10%
12%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30distance
mis
pre
dic
tio
n r
ate
%
bzip2 crafty gzip twolf
IDEAL:Assumes ID of Nth branch away is always available.
REAL:Use explicit predictor only if Nth branch away is retired.Otherwise use default predictor.
(N)
(N)
use IDof branchitself
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 21© Eric Rotenberg
Need large explicit predictor with fast cycle time Indexing with prior retired branch IDs makes it
easily pipelinable In general, pipelining is straightforward if the
index does not depend on immediately preceding predictions
Implementation Challenge #2:Large Explicit Predictor
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 22© Eric Rotenberg
++
++
++
GBQ
ID1+
BHR
ID2+
BHR
ID3+
BHR
CYCLE 1 CYCLE 2 CYCLE 3 CYCLE 4 CYCLE 5 CYCLE 6 CYCLE 7
Implementation Challenge #2:Large Explicit Predictor
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 23© Eric Rotenberg
Most benchmarks are tolerant of 400+ cycles of active update latency Large distance between stores and re-encounters of the branches
they update
Implementation Challenge #3:Large Storage for Active Updates
0%
2%
4%
6%
8%
10%
12%
0 100 200 300 400 500 600 700 800 900 1000active update latency (cycles)
mis
pre
dic
tio
n r
ate
%
bzip2 crafty gzip twolf
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 24© Eric Rotenberg
Exploit “Predictor Virtualization” [Burcea et al.] Eliminate significant amount of dedicated storage Use small L1 table backed by full table in physical
memory The full table in physical memory is transparently
cached in the general-purpose memory hierarchy (e.g., L2$)
Implementation Challenge #3:Large Storage for Active Updates
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 25© Eric Rotenberg
Putting it all together
Processor PipelineFetch Retire
Explicit Predictor
ID Gen
Default Predictor
past ID for predicting current branch
passive updates
stores
active updates
Active Update
Unit
1
2
3
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 26© Eric Rotenberg
gzip
4%
5%
6%
7%
8%
9%
1 10 100 1000 10000cost (K-Bytes)
mis
pre
dic
tio
n r
ate
%
gsharegshare+EXACTgshare+EXACT (VIRT. SACT)L-TAGEL-TAGE+EXACTL-TAGE+EXACT (VIRT. SACT)
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 27© Eric Rotenberg
Active Update Unit
First proposal to use store instructions to update the branch predictor
Store instructions might: Change a branch outcome in the explicit predictor Change a trip-count in the explicit loop predictor
Two mechanisms required: Convert store address into a predictor index Convert store value into a branch-outcome or trip-count
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 28© Eric Rotenberg
Active Update Example0x0400STORE 0x20 X
.
.
.
0x0800LOAD R5 X
0x0808BLE R5, 0x30, target
Explicit Predictor
Single-Address Conversion Table (SACT)
X
0xABCD
0xABCD0x0808
Ranges Reuse Table (RRT)
T min T max N min N max0x0808 0x10
compare
T
PC index
0x30 0x31 0x50
0x20
N
NC STATE UNIVERSITY
Int’l Conference on Computing Frontiers, May 16-19, 2010 29© Eric Rotenberg
Future Work
• Rich ISA support– Empower compiler or programmer to directly index
into the explicit predictor, directly manage it, and directly manage the instruction fetch unit
– Close gap between real and ideal indexing– More efficient hardware