predicting conditional branches with fusion-based hybrid predictors gabriel h. loh yale university...

Predicting Conditional Predicting Conditional Branches With Fusion-Branches With Fusion-

Based Hybrid PredictorsBased Hybrid Predictors

Gabriel H. LohGabriel H. Loh Yale UniversityYale University

Dept. of Computer ScienceDept. of Computer Science

Dana S. HenryDana S. Henry Yale UniversityYale University

Depts. of Elec. Eng. & Comp. Depts. of Elec. Eng. & Comp. Sci.Sci.

This research was funded by NSF Grant MIP-9702281

The Branch Prediction The Branch Prediction ProblemProblem

• 1 out of 5 instructions is a branch1 out of 5 instructions is a branch• May require many cycles to resolveMay require many cycles to resolve

– P4 has 20 cycle branch resolution pipelineP4 has 20 cycle branch resolution pipeline– Future pipeline depths likely to increase Future pipeline depths likely to increase

[Sprangle02][Sprangle02]

• Predict branches to keep pipeline fullPredict branches to keep pipeline full

PC Compute Branch resolution

Bigger Predictors = More Bigger Predictors = More AccurateAccurate

• Larger predictors tend to yield more Larger predictors tend to yield more accurate predictionsaccurate predictions

• Faster cycle times force smaller Faster cycle times force smaller branch predictorsbranch predictors

• Overriding predictorOverriding predictor couples small, couples small, fast predictor with a large, multi-fast predictor with a large, multi-cycle predictor [Jiménez2000]cycle predictor [Jiménez2000]– performs close to ideal large-fast performs close to ideal large-fast

predictorpredictor

(but bigger predictors = slower)(but bigger predictors = slower)

Hybrid PredictorsHybrid Predictors• Wide variety of branch prediction Wide variety of branch prediction

algorithms availablealgorithms available• Hybrid combines more than one “stand-Hybrid combines more than one “stand-

alone” or alone” or componentcomponent predictor predictor [McFarling93]:[McFarling93]:

PP11 PP22Meta-Meta-

PredictorPredictor

Final PredictionFinal Prediction

Multi-HybridsMulti-Hybrids

PP11 PP22 PPnn

Pr. Encoder

…

… …

…


PP11 PP22MM11 PP33 PP44MM22

MM33


““Multi-Hybrid” [Evers96]Multi-Hybrid” [Evers96] ““Quad-Hybrid” [Evers00]Quad-Hybrid” [Evers00]

Our Idea: Prediction FusionOur Idea: Prediction Fusion

PP11 …

…

PP22 PP33 PPnn

Prediction Selection

PP11 …

…

PP22 PP33 PPnn

Prediction Fusion

Early Attempt from MLEarly Attempt from ML

• Weighted Majority algorithm [LW94]Weighted Majority algorithm [LW94]– Better predictors get assigned larger weightsBetter predictors get assigned larger weights– Make final prediction with larger sumMake final prediction with larger sum

• Predictor with largest weight not always correctPredictor with largest weight not always correct

0.487 0.513

PP22 PP66PP77 PP11

PP33 PP44PP55

PP88

P2, P6 and P7 say “not-taken”P1, P3, P4, P5 and P8 say “taken”

OutlineOutline

• COLT PredictorCOLT Predictor• Choosing parameters and Choosing parameters and

componentscomponents• PerformancePerformance• Prediction distributions, component Prediction distributions, component

choicechoice

COLT OrganizationCOLT Organization

Branch AddressBranch AddressBranch HistoryBranch History

PP11 PP22 PP33 PPnn

11 00 11 00……

…

MappingMappingTableTable

VMTVMT

…


Pathological ExamplePathological Example

PP11 PP22 PP33

00 00 00

Actual outcome = 1 (taken)Actual outcome = 1 (taken)

Example (cont’d)Example (cont’d)

PP11 PP22 PP33

00 00 00

Outcome is always wrongOutcome is always wrong

Selection:Selection:

PP11 PP22 PP33

1 1 0 10 0 0

Can recognizeCan recognizeand rememberand rememberthis patternthis pattern

11

COLT:COLT:

VMTVMT

COLT Lookup DelayCOLT Lookup Delay

1 0 0 1 1…

......

......

PP11PP22

PPnn

PredictionPrediction

timetime

…

MT SelectMT Select

critical delaycritical delay

Design ChoicesDesign Choices

• # of branch address bits# of branch address bits• # of branch history bits# of branch history bits

• # of components# of components

• Choice of componentsChoice of components– gshare, PAs, gskewed, …gshare, PAs, gskewed, …– History length, PHT size, …History length, PHT size, …

}}Determines number ofDetermines number ofmapping tablesmapping tables

}}Determines size ofDetermines size ofindividual MT’sindividual MT’s

Predictor ComponentsPredictor Components• Global HistoryGlobal History

– gshare [McFarling93]gshare [McFarling93]– Bi-Mode [Lee97]Bi-Mode [Lee97]– Enhanced gskewed Enhanced gskewed

[Michaud97][Michaud97]– YAGS [Eden98]YAGS [Eden98]

• Local HistoryLocal History– PAs [Yeh94]PAs [Yeh94]– pskewed [Evers96]pskewed [Evers96]

• OtherOther– 2bC (bimodal) [Smith81]2bC (bimodal) [Smith81]– Loop [Chang95]Loop [Chang95]– alloyed Perceptron alloyed Perceptron

[Jiménez02][Jiménez02]

}}history lengthshistory lengthsoptimized onoptimized ontest data setstest data sets

Total of 59 configurationsTotal of 59 configurationsSizes vary up to 64KBSizes vary up to 64KB

Huge Search SpaceHuge Search Space

• 225959 ways to choose components ways to choose components ways to choose COLT parametersways to choose COLT parameters• We use a genetic searchWe use a genetic search

…

bit-k = 0 means don’t include Pbit-k = 0 means don’t include Pkk

bit-k = 1 means do include Pbit-k = 1 means do include Pkk

VMT SizeVMT Size historyhistorylengthlength

gene format:gene format:……

MethodologyMethodology

• SPEC2000 integer benchmarksSPEC2000 integer benchmarks– For tuning/optimization: 10M branches For tuning/optimization: 10M branches

from testfrom test– For evaluation: 500M branches from trainFor evaluation: 500M branches from train

• Skipped first 100M branchesSkipped first 100M branches

– Compiled with Compiled with cc –arch ev6 –O4 –fast –non_sharedcc –arch ev6 –O4 –fast –non_shared

• SimpleScalar simulatorSimpleScalar simulator– sim-safe for trace collectionsim-safe for trace collection– MASE for ILP simulationsMASE for ILP simulations

Genetic Search COLT Genetic Search COLT ResultsResults

NamNamee

SizeSize

(KB)(KB)ComponentsComponents VMTVMT CounteCounte

r widthr width

HistorHistory y

lengthlength

1616alpct(34/alpct(34/1010) ) gskewed(12)gskewed(12)

gshare(8)gshare(8)20482048 44 88

3232alpct(34/alpct(34/1010) ) gshare(15)gshare(15)

gshare(9) PAs(gshare(9) PAs(77))81928192 44 77

6464alpct(40/alpct(40/1414) )

gshare(16) YAGS(11) gshare(16) YAGS(11) pskewed(pskewed(66))

1638416384 44 1010

128128

alpct(40/alpct(40/1414) ) alpct(38/alpct(38/1414) ) gshare(16) gshare(16)

gskewed(13) gskewed(13) YAGS(12) PAs(YAGS(12) PAs(88))

1638416384 44 77

256256

alpct(50/alpct(50/1818) ) alpct(34/alpct(34/1010) )

gshare(18) Bi-gshare(18) Bi-Mode(16) Mode(16)

gskewed(15) PAs(gskewed(15) PAs(88))

3276832768 44 44

Overall Predictor Overall Predictor PerformancePerformance

Per-Benchmark Per-Benchmark PerformancePerformance

ILP PerformanceILP Performance

• Simulated CPU:Simulated CPU:– 6-issue6-issue– 20 cycle pipeline20 cycle pipeline– Same functional units, latencies, caches Same functional units, latencies, caches

as as IntInteell P4/NetBurst microarchitecture P4/NetBurst microarchitecture

1-cycle1-cycle2bC2bC

4-cycle4-cycleOR alpctOR alpct

++ ++

4-cycle4-cycleOR COLTOR COLT

IdealIdeal1-cycle1-cycleCOLTCOLT

ILP ImpactILP Impact

COLT Parameter COLT Parameter SensitivitySensitivity

• Mapping table counter widthsMapping table counter widths• Number of mapping tablesNumber of mapping tables• Number of history bits for VMT Number of history bits for VMT

indexindex

Counter WidthCounter Width

VMT SizeVMT Size

History LengthHistory Length

Explaining Choice of Explaining Choice of ComponentsComponents

• Parameter sensitivity results shows Parameter sensitivity results shows GA performed well for the COLT GA performed well for the COLT parametersparameters

• Why did it choose the component Why did it choose the component predictors that it did?predictors that it did?

Classifying COLT Classifying COLT PredictionsPredictions

• We examined the We examined the (32KB) COLT config. (32KB) COLT config.• For each mapping table lookup, we For each mapping table lookup, we

examine the neighboring entries:examine the neighboring entries:

PP11 PP22 PP33 PP44

11 00 00 11 1111

0010

1001

entry entry 00001 = NT001 = NT

entry 1001 = Tentry 1001 = T

entry 1entry 11101 = T01 = T

Classifying Predictions Classifying Predictions (cont’d)(cont’d)

easy: all neighboring entries agreeeasy: all neighboring entries agreeshort: only gshare(9) distinguishesshort: only gshare(9) distinguisheslong: only gshare(14) distinguisheslong: only gshare(14) distinguisheslocal: only PAs(local: only PAs(77) distinguishes) distinguishesperceptron: only alpct(34/perceptron: only alpct(34/1010) )

distinguishesdistinguishesmulti-length: mix of gshare(9), (14) or multi-length: mix of gshare(9), (14) or

alpctalpctmixed: both global and local componentsmixed: both global and local components

gsharegshare(9)(9)

gsharegshare(14)(14)

PAsPAs((77))

alpctalpct(34/(34/1010))32KB COLT:32KB COLT:

Classes:Classes:

Prediction ClassificationsPrediction Classifications

Related Work/IssuesRelated Work/Issues

• Alloyed history [Skadron00]Alloyed history [Skadron00]• Variable path history length [Stark98]Variable path history length [Stark98]• Dynamic history length fitting [Juan98]Dynamic history length fitting [Juan98]• Interference reduction [lots…]Interference reduction [lots…]

COLT handles all of these cases*COLT handles all of these cases*

Doesn’t support partial update policiesDoesn’t support partial update policies

Open ResearchOpen Research

• Better individual componentsBetter individual components• Augment with SBI [Manne99], agree Augment with SBI [Manne99], agree

[Sprangle97][Sprangle97]• Better fusion algorithmsBetter fusion algorithms• Hybrid fusion/selection algorithmsHybrid fusion/selection algorithms• Other domains (branch confidence Other domains (branch confidence

prediction, value prediction, memory prediction, value prediction, memory dependence prediction, instruction dependence prediction, instruction criticality prediction, …)criticality prediction, …)

SummarySummary

• Fusion is more powerful than selectionFusion is more powerful than selection– Combines multiple sources of informationCombines multiple sources of information

• Branch behavior is very variedBranch behavior is very varied– Need long, short, global and local histories, Need long, short, global and local histories,

multiple simultaneous lengths and types of multiple simultaneous lengths and types of historyhistory

• COLT is one possible fusion-based COLT is one possible fusion-based predictorpredictor– Combines multiple types of informationCombines multiple types of information– Current “best” purely dynamic predictor*Current “best” purely dynamic predictor*

Questions?Questions?

predicting conditional branches with fusion-based hybrid predictors gabriel h. loh yale university...

Documents

branch prediction problem

vmt slide

slower slide

branch address bits

branch history bits

component choice slide

critical delay slide

larger sum predictor