evaluation of branch predictors

P a g e | 1

The University of Texas at Dallas

Department of Electrical Engineering

EECE/CS 6304: COMPUTER ARCHITECTURE

PROJECT #2

“ ANALYSIS OF DIFFERENT TYPES OF

BRANCH PREDICTORS ”

Submitted by,

Bharat Biyani (2021152193)

Shree Viswa Shamanthan L D (2021180127)

P a g e | 1

INTRODUCTION

In computer architecture, a branch predictor is a digital circuit that tries to speculate

which way a branch will go before this is known for sure (i.e., before its execution). The purpose

of the branch predictor is to improve the flow in the instruction pipeline. They play a critical role

in achieving high effective performance in many modern pipelined microprocessor architectures

such as x86.

In this project, we analyze the behavior of different branch predictor configurations in

three well-recognized benchmarks, especially GCC, ANAGRAM and GO. We used simplescalar

sim-outorder, which models all the execution aspects of Alpha 21264. The simulations provide

the CPI values, which we use to compare among different benchmarks.

We have used three types of hardware based branch prediction strategies, they are:

1) Bimodal Predictor: It is a simple predictor, which uses 2-bit saturating counters to predict if a

given branch is likely to be taken or not.

2) Two Level Predictor: A two-level adaptive predictor with an n-bit history is that it can predict

any repetitive sequence with any period if all n-bit sub-sequences are different. The

advantage of the two-level adaptive predictor is that it can quickly learn to predict an

arbitrary repetitive pattern.

3) Combined Predictor: A hybrid predictor also called combined predictor implements more

than one prediction mechanism. The final prediction is based either on a meta-predictor that

remembers which of the predictors has made the best predictions in the past or a majority

vote function based on an odd number of different predictors.

P a g e | 2

Part 1: Performance analysis of different types of branch predictors

The simulation is done for different configuration of Return Address Space (RAS) and types of

branch predictions.

Baseline default RAS: Bimodal predictor with the default value for RAS.

-bpred bimod -bpred:bimod 256 -bpred:ras 8 -bpred:btb 64 2

2 Level Predictor: Uses two bit for defining the state for branch predictor. -bpred 2lev -bpred:2lev 1 256 4 0 -bpred:ras 8 -bpred:btb 64 2

Comb: Combines a two levels and bimodal predictor. -bpred comb -bpred:comb 256 -bpred:bimod 256 -bpred:2lev 1 256 4 0 -bpred:ras 8 -bpred:btb 64 2

RAS 4: Change the return address stack (RAS) size to 4.


RAS 16: Change the return address stack (RAS) size to 16.


Performance Analysis based on CPI

Sr. No. Configuration Benchmarks

GCC ANAGRAM GO

1 Baseline 0.95 0.4674 0.7571

2 2 Level Predictor 0.9822 0.4605 0.7893

3 Comb 0.8678 0.4546 0.7516

4 Bimod: RAS 4 0.9538 0.4678 0.7574

5 Bimod: RAS 16 0.9498 0.4674 0.7571

Graphical Representation with above CPI

0

0.2

0.4

0.6

0.8

1

1.2

Baseline 2 LevelPredictor

Comb RAS 4 RAS 16

ANAGRAM

GO

GCC

P a g e | 3

Above graph clearly displays the performance of different configurations of branch predictor.

Analysis: Benchmark – GCC vs BP Configurations

GCC benchmark has more CPI as compared to the other benchmarks. Specifically, CPI

improved for combination of two level and bimodal predictor (Comb). It has high CPI for 2 level

predictor which uses two bits for defining state of branch predictor.

Analysis: Benchmark – ANAGRAM vs BP Configurations

From the above graph, we can infer that ANAGRAM benchmark has a less CPI than the

other two benchmarks. The performance of ANAGRAM benchmark is fairly constant for all the

configurations of branch predictor. Specifically, CPI is optimal for combination of two level and

bimodal predictor (Comb).

Analysis: Benchmark – GO vs BP Configurations

Above graph shows that GO benchmark performs better than the GCC benchmark. The

performance of GO benchmark is almost constant for all the configurations of branch predictor.

Specifically, CPI is optimal for combination of two level and bimodal predictor (Comb). With

respect to bimod size variation, if we change baseline configuration from the default return

address space from size of 4 to size of 16, CPI performance gets better. RAS size does not have

much impact on CPI.

P a g e | 4

Performance Analysis based on Address Hit Rates


GCC ANAGRAM GO

1 Baseline 0.6734 0.956 0.7071

2 2 Level Predictor 0.6253 0.9575 0.6484

3 Comb 0.8339 0.9694 0.709

4 Bimod: RAS 4 0.6697 0.9555 0.7067

5 Bimod: RAS 16 0.6736 0.9605 0.7071

Graphical Representation with above Address Hit Rates

The above graph clearly shows the performance of different configurations of branch

predictor for different benchmarks.

For ANAGRAM benchmark, except for bimod, Return Address Stack (RAS) size 4, the

Address Hit Rates are appreciable.

For GO benchmark, except for 2 level predictor configuration, the Address Hit Rates are

appreciable.

For GCC benchmark, except for 2 level predictor configuration, the Address Hits Rates are

appreciable.

0

0.2

0.4

0.6

0.8

1

1.2

Baseline 2 Level Predictor Comb Bimod: RAS 4 Bimod: RAS 16

GCC

GO

ANAGRAM

P a g e | 5

Performance Analysis based on Direction Hit Rates


GCC ANAGRAM GO

1 Baseline 0.6734 0.9605 0.7929

2 2 Level Predictor 0.7919 0.9614 0.7372

3 Comb 0.8617 0.9738 0.7978

4 Bimod: RAS 4 0.8431 0.9605 0.7929

5 Bimod: RAS 16 0.8431 0.9605 0.7929

The graph for the Direction Hit Rates with respect to every benchmark will provide us

more information on the effect of branch prediction configurations on different benchmarks.

Graphical Representation with above Direction Hit Rates

The Direction Hit Rates of the branch predictors fairly stays constant for each benchmark.

Specifically, ANAGRAM benchmark has more direction hit rates than other two benchmarks. In

this case, 2 level prediction direction rate gives worst performance but when we change Returns

Address Space from 8 to 16 or 8 to 4, it performs better.

0

0.2

0.4

0.6

0.8

1

1.2

Baseline 2 Level Predictor Comb Bimod: RAS 4 Bimod: RAS 16

GCC

GO

ANAGRAM

P a g e | 6

Part 2: Modification of the code to accommodate address misses

We carried out modifications in the following two files in simplescalar.

1) bpred.h

2) bpred.c

1) Changes in file bpred.h:

----------------

/* branch predictor def */

struct bpred_t {

------

} dirpred;

struct {

--------

} retstack;

/* stats */

counter_t addr_hits; /* num correct addr-predictions */

counter_t dir_hits; /* num correct dir-predictions (incl addr) */

counter_t addr_misses; /* num addr_misses */

counter_t used_ras; /* num RAS predictions used */

counter_t used_bimod; /* num bimodal predictions used (BPredComb) */

-----------

};

2) Changes in file bpred.c:

-----------

sprintf(buf, "%s.dir_hits", name);

stat_reg_counter(sdb, buf, "total number of direction-predicted hits " "(includes addr-

hits)",

&pred->dir_hits, 0, NULL);

sprintf(buf, "%s.addr_misses", name);

stat_reg_counter(sdb, buf, "total number of addr-misses",

&pred->addr_misses, 0, NULL);

-----------

if (bpred == NULL)

return;

bpred->dir_hits = 0;

bpred->addr_misses = 0;

-----------

/* Have a branch here */

if (correct)

pred->addr_hits++;

if (!!pred_taken == !!taken)

pred->dir_hits++;

else

pred->misses++;

pred->addr_misses= (pred->misses + pred->dir_hits - pred->addr_hits);

-----------

-----------

}

P a g e | 7

Part 3: Comparison of BTB Performance

The simulation is done for the following configurations of Branch Target Buffer:

Baseline BTB configuration: 64 sets, 2 way associativity

–bpred bimod –bpred:bimod 256 -bpred:btb 64 2

Showing the effect of the number of sets in BTB with the following options


–bpred bimod –bpred:bimod 256 –bpred:btb 128 2

Showing the effect of associativity when the total size of BTB is fixed with the following options



Performance Analysis based on addr_hits


GCC ANAGRAM GO

1 64 sets/2 way 2235498 2771048 1934760

2 32 sets/2 way 2095859 2746365 1832302

3 128 sets/2 way 2389785 2777415 2008597

4 32 sets/4 way 2260256 2775372 1936745

5 128 sets/1 way 2197498 2759944 1893595

Graphical Representation with above addr_hits

0

500000

1000000

1500000

2000000

2500000

3000000

64 sets/2 way 32 sets/2 way 128 sets/2 way 32 sets/4 way 128 sets/1 way

GO

GCC

ANAGRAM

P a g e | 8

The above graph shows the behavior of various configurations of Branch Target Buffer

(BTB) for different benchmarks. Among all the three benchmarks, ANAGRAM benchmark has the

highest address hits and the performance is relatively minimum for BTB with 32 sets and 4 way

set associative. GCC benchmark has moderate address hits and the performance is relatively

minimum for BTB with 32 sets and 4 way set associative. GO benchmark has poor address hits

when compared to other benchmark. For this benchmark, the address hits is again minimum for

the configuration of BTB with 32 sets and 4 way set associative.

Comparison of BTB Performance based on addr_misses


GCC ANAGRAM GO

1 64 sets/2 way 1084176 127541 801464

2 32 sets/2 way 1223815 152224 903922

3 128 sets/2 way 929889 121174 727627

4 32 sets/4 way 1059418 123217 799479

5 128 sets/1 way 1122176 138645 842629

Graphical Representation with above addr_misses

From the above graph, as expected, address misses is very optimal for ANAGRAM

benchmark. GCC benchmark has maximum address misses among all the three benchmarks. As

we can see from the graph, decreasing the sets from 64 to 32 increases the miss rate and

increasing the number of set from 64 to 128 decreases the address misses. This is because

capacity misses is reduced by increasing the number of sets. In case of 32 sets/4 way

configuration, even though set is decreased from 64 to 32 the address miss is decreased because

the associativity is increased which reduces the conflict misses. In case of 128 sets/1 way

configuration, due to direct mapping, even the increase in number of set increases the

addr_misses.

0

200000

400000

600000

800000

1000000

1200000

1400000


ANAGRAM

GO

GCC

P a g e | 9

Comparison of BTB Performance based on CPI


GCC ANAGRAM GO

1 64 sets/2 way 0. 9500 0. 4674 0. 7571

2 32 sets/2 way 0. 9664 0. 4711 0. 7645

3 128 sets/2 way 0. 9304 0. 4664 0. 7496

4 32 sets/4 way 0. 9491 0. 4670 0. 7575

5 128 sets/1 way 0. 9528 0. 4686 0. 7583

Graphical Representation with above CPI

From the above graph, CPI remains fairly constant for every benchmark. Among the benchmarks, ANAGRAM benchmark has the most optimal CPI and GCC benchmark holds the maximum CPI for execution with various BTB configurations. The CPI seems to be higher for configuration 32 sets/2 way compared to the 64 sets/2 way which has much higher sets than this configuration. In case of 32 sets/4 way and 128 sets/1 way configurations, associativity and number of sets makes the CPI almost equal to the 64 sets/2 way CPI. For the configuration with set 128 and associativity 2 the CPI remains much lower than all other configurations.

0

0.2

0.4

0.6

0.8

1

1.2


GCC

ANAGRAM

GO

P a g e | 10

Comparison of BTB Performance based on Branch Predictor Hit Rates


GCC ANAGRAM GO

1 64 sets/2 way 0.6779 0.9546 0.6926

2 32 sets/2 way 0.636 0.9476 0.6527

3 128 sets/2 way 0.7221 0.9557 0.7225

4 32 sets/4 way 0.6852 0.9573 0.6931

5 128 sets/1 way 0.665 0.9518 0.6775

Graphical Representation with above Branch Predictor Hit Rates

The above graph clearly shows us that the branch predictor hit rate for all the

benchmarks is relatively low when number of set decreases in a BTB. When we closely observe

the variation in the branch predictor hit rates of different configurations, it is evident that for BTB

configuration, 32 sets and 2 way set associative the branch prediction hit rate is lower for all the

benchmarks.

CONCLUSION

For an optimal branch predictor, it is recommended to have higher sets but at the same time

tradeoff between cost and performance should be taken into consideration.

To have high address hit rates and direction hit rates, the simulation results suggests that

combination of two level and bimodal predictor configuration is better.

0

0.2

0.4

0.6

0.8

1

1.2

64 sets/2way

32 sets/2way

128 sets/2way

32 sets/4way

128 sets/1way

GCC

ANAGRAM

GO