loop unrolling

58
1 Loop Unrolling Determine loop unrolling useful by finding that loop iterations were independent Determine address offsets for different loads/stores Increases program size Use different registers to avoid unnecessary constraints forced by using same registers for different computations Stress on registers Eliminate the extra test and branch instructions and adjust the loop termination and iteration code

Upload: nyoko

Post on 25-Feb-2016

55 views

Category:

Documents


5 download

DESCRIPTION

Loop Unrolling. Determine loop unrolling useful by finding that loop iterations were independent Determine address offsets for different loads/stores Increases program size Use different registers to avoid unnecessary constraints forced by using same registers for different computations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Loop Unrolling

1

Loop Unrolling

•Determine loop unrolling useful by finding that loop iterations were independent

• Determine address offsets for different loads/stores• Increases program size

•Use different registers to avoid unnecessary constraints forced by using same registers for different computations

• Stress on registers

•Eliminate the extra test and branch instructions and adjust the loop termination and iteration code

Page 2: Loop Unrolling

2

Loop Unrolling

•If a loop only has dependences within an iteration, the loop is considered parallel multiple iterations can be executed together so long as order within an iteration is preserved

• If a loop has dependences across iterations, it is not parallel and these dependences are referred to as “loop-carried”

Page 3: Loop Unrolling

3

Example

For (i=1000; i>0; i=i-1) x[i] = x[i] + s;

For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2}

For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2}

For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1

Page 4: Loop Unrolling

4

Example

For (i=1000; i>0; i=i-1) x[i] = x[i] + s;

For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2}

For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2}

For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1

S2 depends on S1 in the same iterationS1 depends on S1 from prev iterationS2 depends on S2 from prev iteration

S1 depends on S2 from prev iteration

S1 depends on S1 from 3 prev iterationsReferred to as a recursionDependence distance 3; limited parallelism

No dependences

Page 5: Loop Unrolling

5

Constructing Parallel Loops

If loop-carried dependences are not cyclic (S1 depending on S1 is cyclic), loops can be restructured to be parallel

For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2}

A[1] = A[1] + B[1];For (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; S3 A[i+1] = A[i+1] + B[i+1]; S4}B[101] = C[100] + D[100];S1 depends on S2 from prev iteration

S4 depends on S3 of same iteration

Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

Page 6: Loop Unrolling

6

Static Branch Prediction

•Scheduling code around delayed branch•To reorder code around branches, need to predict branch statically at compile time•Simplest scheme is to predict a branch as taken

• Average misprediction = untaken branch frequency = 34% SPEC

12%

22%

18%

11% 12%

4%6%

9% 10%

15%

0%

5%

10%

15%

20%

25%

Mis

pred

ictio

n R

ate

• More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run:

Page 7: Loop Unrolling

7

Dynamic Branch Prediction

• Why does prediction work?• Underlying algorithm has regularities• Data that is being operated on has regularities

•Is dynamic branch prediction better than static branch prediction?

• Seems to be • There are a small number of important branches in

programs which have dynamic behavior

Page 8: Loop Unrolling

8

Dynamic Branch Prediction

•Performance = ƒ(accuracy, cost of misprediction)•Branch History Table: Lower bits of PC address index table of 1-bit values

• Says whether or not branch taken last time• No address check

•Problem: in a loop, 1-bit BHT will cause two mispredictions• End of loop case, when it exits instead of looping as before• First time through loop on next time through code, when it predicts

exit instead of looping

Page 9: Loop Unrolling

9

Dynamic Branch Prediction

•Solution: 2-bit scheme where change prediction only if get misprediction twice

• More sophisticated– Count the number of times branch

is taken

2-bit branch predictionState diagram

Page 10: Loop Unrolling

10

Branch History Table

•Mispredict because either:• Wrong guess for that branch• Got branch history of wrong branch when index the table

•4096 entry table:18%

5%

12%10% 9%

5%

9% 9%

0% 1%

0%2%4%6%8%

10%12%14%16%18%20%

eqnto

tt

espre

sso gc

c lisp

icedo

duc

spice

fpppp

matrix30

0

nasa

7

Mis

pred

ictio

n R

ate

Page 11: Loop Unrolling

11

Correlated Branch Prediction

• Standard 2-bit predictor uses local information• Fails to look at the global picture

•Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table

•Global Branch History: m-bit shift register keeping T/NT status of last m branches.

Page 12: Loop Unrolling

12

Correlated Branch Prediction

•Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch•Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table

• In general, (m,n) predictor means record last m branches to select between 2^m history tables each with n-bit counters

• Old 2-bit BHT is then a (0,2) predictor

If (aa == 2) aa=0;If (bb == 2) bb = 0;If (aa != bb) do something;

Page 13: Loop Unrolling

13

Correlated Branch Prediction

(2,2) predictor– Then behavior of

recent branches selects between, say, four predictions of next branch, updating just that prediction

Page 14: Loop Unrolling

14

Accuracy of Different Schemes

0%

Freq

uenc

y of

Mis

pred

ictio

ns

0%1%

5%6% 6%

11%

4%

6%5%

1%2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

nasa

7

mat

rix30

0

dodu

cd

spic

e

fppp

p

gcc

expr

esso

eqnt

ott li

tom

catv

Page 15: Loop Unrolling

15

Tournament Predictors

•Multilevel branch predictor• Selector for the Global and Local predictors of correlating branch

prediction •Use n-bit saturating counter to choose between predictors•Usual choice between global and local predictors

Page 16: Loop Unrolling

16

Tournament Predictors

• A local predictor might work well for some branches or programs, while a global predictor might work well for others

• Provide one of each and maintain another predictor to identify which predictor is best for each branch

TournamentPredictor

Branch PC

Table of 2-bitsaturating counters

LocalPredictor

GlobalPredictor

MUX

Page 17: Loop Unrolling

17

Tournament Predictors

• Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between:

•Global predictor• 4K entries index by history of last 12 branches (2^12 = 4K)• Each entry is a standard 2-bit predictor

•Local predictor• Local history table: 1024 10-bit entries recording last 10 branches,

index by branch address• The pattern of the last 10 occurrences of that particular branch

used to index table of 1K entries with 3-bit saturating counters

Page 18: Loop Unrolling

18

Tournament Predictors

•Advantage of tournament predictor is ability to select the right predictor for a particular branch

Page 19: Loop Unrolling

19

1-Bit Bimodal Prediction (SimpleScalar Term)

• For each branch, keep track of what happened last time and use that outcome as the prediction

• What are prediction accuracies for branches 1 and 2 below:

while (1) { for (i=0;i<10;i++) { branch-1 … } for (j=0;j<20;j++) { branch-2 … } }

Page 20: Loop Unrolling

20

2-Bit Bimodal Prediction (SimpleScalar Term)

• For each branch, maintain a 2-bit saturating counter: if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1)

• If (counter >= 2), predict taken, else predict not taken

• Advantage: a few atypical branches will not influence the prediction (a better measure of “the common case”)

• Especially useful when multiple branches share the same counter (some bits of the branch PC are used to index into the branch predictor)

• Can be easily extended to N-bits (in most processors, N=2)

Page 21: Loop Unrolling

21

Branch Target Buffers (BTB)

•Branch target calculation is costly and stalls the instruction fetch.

•BTB stores PCs the same way as caches

•The PC of a branch is sent to the BTB

•When a match is found the corresponding Predicted PC is returned

•If the branch was predicted taken, instruction fetch continues at the returned predicted PC

Page 22: Loop Unrolling

22

Branch Target Buffers (BTB)

Page 23: Loop Unrolling

23

Branch Prediction

•Sophisticated Techniques:• A “branch target buffer” to help us look up the destination• Correlating predictors that base prediction on global behavior

and recently executed branches (e.g., prediction for a specificbranch instruction based on what happened in previous branches)

• Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.

• A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA)

•Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!

•Modern processors predict correctly 95% of the time!

Page 24: Loop Unrolling

24

Pipeline without Branch Predictor

IF (br)

PC

Reg ReadCompareBr-target

PC + 4

In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch

Page 25: Loop Unrolling

25

Pipeline with Branch Predictor

IF (br)

PC

Reg ReadCompareBr-target

In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch

BranchPredictor

Page 26: Loop Unrolling

26

Branch Mispredict Penalty

• Assume: no data or structural hazards; only control hazards; every 5th instruction is a branch; branch predictor accuracy is 90%

• Slowdown = 1 / (1 + stalls per instruction)

• Stalls per instruction = % branches x %mispreds x penalty = 20% x 10% x 1 = 0.02

• Slowdown = 1/1.02 ; if penalty = 20, slowdown = 1/1.4

Page 27: Loop Unrolling

27

Dynamic Vs. Static ILP

• Static ILP:+ The compiler finds parallelism no extra hw higher clock speeds and lower power+ Compiler knows what is next better global schedule- Compiler can not react to dynamic events (cache misses)- Can not re-order instructions unless you provide hardware and extra instructions to detect violations (eats into the low complexity/power argument)- Static branch prediction is poor even statically scheduled processors use hardware branch predictors

Page 28: Loop Unrolling

28

Dynamic Scheduling

•Hardware rearranges instruction execution to reduce stalls• Maintains data flow and exception behavior

•Advantages:• Handles cases where dependences are unknown at

compile time• Simplifies compiler• Allows processor to tolerate unpredictable delays by

executing other code • Cache miss delay

• Allows code compiled with one pipeline in mind to run efficiently on a different pipeline

•Disadvantage: • Complex hardware

Page 29: Loop Unrolling

29

Dynamic Scheduling

• Simple Pipelining technique:• In-order instruction issue and execution• If an instruction is stalled, no later instructions can

proceed• Dependence between two closely spaced instructions

leads to hazard• What if there are multiple functional units?

• Units could stay idle• If instruction “j” depends on a long running instruction “i”

• All instructions after “j” stalls

DIV.D F0,F2,F4ADD.D F10,F0,F8SUB.D F12,F8,F14

Can be eliminated by not requiring instructions to

execute in-order

Page 30: Loop Unrolling

30

Idea

• Classic five stage pipeline• Structural and data hazards could be checked during ID

• What do we need to allow us execute the SUB.D?

•Separate ID process into two • Check for hazards (Issue)• Decode (read operands)

•In-order instruction issue (in program order)• Begin execution as soon as its operands are available• Out-of-order execution (out-of-order completion)

DIV.D F0,F2,F4ADD.D F10,F0,F8SUB.D F12,F8,F14

Page 31: Loop Unrolling

31

Out-of-order complications

•Introduces possibility of WAW, WAR hazards• Do not exist in 5 stage pipeline

• Solution• Register renaming

DIV.D F0,F2,F4ADD.D F6,F0,F8SUB.D F8,F10,F14MUL.D F6,F10,F8

WAR WAW

Page 32: Loop Unrolling

32

Out-of-order complications

•Handling exceptions• Out-of-order completion must preserve exception

behavior• Exactly those exceptions that would arise if the program

was executed in strict program order actually do arise• Preserve exception behavior by:

• Ensuring that no instruction can generate an exception until the processor knows that the instruction raising the exception will be executed

Page 33: Loop Unrolling

33

Splitting ID Stage

•Instruction Fetch:• Fetch into register or queue

•Instruction Decode• Issue:

• Decode instructions• Check for structural hazards

• Read operands• Wait until no data hazards• Then read operands

•Execute• Just as 5-stage pipeline execute stage• May take multiple cycles

Page 34: Loop Unrolling

34

Hardware requirement

•Pipeline must allow multiple instructions to be in execution stage

• Multiple functional units

•Instructions pass through issue stage in order (in-order issue)

•Instructions can be stalled and bypass each other in the second stage (read operands)

•Instructions enter execution stage out-of-order

Page 35: Loop Unrolling

35

Dynamic Scheduling using Tomasulo’s Method

•Sophisticated scheme to allow out-of-order execution

•Objective is to minimize RAW hazards• Introduces register renaming to minimize WAR and

WAW hazards

•Many variations of this technique are used in modern processors

•Common features:• Tracking instruction dependencies to allow as soon as

(only when) operands are available (resolves RAW)• Renaming destination registers (WAR, WAW)

Page 36: Loop Unrolling

36

Dynamic Scheduling using Tomasulo’s Method

•Register renaming:

DIV.D F0,F2,F4ADD.D F6,F0,F8S.D F6,0(R1)SUB.D F8,F10,F14MUL.D F6,F10,F8

DIV.D F0,F2,F4ADD.D S,F0,F8S.D S,0(R1)SUB.D T,F10,F14MUL.D F6,F10,T

Finding any use of F8 requires sophisticated compiler or hardware

Page 37: Loop Unrolling

37

Dynamic Scheduling using Tomasulo’s Method

•Register renaming is provided by reservation stations• Buffer the operands of instructions waiting to issue

•Reservation station fetches and buffers an operand as soon as it is available

• Eliminates the need to get the operand from a register

•Pending instructions designate the reservation station that will provide their input (register renaming)

•Successive writes to a register: last one is used to update

•There can be more reservations stations than real registers!• Name dependencies that can’t be eliminated by a compiler , now

can be eliminated by hardware

Page 38: Loop Unrolling

38

Dynamic Scheduling using Tomasulo’s Method

•Data structures attached to • reservation stations• Load/store buffers• Register file

•Once an instruction has issued and is waiting for source operand

• Refers to the operand by reservation station number where the instruction that will generate the value has been assigned

• If reservation statin number is 0• Operand is already in the register file

•1 cycle latency between source and result• Effective latency between producing instruction and consuming

instruction is at least 1 cycle longer than the latency of the function unit producing the result

Page 39: Loop Unrolling

39

Reservation Station

•Op: Operation to perform in the unit (e.g., + or –)•Vj, Vk: Value of Source operands

• For loads Vk is used to hold the offset•Qj, Qk: Reservation stations producing source registers (value to be written)

Note: Qj,Qk=0 => ready• Busy: Indicates reservation station or FU is busy• A : Hold information for the memory address calculation for load/store.

• Initially, immediate value is stored in A. •Register result status (Qi)—Indicates the number of the reservation station that contains the operation whose results shoul dbe streod in to this register.

• Blank when no pending instructions that will write that register.

Page 40: Loop Unrolling

40

Dynamic Scheduling using Tomasulo’s Method

Page 41: Loop Unrolling

D. Patterson’s

Tomasulo Slides

41

Page 42: Loop Unrolling

42

Hardware-Based Speculation

•Branch prediction reduces the stalls attributable to branches• • For a processor executing multiple instructions

• Just predicting branch is not enough• Multiple issue processor may execute a branch every

clock cycle

•Exploiting parallelism requires that we overcome the limitation of control dependence

Page 43: Loop Unrolling

43

Hardware-Based Speculation

•Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct

• extension over branch prediction with dynamic scheduling• Speculation fetch, issue, and execute instructions as if branch

predictions were always correct • Dynamic scheduling only fetches and issues instructions

•Essentially a data flow execution model: Operations execute

as soon as their operands are available

Page 44: Loop Unrolling

44

Hardware-Based Speculation

•3 components of HW-based speculation:•Dynamic branch prediction to choose which instructions to execute

•Speculation to allow execution of instructions before control dependences are resolved

• ability to undo effects of incorrectly speculated sequence

•Dynamic scheduling to deal with scheduling of different combinations of basic blocks

• without speculation only partially overlaps basic blocks •requires that a branch be resolved before actually executing any instructions in the successor basic block.

Page 45: Loop Unrolling

45

Hardware-Based Speculation in Tomasulo

•The key idea • allow instructions to execute out of order • force instructions to commit in order• prevent any irrevocable action (such as updating state or

taking an exception) until an instruction commits.

•Hence: • Must separate execution from allowing instruction to

finish or “commit”• instructions may finish execution considerably before

they are ready to commit.

•This additional step called instruction commit

Page 46: Loop Unrolling

46

Hardware-Based Speculation in Tomasulo

•When an instruction is no longer speculative, allow it to update the register file or memory

•Requires additional set of buffers to hold results of instructions that have finished execution but have not committed : reorder buffer (ROB)

•This reorder buffer (ROB) is also used to pass results among instructions that may be speculated

Page 47: Loop Unrolling

47

Reorder Buffer

•In Tomasulo’s algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file

•With speculation, the register file is not updated until the instruction commits

• (we know definitively that the instruction should execute)

•Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit

• ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s algorithm

• ROB extends architectured registers like RS

Page 48: Loop Unrolling

48

Reorder Buffer Structure (Four fields)

•instruction type field• Indicates whether the instruction is a branch (and has no

destination result), a store (which has a memory address destination), or a register operation (ALU operation or load, which has register destinations).

•destination field • supplies the register number (for loads and ALU operations) or the

memory address (for stores) where the instruction result should be written.

•value field • hold the value of the instruction result until the instruction commits.

•ready field • indicates that the instruction has completed execution, and the

value is ready.

Page 49: Loop Unrolling

49

Reorder Buffer Operation

•Holds instructions in FIFO order, exactly as issued•When instructions complete, results placed into ROB

• Supplies operands to other instruction between execution complete & commit => more registers like RS

• Tag results with ROB buffer number instead of reservation station•Instructions commit =>values at head of ROB placed in registers•As a result, easy to undo speculated instructions on mispredicted branches or on exceptions

ReorderBufferFP

OpQueue

FP Adder FP AdderRes Stations Res Stations

FP Regs

Commit path

Page 50: Loop Unrolling

50

Page 51: Loop Unrolling

51

Page 52: Loop Unrolling

52

4 Steps of Speculative Tomasulo1. Issue —get instruction from instruction queue

If reservation station and reorder buffer slot free, issue instr & send operands to reservation station & send reorder buffer no. allocated for result to reservation station (tag the result when it is palced on CDB)

2. Execution —operate on operands (EX) When both operands ready then execute; if not ready,

watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

Page 53: Loop Unrolling

53

4 Steps of Speculative Tomasulo3. Write result —finish execution (WB)

Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

If the value to be stored is available, it is written into the Value field of the ROB entry for the store.

If the value to be stored is not available yet, the CDB must be monitored until that value is broadcast, at which time the Value field of the ROB entry of the store is updated.

Page 54: Loop Unrolling

54

4 Steps of Speculative Tomasulo4. Commit a) when an instruction reaches the head of the ROB and its

result is present in the buffer; • update the register with the result and remove the instruction from

the ROB. b) Committing a store is similar except that memory is

updated rather than a result register. c) If a branch with incorrect prediction reaches the head ROB

• it indicates that the speculation was wrong. • ROB is flushed and execution is restarted at the correct successor

of the branch. d) If the branch was correctly predicted, the branch is finished.

• Once an instruction commits, its entry in the ROB is reclaimed. If the ROB fills, we simply stop issuing instructions until an entry is made free.

Page 55: Loop Unrolling

55

Speculation vs. Dynamic Scheduling

• no instruction after the earliest uncompleted instruction (MUL.D) is allowed to complete. In contrast, in dynamic scheduling fast instructions (SUB.D and ADD.D) have also completed.

• ROB can dynamically execute code while maintaining a precise interrupt model.

• if MUL.D caused an interrupt, we could simply wait until it reached the head of the ROB and take the interrupt, flushing any other pending instructions from the ROB. Because instruction commit happens in order, this yields a precise exception.

• By contrast, in Tomasulo’s algorithm, the SUB.D and ADD.D completed before the MUL.D raised the exception. F8 and F6 could be overwritten, and the interrupt would be imprecise.

L.D F6, 32(R2)L.D F2,44(R3)MUL.D F0,F2,F4SUB.D F8,F6,F2DIV.D F10,F0,F6ADD.D F6,F8,F2

Page 56: Loop Unrolling

56

Design Details - I

• Instructions enter the pipeline in order

• No need for branch delay slots if prediction happens in time

• Instructions leave the pipeline in order – all instructions that enter also get placed in the ROB – the process of an instruction leaving the ROB (in order) is called commit – an instruction commits only if it and all instructions before it have completed successfully (without an exception)

• To preserve precise exceptions, a result is written into the register file only when the instruction commits – until then, the result is saved in a temporary register in the ROB

Page 57: Loop Unrolling

57

Design Details - II

• Instructions get renamed and placed in the issue queue – some operands are available (T1-T6; R1-R32), while others are being produced by instructions in flight (T1-T6)

• As instructions finish, they write results into the ROB (T1-T6) and broadcast the operand tag (T1-T6) to the issue queue – instructions now know if their operands are ready

• When a ready instruction issues, it reads its operands from T1-T6 and R1-R32 and executes (out-of-order execution)

Page 58: Loop Unrolling

58

Design Details - III

• If instr-3 raises an exception, wait until it reaches the top of the ROB – at this point, R1-R32 contain results for all instructions up to instr-3 – save registers, save PC of instr-3, and service the exception

• If branch is a mispredict, flush all instructions after the branch and start on the correct path – mispredicted instrs will not have updated registers (the branch cannot commit until it has completed and the flush happens as soon as the branch completes)