lecture 5: eitf20 computer architecture · lecture 5: eitf20 computer architecture anders ardö eit...

15
logoonly Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 1 / 59 logoonly Outline 1 Reiteration 2 Dynamic scheduling - Tomasulo 3 Superscalar, VLIW 4 Speculation 5 ILP limitations 6 What we have done so far A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 2 / 59 logoonly Instruction Level Parallelism - ILP ILP: Overlap execution of unrelated instructions: Pipelining Two main approaches: DYNAMIC = hardware detects parallelism STATIC = software detects parallelism Often a mix between both. Pipeline CPI = Ideal CPI + Structural stalls + Data hazard stalls + Control stalls A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 3 / 59 logoonly Why loop unrolling works Longer sequences of straight code without branches (longer basic blocks) allows for easier compiler static rescheduling Longer basic blocks also facilitates dynamic rescheduling such as Scoreboard and Tomasulo’s algorithm A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 4 / 59

Upload: others

Post on 30-Apr-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Lecture 5: EITF20 Computer Architecture

Anders Ardö

EIT – Electrical and Information Technology, Lund University

November 19, 2014

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 1 / 59

logoonly

Outline

1 Reiteration

2 Dynamic scheduling - Tomasulo

3 Superscalar, VLIW

4 Speculation

5 ILP limitations

6 What we have done so far

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 2 / 59

logoonly

Instruction Level Parallelism - ILP

ILP: Overlap execution of unrelated instructions: PipeliningTwo main approaches:

DYNAMIC =⇒ hardware detects parallelismSTATIC =⇒ software detects parallelism

Often a mix between both.

Pipeline CPI = Ideal CPI + Structural stalls+ Data hazard stalls + Control stalls

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 3 / 59

logoonly

Why loop unrolling works

Longer sequences of straight code without branches (longer basicblocks) allows for easier compiler static reschedulingLonger basic blocks also facilitates dynamic rescheduling such asScoreboard and Tomasulo’s algorithm

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 4 / 59

Page 2: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Dynamic Branch Prediction

Branches limit performance because:Branch penaltiesLimit to available Instruction Level Parallelism

Solution: Dynamic branch prediction to predict the outcome ofconditional branches.Benefits:

Reduce the time to when the branch condition is knownReduce the time to calculate the branch target address

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 5 / 59

logoonly

Dependencies

Two instructions must be independent in order to execute inparallelThere are three general types of dependencies that limitparallelism:

Data dependenciesName dependenciesControl dependencies

Dependencies are properties of the programWhether a dependency leads to a hazard or not is a property ofthe pipeline implementation

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 6 / 59

logoonly

Scoreboard pipeline

Goal of scoreboarding is to maintain an execution rate of oneinstruction per clock cycle by executing an instruction as early aspossible.Instructions execute out-of-order when there are sufficientresources and no data dependencies.A scoreboard is a hardware unit that keeps track of

the instructions that are in the process of being executed,the functional units that are doing the executing,and the registers that will hold the results of those units.

A scoreboard centrally performs all hazard detection andresolution and thus controls the instruction progression from onestep to the next.

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 7 / 59

logoonly

Summary

ILP:Rescheduling and loop unrolling are important to take advantage ofpotential Instruction Level Parallelism

Dynamic instruction schedulingAn alternative to compile-time schedulingDoes not need recompilation to increase performanceUsed in most new processor implementations

Dynamic Branch Predictionreduce branch penalties by early prediction of conditional branchoutcomes

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 8 / 59

Page 3: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Questions!

QUESTIONS?

COMMENTS?

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 9 / 59

logoonly

Lecture 5 agenda

Chapters 2.4-2.8, 3.1-3.4 in "Computer Architecture"

1 Reiteration

2 Dynamic scheduling - Tomasulo

3 Superscalar, VLIW

4 Speculation

5 ILP limitations

6 What we have done so far

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 10 / 59

logoonly

Outline

1 Reiteration

2 Dynamic scheduling - Tomasulo

3 Superscalar, VLIW

4 Speculation

5 ILP limitations

6 What we have done so far

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 11 / 59

logoonly

Scoreboard pipeline

Issue: Decode and check for structural hazardsRead operands: wait until no data hazards, then read operandsAll data hazards are handled by the scoreboard

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 12 / 59

Page 4: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Limitations with Scoreboard

The number of scoreboard entries (window size)The number and types of functional unitsNumber of datapaths to registersThe presence of name dependencies

Tomasulo’s algorithm addresses the last two limitations.

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 13 / 59

logoonly

Tomasulo’s Algorithm

Another dynamic instruction scheduling algorithmFor IBM 360/91, a few years after the CDC 6600 (Scoreboard)Goal: High performance without compiler supportDifferences between Tomasulo & Scoreboard:

Control & Buffers distributed with FUs (called reservationstations) vs. centralized in ScoreboardRegister names in instructions replaced by pointers to reservationstation buffer (HW register renaming)Common Data Bus broadcasts results to all FUsLoads and Stores treated as FUs as well

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 14 / 59

logoonly

Tomasulo Organization

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 15 / 59

logoonly

Three Stages of Tomasulo Alg.

1. Issue – get instruction from FP Op QueueIf reservation station free (no structural hazard), the instruction isissued together with its operands (renames registers)

2. Execution – operate on operands (EX)When both operands are ready, then execute; if not ready, watchCommon Data Bus (CDB) for operands (snooping)

3. Write result – finish execution (WB)Write on CDB to all awaiting functional units; mark reservationstation available

Normal bus: data + destinationCommon Data Bus: data + source (snooping)

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 16 / 59

Page 5: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Tomasulo example, cycle 0

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 17 / 59

logoonly

Tomasulo example, cycle 1

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 18 / 59

logoonly

Tomasulo example, cycle 3

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 19 / 59

logoonly

Tomasulo example, cycle 4

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 20 / 59

Page 6: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Tomasulo example, cycle 5

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 21 / 59

logoonly

Tomasulo example, cycle 7

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 22 / 59

logoonly

Tomasulo example, cycle 10

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 23 / 59

logoonly

Elimination of WAR hazards

Example:

LD F6, 34(R2)... ...DIVD F10,F0,F6ADDD F6,F8,F2

ADDD can safely finish before DIVD has read register F6because:

DIVD has renamed register F6 to point at the reservation stationLD broadcasts its result on the Common Data Bus

Register renaming can thus be done:statically by the compilerdynamically by the hardware

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 24 / 59

Page 7: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Tomasulo example, cycle 11

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 25 / 59

logoonly

Tomasulo example, cycle 16

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 26 / 59

logoonly

Tomasulo example, cycle 57

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 27 / 59

logoonly

Benefits Tomasulo

distributed hazard detection logicdistributed reservation stationsCommon Data Bus (CDB) with snooping

elimination WAR,WAW hazards (renaming registers)

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 28 / 59

Page 8: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Dynamic scheduling Tomasulo - summary

tolerates unpredictable delayscompile for one pipeline - run effectively on anothersignificant increase in HW complexityout-of-order execution, completionregister renaming

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 29 / 59

logoonly

Outline

1 Reiteration

2 Dynamic scheduling - Tomasulo

3 Superscalar, VLIW

4 Speculation

5 ILP limitations

6 What we have done so far

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 30 / 59

logoonly

Getting CPI < 1!

Issuing multiple instructions per clock cycleSuperscalar : varying number of instructions/cycle (1-8) scheduledby compiler or HW

IBM Power5, Pentium 4, Sun SuperSparc, DEC AlphaSimple hardware, complicated compiler or...Very complex hardware but simple for compiler

Very Long Instruction Word (VLIW): fixed number of instructions(3-5) scheduled by the compiler

HP/Intel IA-64, ItaniumSimple hardware, difficult for compilerhigh performance through extensive compiler optimization

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 31 / 59

logoonly

Approaches for multiple issue

Issue Hazard Scheduling Characteristicsdetection /examples

Superscalar dynamic HW static in-order executionARM

Superscalar dynamic HW dynamic out-of-orderexecution

Superscalar dynamic HW dynamic speculationPentium 4

IBM power5VLIW static compiler static TI C6xEPIC static compiler mostly static Itanium

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 32 / 59

Page 9: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Very Long Instruction Word (VLIW)

A number of functional units that independently executeinstructions in parallel.The compiler decides which instructions can execute in parallelNo hazard detection needed

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 33 / 59

logoonly

Itanium instruction format

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 34 / 59

logoonly

Itanium architecture

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 35 / 59

logoonly

Limits of VLIW

Limited Instruction Level ParallelismWith n functional units and k pipeline stages we need n x kindependent instructions to utilize the hardware

Memory and register bandwidthWith increasing number of functional units, the number of portsneeded at the memory or register file must increase to preventstructural hazards

Code sizeCompiler scheduled pipeline “bubbles” take up space in theinstructionNeed more aggressive loop unrolling to work well which alsoincreases code size

No binary code compatibility

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 36 / 59

Page 10: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Outline

1 Reiteration

2 Dynamic scheduling - Tomasulo

3 Superscalar, VLIW

4 Speculation

5 ILP limitations

6 What we have done so far

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 37 / 59

logoonly

HW supported speculation

A combination of three main ideas:Dynamic instruction scheduling; take advantage of ILPDynamic branch prediction; allows instruction scheduling acrossbranchesSpeculative execution; execute instructions before all controldependencies are resolved

Hardware based speculation uses data-flow execution:instructions execute when their operands are available

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 38 / 59

logoonly

HW vs. SW speculation

Advantages:Dynamic runtime disambiguation of memory addressesDynamic branch prediction is often better than static which limitsthe performance of SW speculationHW speculation can maintain a precise exception modelCan achieve higher performance on older code (withoutrecompilation)

Main disadvantage:Extremely complex implementation and extensive need forhardware resources

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 39 / 59

logoonly

Tomasulo extended to handle speculation

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 40 / 59

Page 11: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Re-order buffer - ROB

Data structure

entry instruction type destination value ready12...n

supports speculative executioninstructions commit in orderprecise exceptions

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 41 / 59

logoonly

Four steps of Speculative Tomasulo

Issue – get instruction from FP Op QueueIf reservation station and reorder buffer slot free, issue instr &send operands & reorder buffer nr. for destinationExecution – operate on operands (EX)If both operands ready: execute; if not, watch CDB for result;when both operands are in reservation station: executeWrite result – complete executionWrite on Common Data Bus to all awaiting FUs & reorder buffer ;mark reservation station availableCommit – update register with reorder resultWhen instr. is at head of reorder buffer & result is present; updateregister with result (or store to memory) and remove instr. fromreorder buffer;(handle misspeculations and precise exceptions)

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 42 / 59

logoonly

Misspeculation!

Commit – branch prediction wrongWhen branch instr. is at head of reorder buffer & incorrect prediction:remove all instr. from reorder buffer (flush);restart execution at correct instruction

Expensive =⇒ try to recover as early as possiblePerformance sensitive to branch prediction/speculationmechanism

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 43 / 59

logoonly

Multiple issue and speculation

Possible to extend Tomasulo with both multiple issue andspeculation.Major issues – instruction issue and monitoring CDBMust be able to handle multiple commitsAlternative to Tomasulo is to use extra physical registers for botharchitecturally visible registers and temporary values with registerrenaming

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 44 / 59

Page 12: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

Tomasulo speculation - increased complexity

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 45 / 59

logoonly

Dynamic scheduling, speculation - summary

tolerates unpredictable delayscompile for one pipeline - run effectively on anotherallows speculation

multiple branchesin-order commitprecise exceptionstime, energy; recovery

significant increase in HW complexityout-of-order execution, completionregister renaming

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 46 / 59

logoonly

Sandy Bridge microarchitecture

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 47 / 59

logoonly

Outline

1 Reiteration

2 Dynamic scheduling - Tomasulo

3 Superscalar, VLIW

4 Speculation

5 ILP limitations

6 What we have done so far

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 48 / 59

Page 13: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

ILP

How much performance canwe get by utilizing ILP?

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 49 / 59

logoonly

A model of an ideal processor

Provides a base for ILP measurementsNo structural hazardsRegister renaming – infinite virtual registers and all WAW & WARhazards avoidedMachine with perfect speculation

Branch prediction – perfect; no mispredictionsJump prediction – all jumps perfectly predicted

Memory-address alias analysis – addresses are known & a storecan be moved before a load provided addresses not equalPerfect caches

There are only true data dependencies left!

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 50 / 59

logoonly

Upper Limit to ILP

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 51 / 59

logoonly

Impact window size

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 52 / 59

Page 14: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

More realistic HW: Branch impact

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 53 / 59

logoonly

More realistic HW: Register impact

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 54 / 59

logoonly

Summary

Software (compiler ) tricks:Loop unrollingStatic instructionscheduling (with registerrenaming)... and more

Hardware tricks:Dynamic instructionschedulingDynamic branchpredictionMultiple issue –Superscalar, VLIWSpeculative execution... and more

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 55 / 59

logoonly

Outline

1 Reiteration

2 Dynamic scheduling - Tomasulo

3 Superscalar, VLIW

4 Speculation

5 ILP limitations

6 What we have done so far

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 56 / 59

Page 15: Lecture 5: EITF20 Computer Architecture · Lecture 5: EITF20 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University November 19, 2014 A

logoonly

AMD Phenom CPU

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 57 / 59

logoonly

Intel Core2

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 58 / 59

logoonly

Intel Core2 chip (Nehalem)

A. Ardö, EIT Lecture 5: EITF20 Computer Architecture November 19, 2014 59 / 59