1 sixth lecture: chapter 3: cisc processors (tomasulo scheduling and ibm system 360/91) please...
TRANSCRIPT
1
Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91)
Please recall: Multicycle instructions lead to the requirement of out-of-order
execution
Control flow scheduling, when performed centrally at the time of decode: ==> Scoreboarding technique implemented in CDC 6600
Dataflow scheduling, if performed in a distributed manner by the FUs themselves at execute time. Instructions are decoded and issued to reservation stations awaiting their operands. ==> Tomasulo scheme in the IBM System/360 Model 91 processor is the basis of modern superscalar processors
2
Scoreboard Summary
Main advantage: managing multiple FUs out-of-order execution of multi-cycle operations maintaining all data dependences (RAW, WAW, WAR)
Scoreboard limitations: single issue scheme, however: scheme is extendable to multiple-issue in-order issue no renaming antidependences and output dependences may lead to WAR
and WAW stalls, no forwarding hardware all results go through the registers
General limitations (not only valid for scoreboarding) number and types of FUs since contention for FUs leads to structural hazards the amount of parallelism available in code (dependences lead to stalls)
3
Tomasulo scheme removes some of the scoreboard limitations by forwarding and renaming hardware,
but is still
single issue and in-order issue
4
Register Renaming
A name dependence occurs when two instructions Inst1 and Inst2 use the same register (or memory location), but there is no data transmitted between Inst1 and Inst2.
If the register is renamed so that Inst1 and Inst2 do not conflict, the two instructions can execute simultaneously or be reordered.
The technique that dynamically eliminates name dependences in registers to avoid WAR and WAW hazard, is called register renaming.
Register renaming can be done statically (= by compiler) or dynamically (= by hardware).
Tomasulo’s algorithm performs register renaming per hardware!
Dynamic renaming in memory is much harder to perform!
Why??
Pointer aliasing problems.
5
Tomasulo Algorithm
Developed for IBM 360/91 in 1967 (about 3 years after CDC 6600) Hazard detection and execution control are distributed among the functional
units (vs. centralized in scoreboard) Reservation stations at each functional unit control when an instruction can
begin execution at that unit. Common Data Bus broadcasts results to all reservation stations (of all FUs) Load and Stores treated as FUs as well. Each Register has additional flags.
TomasuloOrganization
LoadBuffers
Load/StoreReservationStations
Registers
Control
Inst
ruct
ions
InstructionUnit
Memory
Memory
Reservation Stations
Operand Bus
Functional UnitFunctional Unit
Com
mon
Dat
a B
ud (
CD
B)
…
7
Reservation Station Components
Each FU has one or more reservation stations The reservation station holds:
instructions that have been issued and are awaiting execution at a functional unit, the operands for that instruction if they have already been computed (or the source of
the operands otherwise), the information needed to control the instruction once it has begun execution.
The reservation stations buffer the operands of instructions waiting to issue, eliminating the need to get the operands from registers (similar to forwarding).
The register specifications store register values (scoreboarding: only pointers to the registers!) or pointers to reservation stations that produce the result.
WAR hazards are avoided because an operand is already stored in reservation station even when a write to the same register is performed out-of-order
WAW hazards are avoided because of the use of pointers to reservation stations instead of register pointers as tags on the CDB
8
Reservation Station Entries
Empty: Indicates reservation station is empty or not InFU: Indicates the instruction is executed in the FU, remains until completion Op: Operation to perform in the unit (e.g., + or –) Dest: Tag of the Reservation Src1, Src2: Value of source operands RS1, RS2: Tag of the Reservation stations producing source registers Vld1, Vld2: Valid flags indicating whether the values are available
Tomasulo Organization
rese
rvat
ion
stat
ions
R
Value
Vld
Op Vld1Src1Dest RS1 RS2Vld2Src2Empty InFU
12S
S
S
1
f
n
RS status
s
k
1 2 m
register status
registers… r …
…… ……
…… ……
…… ……RS
10
CBD and Reservation Stations
After completion of the instruction from RS, a result token is formed and passed on the common data bus (CDB) to the register file and, by snooping, directly to all RSs (thus eliminating the need to get the operand value from a register).
The traffic passing on the CDB is continually monitored. A result on the CDB is copied into all RSs awaiting it. CDB allows all units that are waiting for an operand to be loaded
simultaneously. Hence, the RS fetches and buffers an operand as soon it becomes available (dataflow principle).
The load buffers and load/store reservation stations hold data or addresses coming from and going to memory.
Register result status in register set: Indicates which reservation station will write each register, if one exists. Blank when no pending instructions that will write that register.
11
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from Instruction QueueIf reservation station free, the Tomasulo algorithm issues the instruction and fetches operands from registers if possible. In-order issue!
2. Execution—operate on operands (EX)When both operands ready then dispatch to FU and execute;if not ready, watch CDB for result (check for RAWs). Out-of-order dispatch and execution!
3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting units; mark reservation station available.
12
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- - (R3) (R4) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 1 add Reg4, Reg2, Reg30 0 0 0 0 0
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 12 1
Smul 3 1
Sdiv 4 1
cycle 0token.tagtoken.data
registers
RS status
Sadd
rese
rvat
ion
stat
ions
RValueVldRS
register status
We assume:mul and div need 4 EX cycles,sub and add need 1 EX cycle.
13
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- - (R3) (R4) (R5) - div Reg6, Reg1, Reg40 1 1 1 1 1 add Reg4, Reg2, Reg33 0 0 0 0 0
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 12 1
Smul 3 0 0 mul 1 (R3) 1 0 (R5) 1 0
Sdiv 4 1
cycle 1token.tagtoken.data
registers
RS status
Sadd
rese
rvat
ion
stat
ions
RValueVldRS
register status
14
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- - (R3) (R4) (R5) - div Reg6, Reg1, Reg40 0 1 1 1 1 add Reg4, Reg2, Reg33 1 0 0 0 0
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 0 0 sub 2 (R4) 1 0 (R3) 1 02 1
Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0
Sdiv 4 1
cycle 2token.tagtoken.data
registers
RS status
Sadd
rese
rvat
ion
stat
ions
RValueVldRS
register status
15
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- - (R3) (R4) (R5) - div Reg6, Reg1, Reg40 0 1 1 1 0 add Reg4, Reg2, Reg33 1 0 0 0 4
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 0 1 sub 2 (R4) 1 0 (R3) 1 02 1
Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0 3
Sdiv 4 0 0 div 6 0 3 (R4) 1 0
cycle 3token.tagtoken.data remaining cycles in FU
registers
RS status
Sadd
rese
rvat
ion
stat
ions
RValueVldRS
register status
16
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- (R4)-(R3) (R3) - (R5) - div Reg6, Reg1, Reg40 1 1 0 1 0 add Reg4, Reg2, Reg33 0 0 2 0 4
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 0 0 add 4 (R4)-(R3) 1 0 (R3) 1 0
Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0 2
Sdiv 4 0 0 div 6 0 3 (R4) 1 0
cycle 4token.tag 1token.data (R4)-(R3)
registers
RS status
Sadd
rese
rvat
ion
stat
ions
RValueVldRS
register status
sub writes result on CDB and frees RS;add is issued to RS 2 and gets resultfrom CDB in same cycle
17
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- (R4)-(R3) (R3) - (R5) - div Reg6, Reg1, Reg40 1 1 0 1 0 add Reg4, Reg2, Reg33 0 0 2 0 4
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 0 1 add 4 (R4)-(R3) 1 0 (R3) 1 0
Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0 1
Sdiv 4 0 0 div 6 0 3 (R4) 1 0
cycle 5token.tagtoken.data
registers
RS status
Sadd
rese
rvat
ion
stat
ions
RValueVldRS
register status
18
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3
- (R4)-(R3) (R3)(R4)-
(R3)+(R3) (R5) - div Reg6, Reg1, Reg40 1 1 1 1 0 add Reg4, Reg2, Reg33 0 0 0 0 4
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0
Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0 0
Sdiv 4 0 0 div 6 0 3 (R4) 1 0
cycle 6token.tag 2token.data (R4)-(R3)+(R3)
registers
RS status
Sadd
rese
rvat
ion
stat
ions
R
ValueVldRS
register status
add and mul complete in the same cycleand compete for the CDB;add gets the CDB, mul is deferred;
Please note the WAR hazard which is automatically solved:add updates Reg4 before div starts executing; however, div has already stored the previous value in its reservation station (only works with in-order issue!)
19
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3
(R3)*(R5) (R4)-(R3) (R3)(R4)-
(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0
Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0
Sdiv 4 0 0 div 6 (R3)*(R5) 1 0 (R4) 1 0
cycle 7token.tag 3token.data (R3)*(R5)
registers
RS status
Sadd
rese
rvat
ion
stat
ions
R
ValueVldRS
register status
20
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3
(R3)*(R5) (R4)-(R3) (R3)(R4)-
(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0
Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0
Sdiv 4 0 1 div 6 (R3)*(R5) 1 0 (R4) 1 0
cycle 8token.tagtoken.data
registers
RS status
Sadd
rese
rvat
ion
stat
ions
R
ValueVldRS
register status
21
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3
(R3)*(R5) (R4)-(R3) (R3)(R4)-
(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0
Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0
Sdiv 4 0 1 div 6 (R3)*(R5) 1 0 (R4) 1 0 3
cycle 9token.tagtoken.data
registers
RS status
Sadd
rese
rvat
ion
stat
ions
R
ValueVldRS
register status
22
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3
(R3)*(R5) (R4)-(R3) (R3)(R4)-
(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0
Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0
Sdiv 4 0 1 div 6 (R3)*(R5) 1 0 (R4) 1 0 2
cycle 10token.tagtoken.data
registers
RS status
Sadd
rese
rvat
ion
stat
ions
R
ValueVldRS
register status
23
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3
(R3)*(R5) (R4)-(R3) (R3)(R4)-
(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0
Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0
Sdiv 4 0 1 div 6 (R3)*(R5) 1 0 (R4) 1 0 1
cycle 11token.tagtoken.data
registers
RS status
Sadd
rese
rvat
ion
stat
ions
R
ValueVldRS
register status
24
Tomasulo Scheduling
mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3
(R3)*(R5) (R4)-(R3) (R3)(R4)-
(R3)+(R3) (R5)(R3)*(R5)
/(R4) div Reg6, Reg1, Reg41 1 1 1 1 1 add Reg4, Reg2, Reg30 0 0 0 0 0
Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0
Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0
Sdiv 4 1 1 div 6 (R3)*(R5) 1 0 (R4) 1 0 0
cycle 12token.tag 4token.data (R3)*(R5) /(R4)
registers
RS status
Sadd
rese
rvat
ion
stat
ions
R
ValueVldRS
register status
25
Comment on the Original Tomasulo Scheme
In the original Tomasulo scheme, the CDB is reserved at least two cycles in advance
each instruction stays at least two cycles in the EX phase CDB resource conflicts are solved at CDB reservation time (before
execution)In contrast, we assume CDB resource conflict resolution in WB stage (see cycle 6 in example).
What happens when an instruction is issued and one of its operands is on the CDB in the same cycle?Uncertain in original Tomasulo paper! We assume the instruction snoops the CDB already in issue phase(see cycle 4 in example).
26
Tomasulo Summary
Prevents register as bottleneck (forwarding from CDB to reservation stations)
Avoids WAR and WAW hazards Not limited to basic blocks (provided branch prediction) Lasting Contributions
Dynamic scheduling Register renaming in reservation stations
However: single-issue scheme, in-order issue scheme!
Implementation in IBM 360/91
27
IBM 360/91
Belongs to the family of the IBM System/360 architecture which all share the ISA. The IBM System/360 Model 91 was deeply pipelined
(overall pipeline length was 20 stages). Floating-point execution unit: two separate, fully pipelined floating-point FUs, the
adder and the multiplier/divider. The FUs could be used concurrently. Addition took two cycles, multiplication three cycles, and division eleven cycles. Three reservation stations (RS) associated to adder, and two to the
multiplier/divider. A speculative branch prediction was used that speculated the target will be taken,
when the branch target instruction is within the last eight instructions. Memory had a 10-cycle access, it was fully buffered and 32-way interleaved.
The processor could have up to 32 memory accesses pending to reduce latency. But no cache.
IBM 360/91
Floating-PointBuffers(FLB)
Floating-PointOperatingStack
Floating-PointRegisters(FLR)
FromInstruction Unit
FromStore Unit
ToStoreUnit
Decoder
Add Unit Multiply/DivideUnit
Com
mon
Dat
a B
us (
CD
B)
Reservation Stations
29
IBM 360/91 Implementation Details
The processor had about 120 000 gates implemented in ECL technology with a 60 ns basic CPU clock.
IBM produced about 12 of the IBM System/360 Model 91 and perhaps twice that number of Model 195 (which was based on Model 91 but had a faster cycle and incorporated a cache).
30
Lessons Learned from CISC
Modern processors use ideas from RISC and CISC approach. Out-of-order execution is not a new concept - it existed twenty-five years
ago on CISC machines CDC6600 as scoreboarding and on IBM System/360 Model 91 as Tomasulo scheme.
Out-of-order scheduling is quite similar to dataflow and is referred to as micro dataflow by microprocessor researchers.
Next: Chapter 4: Multiple-issue (Superscalar Processors)