computer architecture pipelines & superscalars sunset over the pacific ocean taken from iolanthe...
TRANSCRIPT
Computer Architecture
Pipelines & Superscalars
Sunset over the Pacific OceanTaken from Iolanthe II about 100nm north of Cape Reanga
Pipelines
• Data Hazards• Code:
lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)
The last four instructions all depend on a result
produced by the first!
MIPS instructionshave the format
op dest, srca, srcb
Pipelines - Data hazards
• Examine the pipeline(ignore first 2!)
• r2 onlyupdatedin timefor add!
Pipelines - Data Hazards
• Compilersolution• Insert
NOOPs• Inefficient!
Pipelines - Data Hazards
• Second compiler solution• Reorder
lw $4, 0($1)add $15, $1, $1sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)
sub $2, $1, $3lw $4, 0($1)add $15, $1, $1and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)
These two must not define$1 or $3!
ReadWritten
Pipelines - Data Hazards
• Second compiler solution• Reorder
sub $2, $1, $3lw $4, 0($1)add $15, $1, $1and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)
ReadWritten
First use of $2
Pipelines - Data Hazards
• Compiler analyses dependencies• Register
definitions
• Registeruse
• Read After Write(RAW)dependency
• No dependencies
• Instruction can be moved!
sub $2, $1, $3lw $4, 0($1)add $15, $1, $1and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15,100($2)
Written
Usesof $2
Pipelines - Data Hazards
• Hardware solution• Value forwarding
• Hardware detectsdependency
• scoreboard• Forwards result
from WB to EXfor subsequentuse
• Hardware• Transparent to software!
Data Hazards - classification
• Read after Write (RAW)• Instruction 1 must write
before instruction 2 reads
• Write after Write (WAW)• Instructions 1 and 2 both write
Instruction 2 must write after 1
• Write after Read (WAR)• Instruction 1 reads
Instruction 2 writes (overwrites)• Instruction 2 must not write before 1 reads
Reordering algorithms must consider all three!
Lecture 5 - Key Points
• Data Hazards• RAW - most common• WAW• WAR
• Compiler looks for dependencies• then re-orders
• Hardware• Scoreboard
• Monitors dependencies• ensures correct operation
• Value forwarding hardware• Forwards results from EX stage
Pipelines - Exceptions
• Caused by overflow, underflow• Example
add $1, $2, $1• Overflow detected in EX stage• Causes jump to exception handler
• as branch - remainder of pipeline flushed
but• Compiler needs original $1 causing overflow
Register must not be overwritten • EX stage needs to squash WB operation
• Precise Exception problem - more later!
Superpipelines
Superpipelines
• Time to complete each instruction = t• Total: Fetch + decode + fetch operands + operation + write-back
• Clock frequency: f = 1/t
• An n-stage pipeline allows n instructions ‘in flight’ simultaneously
• Each pipeline stage does 1/n of the work Each stage requires time t/n
• Assumes a perfectly balanced pipeline!• Balanced = each stage requires the same time
Clock frequency: fpipe = 1/(t/n) = n/t
Increasing n increases processor power?
Pipelines - Depth
• Pipeline can’t be too deep• Hazards are frequent
many stalls in deep pipelines
0.5
1.0
1.5
2.0
2.5
1 2 4 8 16
Rel
ativ
eP
erfo
rman
ce
Pipeline Depth
TooDeep!
Pipelines - Depth
• Pipeline can’t be too deep• Hazards are frequent
many stalls in deep pipelines
0.5
1.0
1.5
2.0
2.5
1 2 4 8 16
Rel
ativ
eP
erfo
rman
ce
Pipeline Depth
TooDeep!
Superpipelined
Pipeline depth
• Increasing number of stages• Each stage adds overheads
• Problems balancing pipeline
• Require tpd1 ≈ tpd
2 ≈ tpd3
• Stage time is tpdj + tpd
reg
• n stages means n tpdreg overhead
Reg
iste
r
Op
erat
ion
(wo
rk)
Reg
iste
r
Reg
iste
r
Op
erat
ion
(wo
rk)
Op
erat
ion
(wo
rk)
tpdregtpd
1 tpd2 tpd
3tpdreg tpd
reg
CISC and pipelines
• High Speed CISC processors are pipelined• Overlap IF, EX
• Variable• instruction length• running time (number of microcode cycles)pipeline imbalance“backup” in pipe stagescomplicate hazard detection
• Complex addressing modesauto-increment updates address registermultiple memory accesses required
smooth pipeline flow more difficult!
Instruction Queues
• Vital performance determinant• Rate of instruction fetch
• High Performance processors• Fetch multiple instructions in each cycle
• 2 - 4 common• Use wide datapath to memory
• PowerPC 604 128 bits = 4 instructions• Despatch unit
• Examine dependencies• Determine which instructions can be
despatched
Instruction Queues
• Q “matches” fetch/despatch rates• General Strategy for matching
Producers - Consumers• Use of FIFO-style Queues• Absorb
AsynchronousDelivery / ConsumptionRates
• ProvidesElasticityin pipelines
Producer
FIFO
Consumer
DifferingInstantaneous
Rates
Superscalar Processors
PowerPC organisation
PowerPC 601~1993
Boundary of theSi die
New - Look in the “Example Processors” sectionof the Web notes
3-way SuperScalar• Integer• Branch• Floating Point
A newer machine will have more functional units here!
Superscalar Processors
• Multiple Functional Units• PowerPC 604
6-way superscalar
• Despatch Unit • Sends “ready” instructions to all free units• PowerPC 604:
• potential 4 instructions/cycle (pipeline lengths are different!)
• reality: 2-3 instructions/cycle?(program dependent!)
Branch UnitLoadStore Unit3 Integer UnitsFloating Point Unit
Superscalar Processors
• Mix of functional units• Up to 8-way superscalar common now
• 2 Floating point units• Usually have ~3 cycle latency
• 3 Integer Arithmetic• Branch unit• Load / store unit• + ….?
• Marketing departments can play some games with the ‘n’ of a n-way superscalar!
Pentium Quad Core - 2008
• Distinguish between • Multiple ‘cores’ (separate processors) – later –
and• Superscalars – multiple functional units per
processor☺“Wide dynamic execution” in Intel-speak
• Quad core• 4 cores• Complete up to 4 instructions / cycle each• IIU can issue four instructions / cycle• 3 Mb L2 cache / processor (total 12Mb)• Master clock 3.2 GHz, front side bus 1.6GHz• 771 pins
Superscalar Limitations
• To achieve maximum performance• Instruction mix must match Functional Unit mix
• eg if we have 2 Integer ALUs, 2 FPUs, 1 branch unit, 1 load/store unit
• Instruction issue unit (IIU) can issue 4 instructions• Each four instructions should be able to use 4 of the
functional units• If instruction stream doesn’t have right mix
• Some functional units will remain idle
• FPUs require multiple cycles• Additional stalls
• Pipeline hazards stall pipeline• 4-way superscalar gets 1.8-3 instructions completed per
cycle• Program dependent!