Download - CSC 2405 Computer Systems II
![Page 1: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/1.jpg)
CSC 2405Computer Systems II
Advanced Topics
![Page 2: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/2.jpg)
Instruction Set Architecture
![Page 3: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/3.jpg)
3Chapter 4
Instruction Set Architecture Assembly Language View
– Processor state Registers, memory, …
– Instructions addl, movl, leal, … How instructions are encoded as bytes
Layer of Abstraction– Above: how to program machine
Processor executes instructions in a sequence
– Below: what needs to be built Use variety of tricks to make it run fast E.g., execute multiple instructions
simultaneously
ISA
Compiler OS
CPUDesign
CircuitDesign
ChipLayout
ApplicationProgram
![Page 4: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/4.jpg)
4Chapter 4
Instruction Set Architectures Basic ISA Classes
Stack Accumulator Register
(Register-memory)
Register
(load-store)
Push A Load A Load R1, A Load R1, A
Push B Add B Add R1, B Load R2, B
Add Store C Store C, R1 Add R3, R1, R2
Pop C Store C, R3
The results of different address classes is easiest to see with the examples here, all of which implement the sequences for C = A + B.
Registers are the class that won out. The more registers on the CPU, the better.
![Page 5: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/5.jpg)
5Chapter 4
80x86 Instruction Frequency
![Page 6: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/6.jpg)
6Chapter 4
Relative Frequency of Control Instructions
Operation Integer Floating Pt Call/Return 19% 8%
Jumps 6% 10% Branches 75% 82%
Design hardware to handle branches quickly, since these occur most frequently
![Page 7: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/7.jpg)
7Chapter 4
CISC Instruction Sets– Complex Instruction Set Computer– Dominant style through mid-80’s
Stack-oriented instruction set– Use stack to pass arguments, save program counter– Explicit push and pop instructions
Arithmetic instructions can access memory– addl %eax, 12(%ebx,%ecx,4)
requires memory read and write Complex address calculation
Condition codes– Set as side effect of arithmetic and logical instructions
Philosophy– Add instructions to perform “typical” programming tasks
![Page 8: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/8.jpg)
8Chapter 4
RISC Instruction Sets– Reduced Instruction Set Computer– Internal project at IBM, later popularized by Hennessy (Stanford) and
Patterson (Berkeley)
Fewer, simpler instructions– Might take more to get given task done– Can execute them with small and fast hardware
Register-oriented instruction set– Many more (typically 32) registers– Use for arguments, return pointer, temporaries
Only load and store instructions can access memory– Similar to Y86 mrmovl and rmmovl
No Condition codes– Test instructions return 0/1 in register
![Page 9: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/9.jpg)
9Chapter 4
Example RISC Instruction Formats
Op
31 26 01516202125
rs1 rd immediate
Op
31 26 025
Op
31 26 01516202125
rs1 rs2
offset added to PC
rd
Register-Register (R-type) ADD R1, R2, R3
561011
Register-Immediate (I-type) SUB R1, R2, #3
Jump / Call (J-type) JUMP end
func
(ALU imm. operations, loads and stores, conditional branch, jump (and link)
(jump, jump and link, trap and return from exception)
(ALI reg. operations, read/write special registers and moves)
![Page 10: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/10.jpg)
10Chapter 4
CISC vs. RISC Original Debate
– Strong opinions!– CISC proponents---easy for compiler, fewer code bytes– RISC proponents---better for optimizing compilers, can make run fast
with simple chip design
Current Status– For desktop processors, choice of ISA not a technical issue
With enough hardware, can make anything run fast Code compatibility more important
– For embedded processors, RISC makes sense Smaller, cheaper, less power
![Page 11: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/11.jpg)
Logic Design
![Page 12: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/12.jpg)
12Chapter 4
Overview of Logic Design Fundamental Hardware Requirements
– Communication How to get values from one place to another
– Computation– Storage
Bits are Our Friends– Everything expressed in terms of values 0 and 1– Communication
Low or high voltage on wire
– Computation Compute Boolean functions
– Storage Store bits of information
![Page 13: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/13.jpg)
13Chapter 4
Digital Signals
– Use voltage thresholds to extract discrete values from continuous signal
– Simplest version: 1-bit signal Either high range (1) or low range (0) With guard range between them
– Not strongly affected by noise or low quality circuit elements Can make circuits simple, small, and fast
Voltage
Time
0 1 0
![Page 14: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/14.jpg)
14Chapter 4
Computing with Logic Gates
– Outputs are Boolean functions of inputs– Respond continuously to changes in inputs
With some, small delay
ab out
ab out a out
out = a && b out = a || b out = !a
And Or Not
Voltage
Time
a
ba && b
Rising Delay Falling Delay
![Page 15: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/15.jpg)
15Chapter 4
Combinational Circuits
Acyclic Network of Logic Gates– Continuously responds to changes on primary inputs– Primary outputs become (after some delay) Boolean functions of
primary inputs
Acyclic Network
PrimaryInputs
PrimaryOutputs
![Page 16: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/16.jpg)
16Chapter 4
Bit Equality
– Generate 1 if a and b are equal
Hardware Control Language (HCL)– Very simple hardware description language
Boolean operations have syntax similar to C logical operations
– We’ll use it to describe control logic for processors
Bit equala
b
eqbool eq = (a&&b)||(!a&&!b)
HCL Expression
![Page 17: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/17.jpg)
17Chapter 4
Word Equality
– 32-bit word size– HCL representation
Equality operation Generates Boolean value
b31Bit equal
a31
eq31
b30Bit equal
a30
eq30
b1Bit equal
a1
eq1
b0Bit equal
a0
eq0
Eq
==B
A
Eq
Word-Level Representation
bool Eq = (A == B)
HCL Representation
![Page 18: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/18.jpg)
18Chapter 4
1-Bit LatchD Latch
Q+
Q–
R
S
D
C
Data
Clock
Latching
1
Q+
Q–
R
S
D
C
Q+
Q–
R
S
D
C
d !d !d !d d
d d !d0
Storing
Q+
Q–
R
S
D
C
Q+
Q–
R
S
D
C
d !d q
!q
!q
q0
0
![Page 19: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/19.jpg)
19Chapter 4
Registers
– Stores word of data Different from program registers seen in assembly code
– Collection of edge-triggered latches– Loads input on rising edge of clock
I O
Clock
DC
Q+
DC
Q+
DC
Q+
DC
Q+
DC
Q+
DC
Q+
DC
Q+
DC
Q+
i7
i6
i5
i4
i3
i2
i1
i0
o7
o6
o5
o4
o3
o2
o1
o0
Clock
Structure
![Page 20: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/20.jpg)
20Chapter 4
Random-Access Memory
– Stores multiple words of memory Address input specifies which word to read or write
– Register file Holds values of program registers %eax, %esp, etc. Register identifier serves as address
– ID 8 implies no read or write performed
– Multiple Ports Can read and/or write multiple words in one cycle
– Each has separate address and data input/output
Registerfile
Registerfile
A
B
W dstW
srcA
valA
srcB
valB
valW
Read ports Write port
Clock
![Page 21: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/21.jpg)
21Chapter 4
Basic Logic Gates
NOTE: okay to use just a circle for NOT:
![Page 22: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/22.jpg)
22Chapter 4
More than 2 Inputs? AND/OR can take any number of inputs.
– AND = 1 if all inputs are 1.– OR = 1 if any input is 1.– Similar for NAND/NOR.
Can implement with multiple two-input gates
![Page 23: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/23.jpg)
23Chapter 4
Logical Completeness Can implement ANY truth table with AND, OR, NOT.
A B C D
0 0 0 0
0 0 1 0
0 1 0 1
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 0
1. AND combinations that yield a "1" in the truth table.
2. OR the resultsof the AND gates.
![Page 24: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/24.jpg)
24Chapter 4
DeMorgan's Law Converting AND to OR (with some help from NOT) Consider the following gate:
A B
0 0 1 1 1 0
0 1 1 0 0 1
1 0 0 1 0 1
1 1 0 0 0 1
BA BA BA To convert AND to OR
(or vice versa),invert inputs and output.
![Page 25: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/25.jpg)
25Chapter 4
Decoder n inputs, 2n outputs
– exactly one output is 1 for each possible input pattern
2-bitdecoder
![Page 26: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/26.jpg)
Sequential Processors
![Page 27: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/27.jpg)
27Chapter 4
Sequential HW Structure
State– Program counter register (PC)– Condition code register (CC)– Register File– Memories
Access same memory space Data: for reading/writing program data Instruction: for reading instructions
Instruction Flow– Read instruction at address specified by
PC– Process through stages– Update program counter
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
icode, ifunrA , rB
valC
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
PC
valP
srcA, srcBdstA, dstB
valA, valB
aluA, aluB
Bch
valE
Addr, Data
valM
PCvalE, valM
newPC
![Page 28: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/28.jpg)
28Chapter 4
Seqential Stages Fetch
– Read instruction from instruction memory
Decode– Read program registers
Execute– Compute value or address
Memory– Read or write data
Write Back– Write program registers
PC– Update program counter
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
icode, ifunrA , rB
valC
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
PC
valP
srcA, srcBdstA, dstB
valA, valB
aluA, aluB
Bch
valE
Addr, Data
valM
PCvalE, valM
newPC
![Page 29: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/29.jpg)
29Chapter 4
Instruction Decoding
Instruction Format– Instruction byte icode:ifun– Optional register byte rA:rB– Optional constant word valC
5 0 rA rB D
icodeifun
rArB
valC
Optional Optional
![Page 30: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/30.jpg)
30Chapter 4
Sequential Summary Implementation
– Express every instruction as series of simple steps– Follow same general flow for each instruction type– Assemble registers, memories, predesigned combinational blocks– Connect with control logic
Limitations– Too slow to be practical– In one cycle, must propagate through instruction memory, register file,
ALU, and data memory– Would need to run clock very slowly– Hardware units only active for fraction of clock cycle
![Page 31: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/31.jpg)
Pipelined Processors
![Page 32: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/32.jpg)
32Chapter 4
What is Pipelining Computers execute billions of instructions, so instruction
throughput is what matters IDEA: Divide instruction execution up into several pipeline
stages. For example
IF ID EX MEM WB Simultaneously have different instructions in different
pipeline stages The length of the longest pipeline stage determines the
cycle time Desirable pipeline features (e.g., RISC):
– all instructions same length– registers located in same place in instruction format– memory operands only in loads or stores
![Page 33: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/33.jpg)
33Chapter 4
What Is Pipelining
Laundry Example
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
A B C D
![Page 34: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/34.jpg)
34Chapter 4
What Is Pipelining
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
![Page 35: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/35.jpg)
35Chapter 4
Start work ASAP
Pipelined laundry takes 3.5 hours for 4 loads A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
What Is Pipelining
![Page 36: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/36.jpg)
36Chapter 4
Pipelining Lessons
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
What Is Pipelining
![Page 37: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/37.jpg)
37Chapter 4
Real-World Pipelines: Car Washes
Idea– Divide process into independent stages– Move objects through stages in sequence– At any given times, multiple objects being processed
Sequential Parallel
Pipelined
![Page 38: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/38.jpg)
38Chapter 4
Pipeline Diagrams Unpipelined
– Cannot start new operation until previous one completes
3-Way Pipelined
– Up to 3 operations in process simultaneously
Time
OP1
OP2
OP3
Time
A B C
A B C
A B C
OP1
OP2
OP3
![Page 39: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/39.jpg)
39Chapter 4
Data Dependencies
System– Each operation depends on result from preceding one
Clock
Combinationallogic
Reg
Time
OP1
OP2
OP3
![Page 40: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/40.jpg)
40Chapter 4
Data Hazards
– Result does not feed back around in time for next operation– Pipelining has changed behavior of system
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
Time
OP1
OP2
OP3
A B C
A B C
A B C
OP4 A B C
![Page 41: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/41.jpg)
41Chapter 4
One Memory Port/Structural Hazards
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
![Page 42: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/42.jpg)
42Chapter 4
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
How do you “bubble” the pipe?
One Memory Port/Structural Hazards
![Page 43: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/43.jpg)
43Chapter 4
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Data Hazard on R1
Time (clock cycles)
IF ID/RF EX MEM WB
![Page 44: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/44.jpg)
44Chapter 4
Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
![Page 45: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/45.jpg)
45Chapter 4
Write After Read (WAR) InstrJ writes operand before InstrI reads it
Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
![Page 46: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/46.jpg)
46Chapter 4
Three Generic Data Hazards
Write After Write (WAW) InstrJ writes operand before InstrI writes it.
Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
![Page 47: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/47.jpg)
47Chapter 4
Data Forwarding Naïve Pipeline
– Register isn’t written until completion of write-back stage– Source operands read from register file in decode stage
Needs to be in register file at start of stage
Observation– Value generated in execute or memory stage
Trick– Pass value directly from generating instruction to decode stage– Needs to be available at end of decode stage
![Page 48: CSC 2405 Computer Systems II](https://reader036.vdocuments.us/reader036/viewer/2022062519/56814de7550346895dbb57e5/html5/thumbnails/48.jpg)
48Chapter 4
Time (clock cycles)
Forwarding to Avoid Data Hazard
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg