digital logic - york university · zero volts is logic0 5 volt is logic1 unless we use negative...
TRANSCRIPT
Digital Logic
Ch. 4 and Appendix C
Gates
The most obvious gates are AND and OR
We can combine them to implement any logic function
Conventions
Zero volts is logic0
5 volt is logic1
Unless we use negative logic
Most computers use smaller voltages now
1.5 volt is used by DDR3 memories
In this case 1.5 volt is logic1
Dues to electrical noise the logic levels are define by a range.
Other gates
The little circle means ”not”
NOR gate (notOR)
NAND gate
WHAT???
WHAT???
Truth Tables
It is the opposite of an AND gate
It is a NAND gate
Example
Try to figure out what this does
It is a one bit adder with carry in.
Simpler Drawing
Programmable Logic Arrays
PLA for short
The dots are really fuses inside a chip
Fuses can be programmed once
Can implement any logic function
Modern “fuses” are programed many times
PLAs on hormones are called
Field Programmable Gate Arrays (FPGA)
PLAs
AND gate array
OR gate array
Standard Components
Decoders
Multiplexers
ROM
Decoder
Multiplexer
ROM
Boolean Algebra Laws
Identity Law: A+0=A, A*1=A
Zero & One Law: A+1=1, A*0=0
Existence of inverse: A+A' = 1, A*A' = 0
Commutative Law: A+B=B+A, A*B=B*A
Associative Law: A+(B+C)=(A+B)+C
A*(B*C)=(A*B)*C
Distributive Law: A*(B+C)=A*B+A*C
A+(B*C)=(A+B)*(A+C)
De Morgan's Law
(A+B)' = A' * B'
(A*B)' = A' + B'
Principle of Duality
AND and OR are symmetric
So is 0 and 1
Optimization
Two different logic expressions can have exactly the same behavior.
Two different expressions with identical behavior may have different cost of implementation
Choosing the cheapest is optimization
May have to satisfy other criteria
Propagation delay, no glitches, etc
Optimization
AB + AB'
=A(B+B')
=A*1=A
A'B'C + ABC
= (A'B' + AB)C
=( (A'B' + A)(A'B' + B) )C
( B' + A)(A'+B)C
Half Adder
S = A'B + AB'
C = AB AB SC00 0001 1010 1011 01
Full Adder
S = A'B'C + AB'C' + A'B'C + ABC
Cout = ABC + A'BC + ABC' + AB'C
Cout
= AB + BC + CA (optimized)
ABC SC000 00001 10010 10011 01100 10101 01110 01111 11
Verilog
A hardware description language
Can be used to design, optimize and simulate hardware
Started in the mid80's as a hardware simulation system
Hardware synthesis was added later
Its main competitor is VHDL
What can Verilog do?
Describe a circuit for simulation purposes
Many of the Verilog constructs can be synthesizeable.
Allows the designer to specify
Behavior and/or
Structure
Structure of a Verilog Module
Contains initial constructs
Parallel blocks called always constructs
Continuous assignments to specify combinational circuits (gates w/o memory)
Instances of other modules
Elements of Verilog
Wire: mathematical abstraction of a real wire
Can have 4 possible values!!
True or 1
False or 0
X: unknown (not yet defined, unconnected etc)
Z: high impedanceElectrically disconnected. A smart trick electronics engineers have invented.
Elements of Verilog
Registers (reg)
Are memory elements
Verilog compiler may map them to actual memory elements (flipflops)
Same set of possible values
Elements of Verilog
Constants
Can be specified as plain constants like 3, 15, 20...
Often we want to specify the bitwidth of a constant
4'b0011 is 4bit representation of 3
5'b00011 is a 5bit representation of 3
4'b0011 is 4bit representation of 3 (2's compl.)
4'hF is 4bit representation of 15
Operators in Verilog
+,,*,/ like C
&, |, ~, ^ again like C
==, !=, <, >, <=, >= like C
<<, >> like C
con?expr1:expr2 like C
Operators in Verilog
But adds to C
Unary &, |, ^Apply the operator on all bits of the operand
{A,B} the bits of A followed by the bits of B
{x{const}} is {const,const... x times}
Combinational Circuits
A network of gates
Directed graph
There should be no cycles
Output determined exclusively by inputs
Implement logic functions
Memory elements
Memory Elements
We can think of memory elements as combinational circuits with feedback
We would rather think of them as little black boxes
Sometimes memory is implemented using other technologies (capacitors for DRAM)
Combinational Circuits
Module half_adder(A,B,Sum,Carry);
input A,B;
output Sum, Carry;
assign Sum = A^B;
assign Carry = A & &;
endmodule
Combinational Circuits
Use the assign keyword
They represent permanent connections
The assign keyword can specify only combinational circuits
Combinational circuits can be specified with the always construct as well
The always construct can also specify sequential circuits as well
The always construct
Module half_adder(A,B,Sum,Carry)
input A,B;
output reg S, C
always @(A,B) begin
case ({A,B})
2'b00: begin S=0; C=0; end;
2'b01: begin S=1; C=0; end;
2'b10: begin S=1; C=0; end;
2'b11: begin S=0; C=1; end;
end
endmodule
Combinational with always
Previous example used always to implement a halfadder
Uses blocking assignments
Pretty much the same as C
If properly defined, most compilers will not use flipflops to implement it
If all input signals are on sensitivity list
Every execution path assigns value to the same bits
Sequential Circuits
Any circuit that contains memory
If it contains memory then it has “state”
If it has state then the state changes, so it goes through a sequence of states
Hence the name sequential.
Sequential Circuits
Sequential CircuitsHow come signals don't rush around the loop uncontrollably?
This is where the “clock” comes in
It is the same clock you see on the specs of your CPU
With every clock pulse the signal goes around once
These are called synchronous sequential circuits
There are also asynchronous
Typical Latch
Still....
Unless the width of the clock pulse is wisely selected...
The signal will travel around more than once
These latches are useful in some case, but not good enough for our current task
Falling edge trigger FF
Edge triggered DFlipFlop
Module DFF(clock,D,Q,Qb)
input clock, D;
output reg Q;
output Qb;
assign Qb = ~Q;
always @(posedge clock)
Q <= D;
endmodule
Timings
Timing is complex
We use a simplified model
Setup time: time the input to the FF has to be stable before the clock edge
Hold time: time the input has to be stable after the clock edge
Multibit Wires and Registers
reg [31:0] regA;
regA[0] is the LSB;
wire [31:0] ALUout;
reg [31:0] regfile[0:31];
regfile[0] is the first register in the register file.
MIPS ALUmodule MIPSALU (ALUctl, A, B, ALUOut, Zero);
input [3:0] ALUctl; input [31:0] A,B;
output reg [31:0] ALUOut; output Zero;
assign Zero = (ALUOut==0); //Zero is true if ALUOut is 0
always @(ALUctl, A, B) begin //reevaluate if these change
case (ALUctl)
0: ALUOut = A & B;
1: ALUOut = A | B;
2: ALUOut = A + B;
6: ALUOut = A B;
7: ALUOut = A < B ? 1 : 0;
12: ALUOut = ~(A | B); // result is nor
default: ALUOut <= 0;
endcase
end
endmodule
Register File
Register File: read
Register File: write
Register File: Verilogmodule rfile(R1,R2,W,WD,Wctl,RD1,RD2,clock)
input [5:0] R1,R2,W;
input [31:0] WD;
input Wctl, clock;
output [31:0] RD1,RD2;
reg [31:0] RF[31:0];
assign RD1 = RF[R1];
assign RD2 = RF[R2];
always @(posedge clock)
if (Wctl) RF[W] <= WD;
endmodule
Specifying Gates
Verilog allows the designer to specify individual gates
Can be bulky
Similar syntax can be used for user defined modules
Half Adder
module HA(A,B,S,C)
input A, B;
output S, C;
wire Bn, An, Abn, AnB;
not N1(An,A);
not N2(Bn,B);
and (Abn,A,Bn);
and (AnB,An,B);
or (S,ABn,AnB);
and (C,A,B);
endmodule
Speeding Up Addition
Carry propagation is what slows down addition
Sometimes the LSB of input will affect the MSB or the output
We design for the worst case senario
The simpler adders are called ripple adders
Carry LookAhead
a0, a1, a2, etc; b0, b1, b2, etc are the inputs
c0, c1, c2 are the carries.
c1 = b0 c0 + a0 c0 + a0 b0
c1 = a0 b0 + c0 (a0 + b0)
c1 = g0 + c0 p0
g0 = a0 b0; p0 = a0 + b0;
Carry LookAhead
Define
gi = a
i + b
i
pi = a
i b
i
Then
ci+1
= gi + p
i c
i
Carry LookAhead
c1 = g0 + p0 c0
c2 = g1 + p1 g0 + p1 p0 c0
c3 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0
Control Hazards
Whenever we have a branch/jump/jal/whatever We find out which way we branch at the MEM stage Meanwhile we have loaded the next three
instructions We have to flush the pipeline We waste three cycles
What is the problem
Jumps/branches are very common 25% of the instructions sometimes
If our processor is 4way superscalar wasting three cycles means we do not execute 12 instructions!
Longer pipelines suffer even more
Solutions
Delayed branch Means that next instruction is always executed
Ideally an instruction from before that is independent of the branch
An instruction from fall through that has no effect if branch is taken
An instruction from the target that has no effect if branch falls through
A nop if all else is unavailable We save at most one cycle
Solutions
Decide the branch at ID stage Requires extra hardware Saves two cycles With branch delay can be stallfree
Solutions
Always predict not taken The easiest... just do what we did so far Fails miserably for loops
Solutions
Predict taken We can do this at the ID stage Waste one cycle only if prediction correct Combined with delayed branch cost goes to zero (if
correct Works fine for many loops
Solutions
Statically predict taken/not taken Can be done with heuristics Or by giving the compiler an execution trace Just have two variants of every branch instruction Easy to implement Works great for numerical programs Not so great for non numerical
Solutions
Dynamic prediction The most advanced and most popular Requires a lot of silicon area Can be done by hashing the address to a small
memory. (Branch Prediction Buffer) Memory remembers 1 bit (taken/not taken) Loops have two mispredictions Can be solved with two bit prediction There are many far more sophisticated techniques
Solutions
Speculation The technique nowadays Good when the control hazard is compounded by a
data hazard Should allow out of order execution Should provide a way to undo a change after a
failed speculation
Exceptions/Interrupts
There is a difference Exceptions are caused by an internal condition
Error, system call Interrupts are caused by external conditions
I/O complete, mouse clicks
In many cases all are called interrupts They are handled in more or less the same way
Why bother
A computer that does not communicate with its environment is called a brick
The extra hardware to detect and handle interrupts is large and contributes to the slowing down of the clock That's part of the reason why some coprocessors
run so much faster
How are they handled
The CPU provide relevant info in two registers EPC (Exception Program Counter), 32 bits Cause Register, 32 bits bu many unused
Alternatively Use Vector interrupts For each possible cause there is an entry in the
vector
In more detail Another form of control hazard
Instead of branching to a user space address, branch to a kernel space address
Branches happen only at a particular stage in the pipeline, but exceptions can happen almost anywhere
More than one exception can happen at the same time in different instructions
We may need to restart the instruction after the exception is handled
Some instructions are handled on the spot, others where they happened.
Instruction Level Parallelism
What drove the speed of cpus 19952005 Pipelining is the oldest technique
Race to reduce hazards
Programmer is unaware of the parallelism The key is multiple issue
We encounter hazards on hormones
Two kinds
Static multiple issue VLIW Fixed form issue packet Was the technique used on Itanium There are usually restriction on what instructions
can be packaged together In some designs the compiler has to guarantee no
data/structural hazards within the issue packet
Extra cost
If we allow the issue of an ALU and a memory instruction at the same time we need Twice as many ports on the register file An extra adder to calculate the effective address Ability to detect/forward many more hazards
between different issue packets Stalls create twice as much delay
Advantage
With two issue we have possibly twice as fast processor (if the world was made by angels)
We do not need much more hardware With a good compiler the C programmer will
never know
Disadvantage
We have to recompile for new architectures We save a bit on hardware but it is hard to
make use of advances immediately Software vendors hated it Itanium is dead.
Example: VLIW for MIPS
A simplified static multiple issue MIPSlike processor
Can issue one ALU/branch and one load/store instruction per cycle.
Ignores dependencies within the issue packet. Stalls/forwards for dependencies between issue
packets
Example: VLIW for MIPSALU/branch IF ID EX M WB
Load/Store IF ID EX M WB
ALU/branch IF ID EX M WB
Load/Store IF ID EX M WB
ALU/branch IF ID EX M WB
Load/Store IF ID EX M WB
ALU/branch IF ID EX M WB
Load/Store IF ID EX M WB
Example: scheduling code
Loop: lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, 4
bne $s1, $0, Loop
Scheduled
Loop: nop lw $t0, 0($s1)
addi $s1, $s1, 4 nop
addu $t0, $t0, $s2 nop
bne $s1, $0, Loop sw $t0, 0($s1)
The Verdict
We can do it in 4 issue packets instead of five instructions
Before we had one or two stalls so it would take 67 cycles to execute, plus the stalls due to the branch
If we optimize the single issue version we can get it down to 5 cycles plus branch stalls
Now we can execute it in 4 cycles plus branch stalls.
Observations
We now have many more stalls/nops than single issue
The new stalls/nops eat up most of the improvement
It is not worth the extra hardware/power consumption
Is it the end of the road?
Loop unrolling
Compiler optimization Can be done easily when loops are
independent Sometimes even when they are not independent
Reduces the loop overhead Fewer instructions executed
Allows more freedom in scheduling Fewer stalls/nops
The code
Loop: addi $s1,$s1, 16 lw $t0, 0($s1)
nop lw $t1, 12($s1)
addu $t0, $t0, $s2 lw $t2, 8($s1)
addu $t0, $t0, $s2 lw $t3, 8($s1)
addu $t0, $t0, $s2 sw $t0, 8($s1)
addu $t0, $t0, $s2 sw $t1, 8($s1)
nop sw $t2, 8($s1)
bne $s1, $0, Loop sw $t3, 4($s1)
The tricks we used Unroll the loop, eliminate the branches, simplify
loop variable updating Use more temp registers
This is called register renaming We need to do it if we have antidependence or
name dependence
We may run out of registers or need more saving/restoring
Longer code May not be optimal in all architectures
Dynamic Multiple Issue
A.K.A superscalars The processor decides if it going to issue 0, 1,
2... instructions Instructions are allowed to execute out of order
But not necessarily complete out of order
The processor decides how many to instructions to issue The compiler does not need to know.
Dynamic Pipeline scheduling
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20
The sub instruction can execute before addu
Dynamic Pipeline
IF/ID
Reservationstation
Reservationstation
Reservationstation ...
Execunit
Execunit
Execunit
CommitUnit
The bad news
Dynamic multiple issue CPUs were available for decades
Some can issue more than 4 instructions per cycle
They rarely complete more than 2 per cycle on average
Have to be conservative to maintain correctness (pointer aliasing)
Power Efficiency
Power has emerged as the limiting factor Cost of energy goes up Huge server farms are common Ability to eliminate heat is limited Battery life is very important Environmental concerns
Fallacies and Pitfalls
Pipelining is easy Real pipelining is quite complex
Pipelining is independent of technology The huge number of transistors offer options that
annul previous technologies (huge pipelines vs delayed branches)
Some “optimizations” in ISA spoil the speed of the pipeline.