digital logic - york university · zero volts is logic0 5 volt is logic1 unless we use negative...

Digital Logic

Ch. 4 and Appendix C

Gates

The most obvious gates are AND and OR

We can combine them to implement any logic function

Conventions

Zero volts is logic0

5 volt is logic1

Unless we use negative logic

Most computers use smaller voltages now

1.5 volt is used by DDR3 memories

In this case 1.5 volt is logic1

Dues to electrical noise the logic levels are define by a range.

Other gates

The little circle means ”not”

NOR gate (notOR)

NAND gate

WHAT???

WHAT???

Truth Tables

It is the opposite of an AND gate

It is a NAND gate

Example

Try to figure out what this does

It is a one bit adder with carry in.

Simpler Drawing

Programmable Logic Arrays

PLA for short

The dots are really fuses inside a chip

Fuses can be programmed once

Can implement any logic function

Modern “fuses” are programed many times

PLAs on hormones are called

Field Programmable Gate Arrays (FPGA)

PLAs

AND gate array

OR gate array

Standard Components

Decoders

Multiplexers

ROM

Decoder

Multiplexer

Boolean Algebra Laws

Identity Law: A+0=A, A*1=A

Zero & One Law: A+1=1, A*0=0

Existence of inverse: A+A' = 1, A*A' = 0

Commutative Law: A+B=B+A, A*B=B*A

Associative Law: A+(B+C)=(A+B)+C

A*(B*C)=(A*B)*C

Distributive Law: A*(B+C)=A*B+A*C

A+(B*C)=(A+B)*(A+C)

De Morgan's Law

(A+B)' = A' * B'

(A*B)' = A' + B'

Principle of Duality

AND and OR are symmetric

So is 0 and 1

Optimization

Two different logic expressions can have exactly the same behavior.

Two different expressions with identical behavior may have different cost of implementation

Choosing the cheapest is optimization

May have to satisfy other criteria

Propagation delay, no glitches, etc

Optimization

AB + AB'

=A(B+B')

=A*1=A

A'B'C + ABC

= (A'B' + AB)C

=( (A'B' + A)(A'B' + B) )C

( B' + A)(A'+B)C

Half Adder

S = A'B + AB'

C = AB AB SC00 0001 1010 1011 01

Full Adder

S = A'B'C + AB'C' + A'B'C + ABC

Cout = ABC + A'BC + ABC' + AB'C

Cout

= AB + BC + CA (optimized)

ABC SC000 00001 10010 10011 01100 10101 01110 01111 11

Verilog

A hardware description language

Can be used to design, optimize and simulate hardware

Started in the mid80's as a hardware simulation system

Hardware synthesis was added later

Its main competitor is VHDL

What can Verilog do?

Describe a circuit for simulation purposes

Many of the Verilog constructs can be synthesizeable.

Allows the designer to specify

Behavior and/or

Structure

Structure of a Verilog Module

Contains initial constructs

Parallel blocks called always constructs

Continuous assignments to specify combinational circuits (gates w/o memory)

Instances of other modules

Elements of Verilog

Wire: mathematical abstraction of a real wire

Can have 4 possible values!!

True or 1

False or 0

X: unknown (not yet defined, unconnected etc)

Z: high impedanceElectrically disconnected. A smart trick electronics engineers have invented.

Elements of Verilog

Registers (reg)

Are memory elements

Verilog compiler may map them to actual memory elements (flipflops)

Same set of possible values

Elements of Verilog

Constants

Can be specified as plain constants like 3, 15, 20...

Often we want to specify the bitwidth of a constant

4'b0011 is 4bit representation of 3

5'b00011 is a 5bit representation of 3

4'b0011 is 4bit representation of 3 (2's compl.)

4'hF is 4bit representation of 15

Operators in Verilog

+,,*,/ like C

&, |, ~, ^ again like C

==, !=, <, >, <=, >= like C

<<, >> like C

con?expr1:expr2 like C

Operators in Verilog

But adds to C

Unary &, |, ^Apply the operator on all bits of the operand

{A,B} the bits of A followed by the bits of B

{x{const}} is {const,const... x times}

Combinational Circuits

A network of gates

Directed graph

There should be no cycles

Output determined exclusively by inputs

Implement logic functions

Memory elements

Memory Elements

We can think of memory elements as combinational circuits with feedback

We would rather think of them as little black boxes

Sometimes memory is implemented using other technologies (capacitors for DRAM)


Module half_adder(A,B,Sum,Carry);

input A,B;

output Sum, Carry;

assign Sum = A^B;

assign Carry = A & &;

endmodule


Use the assign keyword

They represent permanent connections

The assign keyword can specify only combinational circuits

Combinational circuits can be specified with the always construct as well

The always construct can also specify sequential circuits as well

The always construct

Module half_adder(A,B,Sum,Carry)

input A,B;

output reg S, C

always @(A,B) begin

case ({A,B})

2'b00: begin S=0; C=0; end;




end

endmodule

Combinational with always

Previous example used always to implement a halfadder

Uses blocking assignments

Pretty much the same as C

If properly defined, most compilers will not use flipflops to implement it

If all input signals are on sensitivity list

Every execution path assigns value to the same bits

Sequential Circuits

Any circuit that contains memory

If it contains memory then it has “state”

If it has state then the state changes, so it goes through a sequence of states

Hence the name sequential.

Sequential Circuits

Sequential CircuitsHow come signals don't rush around the loop uncontrollably?

This is where the “clock” comes in

It is the same clock you see on the specs of your CPU

With every clock pulse the signal goes around once

These are called synchronous sequential circuits

There are also asynchronous

Typical Latch

Still....

Unless the width of the clock pulse is wisely selected...

The signal will travel around more than once

These latches are useful in some case, but not good enough for our current task

Falling edge trigger FF

Edge triggered DFlipFlop

Module DFF(clock,D,Q,Qb)

input clock, D;

output reg Q;

output Qb;

assign Qb = ~Q;

always @(posedge clock)

Q <= D;

endmodule

Timings

Timing is complex

We use a simplified model

Setup time: time the input to the FF has to be stable before the clock edge

Hold time: time the input has to be stable after the clock edge

Multibit Wires and Registers

reg [31:0] regA;

regA[0] is the LSB;

wire [31:0] ALUout;

reg [31:0] regfile[0:31];

regfile[0] is the first register in the register file.

MIPS ALUmodule MIPSALU (ALUctl, A, B, ALUOut, Zero);

input [3:0] ALUctl; input [31:0] A,B;

output reg [31:0] ALUOut; output Zero;

assign Zero = (ALUOut==0); //Zero is true if ALUOut is 0

always @(ALUctl, A, B) begin //reevaluate if these change

case (ALUctl)

0: ALUOut = A & B;

1: ALUOut = A | B;

2: ALUOut = A + B;

6: ALUOut = A B;

7: ALUOut = A < B ? 1 : 0;

12: ALUOut = ~(A | B); // result is nor

default: ALUOut <= 0;

endcase

end

endmodule

Register File

Register File: read

Register File: write

Register File: Verilogmodule rfile(R1,R2,W,WD,Wctl,RD1,RD2,clock)

input [5:0] R1,R2,W;

input [31:0] WD;

input Wctl, clock;

output [31:0] RD1,RD2;

reg [31:0] RF[31:0];

assign RD1 = RF[R1];

assign RD2 = RF[R2];

always @(posedge clock)

if (Wctl) RF[W] <= WD;

endmodule

Specifying Gates

Verilog allows the designer to specify individual gates

Can be bulky

Similar syntax can be used for user defined modules

Half Adder

module HA(A,B,S,C)

input A, B;

output S, C;

wire Bn, An, Abn, AnB;

not N1(An,A);

not N2(Bn,B);

and (Abn,A,Bn);

and (AnB,An,B);

or (S,ABn,AnB);

and (C,A,B);

endmodule

Speeding Up Addition

Carry propagation is what slows down addition

Sometimes the LSB of input will affect the MSB or the output

We design for the worst case senario

The simpler adders are called ripple adders

Carry LookAhead

a0, a1, a2, etc; b0, b1, b2, etc are the inputs

c0, c1, c2 are the carries.

c1 = b0 c0 + a0 c0 + a0 b0

c1 = a0 b0 + c0 (a0 + b0)

c1 = g0 + c0 p0

g0 = a0 b0; p0 = a0 + b0;

Carry LookAhead

Define

gi = a

i + b

i

pi = a

i b

i

Then

ci+1

= gi + p

i c

i

Carry LookAhead

c1 = g0 + p0 c0

c2 = g1 + p1 g0 + p1 p0 c0

c3 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0

Control Hazards

Whenever we have a branch/jump/jal/whatever We find out which way we branch at the MEM stage Meanwhile we have loaded the next three

instructions We have to flush the pipeline We waste three cycles

What is the problem

Jumps/branches are very common 25% of the instructions sometimes

If our processor is 4way superscalar wasting three cycles means we do not execute 12 instructions!

Longer pipelines suffer even more

Solutions

Delayed branch Means that next instruction is always executed

Ideally an instruction from before that is independent of the branch

An instruction from fall through that has no effect if branch is taken

An instruction from the target that has no effect if branch falls through

A nop if all else is unavailable We save at most one cycle

Solutions

Decide the branch at ID stage Requires extra hardware Saves two cycles With branch delay can be stallfree

Solutions

Always predict not taken The easiest... just do what we did so far Fails miserably for loops

Solutions

Predict taken We can do this at the ID stage Waste one cycle only if prediction correct Combined with delayed branch cost goes to zero (if

correct Works fine for many loops

Solutions

Statically predict taken/not taken Can be done with heuristics Or by giving the compiler an execution trace Just have two variants of every branch instruction Easy to implement Works great for numerical programs Not so great for non numerical

Solutions

Dynamic prediction The most advanced and most popular Requires a lot of silicon area Can be done by hashing the address to a small

memory. (Branch Prediction Buffer) Memory remembers 1 bit (taken/not taken) Loops have two mispredictions Can be solved with two bit prediction There are many far more sophisticated techniques

Solutions

Speculation The technique nowadays Good when the control hazard is compounded by a

data hazard Should allow out of order execution Should provide a way to undo a change after a

failed speculation

Exceptions/Interrupts

There is a difference Exceptions are caused by an internal condition

Error, system call Interrupts are caused by external conditions

I/O complete, mouse clicks

In many cases all are called interrupts They are handled in more or less the same way

Why bother

A computer that does not communicate with its environment is called a brick

The extra hardware to detect and handle interrupts is large and contributes to the slowing down of the clock That's part of the reason why some coprocessors

run so much faster

How are they handled

The CPU provide relevant info in two registers EPC (Exception Program Counter), 32 bits Cause Register, 32 bits bu many unused

Alternatively Use Vector interrupts For each possible cause there is an entry in the

vector

In more detail Another form of control hazard

Instead of branching to a user space address, branch to a kernel space address

Branches happen only at a particular stage in the pipeline, but exceptions can happen almost anywhere

More than one exception can happen at the same time in different instructions

We may need to restart the instruction after the exception is handled

Some instructions are handled on the spot, others where they happened.

Instruction Level Parallelism

What drove the speed of cpus 19952005 Pipelining is the oldest technique

Race to reduce hazards

Programmer is unaware of the parallelism The key is multiple issue

We encounter hazards on hormones

Two kinds

Static multiple issue VLIW Fixed form issue packet Was the technique used on Itanium There are usually restriction on what instructions

can be packaged together In some designs the compiler has to guarantee no

data/structural hazards within the issue packet

Extra cost

If we allow the issue of an ALU and a memory instruction at the same time we need Twice as many ports on the register file An extra adder to calculate the effective address Ability to detect/forward many more hazards

between different issue packets Stalls create twice as much delay

Advantage

With two issue we have possibly twice as fast processor (if the world was made by angels)

We do not need much more hardware With a good compiler the C programmer will

never know

Disadvantage

We have to recompile for new architectures We save a bit on hardware but it is hard to

make use of advances immediately Software vendors hated it Itanium is dead.

Example: VLIW for MIPS

A simplified static multiple issue MIPSlike processor

Can issue one ALU/branch and one load/store instruction per cycle.

Ignores dependencies within the issue packet. Stalls/forwards for dependencies between issue

packets

Example: VLIW for MIPSALU/branch IF ID EX M WB

Load/Store IF ID EX M WB

ALU/branch IF ID EX M WB






Example: scheduling code

Loop: lw $t0, 0($s1)

addu $t0, $t0, $s2

sw $t0, 0($s1)

addi $s1, $s1, 4

bne $s1, $0, Loop

Scheduled

Loop: nop lw $t0, 0($s1)

addi $s1, $s1, 4 nop

addu $t0, $t0, $s2 nop

bne $s1, $0, Loop sw $t0, 0($s1)

The Verdict

We can do it in 4 issue packets instead of five instructions

Before we had one or two stalls so it would take 67 cycles to execute, plus the stalls due to the branch

If we optimize the single issue version we can get it down to 5 cycles plus branch stalls

Now we can execute it in 4 cycles plus branch stalls.

Observations

We now have many more stalls/nops than single issue

The new stalls/nops eat up most of the improvement

It is not worth the extra hardware/power consumption

Is it the end of the road?

Loop unrolling

Compiler optimization Can be done easily when loops are

independent Sometimes even when they are not independent

Reduces the loop overhead Fewer instructions executed

Allows more freedom in scheduling Fewer stalls/nops

The code

Loop: addi $s1,$s1, 16 lw $t0, 0($s1)

nop lw $t1, 12($s1)

addu $t0, $t0, $s2 lw $t2, 8($s1)

addu $t0, $t0, $s2 lw $t3, 8($s1)

addu $t0, $t0, $s2 sw $t0, 8($s1)

addu $t0, $t0, $s2 sw $t1, 8($s1)

nop sw $t2, 8($s1)

bne $s1, $0, Loop sw $t3, 4($s1)

The tricks we used Unroll the loop, eliminate the branches, simplify

loop variable updating Use more temp registers

This is called register renaming We need to do it if we have antidependence or

name dependence

We may run out of registers or need more saving/restoring

Longer code May not be optimal in all architectures

Dynamic Multiple Issue

A.K.A superscalars The processor decides if it going to issue 0, 1,

2... instructions Instructions are allowed to execute out of order

But not necessarily complete out of order

The processor decides how many to instructions to issue The compiler does not need to know.

Dynamic Pipeline scheduling

lw $t0, 20($s2)

addu $t1, $t0, $t2

sub $s4, $s4, $t3

slti $t5, $s4, 20

The sub instruction can execute before addu

Dynamic Pipeline

IF/ID

Reservationstation

Reservationstation

Reservationstation ...

Execunit

Execunit

Execunit

CommitUnit

The bad news

Dynamic multiple issue CPUs were available for decades

Some can issue more than 4 instructions per cycle

They rarely complete more than 2 per cycle on average

Have to be conservative to maintain correctness (pointer aliasing)

Power Efficiency

Power has emerged as the limiting factor Cost of energy goes up Huge server farms are common Ability to eliminate heat is limited Battery life is very important Environmental concerns

Fallacies and Pitfalls

Pipelining is easy Real pipelining is quite complex

Pipelining is independent of technology The huge number of transistors offer options that

annul previous technologies (huge pipelines vs delayed branches)

Some “optimizations” in ISA spoil the speed of the pipeline.

digital logic - york university · zero volts is logic0 5 volt is logic1 unless we use negative...

Documents