computer architecture midterm notes

1"

Computer Architecture Midterm 1 Notes: The Name of This Class is Computer Architecture [Note: Any topics that are marked with an asterisk (*) are optional and will probably not be on the test.] Recommended Reading

1. Digital Design and Computer Architecture, 2nd Edition – The undergraduate-level computer architecture textbook issued by Dr. John; it serves as a tight, well written, and holistic introduction to basic digital design, assembly language, and computer architecture. It also has the best explanation of pipelined microarchitectures that I’ve ever seen. Strongly recommended.

2. Computer Systems Design and Architecture – Like Digital Design and Computer Architecture, this text provides a concise, well-explained overview of introductory computer architecture. Unlike that book, it includes historical architectures and nitty-gritty design details that are neglected by Digital Design and Computer Architecture and Computer Organization and Design.

3. Computer Organization and Design – The undergraduate-level computer architecture textbook written by the same authors as Computer Architecture: A Quantitative Approach. While not as clean of a text as Digital Design and Computer Architecture, it does have better coverage of basic computer performance equations and computer arithmetic.

4. Modern Processor Design: Fundamentals of Superscalar Processors – This book serves as an introduction to various advanced computer architecture topics, including several that are not covered in Computer Architecture: A Quantitative Approach, such as superscalar processors.

5. Computer Architecture: A Quantitative Approach, 5th Edition – Written by two of the greatest computer architects in the business, this book has long been regarded as the computer architecture bible by computing industry professionals. Not for the faint of heart, it provides a dense and esoteric, if poorly organized, look at the state of the art of computer architecture. Don’t read this until you’ve read Digital Design and Computer Architecture or Computer Organization and Design first.

I: The Hardware-Software Interface Hardware vs. Software

• Hardware includes: o The CPU, the design of which involves:

! Instruction set ! Processor ! Control Unit

• Software includes: o The compiler o The operating system (OS)

2"

• A computer can be thought of as the following components, all connected by a bus:

o Central Processing Unit (CPU) o Memory o Input/Output (I/O) Devices

• In turn, a CPU contains: o An ALU o A general-purpose register file o Various special-purpose registers (i.e. the program

counter) o The control unit, which routes data across the

CPU Pipelining (Temporal Parallelism)

• Common design technique for improving performance by breaking up one large combinational circuit into several smaller combinational circuits.

o Without pipelining, the number of tasks that can be completed per clock cycle is limited by the time to complete the entire task.

o With pipelining, the number of tasks that can be completed per clock cycle is limited by the time it takes to complete each stage.

• Registers are inserted between each stage to divide the tasks into smaller subtasks that can be run with a faster clock cycle.

o Registers prevent data from one stage to catch up to and corrupt data in the next stage

• Pipelining has the potential to multiply the throughput (the number of tasks completed per clock cycle) by the number of stages at the expense of latency (the time it takes to complete a task)

• Problems: o Need more resources (i.e., registers) on each stage o It’s difficult to evenly split up a task into subtasks that take the same amount of

time ! Throughput is limited by the performance of the slowest stage

o Dependencies can occur between different tasks in the pipeline ! The data from a later instruction depends on the data from an earlier

instruction

3"

The Compilation Process: From High-Level Code to Machine Code

• The Process: 1. A compiler takes high-level code (i.e. C, C++, Fortran, etc.) and translates it into the

assembly code for a given processor architecture (i.e. MIPS, x86, DLX, etc.) 2. An assembler then takes the assembly code, links it up with other bits of assembly

code, and converts it into the equivalent machine code (the numbers stored in memory that represent processor instructions)

• Each high-level instruction can correspond to several assembly instructions, depending on the architecture in use.

o Variables in a high level language generally correspond to locations in memory. o Data from memory (at least in load-store architectures) has to be loaded onto

registers in the CPU before they can be operated on.

4"

The Evolution of Bit Sizes

• The bit size of a processor can variously refer to: o The operand size (the bit length of the largest number that an architecture can

naturally operate on) o The bus size (the bit length of the data bus between the processor and memory)

• A brief history of Intel processors o Early processors (i.e. the Intel 4004, Intel 4040) were 4-bit

! Number range: 0 to 24-1 o Later processors (the Intel 8008, 8080 and 8085) were 8-bit

! Number range: 0 to 28-1 o Early x86 processors (the Intel 8086/8088 to the Intel 80286) were 16-bit

! Number range: 0 to 216-1 o Later x86 processors (the Intel 80386 to the Pentium 4) were 32-bit

! Number range: 0 to 232-1 o Most modern x86 processors (later Pentium 4 processors on up) are 64-bit

! Number range: 0 to 264-1

5"

Memory Arrays

• A CPU has access to a separate memory array with which to store large amounts of data. • Memory arrays are organized as N x M arrays of memory cells

o N rows, M columns. o Each row refers to a word containing data; each column is a bit from a given

word. o Each cell stores one bit of data.

• An N-to-2N decoder is used to convert the address to a specific word within the memory array in order to access the data at that address, where N is the number of address bits.

! = log! ! !! = !number'of'address'bits!! = !number'of'addresses

specified by an Address. The value read or written is called Data. Anarray with N-bit addresses and M-bit data has 2N rows and M columns.Each row of data is called a word. Thus, the array contains 2N M-bitwords.

Figure 5.39 shows a memory array with two address bits and threedata bits. The two address bits specify one of the four rows (data words)in the array. Each data word is three bits wide. Figure 5.39(b) showssome possible contents of the memory array.

The depth of an array is the number of rows, and the width is thenumber of columns, also called the word size. The size of an array isgiven as depth ×width. Figure 5.39 is a 4-word × 3-bit array, or simply4 × 3 array. The symbol for a 1024-word × 32-bit array is shown inFigure 5.40. The total size of this array is 32 kilobits (Kb).

Bit CellsMemory arrays are built as an array of bit cells, each of which stores 1 bitof data. Figure 5.41 shows that each bit cell is connected to a wordlineand a bitline. For each combination of address bits, the memory assertsa single wordline that activates the bit cells in that row. When the word-line is HIGH, the stored bit transfers to or from the bitline. Otherwise, thebitline is disconnected from the bit cell. The circuitry to store the bit varieswith memory type.

To read a bit cell, the bitline is initially left floating (Z). Then thewordline is turned ON, allowing the stored value to drive the bitline to0 or 1. To write a bit cell, the bitline is strongly driven to the desiredvalue. Then the wordline is turned ON, connecting the bitline to thestored bit. The strongly driven bitline overpowers the contents of the bitcell, writing the desired value into the stored bit.

OrganizationFigure 5.42 shows the internal organization of a 4 × 3 memory array. Ofcourse, practical memories are much larger, but the behavior of largerarrays can be extrapolated from the smaller array. In this example, thearray stores the data from Figure 5.39(b).

During a memory read, a wordline is asserted, and the corre-sponding row of bit cells drives the bitlines HIGH or LOW. Duringa memory write, the bitlines are driven HIGH or LOW first and thena wordline is asserted, allowing the bitline values to be stored in thatrow of bit cells. For example, to read Address 10, the bitlines areleft floating, the decoder asserts wordline2, and the data stored inthat row of bit cells (100) reads out onto the Data bitlines. To writethe value 001 to Address 11, the bitlines are driven to the value001, then wordline3 is asserted and the new value (001) is stored inthe bit cells.

(a)

Address

Data

Array2

3

(b)

Address

11

10

01

00

depth

0

1

1

0

1

0

1

1

0

0

0

1

width

Data

Figure 5.39 4× 3 memoryarray: (a) symbol, (b) function

Address

Data

1024-word ×32-bitArray

10

32

Figure 5.40 32 Kb array: depth =210 = 1024 words, width = 32 bits

storedbit

wordline

bitline

Figure 5.41 Bit cell

264 CHAPTER FIVE Digital Building Blocks

specified by an Address. The value read or written is called Data. Anarray with N-bit addresses and M-bit data has 2N rows and M columns.Each row of data is called a word. Thus, the array contains 2N M-bitwords.

Figure 5.39 shows a memory array with two address bits and threedata bits. The two address bits specify one of the four rows (data words)in the array. Each data word is three bits wide. Figure 5.39(b) showssome possible contents of the memory array.

The depth of an array is the number of rows, and the width is thenumber of columns, also called the word size. The size of an array isgiven as depth ×width. Figure 5.39 is a 4-word × 3-bit array, or simply4 × 3 array. The symbol for a 1024-word × 32-bit array is shown inFigure 5.40. The total size of this array is 32 kilobits (Kb).

Bit CellsMemory arrays are built as an array of bit cells, each of which stores 1 bitof data. Figure 5.41 shows that each bit cell is connected to a wordlineand a bitline. For each combination of address bits, the memory assertsa single wordline that activates the bit cells in that row. When the word-line is HIGH, the stored bit transfers to or from the bitline. Otherwise, thebitline is disconnected from the bit cell. The circuitry to store the bit varieswith memory type.

To read a bit cell, the bitline is initially left floating (Z). Then thewordline is turned ON, allowing the stored value to drive the bitline to0 or 1. To write a bit cell, the bitline is strongly driven to the desiredvalue. Then the wordline is turned ON, connecting the bitline to thestored bit. The strongly driven bitline overpowers the contents of the bitcell, writing the desired value into the stored bit.

OrganizationFigure 5.42 shows the internal organization of a 4 × 3 memory array. Ofcourse, practical memories are much larger, but the behavior of largerarrays can be extrapolated from the smaller array. In this example, thearray stores the data from Figure 5.39(b).

During a memory read, a wordline is asserted, and the corre-sponding row of bit cells drives the bitlines HIGH or LOW. Duringa memory write, the bitlines are driven HIGH or LOW first and thena wordline is asserted, allowing the bitline values to be stored in thatrow of bit cells. For example, to read Address 10, the bitlines areleft floating, the decoder asserts wordline2, and the data stored inthat row of bit cells (100) reads out onto the Data bitlines. To writethe value 001 to Address 11, the bitlines are driven to the value001, then wordline3 is asserted and the new value (001) is stored inthe bit cells.

(a)

Address

Data

Array2

3

(b)

Address

11

10

01

00

depth

0

1

1

0

1

0

1

1

0

0

0

1

width

Data

Figure 5.39 4× 3 memoryarray: (a) symbol, (b) function

Address

Data

1024-word ×32-bitArray

10

32

Figure 5.40 32 Kb array: depth =210 = 1024 words, width = 32 bits

storedbit

wordline

bitline

Figure 5.41 Bit cell

264 CHAPTER FIVE Digital Building Blocks

Memory PortsAll memories have one or more ports. Each port gives read and/or writeaccess to one memory address. The previous examples were all single-ported memories.

Multiported memories can access several addresses simultaneously.Figure 5.43 shows a three-ported memory with two read portsand one write port. Port 1 reads the data from address A1 onto theread data output RD1. Port 2 reads the data from address A2 ontoRD2. Port 3 writes the data from the write data input WD3 intoaddress A3 on the rising edge of the clock if the write enable WE3 isasserted.

Memory TypesMemory arrays are specified by their size (depth ×width) and the numberand type of ports. All memory arrays store data as an array of bit cells,but they differ in how they store bits.

Memories are classified based on how they store bits in the bit cell.The broadest classification is random access memory (RAM) versus readonly memory (ROM). RAM is volatile, meaning that it loses its datawhen the power is turned off. ROM is nonvolatile, meaning that it retainsits data indefinitely, even without a power source.

RAM and ROM received their names for historical reasons that areno longer very meaningful. RAM is called random access memorybecause any data word is accessed with the same delay as any other. Incontrast, a sequential access memory, such as a tape recorder, accessesnearby data more quickly than faraway data (e.g., at the other end of

wordline311

10

2:4Decoder

Address

01

00

storedbit = 0

storedbit = 1

storedbit = 0

storedbit = 1

storedbit = 0

storedbit = 0

storedbit = 1

storedbit = 1

storedbit = 0

storedbit = 0

storedbit = 1

storedbit = 1

wordline2

wordline1

wordline0

bitline2 bitline1 bitline0

Data 2 Data 1 Data 0

2

Figure 5.42 4× 3 memory array

A1

A3WD 3

WE 3

A2

CLK

Array

RD 2RD1

MM

NN

NM

Figure 5.43 Three-portedmemory

5.5 Memory Arrays 265

6"

• Larger memories units require more elaborate decode schemes to deal with fan-out limitations, as well as to tie together smaller memory units into bigger ones.

• Some binary prefixes you should know: o K: Kilobyte/Kibibyte (210 bytes) o M: Megabyte/Mebibyte (220 bytes) o G: Gigabyte/Gibibyte (230 bytes)

7"

II: Measuring Computer Performance Speedup

• Given the execution time of an old computer system Told and the execution time of a newer system Tnew, the speedup of the new system over the old one can be obtained by the following formula:

!! = ! !!"#!!"#

• What computer architects mean when saying that system A is n% faster than another

system B:

! = !!!!= 100+ !

100

n A B Speedup 50 100 150 (100 + 50) / 100 = 1.5 times faster 100 100 200 (100 + 100) / 100 = 2 times faster 0 100 100 (100 + 0) / 100 = 1 times faster

Amdahl’s Law

• The quantitative version of the law of diminishing returns. • States that performance enhancements with a given improvement are limited by the

number of times said enhancement is used. o Even if dramatic improvements were made to part of a program’s performance,

the overall speedup is limited if said parts of the program were only a small part of the overall execution time.

8"

Derivation:

• Consider a program whose unimproved execution time is given by Told and whose improved execution time is given by Tnew.

o Tx gives the execution time of the unaffected parts of the program. o Ty gives the execution time of the affected parts of the program prior to

enhancement. o Ty’ gives the execution time of the affected parts of the program after

enhancement. o Sf is the feature speedup, the amount by which a given part of the program

improves, and is given by:

!!! =!!!!!

• The overall speedup to a given program due to the feature speedup !!! is given by:

! = !!"#!!"#

= !! + !!!! + !!!

!

=!!!! + 1!!!! +

!!!!!!

=!!!! + 1!!!! +

1!!

9"

• Plotting !!!! vs. speedup shows that:

o Maximum speedup occurs when the execution time of unaffected elements of the program is minimized (Tx = 0, Speedup = Sf):

! =!!!! + 1!!!! +

1!!=

0!! + 10!! +

1!!= 11!!= !!

o As Tx increases, the speedup due to feature speedup diminishes and approaches 1

(no speedup):

! = lim!!→�

!!!! + 1!!!! +

1!!= lim

!!→�

!!!!

!!!! + 1

!!!!

!!!! +

1!!

= lim!!→�

1!! + 01!! + 0

=1!!1!!

= 1

Example (Sf = 5, !!!! = 4):

! =!!!! + 1!!!! +

1!!= 4+ 14+ 15

= 54.2 ≈ 1.19

10"

Equivalent Formulations:

Execution Time = !Execution time affected by improvementAmount of improvement

+Execution time unaffected

!!"# =!!!!+ !! =

!!!!!!!+ !! = !!! + !!

!Speedup = Execution Time before

Execution time before-Execution time affected + Execution time affectedFeature speedup

!

= !!"#!!"# − !! + !!

!!!!!

!

= !!"#!!"# − !! + !!!!

!Speedup = 1

1-Fraction time affected +! Fraction time affectedFeature speedup

!

= 11− !! + !!!!!

Measuring The Time To Run a Program

• The time required to run a program is given by the following equation:

!!"#$"%& = instructionsprogram

×clock cyclesinstruction

×time

clock cycle

= instructionsprogram

×clock cyclesinstruction

×1

frequency!

= number'of'instructions!×!clocks&per&instruction&(CPI)!× 1frequency

• Although decreasing the number of instructions in the program to improve execution time

sounds like a good idea, efforts to do so (i.e. CISC architectures) tend to cause the clock cycle time (and the frequency) to increase, potentially decreasing performance.

11"

III: Basic Logic Design Combinational vs. Sequential Logic

• In combinational logic: o Output depends strictly on input. o There are no state elements in the circuit (flip-flops, latches, etc.)

• In sequential logic: o Output depends on both input and the current state. o There are state elements present.

12"

Gate-Level Design

• This course only goes down to gate-level primitives (AND, OR, NOT, etc.)

• Two in particular – the NAND and NOR gates – are referred to as the “universal” gates due to being easier to make and their ability to implement all of the other basic gates.

• The XOR gate is of special interest in digital design.

o The XOR gate outputs a 1 if an odd number of inputs are 1.

13"

• One application for the XOR gate is parity checking: testing data integrity by checking whether a packet of data has an odd number of bits (odd parity) or an even number of bits (even parity).

14"

Combinational Circuit Design Example: The Full Adder

SOP and POS Representations of Boolean Functions

• Boolean functions can either be represented as Sum-of-Product (SOP) or Product-of-Sum (POS)

• From a hardware standpoint, they are represented by two-level logic consisting of either AOI (AND-OR-Inverter) or OAI (OR-AND-Inverter) logic.

SOP:!!!! = !" + !!! + !"# POS:!!!! = (! + !)(! + !)

Gate Delay

• Because all combinational logic has an SOP or POS representation, a good approximation of the delay of a combinational circuit is given by the following:

!!"#$ = 2! ! = gate delay

• This approximation disregards the effect that inverters and gate fan-in/fan-out have on

delay.

15"

Digital Building Blocks MUX

• Commonly used combinational circuit that chooses an output from one of 2n inputs using an n-bit select signal.

• Word-level/register-level muxes can also be used instead of bit-level muxes.

human with a bit of experience can find a good solution by inspection.Neither of the authors has ever used a Karnaugh map in real life tosolve a practical problem. But the insight gained from the principlesunderlying Karnaugh maps is valuable. And Karnaugh maps oftenappear at job interviews!

2.8 COMBINATIONAL BUILDING BLOCKSCombinational logic is often grouped into larger building blocks to buildmore complex systems. This is an application of the principle of abstrac-tion, hiding the unnecessary gate-level details to emphasize the function ofthe building block. We have already studied three such building blocks:full adders (from Section 2.1), priority circuits (from Section 2.4), andseven-segment display decoders (from Section 2.7). This section intro-duces two more commonly used building blocks: multiplexers and deco-ders. Chapter 5 covers other combinational building blocks.

2 . 8 . 1 Multiplexers

Multiplexers are among the most commonly used combinational circuits.They choose an output from among several possible inputs based on the valueof a select signal. A multiplexer is sometimes affectionately called a mux.

2:1 MultiplexerFigure 2.54 shows the schematic and truth table for a 2:1 multiplexerwith two data inputs D0 and D1, a select input S, and one output Y.The multiplexer chooses between the two data inputs based on the select:if S= 0, Y =D0, and if S= 1, Y=D1. S is also called a control signalbecause it controls what the multiplexer does.

A 2:1 multiplexer can be built from sum-of-products logic as shownin Figure 2.55. The Boolean equation for the multiplexer may be derived

D3:2

D1:0

D3:2

D1:0

Sa Sb

01 11

1

0

0

1

X

X

1

101

1

1

1

1

X

X

X

X

11

10

00

00

10 01 11

1

1

1

0

X

X

1

101

1

1

1

0

X

X

X

X

11

10

00

00

10

Sa = D3 + D2D0 + D2D0 + D1 Sb = D2 + D1D0 + D1D0

Figure 2.53 K-map solution withdon’t cares

Y0 00 11 01 1

0101

0000

0 00 11 01 1

1111

0011

0

1

S

D0Y

D1

D1 D0S

Figure 2.54 2:1 multiplexersymbol and truth table

2.8 Combinational Building Blocks 83

16"

Decoder

• A combinational circuit with n input bits and 2n output bits in which only one of the outputs is asserted at a time depending on the value of the inputs.

• Used to decode a smaller set of signals into a much larger set.

Example 2.14 DECODER IMPLEMENTATION

Implement a 2:4 decoder with AND, OR, and NOT gates.

Solution: Figure 2.64 shows an implementation for the 2:4 decoder using fourAND gates. Each gate depends on either the true or the complementary form ofeach input. In general, an N:2N decoder can be constructed from 2N N-inputAND gates that accept the various combinations of true or complementary inputs.Each output in a decoder represents a single minterm. For example, Y0 representsthe minterm A1A0: This fact will be handy when using decoders with other digitalbuilding blocks.

Decoder LogicDecoders can be combined with OR gates to build logic functions. Figure2.65 shows the two-input XNOR function using a 2:4 decoder and asingle OR gate. Because each output of a decoder represents a single min-term, the function is built as the OR of all the minterms in the function. InFigure 2.65, Y = AB+AB = A⊕B:

A B Y0 0 10 1 01 0 01 1 1

0000

0 00 11 01 1

1111

1100

C

(a)

A Y0 00 11 0 11 1 0

BCC

00

Y011011

A B

C

(b) (c)

Figure 2.62 Alyssa’s new circuit

A1 A0

Y3

Y2

Y1

Y0

Figure 2.64 2:4 decoder implementation

2:4Decoder

A1

A0

Y3Y2Y1Y000

011011

0 00 11 01 1

0001

Y3 Y2 Y1 Y0A0A1

0010

0100

1000

Figure 2.63 2:4 decoder

2:4Decoder

AB

00011011

Y = A B

Y

ABABABAB

Minterm

⊕

Figure 2.65 Logic function usingdecoder

2.8 Combinational Building Blocks 87

17"

• Used in memory circuits to select a word in memory given an address.

Flip-Flops

• Edge-sensitive memory elements (unlike latches, which are level-sensitive) • Types:

o SR*

o JK*

! More powerful flip-flop class ! Results in smaller control logic than equivalent sequential circuits using D

flip-flops

▶ Case IVa: Q= 0Because S and Q are FALSE, N2 produces a TRUE output on Q,as shown in Figure 3.4(a). Now N1 receives one TRUE input, Q,so its output, Q, is FALSE, just as we had assumed.

▶ Case IVb: Q= 1Because Q is TRUE, N2 produces a FALSE output on Q, asshown in Figure 3.4(b). Now N1 receives two FALSE inputs, Rand Q, so its output, Q, is TRUE, just as we had assumed.

Putting this all together, suppose Q has some known prior value,which we will call Qprev, before we enter Case IV. Qprev is either 0or 1, and represents the state of the system. When R and S are 0, Qwill remember this old value, Qprev, and Q will be its complement,Q

prev: This circuit has memory.

The truth table in Figure 3.5 summarizes these four cases. Theinputs S and R stand for Set and Reset. To set a bit means to make itTRUE. To reset a bit means to make it FALSE. The outputs, Q andQ, are normally complementary. When R is asserted, Q is reset to 0and Q does the opposite. When S is asserted, Q is set to 1 and Q doesthe opposite. When neither input is asserted, Q remembers its old value,Qprev. Asserting both S and R simultaneously doesn’t make much sensebecause it means the latch should be set and reset at the same time,which is impossible. The poor confused circuit responds by makingboth outputs 0.

The SR latch is represented by the symbol in Figure 3.6. Using thesymbol is an application of abstraction and modularity. There are variousways to build an SR latch, such as using different logic gates or transis-tors. Nevertheless, any circuit element with the relationship specified bythe truth table in Figure 3.5 and the symbol in Figure 3.6 is called anSR latch.

Like the cross-coupled inverters, the SR latch is a bistable elementwith one bit of state stored in Q. However, the state can be controlledthrough the S and R inputs. When R is asserted, the state is reset to 0.When S is asserted, the state is set to 1. When neither is asserted, the stateretains its old value. Notice that the entire history of inputs can be

R

S

QN1

N2

0

0

(b)

1

01

0

Q

R

S

QN1

N2

0

0

(a)

0

10

1

Q

Figure 3.4 Bistable states of SRlatch

S R Q0 0 Qprev0 1 01 0 11 1 0

100

CaseIVIIIIII

QQprev

Figure 3.5 SR latch truth table

S

R Q

Q

Figure 3.6 SR latch symbol

112 CHAPTER THREE Sequential Logic Design

18"

o D ! Most commonly used flip-flop type. ! Input that comes in is what comes out on the next clock cycle. ! 1-bit storage. ! Used to implement registers.

Registers

• Made using several flip-flops

3 . 2 . 3 D FIip-Flop

A D flip-flop can be built from two back-to-back D latches controlled bycomplementary clocks, as shown in Figure 3.8(a). The first latch, L1, iscalled the master. The second latch, L2, is called the slave. The nodebetween them is named N1. A symbol for the D flip-flop is given in Figure3.8(b). When the Q output is not needed, the symbol is often condensedas in Figure 3.8(c).

When CLK= 0, the master latch is transparent and the slave is opa-que. Therefore, whatever value was at D propagates through to N1.When CLK = 1, the master goes opaque and the slave becomes transpar-ent. The value at N1 propagates through to Q, but N1 is cut off from D.Hence, whatever value was at D immediately before the clock rises from 0to 1 gets copied to Q immediately after the clock rises. At all other times,Q retains its old value, because there is always an opaque latch blockingthe path between D and Q.

In other words, a D flip-flop copies D to Q on the rising edge of theclock, and remembers its state at all other times. Reread this definitionuntil you have it memorized; one of the most common problems forbeginning digital designers is to forget what a flip-flop does. The risingedge of the clock is often just called the clock edge for brevity. The Dinput specifies what the new state will be. The clock edge indicates whenthe state should be updated.

A D flip-flop is also known as amaster-slave flip-flop, an edge-triggeredflip-flop, or a positive edge-triggered flip-flop. The triangle in the symbolsdenotes an edge-triggered clock input. The Q output is often omitted whenit is not needed.

Example 3.1 FLIP-FLOP TRANSISTOR COUNT

How many transistors are needed to build the D flip-flop described in this section?

Solution: A NAND or NOR gate uses four transistors. A NOT gate uses twotransistors. An AND gate is built from a NAND and a NOT, so it uses six tran-sistors. The SR latch uses two NOR gates, or eight transistors. The D latch usesan SR latch, two AND gates, and a NOT gate, or 22 transistors. The D flip-flopuses two D latches and a NOT gate, or 46 transistors. Section 3.2.7 describes amore efficient CMOS implementation using transmission gates.

3 . 2 . 4 Register

An N-bit register is a bank of N flip-flops that share a common CLKinput, so that all bits of the register are updated at the same time. Regis-ters are the key building block of most sequential circuits. Figure 3.9

The precise distinction betweenflip-flops and latches issomewhat muddled and hasevolved over time. In commonindustry usage, a flip-flop isedge-triggered. In other words,it is a bistable element with aclock input. The state of theflip-flop changes only inresponse to a clock edge, suchas when the clock rises from0 to 1. Bistable elementswithout an edge-triggeredclock are commonly calledlatches.

The term flip-flop or latchby itself usually refers to aD flip-flop or D latch,respectively, because these arethe types most commonly usedin practice.

(a)

CLK

D Q

CLK

D Q QD N1

CLK

L1 L2master slave

(b)

D Q

(c)

QQQ

Q

Figure 3.8 D flip-flop:(a) schematic, (b) symbol,(c) condensed symbol

114 CHAPTER THREE Sequential Logic Design

19"

Bus

• Interconnect between multiple devices w/ multiple entry and exit points. • Devices connect to the bus via tri-state buffers (unidirectional switches that output a

high-impedance value (Z) when S = 0). o The tri-state buffers are needed to isolate un-accessed devices from the bus when

not in use. • Unidirectional and bidirectional interconnects:

o Unidirectional interconnects to the bus are handled using a single tri-state buffer. o Bidirectional interconnects are handled using two tri-state buffers in opposite

directions and with separate control signals.

20"

Sequential Circuit Design (Finite State Machines)

• Recall that sequential circuits have outputs that are dependent on both the input and the current state.

!!!! = ! !! , !! !!!!! = !(!! , !!)

Designing a Finite State Machine:

1. Define the specification 2. Determine the state information to be encoded. 3. Design a state transition diagram (state graph). 4. Encode the state symbols. 5. Translate the state transition diagram to a state transition table (STT). 6. Generate the combinational circuits that implement the STT

Example: Consecutive 1’s Detector

1. Specification: design a circuit that produces two consecutive clock cycles of 1’s on the output Z when it sees three consecutive clock cycles of 1’s on the input x starting from the last 1.

21"

2. State Information: How many consecutive 1’s seen so far.

3. State Transition Diagram (Mealy FSM):

4. State Symbols:

State S1 S0 a 0 0 b 0 1 c 1 0 d 1 1

5. State Transition Table:

S1 S0 x S1’ S0’ z 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1

22"

6. Combinational Logic:

Moore vs. Mealy Finite State Machines*

• Moore Machine: Finite state machine in which the output depends only on the current state.

o Inputs are tied to transition arrows, while outputs are tied to a specific state.

• Mealy Machine: Finite state machine in which the output depends on both the current state and the inputs.

o Inputs and outputs are tied to transition arrows in the format “input/output”.

Table 3.15 shows the combined state transition and output table for the Mealymachine. The Mealy machine requires only one bit of state. Consider using a bin-ary state encoding: S0= 0 and S1= 1. Table 3.16 rewrites the state transition andoutput table with these encodings.

From these tables, we find the next state and output equations by inspection.

S′0=A (3.10)

Y=S0A (3.11)

TheMoore andMealy machine schematics are shown in Figure 3.31. The timing dia-grams for each machine are shown in Figure 3.32 (see page 135). The two machinesfollow a different sequence of states. Moreover, the Mealy machine’s output rises acycle sooner because it responds to the input rather than waiting for the state change.If the Mealy output were delayed through a flip-flop, it would match the Mooreoutput.When choosing your FSMdesign style, consider when youwant your outputsto respond.

Table 3.12 Moore output table

Current StateS

OutputY

S0 0

S1 0

S2 1

Reset

(a)

S00

S10

S21

0

Reset

(b)

0 1

S0 S1

1/1

0/0

1/01 01

0/0

Figure 3.30 FSM state transition diagrams: (a) Moore machine, (b) Mealy machine

Table 3.11 Moore state transition table

Current StateS

InputA

Next StateS′

S0 0 S1

S0 1 S0

S1 0 S1

S1 1 S2

S2 0 S1

S2 1 S0

3.4 Finite State Machines 133

Table 3.15 shows the combined state transition and output table for the Mealymachine. The Mealy machine requires only one bit of state. Consider using a bin-ary state encoding: S0= 0 and S1= 1. Table 3.16 rewrites the state transition andoutput table with these encodings.

From these tables, we find the next state and output equations by inspection.

S′0=A (3.10)

Y=S0A (3.11)

TheMoore andMealy machine schematics are shown in Figure 3.31. The timing dia-grams for each machine are shown in Figure 3.32 (see page 135). The two machinesfollow a different sequence of states. Moreover, the Mealy machine’s output rises acycle sooner because it responds to the input rather than waiting for the state change.If the Mealy output were delayed through a flip-flop, it would match the Mooreoutput.When choosing your FSMdesign style, consider when youwant your outputsto respond.

Table 3.12 Moore output table

Current StateS

OutputY

S0 0

S1 0

S2 1

Reset

(a)

S00

S10

S21

0

Reset

(b)

0 1

S0 S1

1/1

0/0

1/01 01

0/0

Figure 3.30 FSM state transition diagrams: (a) Moore machine, (b) Mealy machine

Table 3.11 Moore state transition table

Current StateS

InputA

Next StateS′

S0 0 S1

S0 1 S0

S1 0 S1

S1 1 S2

S2 0 S1

S2 1 S0

3.4 Finite State Machines 133

23"

IV: Number Systems and Unsigned Integers

• Number System: A system for representing numbers

Base 10 Base 14 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 A 11 B 12 C 13 D

23! = 2×4! + 3×4! = 11!"!23!" = 2×10! + 3×10! = 23!"!357!" = 3×10! + 5×10! + 7×10!

Binary Integer Weighting System

! = !!!! ∙ 2!!! + !!!! ∙ 2!!! +⋯+ !!2! + !! ∙ 2!!! = number'of'bits

Base Conversion Algorithm

1. Divide the number by the base until the result equals 0. 2. Store the remainders from the LSB to the MSB. 3. Use the result of the last division as the dividend of the next division.

24"

5710 from base-10 to base 2:

57÷ 2 = 28!!1!28÷ 2 = 14!!0!14÷ 2 = 7!!0!7÷ 2 = 3!!1!3÷ 2 = 1!!1!1÷ 2 = 0!!1

57!" = 111001!

• Each divide by 2 operation shifts the results right by 1 (which is why right shifts by n double as dividing by 2n).

2125 from base-5 to base-2*:

212! ÷ 2 = 103!!!1!103! ÷ 2 = 24!!!0!24! ÷ 2 = 12!!!0!12! ÷ 2 = 3!!!1!3! ÷ 2 = 1!!!1!1! ÷ 2 = 0!!!1

212! = 111001!

2125 from base-5 to base-7*:

212! ÷ 7 = 13!!1!13÷ 7 = 1!!1!1÷ 7 = 0!!1

212! = 111!

• In general, the base conversion algorithm is base-agnostic: it works regardless of the

original base of the number (due to relying purely on the results of division operations).

Remainder"of"57"/"2"

25"

Unsigned Integers Ranges for n-bit Numbers

• Base-10 Range:

0 ≤ ! ≤ 10! − 1

• Base-2 Range:

0 ≤ ! ≤ 2! − 1

• Base-7 Range:

0 ≤ ! ≤ 7! − 1 V: Fixed Point Representation Fixed-Point Basics

• The “point” refers to the position of the decimal point. • Fixed point fixes the decimal point’s position, while floating point changes the decimal

point’s position. • Tradeoff between range and position:

o Moving the decimal point left increases precision at the expense of range o Moving the decimal point right increases range at the expense of precision

Example: 4-bit number

xxxx. " High range, low precision xxx.x .xxxx " Low range, high precision

Fixed Point Weighting System

! = !!!! ∙ 2!!! + !!!! ∙ 2!!! +⋯+ !! ∙ 2! + !!! ∙ 2!! +⋯+ !!! ∙ 2!!!! = number'of'integer'bits!! = number'of'fractional'bits

26"

Decimal to Binary Fixed Point Conversion Algorithm

1. Convert the integer part using the base conversion algorithm for integers: a. Divide the number by the base until the result equals 0. b. Store the remainders from the LSB to the MSB. c. Use the result of the last division as the dividend of the next division.

2. Convert the fractional part using the following algorithm: a. Multiply the fraction by the base. b. Append the integer part of the product to the binary fraction. c. Use the current fractional part of the product as the multiplicand of the next

multiplication operation. d. Stop when the maximum number of fractional bits is satisfied or when the

fractional result is 0. 4.310 from base-10 to base-2 (five fractional bits):

• Integer component:

4!" = 100!

• Fractional component:

0.3!×!2 = 0.6 → 0!0.6!×!2 = 1.2 → 1!0.2!×!2 = 0.4 → 0!0.4!×!2 = 0.8 → 0!0.8!×!2 = 1.6 → 1!0.3!" ≈ 0.01001!

• Final result:

4.3!" = 100.01001!

4.2510 from base-10 to base-2:

• Integer component:

4!" = 100!

• Fractional component:

0.25×2 = 0.5 → 0!0.5×2 = 1.0 → 1!0.25!" = 0.01!

27"

• Final result:

4.25!" = 100.01!

• Each multiply by 2 operation shifts the results left by 1 (which is why left shifts by n double as multiply by 2n).

o The integer part being generated by this algorithm corresponds to the carry-out. VI: Signed Integers Overview

Number Sign and Magnitude

1’s Complement

2’s Complement

7 0111 0111 0111 6 0110 0110 0110 5 0101 0101 0101 4 0100 0100 0100 3 0011 0011 0011 2 0010 0010 0010 1 0001 0001 0001 0 0000 0000 0000 0 1000 1111 0000

-1 1001 1110 1111 -2 1010 1101 1110 -3 1011 1100 1101 -4 1100 1011 1100 -5 1101 1010 1011 -6 1110 1001 1010 -7 1111 1000 1001 -8 n/a n/a 1000

28"

Sign and Magnitude Overview:

• Most significant bit represents the sign • Least significant bits represents the magnitude

Issues:

• Two representations of 0 (positive and negative zero) o Must check for two different zeroes when testing for zero in a program

• (In)-compatibility with unsigned addition hardware o Performance hit incurred

1’s Complement Overview:

• Bit-by-bit complement of the unsigned number • MSB of 1 indicates a negative number, while an MSB of 0 indicates a positive number

Issues:

• Duplicate representation of zero • Requires an additional end-around carry for unsigned adder hardware to work for 1’s

complement numbers

29"

Conversion from unsigned

• Complement the bits. Face value*

• The value that the 1’s complement value would be if it were interpreted as an unsigned integer

• Formula:

!! ! = !,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! > 02! − ! − 1, ! < 0

Range/Weight System*

• Integer:

! = !!!! ∙−(2!!! − 2!)+ !!!! ∙ 2!!! +⋯+ !!2! + !! ∙ 2!!! = number'of'bits

o MSB: -(2n-1 - 20) o LSB: 20 o Range:

− 2!!! − 1 ≤ ! ≤ 2!!! − 1

o Examples:

5!" = 0101! = 2! + 2! = 4+ 1 = 5!−5!" = 1010! = − 2! − 2! + 2! = − 8− 1 + 2 = −5

30"

• Fraction:

! = !!!! ∙−(2!!! − 2!!)+ !!!! ∙ 2!!! +⋯+ !! ∙ 2! + !!! ∙ 2!! +⋯+ !!! ∙ 2!!!! = number'of'integer'bits!! = number'of'fractional'bits

o MSB: -(2n-1 – 2-m) o LSB: 2-m

Fractions*

1. Shift bits left until no fractional part remains (x 2x) 2. Take the 1’s complement of the integer. 3. Shift bits back (÷ 2x)

Example:

31"

Sign Extension*

• MSB: Sign extend • LSB: Sign extend (w/ MSB)

32"

2’s Complement Overview:

• Subtraction of the unsigned number from 2n • Alternatively, complement the bits and add 1

Resolved issues:

• No duplicate zeroes • Can use unmodified unsigned adder hardware

Conversion from unsigned:

1. Complement the bits, add 1

2. Subtract from 2n, where n is the number of bits

o Most conceptually useful algorithm

3. Scan from right to left until you hit the first 1. Complement all subsequent bits.

33"

Face value*:

• The value that the 2’s complement value would be if it were interpreted as an unsigned integer

o Based on the “subtract from 2n” conversion approach • Formula:

!! ! = !,!!!!!!!!!!!!!!!!!!!!!!! > 02! − ! , ! < 0

• If fv(x) > 2n in an arithmetic operation, subtract 2n from the result to get the final face

value. o Represents the carry-out

Examples:

• n = 4, x = 6:

0110! = 0×−2! + 1×2! + 1×2! + 0×2! = 6!!! 6 = !6

• n = 4, x = -6:

1010! = 1×−2! + 0×2! + 1×2! + 0×2! = −8+ 2 = −6!!! 6 = 2! − −6 = 16− 6 = 10

34"

Range/Weight System:

• Integer:

! = !!!! ∙−2!!! + !!!! ∙ 2!!! +⋯+ !!2! + !! ∙ 2!!! = number'of'bits

o MSB: -2n-1 o LSB: 20 o Range:

−2!!! ≤ ! ≤ 2!!! − 1

o Examples:

1011! = −2! + 2! + 2! = −8+ 2+ 1 = −5

• Fraction*:

! = !!!! ∙−2!!! + !!!! ∙ 2!!! +⋯+ !! ∙ 2! + !!! ∙ 2!! +⋯+ !!! ∙ 2!!!! = number'of'integer'bits!! = number'of'fractional'bits

o MSB: -(2n-1) o LSB: 2-m

35"

Fraction Conversion Algorithm

1. Shift bits left until no fractional part remains (x 2x) 2. Take the 2’s complement of the integer. 3. Shift bits back (÷ 2x)

Example:

36"

Sign Extension*

• MSB: Sign extend • LSB: Zero extend

37"

VII: Floating Point Representation Floating-Point vs. Fixed Point

• Fixed point: o Has high range or high precision, but not both.

• Floating-point: o Has a higher range than fixed-point. o Has high, variable precision.

! High precision for small values ! Low precision for large numbers (not that we care).

38"

An Illustrative Example:

• Two problems show up with this crude representation: o Precision is variable (and not very good at representing really small numbers due

to only supporting positive exponents). o There are multiple redundant representations of the same number (due to the

leading implied 0 in the mantissa). ! Non-normalized format.

39"

A Better Floating-Point Format

• Dealing with duplication: o The leading bit in the mantissa is always 1 anyway, right? Might as well make it

implicit so we can have another bit for more precision and range (as well as get rid of that pesky duplication problem).

o This makes it a normalized format. • Dealing with precision:

o Just having positive values of E is awful. Have a signed value of E is much better for precision, and lets us represent smaller exponents.

• Support for negative numbers should be added as well.

! = −1 !×1.!×2!

• A good floating point format should put the exponent field first, since it has greater weight than the mantissa.

o Makes it possible to reuse integer comparison HW, since the value with the bigger face value is the larger number in this scheme.

40"

Why Blindly Using Signed Numbers For The Exponent Field Is A Bad Idea

• While you can use vanilla signed integers to represent the exponent, doing so is a bad idea.

o It makes floating-point comparison much harder, since you can’t just assume that the number with the bigger face value is the larger number.

• Instead, the preferred solution is to use a biased unsigned integer. o Allows for simple integer comparisons of the exponent field, letting us reuse our

existing integer hardware. • The actual exponent is calculated as exp! = !E!– !b, where:

o E = face value of the exponent field o ! = !"#$ = 2!!! − 1 o e = bit length of E

! = −1 ! ∙ 1.!×2!!!

E 2’s complement 2’s complement + 3 4 0100 111 3 011 110 2 010 101 1 001 100 0 000 011 -1 111 010 -2 110 001 -3 101 000 -4 100 N/A

41"

General Floating-Point Format

• The general formula for the decimal representation of a floating point format is:

! = −1 !!×!1.!!×!2!!!!! = Sign!! = Mantissa!! = Exponent((Biased)!! = Bias = 2!!! − 1!! = bit$length$of$exponent

42"

VIII: IEEE-754 Format!

• Comes in two flavors: o Single-Precision (32-bit)

! C data type: float o Double-Precision (64-bit)

! C data type: double Single Precision Format

! = −1 !!×!1.!!×!2!!! = ! −1 !!×!1.!!×!2!!!"#!! = sign = 1 bit ! = Mantissa = 23 bits ! = Exponent (face value) = 8 bits ! = bias = 2!!! − 1 = 2!!! − 1 = 128− 1 = 127!! = bit length of exponent field = 8 bits

Single Precision Range:

• Maximum (255) and minimum (0) values of E are reserved for infinity and zero/denormalized numbers, respectively.

• Actual range of exponent:

−126 ≤ ! − 127 ≤ 127!1 ≤ ! ≤ 254

43"

• Range of normalized numbers:

1.000×2!!"# ≤ ! ≤ 1.11… 1×2!"# ≈ 1.000×2!"#

Example:

! = −13.625!"!= 1101.101! = 1.101101!×!2! = 1.101101×2!"#!!"#!

! = 1!!! = 130!" = 10000010!!! = 10110100… 0!!! = 0!"15!0000

44"

Single Precision Exceptional Values

• Certain IEEE-754 values are reserved for special cases.

E M Value 255 ≠0 Not a Number (NaN) 255 0 ±∞ 0 ≠0 Denormalized (see next section) 0 0 ±0

• If the results of an arithmetic operation between two numbers results in an exponent

that’s larger than or smaller than the normalized range, the following rules hold: o If the unbiased exponent of the result is less than -126 after renormalizing the

result, return a denormalized number (see next section). o If the unbiased exponent of the result is greater than 127 after renormalizing the

result, return ±∞. • If x ≠ ∞ and x > 0, then the following operations hold:

! +∞ = ∞!∞− ! = ∞!! ∙∞ = ∞!! ÷ 0 = ∞!0 ∙∞ = !"!!

∞−∞ = !"!!0÷ 0 = !"!

45"

Single Precision Denormalized Values (E = 0, M ≠ 0)

• Special case used to further increase precision for extremely small values.

! = −1 !!×!0.!!×!2!!"#!0.0… 01×2!!"# ≤ |!| ≤ 0.1111…×2!!"#

Example:

! = 1.01×2!!"#

• a is outside of the normalized range (E-127 < -126), and must be represented as a denormalized number (i.e. as a power of 2-126 with a leading 0 instead of a leading 1):

! = 1.01×2!!"# = 0.00101×2!!"#!! = 0!! = 0!! = 001010…!!! = 0!00140000

46"

IX: Unsigned Adders Ripple Carry Adder

• Direct implementation of the vanilla binary addition algorithm o Agonizingly slow, but cheap to implement in terms of hardware cost

Full Adder Logic Functions:

!!"# = !!!! + !!!" + !!!"!! = !⊕ !! ⊕ !!"!

Worst Case Delay:

!!"# = ! ∙ !!" = ! ∙ 2! = 2!"!!!" = full adder delay = 2! ! = number of bits

• Corresponds to the maximum number of times that a carry signal needs to propagate

through the RCA chain. • Has O(n) delay. • Example: d = 2 ns

!!"!!"#!!"# = 2!" = 2 ∙ 32 ∙ 2ns = 128ns !!"# =

1! =

1!!"!!"#!!"#

= 1128!ns = 7.8!MHz

47"

Determining The Actual Delay:

• If a 1-1 bit pair is found, generate a carry-out signal o Introduces 2d delay

• Propagate the carry signal it hits the next 1-1 or 0-0 pair. o Introduce 2d delay for each bit pair traversed by the carry chain

• Sum is finalized at a 1-1 or 0-0 pair o 2d delay to finalize the sum

• If a 1-0 or 0-1 pair is found and there is no prior carry-in, introduce 2d delay to complete the sum

1 1 10 0 01 1 10 0 00 1 1 10 0

1 1 10 0 00 0 0 0 0 0 01 1 1 11 1 1 1 10

11 1 1 10 0

1 1 11 1 1 1 1 1 1

00 0 0

000 1 1 10

000 RCA

48"

Carry Lookahead Adder (CLA)

• Fast adder hardware in which all carry signals are generated in parallel based on the state of the input bits

Logic Equations

• Consider the Boolean equation for generating the next carry-out signal:

!!!! = !!!! + !!!! + !!!! != !!!! + !! + !! !! != !! + !!!!

• Generate (gi): indicates that a bit pair will generate a carry out.

!! = !!!!

• Propagate (pi): indicates that a bit pair will generate a carry-out signal given a carry-in of

1.

!! = !! + !!

49"

• Carries (Ci): set to 1 only if the bit in question is a 1-1 pair or there’s a carry-propagate chain leading to that bit.

!! = !! + !!!!!!! = !! + !!!! = !! + !!!! + !!!!!!!!! = !! + !!!! = !! + !!!! + !!!!!! + !!!!!!!!!!! = !! + !!!! = !! + !!!! + !!!!!! + !!!!!!!! + !!!!!!!!!!!…

Worst-Case Delay

• Ideal (no fan-in case):

!!"# = !!" + !!"" + !!"# = ! + 2! + 2! = 5!!!!" = delay of propagate/generate signal generator (AND/OR gate delay)

= 1! !!"" = delay of carry lookahead logic (CLL)

= !!"# + !!" != ! + ! = 2!

!!"# = delay of sum generator circuits (XOR gate delay) = 2!

o Assumes that gates can be as wide as possible (infinite fan-in)

• Non-ideal (fan-in case)

o In practice, fan-in places limits on the performance of the CLA. ! The worst-case delay is determined by the size of the longest carry

generation hardware. o Realistic logic gates only support a maximum number of inputs, which is defined

by the fan-in f. To implement logic gates with larger numbers of inputs, additional levels of logic gates are needed.

! The maximum number of logic inputs N that can be supported with a fan-in f and the number of layers of logic gates n is given by the following formula:

! = f!

! This relation can be re-arranged to

obtain the number of layers of gates needed to support N input signals given a fan-in f:

! = log!!

50"

o The above expression for n leads to the following equation for the non-ideal gate delay of an n-bit CLA given a fan-in of f:

!!"# = !!" + !!"" + !!"#!

= D!" + ! log! ! + 1 ∙ D!"# + log! ! + 1 ∙ D!" + D!"#!= D!" + ! log! ! + 1 ∙ d+ log! ! + 1 ∙ d+ D!"#!= ! + 2 log! ! + 1 ! + 2!!

! = gate fan-in

! The delay is O(log n). o The minimum amount of fan-in needed to implement an ideal n-bit CLA with a

gate delay of 2d can be given by:

! = ! + 1!! = size%of%the%operands%in%bits

! Based on the fact that the maximum fan-in is driven by the case where a

carry-in from C0 propagates through all n bit pairs to the last carry out signal.

o Determining the number of gate levels needed, the intuitive way: ! Take the fan-in f:

• If f is greater than n, it’s 2-level logic. • Otherwise, multiply by f again to get the maximum number of

inputs that can be driven. • If it’s still not enough, multiply by f until it’s large enough to

accommodate all possible inputs. ! The number of times you multiply by f determines the number of logic

levels that you need.

51"

Examples:

• f = 6, n = 32:

!!"# ! = 32 = ! + 2 log!(32+ 1) ! + 2! = ! + 2 2 ! + 2! = 7!

• f = 5, n = 32:

!!"# ! = 32 = ! + 2 log!(32+ 1) ! + 2! = ! + 2 3 ! + 2! = 9!

52"

Block CLA (BCLA)

• Modularized CLA design that accounts for the fan-in limits of real logic gates o Like the CSA, it divides the adders up into blocks o Each block has its own CLL logic o Slower than the CLA due to carry propagation delay between CLA blocks, but

less expensive to implement in terms of gates used. • Determining the actual delay*:

o Fixed 1d delay for propagate/generate signal generation o 2d delay for initial carry-out generation o Additional 2d delay for each carry finalization signal (due to carry-in from

another block) o 2d delay for sum generation

• Worst case delay:

!!"#$ = !!" + !!!"" + !!"#!= ! + ! 2! + 2!!

! = number of blocks

BCLA1 10 0 0

1 1 10 0 00 1 1 10 01 1 10 0 0

0 0 0 0 0 0 01 1 1 11 1 1 1 10

11 1 1 10 0

1 1 11 1 1 1 1 1 1

00 0 0

000 1 1 10

000

1

53"

X: Signed Adders 2’s Complement Adder

• Same as the unsigned adder, except the carry-out is discarded. Cases*:

1. A > B, B > 0:

! + ! = ! + ! 2. A < 0, B < 0:

! + ! = 2! − ! + 2! − ! !

= 2! + 2! − ! + ! != 2! − ! + ! !!(disregarding*carry!out)

3. A > 0, B < 0:

! + ! = ! + 2! − ! = 2! + ! − ! !

= 2! + ! − ! = ! − ! !(disregarding*carry!out),!!!! ≥ !2! − ! − ! ,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! < !

Worst-case delay*:

!!" = 2!"

• Identical to the unsigned RCA.

54"

Overflow detection*:

• Unsigned:

!! = 1

• Signed (2’s comp):

!!!!!!!!!!!! + !!!!!!!!!!!! 1’s Complement Adder

• Uses an unsigned adder with the carry-out connected to the carry-in (end-around carry) Cases*:

1. A > B, B > 0:

! + ! = ! + !

2. A < 0, B < 0

! + ! = 2! − ! − 1 + 2! − ! − 1 != 2! + 2! − ! + ! − 1− 1!= 2! − ! + ! − 1− 1+ 1!(w/end!around'carry)!= 2! − ! + ! − 1

55"

3. A > 0, B < 0

! + ! = ! + 2! − ! − 1 = 2! + ! − ! − 1!= 2! + ! − ! − 1 = ! − ! !(w/$end!around'carry),!!!! ≥ !

2! − ! − ! − 1,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! < !

Worst-case delay*:

!!" = ! + 1 ∙ 2! Determining the actual delay:

• Same as RCA algorithm, but with additional wrap-around for the end-around carry if a carry-out is generated

56"

2’s Complement Signed Adder/Subtracter

• Uses an add/subtract toggle signal to decide the operation in conjunction with an array of XOR gates

• A/S signal feeds into XOR gates and the carry-in signal o XOR gates selectively complement B depending on addition (0) or subtraction (1)

operation o Additional A/S signal feeding into C0 accommodates the “adding 1” part

• Word-level XOR gate is needed.

57"

XI: Unsigned Multipliers Basic Unsigned Algorithm

• Same as normal grade school algorithm, but with ones and zeroes. o For each cycle, each bit of the multiplier is ANDed with each bit of the

multiplicand

• Requires 2n bits:

0 ≤ ! ≤ 2! − 1!0 ≤ ! ≤ 2! − 1!0 ≤ !!×! ≤ 2! − 1 ! = 2!! − 2!!! + 1

58"

Add-and-Shift Multiplier

• The multiplicand A is stored in an n-bit register. • A 2-to-1 n-bit mux is used to select between A and 0. • A shift register is initialized to the value of the multiplier B.

o The current LSB of B is used to select from the MUX. • The output of the mux is passed to an n-bit adder.

o The n-bit adder takes in as input the output of the mux and the value of a second n-bit shift register that is initialized to 0

• Output of adder feeds into said second n-bit shift register + an additional 1-bit shift register (for the carry0

• Delay (assuming an RCA is used as the adder):

!!" = ! !!"# + !!""#$ + !!!!"# != !(2! + 2!" + 2!)

o O(n2) delay.

bi Operation 0 0 1 +A

59"

XII: Signed Multipliers and the Booth Algorithm Intuitive Explanation

• One does not simply apply the unsigned multiplication algorithm directly to signed integers.

• Instead, one good way to think about it is to interpret the multiplier in terms of the number that it currently represents for a given cycle of the add-and-shift multiplier algorithm.

o For unsigned multiplication, this is pretty straightforward, as the weighting remains consistent as more bits from the MSB are pulled in.

o For signed multiplication, the interpretation of the number weighting changes dramatically as the weighting for each bit position changes from positive to negative.

• As a result, the value of the intermediate result changes as the intermediate value of the multiplier changes from positive to negative depending on the next set of bits pulled in.

60"

• In the following example, notice how the interpretation of the intermediate value of the multiplier B changes as new bits are pulled in during the multiplication process.

• In more general terms, the following changes occur as new bits are pulled into the MSB and the interpretation changes:

o If the new MSB is the same as the old MSB, this is interpreted as a sign extension of the existing value of the multiplier B, and nothing is added to the intermediate value of the multiplier.

61"

o If the new MSB is different from the old MSB: ! 0-to-1 transition: The multiplier B is now negative; subtract by A (-A)

! 1-to-0 transition: The multiplier B is now positive; add by A (+A)

62"

The Binary Recoding Algorithm*

• More formally, this process is explained in terms of a recoding scheme in which binary numbers are re-interpreted in terms of a three-digit number scheme consisting of 1, 0, 1 (the 1 indicates a value of -1)

• This recoded number scheme is based on complements of long chains of ones. o Any sequence of 1s can be converted to the difference of two binary numbers.

! Converts chains of 1s to a -1 at the first digit and a +1 at the digit after the last one in the original chain.

o Works with both signed (2’s complement) and unsigned integers. • Provides the theoretical underpinning behind the Booth Algorithm.

Recoding Algorithm:

1. Pre-processing: a. Leading 0 addition:

i. For unsigned numbers, add a leading 0 before recoding. ii. For signed numbers, do not add the leading 0.

b. For all numbers, add in a “ghost 0” after the LSB. 2. If a 1 is found, keep going left until:

a. A 0 is hit. For that group of numbers: i. Replace the leading 0 with a 1.

ii. Replace the middle 1s (if any) with a 0. iii. Replace the last 1 with a 1

b. The most significant bit is hit, and that bit is a 1. For that group of 1s: i. Set the leading bits to 0.

ii. Set the last 1 to 1 3. If more than one 0 is found in a row, keep adding 0s until the next 1 is hit.

Example 1: 01111 => 10001 23 + 22 + 21 + 20 => 24 + (-20) Example 2: Unsigned Recoding

Weights 212 211 210 29 28 27 26 25 24 23 22 21 20 Ghost 0 {0,1} 0 1 0 1 1 0 0 1 1 0 1 0 1 0 1, 0, 1 1 1 1 0 1 0 1 0 1 1 1 1 1

Example 3: Signed Recoding

Weights 2-9 28 27 26 25 24 23 22 21 20 Ghost 0 {0,1} 1 1 0 1 0 0 0 1 1 1 0 1, 0, 1 0 1 1 1 0 0 1 0 0 1

63"

Booth Algorithm

• Modification of the shift-and-add algorithm that works with both unsigned and signed (2’s complement) numbers.

o Based on the recoding algorithm described earlier Algorithm:

1. If the number is unsigned, append a “ghost 0” after the MSB. 2. Append a “ghost 0” before the LSB at bi-1 3. For each bit, compare bi and bi-1 to determine the addend and the operation to perform:

a. Operation: !! = !1:!subtract

!0:!!!!!!!!!!add b. Addend:

!! ⊕ !!!! = 0:!!!!!!!!!!+01:!!!!!!!!!!±A

4. Add/subtract the addend/dividend to the results register using 2’s complement. Discard the carry out signal.

5. Shift the multiplicand at the end of each evaluation, sign-extending it by one bit each time. Repeat until all bits have been exhausted.

Hardware Implementation

• Requires an n-bit 2’s complement adder/subtractor • Additional 1-bit buffer is needed to determine the next operation based on both the

current and previous bits. • Delay:

!!" = ! !!"# + !!""#$ + !!!!"# !

= !(2! + 2!" + 2!) o Also O(n2) o Dmux includes the delay of the XOR gate and the mux itself.

{0,1} {1, 0, 1} Action bi bi-1 bi bi-1 0 0 0 0 +0 0 1 1 0 +A 1 0 1 0 -A 1 1 0 0 +0

64"

Formal Proof

• For two n-bit unsigned integers A and B, the binary multiplication operation can be represented by the following expansion:

!×! = !×(2!!!!!!! + 2!!!!!!! +⋯+ 2!!! + 2!!!)

• Similarly, if A and B are n-bit 2’s complement integers, the expansion of the operation

can also be represented as:

!×! = !× −2!!!!!!! + 2!!!!!!! +⋯+ 2!!! + 2!!!

o Notice that, due to the 2’s complement weight system, the most significant bit has a negative weight.

• The above expression can then be generalized into a finite series: !!×! = !× −2!!!!!!! + 2!!! − 2!!! !!!! + 2!!! − 2!!! !!!! +⋯+ 2! − 2! !!

+ 2! − 2! !! != !× −!!!! + !!!! 2!!! + −!!!! + !!!! 2!!! +⋯+ −!! + !! 2!

+ −!! + 0 2! !

= ! ∙ −!! + !!!! ∙ 2!!!!

!!!

• In general, for any term in the above finite series, the value of ! ∙ !(−!! + !!!!) is given

by the following table:

bi bi-1 ! ∙ (−!! + !!!!) 0 0 0 0 1 A 1 0 -A 1 1 0

o This corresponds neatly to the table of operations for the Booth Algorithm.

65"

Examples

66"

One-Step Booth Algorithm

• Note that each of the intermediate terms in the Booth Algorithm is sign-extended.

XIII: Combinational Array Multiplier (CAM)

• Maps the add-and-shift multiplication algorithm directly to an n x (n-1) array of full adders, where n is the bit size.

o Pre-generates the intermediate multiplication values Mi using an n x n array of AND gates, then feeds the Mi data into a chain of full adders

o All addition operations are carried out in parallel. • Number of AND gates needed:

! ∙ ! = !!

o n rows of n AND gates.

67"

• Number of full adders needed:

2 ∙ ! ! − 12 = !(! − 1)

o n rows of (n-1) full adders.

• Worst-case delay:

!!"# = !!"# + !!"#$%!!"#$!!"#!!= !!"# + !!" ∙ !!"##!!""#$%!= ! + 2! ∙ 2(! − 1)

o Has O(n) delay. o Critical path must travel through two (n-1)-sized sets of full adders (2(n-1)*2d

delay). ! In an n-bit CAM, the array can be divided into two (n-1) by (n-1)

triangles. ! The critical path runs along the carry propagation chain starting from the

full adder in the upper right corner and propagates to the last full adder at the bottom leftmost full adder.

o Additional d delay incurred due to generating Mi values (AND operation).

68"

XIV: Instruction Set Architecture (ISA)

• Instruction Set Architecture: The part of a computer that’s visible to the programmer or compiler writer; the instructions available for a given processor.

What to Consider in an Instruction Set

1. Operations to perform a. Addition, subtraction, multiplication, division, square root, move, load/store, etc.

2. Operand storage a. Store it in external memory, or within the CPU?

3. Number of explicit operands

4. Addressing modes

a. How to specify the operand location.

5. Type, size, etc. of operands

a. BCD, byte-sized integers, IEEE-754 single-precision, etc. RISC vs. CISC

• CISC Instruction Sets: o Have lots of complex instructions.

! Instructions can take more than one clock cycle to complete. ! Clock cycles may be longer due to the slower time required to complete an

instruction. o Generally produce shorter code than RISC processors for the same program.

• RISC Instruction Sets: o Have fewer, simpler instructions.

! Instructions typically take only one clock cycle to complete ! Clock cycles are generally shorter due to simpler instructions.

o Generally produce longer code than CISC processors for the same program.

add# # ####//all#operands#are#implicit#add#r1,#r2,#r3#//all#operands#are#explicitly#defined#

add#4(r5)#//Register#indirect#memory#addressing#

69"

Example: A Hypothetical CISC Instruction

• This instruction would: o Take a long time to execute (due to accessing memory from multiple places,

adding the data, and storing it to another memory location). o Take up more than one instruction word (due to the explicit memory address in

the instruction), leading to an instruction set of non-uniform size. Types of ISAs

1. Stack Machine 2. Accumulator Machine 3. General Purpose Register (GPR) Machine

Stack Machine

• Processor architecture in which specially allocated memory is organized as a stack, a last-in-first-out (LIFO) linear data structure.

o Stacks are last-in-first-out linear data structures that are analogous to piles of plates; data can either be pushed onto the top of the stack or popped off of the stack.

o Also known as a 0-address machine, since ALU operations do not have any explicit operands. Instead, operations act on the top two values in the stack.

o Operands at the top of the stack are popped during arithmetic operations, and the results are pushed back on top of the stack.

Computer Instruction Sets 45

result needed to be written back to memory, additional load or store instructionswould be required.

The 1-address machines generally provide a minimum in the size of bothprogram and CPU memory required, and the architecture was quite popular invery early mainframes and early microcomputers. The Intel 8080, Motorola6800, and MOS Technology 6502 were examples of machines that contained ac-cumulators.

The 0-Address (Stack) Computers and Address Formats The inclusion of apush-down stack in the CPU allows ALU instructions with no addresses. Oper-ands are pushed onto the stack from memory, and ALU operations implicitly op-erate on the top members of the stack. Figure 2.6 shows the add operationperformed on the two operands in the top and second positions on the stack. Theoperation removes both operands from the stack and replaces them with the re-sult. The push operation from memory to stack is also shown. The code to addtwo memory operands is a bit more complex:

Op3 = Opl + Op2; =X push Oplpush Op2addpop Op3

stack, push, Notice that the push and pop operations still require a memory address, andand pop the word count for the code above is 3 + 1 = 4 bytes for each push and pop, and

an additional byte for expression evaluation, for a total of 4 x 3 + 1 = 13 bytes.The drawback of a 0-address computer is that operands must always be in thetop two stack locations, and extra instructions may be required to get themthere. Stack machines, like stack calculators, have their adherents. General reg-ister machines have achieved more popularity in recent times, however, probably

FIGURE 2.6 The 0-Address, or Stack, Machine and Instruction Formats

Instruction formats

_ push Opi (TOS - Opl)

I Bits: 8 24Format | push n OplAddr

I Operation Result

` add (TOS - TOS + SOS)Bits: 8

Format adWhich operation

VVnext insuctin I I Where to find operands,I next instruction . .. and where to put result

(on the stack)

add#4(r1),#x40abfb06,#(r2,r6)#

70"

• Stack machines have no internal registers; all stack data is allocated in main memory.

o Data near the bottom of the stack is allocated at the highest memory address, while data near the top of the stack is allocated at lower memory addresses.

• A special register inside of the CPU, known as the stack pointer (SP), keeps track of the memory location of the top of the stack.

o SP is incremented/decremented by the size of the stack operands in bytes.

• Problems with stack machines: o No internal CPU registers

! Has to make numerous slow accesses to main memory

o Data is retained only briefly, and is destroyed with each ALU operation

! Intermediate values have to be recalculated

RTL Code For Some Common Stack Machine Instructions

• push#x:

sp## "#sp#–#4#######Move#SP#up#the#stack#M[sp]##"#M[x]#########Push#the#data#at#X#onto#the#top#of#the#stack#

• pop#y:

M[y]## "#M[sp]########Copy#data#from#the#top#of#the#stack#to#a#memory#location#sp## "#sp#+#4#######Move#SP#down#the#stack#

• add:

sp## "#sp#+#4############Move#SP#down#the#stack#M[sp]# "#M[sp]#+#M[spP4]###Add#the#two#topmost#values#of#the#stack#

• sub:

sp## "#sp#+#4############Move#SP#down#the#stack#M[sp]# "#M[sp]#P#M[spP4]###Subtract#the#two#topmost#values#of#the#stack#

• mul:

sp## "#sp#+#4############Move#SP#down#the#stack#M[sp]# "#M[sp]#×#M[spP4]###Multiply#the#two#topmost#values#of#the#stack#

71"

• div:

sp## "#sp#+#4############Move#SP#down#the#stack#M[sp]# "#M[sp]#÷#M[spP4]###Divide#the#two#topmost#values#of#the#stack#

Example Assembly Program

• C code: o Prefix Notation

! = ! + ! × ! + ! !

• Reverse Polish Notation:

o Postfix Notation: operations are listed in the order that they would be performed in a stack machine.

!!!" + !" +×

72"

• Equivalent Assembly Program:

Instruction Number of Reads

Number of Writes

Number of overall memory accesses

push#a# 1 1 2 push#b# 1 1 2 add# 2 1 3 push#c# 1 1 2 push#d# 1 1 2 add# 2 1 3 mult# 2 1 3 pop#x# 1 1 2 # Total 19

73"

Accumulator Machine

• Processor architecture in which a special purpose register called an accumulator serves as the source of one of the operands as well as the destination for arithmetic results, allowing it to accumulate data.

o Also known as a 1-address machine, since there is only one explicitly specified operand in ALU operations. The other operand is implicitly stated to be the accumulator.

• Popular in early mainframes and microcomputers, due to allowing for small programs and small CPU memory requirements.

• In accumulator machines:

o Fewer memory accesses are made overall than stack machines o Temporary variables still can’t be easily held in the machine

! Operands must be reloaded from memory into the CPU each time they’re accessed

! Intermediate results must be recalculated

44 Chapter 2 Machines, Machine Languages, and Digital Logic

The 2-Address Machine and Instruction Format

Memory CPU add Op2, Opt (Op2 <- Op2 + Opl)I I

II

' +I II I

I II I

I Program II cuner 24

I II Where to find II next instruction II- - - - - - - - -- I Instruction format

Bits: 8 24 24add Op2Addr OplAddr

Which Where to find operandsoperation/

Where toput result

The 1-Address Machine and Instruction Format

Acc + Op1)

Instruction formatBits: 8 24

| add I OplAddr IWhich Where to find

operation operand1

FIGURE 2.4

OplAddr:

Op2Addr:

NextiAddr:

Opi

Op2,Res

Nexti

FIGURE 2.5

next instruction |

74"

RTL Code For Some Common Accumulator Machine Instructions

• load#x:

Acc## "#M[x]#

• store#y:

M[y]## "#Acc#

• add#z:

Acc# "#Acc#+#M[z]##

Example Assembly Program

! = ! + ! × ! + !

Instruction Number of Reads

Number of Writes

Number of overall memory accesses

load#a# 1 0 1 add#b# 1 0 1 store#x# 0 1 1 load#c# 1 0 1 add#d# 1 0 1 mult#x# 1 0 1 store#x# 0 1 1 Total 7

75"

General Purpose Register (GPR) Machine

• Processor architecture that uses a set of numbered registers that have few restrictions on their use.

o The dominant processor architecture in modern computer architecture. • General purpose registers vs. special purpose registers

o General Purpose Registers (GPR): Registers on a computer that can be used for almost any purpose, with few restrictions on their use.

o Special Purpose Registers (SPR): Registers on a computer that are set-aside for specific purposes and have several restrictions on their use.

! Examples: stack pointer, program counter, condition code register (CCR), etc.

• GPR machines: o Make even fewer memory accesses that stack machines or accumulator machines. o Make it easy to store temporary and intermediate values in the machine.

Computer Instruction Sets 47

General Register Machine and Instruction Formats

Instruction formats

load R8, Op (R8 - Op1)|load R | OplAddr |

add R2, R4, R6 (2 - R4 + 16)

I add I R2 I4 R6

load/storemachines

register-memorymachines

memory-to-memory machines

ter machines. This is due in large part to the reduction in the cost of machinememory. When RAM is $15 per megabyte, having many general purpose regis-ters is a fine idea. It was not so fine in the heyday of the accumulator, during theearly days of computing, when a single bit cost $25.

Classifying Machines by Operand and Result Location Bear in mind that thepreceding classes are hypothetical. First of all, there are no 4-address machines.Second, nearly all real machines provide some combination of the above instruc-tion classes. The VAX 1 is probably the champion in this category, as it includesinstructions from all classes. Real machines are usually classed as being in theload-store, register-memory, or memory-memory classes.

Many modern computers, including RISCs, are of the load/store, sometimescalled register-to-register, variety. These are 1 -address machines in which mem-ory access instructions are limited to two instructions: load and store. The loadinstruction moves data from memory to a processor register, and the store in-struction moves data from the processor to memory. ALU and branch operationsin load-store machines can accept only operands located in processor registers,and they must store the result in a processor register. Load/store machines aresometimes called register-to-register machines because ALU operations musthave operands and results in registers. The philosophy is that moving data valuesback and forth between memory and the processor is an expensive operation,and that the instruction set design should discourage this operation by limitingits usage to just a few explicit load and store instructions.

Register-memory machines locate operands and result in a combination ofmemory and registers. They are classed as 1- or 1 -address machines, in whichone operand or the result must be an accumulator or general register. Memory-to-memory machines allow both the operands and the result to reside in mem-ory. They are classed as either 2- or 3-address machines depending upon whetherone of the operand locations also serves as a result location.

FIGURE 2.7

CPu

R8 I

R6

R4 I

R2 I

I~~

76"

• Example Assembly Program

! = ! + !; RISC:

load#r1,#b#load#r2,#c#add#r3,#r1,#r2#store#a,#r3

CISC:

add#a,#b,#c##

RISC (Two-Operand): #

load#r1,#b#load#r2,#c#add#r1,#r2# ##r1#"#r1#+#r2#store#a,#r3#

The (P,Q)-GPR Naming Convention

• GPR machines are categorized in terms of: o Q: the maximum number of operands supported by ALU instructions

2 ≤ ! ≤ 3

o P: the maximum number of memory operands supported by ALU instructions

0 ≤ ! ≤ 3

77"

Example Assembly Programs

! = ! + ! × ! + ! (0,3)-GPR (Load-Store/Register-Register)

(1,2)-GPR (Register-Memory) (3,3)-GPR (Memory-Memory)

Instruction Memory Accesses



load#r1,#a# 1 load#r1,#a# 1 add#r1,#a,#b# 2 load#r2,#b# 1 add#r1,#b# 1 add#r2,#c,#d# 2 add#r3,#r1,#r2#

0 load#r2,#c# 1 mult#x,#r1,#r2#

1

load#r4,#c# 1 add#r2,#d# 1 Total# 5 load#r5,#d# 1 mult#r1,#r2# 0 add#r6,#r6,#r5#

0 store#x,#r1# 1

mult#r7,#r3,#r6#

0 Total# 5

store#x,#r7# 1 Total 5

• The name in parenthesis refers to the Hennessey and Patterson term for that particular GPR machine.

• Why not: o A (2,2)-GPR machine?

! A terrible idea. ! In-memory values would get overwritten and lost.

o A (2,3)-GPR machine? ! A great idea, but one that’s not covered here.

computer architecture midterm notes

Documents