alu architecture and isa extensions - ece...

19
1 ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili (2) Reading Sections 3.2-3.5 (only those elements covered in class) Sections 3.6-3.8 Appendix B.5 Practice Problems: 26, 27 Goal: Understand the v ISA view of the core microarchitecture o Time, space, energy v Organization of functional units and register files into basic data path blocks

Upload: nguyenkhuong

Post on 30-May-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

1

ALU Architecture and ISA Extensions

Lecture notes from MKP, H. H. Lee and S. Yalamanchili

(2)

Reading

• Sections 3.2-3.5 (only those elements covered in class)

• Sections 3.6-3.8

• Appendix B.5

• Practice Problems: 26, 27

• Goal: Understand the

v ISA view of the core microarchitectureo Time, space, energy

v Organization of functional units and register files into basic data path blocks

2

(3)

Overview

• Instruction Set Architectures have a purposev Applications dictate what we need

• We only have a fixed number of bitsv Impact on accuracy

• More is not betterv We cannot afford everything we want

• Basic Arithmetic Logic Unit (ALU) Designv Addition/subtraction, multiplication, division

(4)

Reminder: ISAbyte addressed memory

0xFFFFFFFF

Arithmetic Logic Unit (ALU)

0x000x010x02

0x03

0x1FProcessor Internal Buses

Memory InterfaceRegister File (Programmer Visible State)

stack

Data segment(static)

Text Segment

Dynamic Data

Reserved

Program Counter

Programmer Invisible State

Kernelregisters Who sees what?

Memory MapInstruction register

3

(5)

Arithmetic for Computers

• Operations on integersv Addition and subtractionv Multiplication and divisionv Dealing with overflow

• Operation on floating-point real numbersv Representation and operations

• Let us first look at integers

(6)

Integer Addition(3.2)

• Example: 7 + 6

n Overflow if result out of rangen Adding +ve and –ve operands, no overflown Adding two +ve operands

n Overflow if result sign is 1

n Adding two –ve operandsn Overflow if result sign is 0

4

(7)

Integer Subtraction

• Add negation of second operand• Example: 7 – 6 = 7 + (–6)

+7: 0000 0000 … 0000 0111–6: 1111 1111 … 1111 1010+1: 0000 0000 … 0000 0001

• Overflow if result out of rangev Subtracting two +ve or two –ve operands, no overflowv Subtracting +ve from –ve operand

o Overflow if result sign is 0v Subtracting –ve from +ve operand

o Overflow if result sign is 1

2’s complement representation

(8)

ISA Impact

• Some languages (e.g., C) ignore overflowv Use MIPS addu, addui, subu instructions

• Other languages (e.g., Ada, Fortran) require raising an exceptionv Use MIPS add, addi, sub instructionsv On overflow, invoke exception handler

o Save PC in exception program counter (EPC) registero Jump to predefined handler addresso mfc0 (move from coprocessor register) instruction can

retrieve EPC value, to return after corrective action (more later)

• ALU Design leads to many solutions. We look at one simple example

5

(9)

ISA View

• Register-to-Register data path• We want this to be as fast as possible

ALU

$0$1

$31

CPU/Core

(10)

Multiplication (3.3)

• Long multiplication

1000× 1001

10000000 0000 1000 1001000

Length of product is the sum of operand lengths

multiplicand

multiplier

product

6

(11)

A Multiplier• Uses multiple adders

v Cost/performance tradeoff

n Can be pipelinedn Several multiplication performed in parallel

(12)

Division(3.4)

• Check for 0 divisor• Long division approach

v If divisor ≤ dividend bitso 1 bit in quotient, subtract

v Otherwiseo 0 bit in quotient, bring down

next dividend bit

• Restoring divisionv Do the subtract, and if

remainder goes < 0, add divisor back

• Signed divisionv Divide using absolute valuesv Adjust sign of quotient and

remainder as required

10011000 1001010

-100010101 1010-1000

10

n-bit operands yield n-bitquotient and remainder

quotient

dividend

remainder

divisor

7

(13)

Faster Division

• Can’t use parallel hardware as in multiplierv Subtraction is conditional on sign of remainder

• Faster dividers (e.g. SRT division) generate multiple quotient bits per stepv Still require multiple steps

• Customized implementations for high performance, e.g., supercomputers

(14)

ISA View

• Additional function units and registers (Hi/Lo)• Additional instructions to move data to/from

these registersv mfhi, mflo

• What other instructions would you add? Cost?

ALU

Hi

Multiply Divide

Lo

$0$1

$31

CPU/Core

8

(15)

Floating Point(3.5)

• Representation for non-integral numbersv Including very small and very large numbers

• Like scientific notationv –2.34 × 1056

v +0.002 × 10–4

v +987.02 × 109

• In binaryv ±1.xxxxxxx2 × 2yyyy

• Types float and double in C

normalized

not normalized

(16)

IEEE 754 Floating-point Representation

29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 031 30S exponent significand

1bit 8 bits 23 bits

61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 3263 62S exponent significand

1bit 11 bits 20 bits

significand (continued)

32 bits

Single Precision (32-bit)

Double Precision (64-bit)

(–1)sign x (1+fraction) x 2exponent-127

(–1)sign x (1+fraction) x 2exponent-1023

9

(17)

Floating Point Standard

• Defined by IEEE Std 754-1985

• Developed in response to divergence of representationsv Portability issues for scientific code

• Now almost universally adopted

• Two representationsv Single precision (32-bit)v Double precision (64-bit)

(18)

FP Adder Hardware

• Much more complex than integer adder

• Doing it in one clock cycle would take too longv Much longer than integer operationsv Slower clock would penalize all instructions

• FP adder usually takes several cyclesv Can be pipelined

Example: FP Addition

10

(19)

FP Adder Hardware

Step 1

Step 2

Step 3

Step 4

(20)

FP Arithmetic Hardware

• FP multiplier is of similar complexity to FP adderv But uses a multiplier for significands instead of an

adder

• FP arithmetic hardware usually doesv Addition, subtraction, multiplication, division,

reciprocal, square-rootv FP « integer conversion

• Operations usually takes several cyclesv Can be pipelined

11

(21)

ISA Impact

• FP hardware is coprocessor 1v Adjunct processor that extends the ISA

• Separate FP registersv 32 single-precision: $f0, $f1, … $f31v Paired for double-precision: $f0/$f1, $f2/$f3, …

o Release 2 of MIPs ISA supports 32 × 64-bit FP reg’s

• FP instructions operate only on FP registersv Programs generally do not perform integer ops on FP

data, or vice versav More registers with minimal code-size impact

(22)

ISA View: The Co-Processor

• Floating point operations access a separate set of 32-bit registersv Pairs of 32-bit registers are used for double precision

ALU

Hi

Multiply Divide

Lo

$0$1

$31

FP ALU

$0$1

$31

BadVaddrStatus

CausesEPC

CPU/Core Co-Processor 1

Co-Processor 0

later

12

(23)

Associativity

• Floating point arithmetic is not commutative• Parallel programs may interleave operations in

unexpected ordersv Assumptions of associativity may fail

(x+y)+z x+(y+z)x -1.50E+38 -1.50E+38y 1.50E+38z 1.0 1.0

1.00E+00 0.00E+00

0.00E+001.50E+38

n Need to validate parallel programs under varying degrees of parallelism

(24)

Performance Issues

• Latency of instructionsv Integer instructions can take a single cyclev Floating point instructions can take multiple cyclesv Some (FP Divide) can take hundreds of cycles

• What about energy (we will get to that shortly)

• What other instructions would you like in hardware?v Would some applications change your mind?

• How do you decide whether to add new instructions?

13

(25)

Domain Impact on the ISA: Example

• Floats• Double precision• Massive data• Power

constrained

• Integers• Lower precision• Streaming data• Security support• Energy

constrained

Scientific Computing Embedded Systems

(26)

Summary

• ISAs support operations required of application domainsv Note the differences between embedded and

supercomputers!v Signed, unsigned, FP, etc.

• Bounded precision effectsv Software must be careful how hardware used e.g.,

associativityv Need standards to promote portability

• Avoid “kitchen sink” designsv There is no free lunchv Impact on speed and energy à we will get to this later

14

(27)

Study Guide

• Perform 2’s complement addition and subtraction (review)

• Add a few more instructions to the simple ALUv Add an XOR instructionv Add an instruction that returns the max of its inputsv Make sure all control signals are accounted for

• Convert real numbers to single precision floating point (review) and extract the value from an encoded single precision number (review)

• Execute the programs (class website) that use floating point numbers. Study the memory/register contents via single step execution

(28)

Study Guide (cont.)

• Write a few simple programs for v Multiplication/division of signed and unsigned

numberso Use numbers that produce >32-bit resultso Move to/from HI and LO registers ( find the instructions

for doing so)v Addition/subtraction of floating point numbers

• Try to write a simple program that demonstrates that floating point operations are not associative (this takes some thought and review of the range of floating point numbers)

15

(29)

Glossary

• Co-processor• Instruction set

extensions• Overflow

• Precision• Signed arithmetic

support• Unsigned

arithmetic support

Backup

16

(31)

• Build a 1 bit ALU, and use 32 of them (bit-slice)

b

a

operation

result

op a b res

Integer ALU (arithmetic logic unit)(B.5)

(32)

Single Bit ALU

0

1A

B

Result

Operation

Implements only AND and OR operations

17

(33)

• We can add additional operators (to a point)

• How about addition?

• Review full adders from digital design

Adding Functionality

cout = ab + acin + bcinsum = a Å b Å cin

Sum

CarryIn

CarryOut

a

b

(34)

Building a 32-bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

18

(35)

• Two's complement approach: just negate b and add 1.

• How do we negate?

• A clever solution:

Subtraction (a – b) ?

Binvert

b31

b0

b1

b2

Result31a31

Result0

CarryIn

a0

Result1a1

Result2a2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

0

2

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b

sub

(36)

• Need to support the set-on-less-than instruction(slt)v remember: slt is an arithmetic instruction

v produces a 1 if rs < rt and 0 otherwise

v use subtraction: (a-b) < 0 implies a < b

• Need to support test for equality (beq $t5, $t6, $t7)v use subtraction: (a-b) = 0 implies a = b

Tailoring the ALU to the MIPS

19

(37)

Seta31

0

ALU0 Result0

CarryIn

a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Binvert

CarryIn

Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

What Result31 is when (a-b)<0?

0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

Unsigned vs. signed support

(38)

Test for equality

• Notice control lines:

000 = and001 = or010 = add110 = subtract111 = slt

•Note: zero is a 1 when the result is zero!

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

Note test for overflow!