1
ALU Architecture and ISA Extensions
Lecture notes from MKP, H. H. Lee and S. Yalamanchili
(2)
Reading
• Sections 3.2-3.5 (only those elements covered in class)
• Sections 3.6-3.8
• Appendix B.5
• Practice Problems: 26, 27
• Goal: Understand the
v ISA view of the core microarchitectureo Time, space, energy
v Organization of functional units and register files into basic data path blocks
2
(3)
Overview
• Instruction Set Architectures have a purposev Applications dictate what we need
• We only have a fixed number of bitsv Impact on accuracy
• More is not betterv We cannot afford everything we want
• Basic Arithmetic Logic Unit (ALU) Designv Addition/subtraction, multiplication, division
(4)
Reminder: ISAbyte addressed memory
0xFFFFFFFF
Arithmetic Logic Unit (ALU)
0x000x010x02
0x03
0x1FProcessor Internal Buses
Memory InterfaceRegister File (Programmer Visible State)
stack
Data segment(static)
Text Segment
Dynamic Data
Reserved
Program Counter
Programmer Invisible State
Kernelregisters Who sees what?
Memory MapInstruction register
3
(5)
Arithmetic for Computers
• Operations on integersv Addition and subtractionv Multiplication and divisionv Dealing with overflow
• Operation on floating-point real numbersv Representation and operations
• Let us first look at integers
(6)
Integer Addition(3.2)
• Example: 7 + 6
n Overflow if result out of rangen Adding +ve and –ve operands, no overflown Adding two +ve operands
n Overflow if result sign is 1
n Adding two –ve operandsn Overflow if result sign is 0
4
(7)
Integer Subtraction
• Add negation of second operand• Example: 7 – 6 = 7 + (–6)
+7: 0000 0000 … 0000 0111–6: 1111 1111 … 1111 1010+1: 0000 0000 … 0000 0001
• Overflow if result out of rangev Subtracting two +ve or two –ve operands, no overflowv Subtracting +ve from –ve operand
o Overflow if result sign is 0v Subtracting –ve from +ve operand
o Overflow if result sign is 1
2’s complement representation
(8)
ISA Impact
• Some languages (e.g., C) ignore overflowv Use MIPS addu, addui, subu instructions
• Other languages (e.g., Ada, Fortran) require raising an exceptionv Use MIPS add, addi, sub instructionsv On overflow, invoke exception handler
o Save PC in exception program counter (EPC) registero Jump to predefined handler addresso mfc0 (move from coprocessor register) instruction can
retrieve EPC value, to return after corrective action (more later)
• ALU Design leads to many solutions. We look at one simple example
5
(9)
ISA View
• Register-to-Register data path• We want this to be as fast as possible
ALU
$0$1
$31
CPU/Core
(10)
Multiplication (3.3)
• Long multiplication
1000× 1001
10000000 0000 1000 1001000
Length of product is the sum of operand lengths
multiplicand
multiplier
product
6
(11)
A Multiplier• Uses multiple adders
v Cost/performance tradeoff
n Can be pipelinedn Several multiplication performed in parallel
(12)
Division(3.4)
• Check for 0 divisor• Long division approach
v If divisor ≤ dividend bitso 1 bit in quotient, subtract
v Otherwiseo 0 bit in quotient, bring down
next dividend bit
• Restoring divisionv Do the subtract, and if
remainder goes < 0, add divisor back
• Signed divisionv Divide using absolute valuesv Adjust sign of quotient and
remainder as required
10011000 1001010
-100010101 1010-1000
10
n-bit operands yield n-bitquotient and remainder
quotient
dividend
remainder
divisor
7
(13)
Faster Division
• Can’t use parallel hardware as in multiplierv Subtraction is conditional on sign of remainder
• Faster dividers (e.g. SRT division) generate multiple quotient bits per stepv Still require multiple steps
• Customized implementations for high performance, e.g., supercomputers
(14)
ISA View
• Additional function units and registers (Hi/Lo)• Additional instructions to move data to/from
these registersv mfhi, mflo
• What other instructions would you add? Cost?
ALU
Hi
Multiply Divide
Lo
$0$1
$31
CPU/Core
8
(15)
Floating Point(3.5)
• Representation for non-integral numbersv Including very small and very large numbers
• Like scientific notationv –2.34 × 1056
v +0.002 × 10–4
v +987.02 × 109
• In binaryv ±1.xxxxxxx2 × 2yyyy
• Types float and double in C
normalized
not normalized
(16)
IEEE 754 Floating-point Representation
29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 031 30S exponent significand
1bit 8 bits 23 bits
61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 3263 62S exponent significand
1bit 11 bits 20 bits
significand (continued)
32 bits
Single Precision (32-bit)
Double Precision (64-bit)
(–1)sign x (1+fraction) x 2exponent-127
(–1)sign x (1+fraction) x 2exponent-1023
9
(17)
Floating Point Standard
• Defined by IEEE Std 754-1985
• Developed in response to divergence of representationsv Portability issues for scientific code
• Now almost universally adopted
• Two representationsv Single precision (32-bit)v Double precision (64-bit)
(18)
FP Adder Hardware
• Much more complex than integer adder
• Doing it in one clock cycle would take too longv Much longer than integer operationsv Slower clock would penalize all instructions
• FP adder usually takes several cyclesv Can be pipelined
Example: FP Addition
10
(19)
FP Adder Hardware
Step 1
Step 2
Step 3
Step 4
(20)
FP Arithmetic Hardware
• FP multiplier is of similar complexity to FP adderv But uses a multiplier for significands instead of an
adder
• FP arithmetic hardware usually doesv Addition, subtraction, multiplication, division,
reciprocal, square-rootv FP « integer conversion
• Operations usually takes several cyclesv Can be pipelined
11
(21)
ISA Impact
• FP hardware is coprocessor 1v Adjunct processor that extends the ISA
• Separate FP registersv 32 single-precision: $f0, $f1, … $f31v Paired for double-precision: $f0/$f1, $f2/$f3, …
o Release 2 of MIPs ISA supports 32 × 64-bit FP reg’s
• FP instructions operate only on FP registersv Programs generally do not perform integer ops on FP
data, or vice versav More registers with minimal code-size impact
(22)
ISA View: The Co-Processor
• Floating point operations access a separate set of 32-bit registersv Pairs of 32-bit registers are used for double precision
ALU
Hi
Multiply Divide
Lo
$0$1
$31
FP ALU
$0$1
$31
BadVaddrStatus
CausesEPC
CPU/Core Co-Processor 1
Co-Processor 0
later
12
(23)
Associativity
• Floating point arithmetic is not commutative• Parallel programs may interleave operations in
unexpected ordersv Assumptions of associativity may fail
(x+y)+z x+(y+z)x -1.50E+38 -1.50E+38y 1.50E+38z 1.0 1.0
1.00E+00 0.00E+00
0.00E+001.50E+38
n Need to validate parallel programs under varying degrees of parallelism
(24)
Performance Issues
• Latency of instructionsv Integer instructions can take a single cyclev Floating point instructions can take multiple cyclesv Some (FP Divide) can take hundreds of cycles
• What about energy (we will get to that shortly)
• What other instructions would you like in hardware?v Would some applications change your mind?
• How do you decide whether to add new instructions?
13
(25)
Domain Impact on the ISA: Example
• Floats• Double precision• Massive data• Power
constrained
• Integers• Lower precision• Streaming data• Security support• Energy
constrained
Scientific Computing Embedded Systems
(26)
Summary
• ISAs support operations required of application domainsv Note the differences between embedded and
supercomputers!v Signed, unsigned, FP, etc.
• Bounded precision effectsv Software must be careful how hardware used e.g.,
associativityv Need standards to promote portability
• Avoid “kitchen sink” designsv There is no free lunchv Impact on speed and energy à we will get to this later
14
(27)
Study Guide
• Perform 2’s complement addition and subtraction (review)
• Add a few more instructions to the simple ALUv Add an XOR instructionv Add an instruction that returns the max of its inputsv Make sure all control signals are accounted for
• Convert real numbers to single precision floating point (review) and extract the value from an encoded single precision number (review)
• Execute the programs (class website) that use floating point numbers. Study the memory/register contents via single step execution
(28)
Study Guide (cont.)
• Write a few simple programs for v Multiplication/division of signed and unsigned
numberso Use numbers that produce >32-bit resultso Move to/from HI and LO registers ( find the instructions
for doing so)v Addition/subtraction of floating point numbers
• Try to write a simple program that demonstrates that floating point operations are not associative (this takes some thought and review of the range of floating point numbers)
15
(29)
Glossary
• Co-processor• Instruction set
extensions• Overflow
• Precision• Signed arithmetic
support• Unsigned
arithmetic support
Backup
16
(31)
• Build a 1 bit ALU, and use 32 of them (bit-slice)
b
a
operation
result
op a b res
Integer ALU (arithmetic logic unit)(B.5)
(32)
Single Bit ALU
0
1A
B
Result
Operation
Implements only AND and OR operations
17
(33)
• We can add additional operators (to a point)
• How about addition?
• Review full adders from digital design
Adding Functionality
cout = ab + acin + bcinsum = a Å b Å cin
Sum
CarryIn
CarryOut
a
b
(34)
Building a 32-bit ALU
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
Result31a31
b31
Result0
CarryIn
a0
b0
Result1a1
b1
Result2a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
18
(35)
• Two's complement approach: just negate b and add 1.
• How do we negate?
• A clever solution:
Subtraction (a – b) ?
Binvert
b31
b0
b1
b2
Result31a31
Result0
CarryIn
a0
Result1a1
Result2a2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
0
2
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b
sub
(36)
• Need to support the set-on-less-than instruction(slt)v remember: slt is an arithmetic instruction
v produces a 1 if rs < rt and 0 otherwise
v use subtraction: (a-b) < 0 implies a < b
• Need to support test for equality (beq $t5, $t6, $t7)v use subtraction: (a-b) = 0 implies a = b
Tailoring the ALU to the MIPS
19
(37)
Seta31
0
ALU0 Result0
CarryIn
a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Binvert
CarryIn
Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
What Result31 is when (a-b)<0?
0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
Unsigned vs. signed support
(38)
Test for equality
• Notice control lines:
000 = and001 = or010 = add110 = subtract111 = slt
•Note: zero is a 1 when the result is zero!
Seta31
0
Result0a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
Note test for overflow!