single instruction multiple data
TRANSCRIPT
![Page 1: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/1.jpg)
Single Instruction Multiple Data
Another approach to ILP and performance
![Page 2: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/2.jpg)
Outline
• Array Processors / “True” SIMD• Vector Processors• Multimedia Extensions in modern instruction sets
![Page 3: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/3.jpg)
SIMD: Motivation
• Let’s start with an example:• ILLIAC IV, U of Illinois, 1972 (prototype)
• Reasoning: How to Improve Performance• Rely on Faster Circuits
• Cost/circuit increases with circuit speed• At some point, cost/performance unfavorable
• Concurrency: • Replicate Resources• Do more per cycle
![Page 4: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/4.jpg)
SIMD: Motivation contd.• Replication to the extreme: Multi-processor
• Very Felixible, but costly• Do we need all this flexibility?• There are middle-ground designs were only parts are replicated
CU
ALU
MEM
Uniprocessor
replicate CU
ALU
MEM
CU
ALU
MEM
CU
ALU
MEM
Multiprocessor
![Page 5: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/5.jpg)
SIMD: Motivation Contd.
• Recall: • Part of architecture is understanding application needs
• Many Apps:• for i = 0 to infinity
• a(i) = b(i) + c
• Same operation over many tuples of data• Mostly independent across iterations
![Page 6: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/6.jpg)
SIMD Architecture
• Replicate Datapath, not the control• All PEs work in tandem• CU orchestrates operations
CU
PE
MEM
PE
MEM
PE
MEM
ALU
μCU
regs
![Page 7: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/7.jpg)
ILLIAC IV
• Goal:• 1 Gops/sec• 256 PEs as four partitions of 64 PEs
• What was built• 0.2 Minsts/sec (we’ll talk about peak performance as ops)• 64 PEs• Prototype due date 1972
![Page 8: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/8.jpg)
ILLIAC IV
CU
PE
PMEM
PE
PMEM
PE
PMEM
I/O Proc
![Page 9: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/9.jpg)
ILLIAC IV Processing Element (PE)
• 64-bit numbers, float or fixed point• Multiples of smaller numbers that add up to 64-bits
• Today’s multimedia extensions• PMEM: One local memory module per PE
• 2K x 64-bits• 188ns access / 350ns cycle (includes conflict resolution)
• 100K components per PE
![Page 10: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/10.jpg)
PE Contd.
• PE mode: Active or Inactive, CU sets mode• All PEs operate in lock-step• Routing insts to move data from PE to PE• The CU can execute instructions while PE’s are busy
• Another degree of concurrency• Datatypes
• 64b float• 64b logical• 48b fixed• 32 float• 24 fixed• 8 fixed
![Page 11: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/11.jpg)
Peak Compute Bandwidth
• 64 PEs• Each can perform:
• 1 64b, 2 32b, or 4 8b operations• Or, in total:
• 64 elems, 128 elems, or 512 elems
• Peak:• 150M 64b ops/sec up to 10G 32b ops/sec• The last figure is for integer ops• Each int op takes 66ns (4 per PE in parallel)
![Page 12: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/12.jpg)
Control Unit (CU)
• A simple CPU• Can execute instructions w/o PE intervention• Coordinates all PEs• 64 64b registers, D0-D63• 4 64b Accumulators A0-A3• Ops:
• Integer ops• Shifts• Boolean• Loop control• Index PMem
D0
D63A0
A3
A1A2
ALU
CU
![Page 13: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/13.jpg)
Processing Element (PE)• 64 bit regs• A: Accumulator• B: 2nd operand for binary
ops• R: Routing – Inter-PE
Communication• S: Temporary• X: Index for PMEM 16bits• D: mode 8bits• Communication:
• PMEM only from local PE• Amongst PE with R
A
S
BR
ALUPEi
XD
01
2043
PMEMi
PEi-1
PEi+1
PEi-8
PEi+8
![Page 14: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/14.jpg)
Datapaths
• CU Bus: Insts and Data from PMEM to CU in 8 words• CDB: Broadcast to all PEs
• E.g., constants for adds• Routing Network: amongst R registers• Mode: To activate/de-activate PEs
CU
PE
PMEM
PE
PMEM
PE
PMEM
Control Unit Bus
Mode Common Data Bus
Routing
![Page 15: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/15.jpg)
Routing Network
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
0
8
16
24
32
40
48
56
7
15
23
31
39
47
55
63
56 57 58 59 60 61 62 63
0 1 2 3 4 5 6 7
12
19 20 21
28
i-8
i+8
i+1i-1
![Page 16: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/16.jpg)
Using ILLIAC IV: Example #2• DO 10 I = 1 TO 64
10 C(I) = A(I) + B(I)• LDA a + 2 load A(i) into A (same a per PMEM)• ADDRN a + 1 add B(i) into A• STA a store A into C(i)
C(1)A(1)B(1)
PMEM1
a C(2)A(2)B(2)
PMEM2
C(64)A(64)B(64)
PMEM64
![Page 17: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/17.jpg)
Using ILLIAC IV: Example #2
• DO 10 I = 2 TO 64• 10 A(I) = B(I) + A(I-1)• Expand into:
• A(N) = A(1) + Sum B(i) [i = 2 to N]
• We get:• DO 10 N=2 TO 64• S = S + B(N)• 10 A(N) = S
![Page 18: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/18.jpg)
Using ILLIAC IV: Example #2 contd.
1. Enable all PEs2. All load A from a 3. i = 04. All R = A (including those inactive)5. All route R to PE(2^i) to the right6. j = 2^i – 17. Disable all PEs 1 through j8. A = A + R R contains a partial sum of many A(i)9. i = i + 110. if i < lg(64) goto 411. Enable All PEs12. All store A at (a + 1)
![Page 19: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/19.jpg)
Using ILLIAC IV: Example #2 contd.
• Initial State:• PMEM(1)[a] = A(1)• PMEM(1+i)[a] = B(i+1)
• For example, at PE1• STEP 1: A = A(1)
• from PE2 we get B(2)
• STEP 2: A = A(1) + B(2)• from PE4 we get B(4) + B(5)
• STEP 3: A = A(1) + B(2) + B(4) + B(5)• From PE8 we get B(8) + B(7) + B(12) + B(13)
![Page 20: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/20.jpg)
Vector Processors
SIMD over time
![Page 21: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/21.jpg)
Vector Processors
• Vector Datatype• Apply same operation on all elements of the vector• No dependences amongst elements• Same motivation as SIMD
![Page 22: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/22.jpg)
Properties of Vector Processors
• One Vector instruction implies lots of work• Fewer instructions
• Each result independent of previous result• Multiple operations in parallel• Simpler design; no need for dependence checks• Higher clock rate• Compiler must help
• Fewer Branches• Memory access pattern per vector inst known
• Prefetching effect• Amortize mem latency• Can exploit high-bandwidth mem system• Less/no need for data caches
![Page 23: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/23.jpg)
Classes of Vector Processors
• Memory to memory• Vectors are in memory
• Load/store• Vectors are in registers• Load/store to communicate with memory• This prevailed
![Page 24: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/24.jpg)
Historical Perspective
• Mid-60s: performance concerns• SIMD processor arrays• Also fast Scalar machines
• CDC 6600
• Texas Instruments ASC, 1972• Memory to memory vector
• Cray Develops CRAY-1, 1978
![Page 25: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/25.jpg)
CRAY-1
• Fast and simple scalar processor• 80 Mhz
• Vector register concept• Much simple ISA• Reduced memory pressure
• Tight integration of scalar and vector units
• Cylindrical design to minimizewire lengths
• Freon Cooling
![Page 26: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/26.jpg)
Physical Organization of CRAY-1
![Page 27: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/27.jpg)
Components of Vector Processor
• Scalar CPU: registers, datapaths, instruction fetch• Vector Registers:
• Fixed length memory bank holding a single vector reg• Typically 8-32 Vregs, up to 8Kbits per Vreg• At least; 2 Read, 1 Write ports • Can be viewed as an array of N elements
• Vector Functional Units:• Fully pipelined. New op per cycle• Typically 2 to 8 FUs: integer and FP• Multiple datapaths to process multiple elements per cycle if needed
• Vector Load/Store Units (LSUs):• Fully pipelined• Multiple elems fetched/store per cycle• May have multiple LSUs
• Cross-bar:• Connects FUS, LSUs and registers
![Page 28: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/28.jpg)
CRAY-1 Organization• Simple 16-bit Reg-to-Reg ISA• Use two 16-bit to get Imm• Natural combinations of
scalar and vector• Scalar bit-vectors
match vector length• Gather/Scatter M-R• Cond. Merge
![Page 29: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/29.jpg)
CRAY-1 CPU
• Scalar and vector modes• 12.5 ns clock• 64-bit words• Int & FP units• 12 FUs• 8 24-bit A regs• 64 B regs (temp storage for A)• 8 64-bit S regs• 64 T regs (temp storage for S)• 64 64-elem, 64bit elem V regs
![Page 30: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/30.jpg)
CRAY-1 CPU
• Vector Length Register• Can use only a prefix of a vreg
• Vector Mask Register• Can use only a subset of a vreg
• Real Time Register (counts clock cycles)• Four instruction buffers
• 64 16-bit parcels• 128 Basic Instructions• Interrupt Control• NO virtual memory system
![Page 31: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/31.jpg)
Cray-1 Memory System
• 1M 64b words + 8 check bits (single error correction, double error detection)
• 16 banks of 64K words• 4 clocks period• 1 word per clock for B, T and Vreg• 1 word per 2 clocks for A & S• 4 words per clock for inst buffers
![Page 32: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/32.jpg)
Instruction Format• Fields g h I j k m• Bits 0-3 4-6 7-9 10-12 13-15 16-31• Bits cnts 4 3 3 3 3 16• X X opcode• Rd Rs1 Rs2• A/S B/T
![Page 33: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/33.jpg)
Basic Vector Instructions
• Inst Operands Operation Comment• VADD.VV V1, V2, V3 V1=V2+V3 vector+vector• VADD.SV V1, R0, V2 V1=R0+V2 scalar+vector• VMUL.VV V1, V2, V3 V1=V2*V3 vector * vector• VMUL.SV V1, R0, V2 V1=R0*V2 scalar * vector• VLD V1, R0 V1=M[R0…R0+63] stride = 1• VLDS V1, R1, R2 V1=M[R1…R1+63*R2] stride=R2• VLDX V1, R1, V2 V1=M[R1+V2[i], i=0 to 63] gather• VST store equiv of VLD• VSTS store equiv of VLDS• VSTX V1, R1 M[R1+V2[i], i=0 to 63]=V1 scatter
![Page 34: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/34.jpg)
Vector Memory Operations
• Load/Store move groups of data between memory and registers• Addressing Modes
• Unit-stride: Fastest• Non-Unit, constant stride (interleaved memory helps• Indexed (gather-scatter)
• Vector equiv of register indirect• Sparse arrays• Can vectorize more loops
![Page 35: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/35.jpg)
Vector Code Example
• Y[0:63] = Y[0:63] + a * X[0:63]• LD R0, a• VLD V1, Rx Load X[] in V1• VLD V2, Ry Load Y[] in V2• VMUL.SV V3, R0, V1 V3 = X[]*a• VADD.VV V4, V2, V3 V4 = Y[]+X[]*a• VST Ry, V4 store in Y[]
![Page 36: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/36.jpg)
Scalar Equivalent
• LD R0, a• LI R5, 512 (offset at the end of X[])• Loop: LD R2, 0(Rx)• MULTD R2, R0, R2• LD R3, 0(Ry)• ADD R4, R2, R3• ST R4, 0(Ry)• ADD Rx, Rx, 8• ADD Ry, Ry, 8• SUB R5, R5, 8• BNE Loop
LD R0, aVLD V1, RxVLD V2, RyVMUL.SV V3, R0, V1VADD.VV V4, V2, V3VST Ry, V4
![Page 37: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/37.jpg)
Vector Length Register
• Allows us to vectorize code where the elements do not exactly fit within the vector register
• What if we need a vector of just 32 elems?• Vector length register:
• Operate up to this element• Can be anything from 0 to Maximum (64 in CRAY-1)
• Can also be used to support runtime vector length variability
![Page 38: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/38.jpg)
Strip Mining
• Suppose (application vector length) AVL > MVL (max vector length)• Each loop iteration handles MVL elems• Last iteration AVL MOD MVL
• VL = (AVL mode MVL)• For (I=0; I<VL; I++)• Y[I] = A*X[I] + Y[I]• low = (AVL mod MVL)• VL = low• For (i=low; i < VL; i++)• Y[i] = A*X[i] + Y[i]
![Page 39: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/39.jpg)
Optimization #1: Chaining• Subsequent vector op can be initiated as soon as a preceding vector op it depends upon produces its first result
• Example• Vadd.vv v1, v2, v3• Vadd.sv v4, v1, R0
V1(1) V1(2) V1(3) V1(4) V1(63)
timeAdd initiated
V4(1) V4(2) V4(3) V4(4) V4(63)
unchained
V1(1) V1(2) V1(3) V1(4) V1(63)Add initiated
V4(1) V4(2) V4(3) V4(4) V4(63) chained
![Page 40: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/40.jpg)
Optimization #2: Conditional Execution
• Vector Mask Register• Bit vector: used as predicate• If 0 operation is not performed for the corresponding pair
• VLD V1, Ra• VLD V2, Rb• VCMP.NEQ.VV VMR, V1, V2• VSUB.VV V3, V2, V1 (VMR)• VST V3, Ra• For (i = 0; i < 64; i++)• if (A[i] != B[i]) A[i] = A[i] – B[i]
![Page 41: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/41.jpg)
Optimization #3: Multi-lane Implementation
• Vectors are interleaved so that multiple elems can be accessed per cycle
• Replicate resources• Equivalent of Superscalar• Because of no intra-vector dependences and because inter-vector
dependences are aligned (elem(i) to elem(i)) no need for inter-bank communications
![Page 42: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/42.jpg)
Two Ways to View Vectorization
• Classic Approach: Inner-loop• Think machine as having 32 vector registers with 16 elems• 1 instruction updates all elements of a vector• Vectorize single dimension array operations
• A new approach: Outer-loop• Think of machine as 16 “virtual processors” each with 32 scalar registers• 1 instruction updates register in 16 VPs• Good for irregular kernels
• Hardware is the same for both• These describe the compiler’s perspective
![Page 43: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/43.jpg)
Startup Cost
![Page 44: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/44.jpg)
Execution Cost
![Page 45: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/45.jpg)
Multimedia extensions
SIMD in modern CPUs
![Page 46: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/46.jpg)
Multimedia ISA Extensions
• Intel’s MMX• The Basics• Instruction Set• Examples• Integration into Pentium • Relationship to vector ISAs
• AMD’s 3DNow!• Intel’s ISSE (a.k.a. KNI)
![Page 47: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/47.jpg)
MMX: Basics• Multimedia applications are becoming popular• Are current ISAs a good match for them?• Methodology:
• Consider a number of “typical” applications• Can we do better?• Cost vs. performance vs. utility tradeoffs
• Net Result: Intel’s MMX• Can also be viewed as an attempt to maintain market share
• If people are going to use these kind of applications we better support them
![Page 48: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/48.jpg)
Multimedia Applications
• Most multimedia apps have lots of parallelism:• for I = here to infinity
• out[I] = in_a[I] * in_b[I]• At runtime:
• out[0] = in_a[0] * in_b[0]• out[1] = in_a[1] * in_b[1]• out[2] = in_a[2] * in_b[2]• out[3] = in_a[3] * in_b[3]• …..
• Also, work on short integers:• in_a[i] is 0 to 256 for example (color)• or, 0 to 64k (16-bit audio)
![Page 49: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/49.jpg)
Observations
• 32-bit registers are wasted• only using part of them and we know• ALUs underutilized and we know
• Instruction specification is inefficient• even though we know that a lot of the same operations will be
performed still we have to specify each of the individually• Instruction bandwidth • Discovering Parallelism• Memory Ports?
• Could read four elements of an array with one 32-bit load• Same for stores• The hardware will have a hard time discovering this
• Coalescing and dependences
![Page 50: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/50.jpg)
MMX Contd.
• Can do better than traditional ISA• new data types• new instructions
• Pack data in 64-bit words• bytes• “words” (16 bits)• “double words” (32 bits)
• Operate on packed data like short vectors• SIMD• First used in Livermore S-1 (> 20 years)
![Page 51: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/51.jpg)
MMX:Example
Up to 8 operations (64bit) go in parallel Potential improvement: 8x In practice less but still good
Besides another reason to think your machineis obsolete
![Page 52: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/52.jpg)
Data Types
![Page 53: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/53.jpg)
MMX: Instruction Set
• 57 new instructions• Integer Arithmetic
• add/sub/mul• multiply add• signed/unsigned• saturating/wraparound
• Shifts• Compare (form mask)• Pack/Unpack• Move
• from/to memory• from/to registers
![Page 54: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/54.jpg)
Arithmetic
• Conventional: Wrap-around• on overflow, wrap to -1• on underflow, wrap to MAXINT
• Think of digital audio• What happens when you turn volume to the MAX?
• Similar for pictures• Saturating arithmetic:
• on overflow, stay at MAXINT• on underflow, stat at MININT
• Two flavors:• unsigned• signed
![Page 55: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/55.jpg)
Operations• Mult/Add
• Compares
• Conversion
• Interpolation/Transpose• Unpack (e.g., byte to word)• Pack (e.g., word to byte)
![Page 56: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/56.jpg)
Matrix Transpose 4x4
• That’s for the first two rows
m33 m32 m31 m30 m13 m12 m11 m10m23 m22 m21 m20 m03 m02 m01 m00
punpcklwd punpcklwdm31 m21 m30 m20 m11 m01 m10 m00
punpckhdq punpckldqm31 m21 m11 m01 m30 m20 m10 m00
m03 m02 m01 m00m13 m12 m11 m10m23 m22 m21 m20m33 m32 m31 m30
m30 m20 m10 m00m31 m21 m11 m01m33 m22 m12 m02m33 m23 m13 m03
![Page 57: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/57.jpg)
Examples• Image Composting
• A and B images fade-in and fade-out• A * fade + B * (1 - fade), OR• (A - B) * fade + B
• Image Overlay• Sprite: e.g., mouse cursor• Spite: normal colors + transparent• for i = 1 to Sprite_Length
• if A[I] = clear_color then• Out_frame[I] = C[I]• else Out_frame[I] = A[I]
• Matrix Transpose• Covert from row major to column major• Used in JPEG
![Page 58: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/58.jpg)
Chroma Keying
• for (i=0; i<image_size; i++) • if (x[i] == Blue) new_image[i] =y[i]• else new_image[i] = x[i];
![Page 59: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/59.jpg)
Chroma Keying Code
• Movq mm3, mem1 • Load eight pixels from persons’ image
• Movq mm4, mem2 • Load eight pixels from the background image
• Pcmpeqb mm1, mm3• Pand mm4, mm1• Pandn mm1, mm3• Por mm4, mm1
![Page 60: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/60.jpg)
Integration into Pentium• Major issue: OS compatibility
• Create new registers?• Share registers with FP
• Existing OSes will save/restore
• Use 64-bit datapaths• Pipe capable of 2 MMX IPC• Separate MEM and Execute stage
![Page 61: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/61.jpg)
“Recent” Multimedia Extensions
• Intel MMX: integer arithmetic only• New algorithms -> new needs• Need for massive amounts of FP ops• Solution? MMX like ISA but for FP not only integer• Example: AMD’s 3DNow!
• New data type:• 2 packed single-precision FP
• 2 x 32-bits• sign + exponent + significant
• New instructions • Speedup potential: 2x
![Page 62: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/62.jpg)
AMD’s 3DNow!• 21 new instructions• Average: motivated by MPEG• Add, Sub, Reverse Sub, Mul• Accumulate
• (A1, A2) acc (B1, B2) = (B1 + B2, A1 + A2)• Comparison (create mask)• Min, Max (pairwise)• Reciprocal and SQRT,
• Approximation: 1st step and other steps• Prefetch• Integer from/to FP conversion • All operate on packed FP data
• sign * 2^(mantissa - 127) * exponent
![Page 63: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/63.jpg)
Recent Extensions Cont.
• Intel’s ISSE• very similar to AMD’s 3DNow!• But has separate registers
• Lessons?• Applications change over time• Careful when introducing new instructions
• How useful are they?• Cost?• LEGACY: are they going to be useful in the future?
• Everyone has their own Multimedia Instruction set these days
• read handout
![Page 64: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/64.jpg)
Intel’s SSE• Multimedia/Internet?• 70 new instructions• Major Types:
• SIMD-FP 128-bit wide 4 x 16 bit FP• Data movement and re-organization• Type conversion
• Int to Fp and vice versa• Scalar/FP precision
• State Save/Restore• New SSE registers not like MMX
• Memory Streaming• Prefetch to specified hierarchy level
• New Media• Absolute Diff, Rounded AVG, MIN/MAX
![Page 65: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/65.jpg)
Altivec (PowerPC Mmedia Ext)
• 128-bit registers• 8, 16, or 32 bit data types• Scalar or single-precision FP• 162 Instructions• Saturation or Modulo arithmetic• Four operand Instructions
• 3 sources, 1 target
![Page 66: Single instruction multiple data](https://reader036.vdocuments.us/reader036/viewer/2022081503/589bc5c51a28ab082b8b61a9/html5/thumbnails/66.jpg)
Altivec Design Process• Look at Mmedia Kernel• Justify new instructions• Video
• 8bit int LowQ, 16-bit int HighQ• Audio
• 16bit int LowQ, SP FP HighQ• Image Processing
• 8bit int LowQ, 16bit Int HighQ• 3D Graphics
• 16bit int LowQ, SP FP HighQ• Speech Recog.
• 16bit int Low Q, Sp FP HighQ• Communications/Crypto
• 8-bit or 16bit unsigned int