9 - cs.uregina.cazhang/802-09/part3.doc · web viewparallel compare greater 8,4,2 all is where...

Intel MMXTM Technology 1. Introduction to MMX multimedia extension technology 1.1 Features of the MMX Technology

- MMX technology: to accelerate multimedia and communication by adding new instructions and defining new 64-bit data types.

- MMX technology introduces new general-purpose instructions. These instructions operate in parallel on multiple data elements packed into 64-bit quantities. These instructions accelerate the performance of applications with compute-intensive algorithms that perform localized, recurring operations on small native data. These applications include motion video, combined graphics with video, image processing, audio, synthesis, speech synthesis and compression, telephony, video conferencing, 2D graphics, and 3d graphics.

- Single Instruction, Multiple Data (SIMD) technique. The MMX technology uses SIMD technique to speed up software performance by processing multiple data elements in parallel, using a single instruction. The MMX technology supports parallel operations on byte, word, and doubleword data elements, and the new quadword (64-bit) integer data type.

- 57 new instructions

- Eight 64-bit wide MMX registers (MM0~MM7)

- Four new data types

1

1.2 How MMX works

To understand Intel's MMX instructions, consider a register file with a number of 64-bit registers. MMX actually uses 8 64-bit floating-point registers as the MMX integer register file, but this is just a cost-saving design choice. More registers would have increased the performance impact of MMX. A 64-bit MMX register can be viewed as holding a single 64-bit integer or a vector of 2, 4, or 8 progressively narrower integer operands. Various arithmetic, logic, comparison, and rearrangement instructions are provided for each of these vector lengths. Table 1 lists the instruction classes and their effects. The column labeled "Vector" indicates the number of independent vector elements that can be specified in the 64-bit operand.

TABLE 1. Intel MMX instructions.

Class Instruction Vector Op type Function or results

Register copy 32 bits Integer register (+ MMX registerCopy Parallel pack 4,2 Saturate Convert to narrower elements

Parallel unpack low 8,4,2 Merge lower halves of 2 vectorsParallel unpack high 8,4,2 Merge upper halves of 2 vectorsParallel add 8,4,2 Wrap/Saturatel Add; inhibit carry at boundariesParallel subtract 8,4,2 Wrap/Saturate' Subtract with carry inhibitionParallel multiply low 4 Multiply, keep the 4 low halves

Arithmetic Parallel multiply high 4 Multiply, keep the 4 high halvesParallel multiply-add 4 Multiply, add adjacent products 2Parallel compare equal 8,4,2 All Is where equal, else all OsParallel compare greater 8,4,2 All Is where greater, else all OsParallel left shift logical 4,2,1 Shift left, respect boundaries

Shift Parallel fight shift logical4,2,1 Shift right, respect boundariesParallel right shift arith 4,2 Arith shift within each (half)wordParallel AND I Bitwise dest (srel) A (src2)

Logic Parallel ANDNOT I Bitwise dest (sre I) A (src2)'Parallel OR I Bitwise dest (srel) V (src2)Parallel XOR I Bitwise dest (sre 1) G (src2)

Memory access Parallel load MMX reg 32 or 64 bits Address given in integer registerParallel store MMX reg 32 or 64 bit Address given in integer register

Control Empty FP tag bits Required for compatibility3

'Wrap simply means dropping the carry-out; saturation may be unsigned or signed.

2Four 16-bit multiplications, four 32-bit intermediate results, two 32-bit final results.

3 Floating-point tag bits help with faster context switching, among other functions.

I

2

The parallel pack and unpack instructions are quite interesting and useful. Unpacking allows vector elements to be extended to the next larger width (byte to halfword, halfword to word, word to doubleword) to allow computation of intermediate results with greater precision. The unpacking operations can be specified thus,

8-vectors, low: xxxxabcd, xxxxefgh --> aebfcgdh8-vectors, high: abcdxxxx, efghxxxx --> aebfcgdh4-vectors, low: xxab, xxcd --> acbd4-vectors, high: abxx, cdxx --> acbd2-vectors, low: xa, xb --> ab2-vectors, high: ax, bx --> ab

where i-vector means a vector of length i and letters stand for vector elements in the operands (on the left of the arrow) and results (on the right). As an example, when the first vector is all Os, this operation effectively doubles the width of the lower or upper elements of the second operand through O-extension (e.g., 0000, xxcd --> 0c0d). Packing performs the reverse conversion in that it allows returning the results to the original width:

4-vector: a1a0b1b0c1c0d1d0 --> 0000abcd

2-vector: a1a0b1b0 --> 00ab

In filtering, for example, pixel values may have to be multiplied by the corresponding filter coefficients; this usually leads to overflow if elements are not first unpacked.

A key feature of MMX is the capability for saturating arithmetic. With ordinary unsigned arithmetic, overflow causes the apparent result (after dropping the carry-out bit) to become smaller then either operand. This is known as wrapped arithmetic. When applied to arithmetic on pixel attributes, this type of wraparound may lead to anomalies such as bright pixels within regions that are supposed to be fairly dark. With saturating arithmetic, results that exceed the maximum representable value are forced to that maximum value. A saturating unsigned adder, for example, can be built from an ordinary adder and a multiplexer at the output that chooses between the adder result and a constant, depending on the carry-out bit. For signed results. saturating arithmetic can be defined in an analogous way, with the most positive or negative value used depending on the direction of

3

overflow. For example.the results of the wrapped and the saturating unsigned addtion of A9 and F3 are IC and FF respectively.

Arithmetic with subword parallelism requires modifying the ALU to treat 64-bit words in a variety of ways, depending on the vector length. For wrapped addition, this capability is easy to provide and essentially comes for free. Saturating addition requires detection of overflow within subwords and choosing either the adder output or a constant as the operation result within that subword. Multiplication is slightly harder, but still quite cost-effective in terms of circuit implementation (supplying the details is left as an exercise). The effects of parallel multiplication and parallel multiply-add MMX instructions are depicted in Figure 1. Parallel comparison instructions are illustrated in Figure 2.

Note that MMX deals exclusively with integer values. A similar capability was added in a subsequent extension to the Intel processors to provide similar speedups with 32- or 64-bit floating-point operands, packed within 128-bit quadwords in registers. The latter capability is known as the streaming SIMD extension (SSE).

4

Figure 1. Parallel multiplication and multiply-add in MMX

Figure 2. Parallel comparisons in MMX

2. MMX New Data Types & MMX Registers 2.1 MMX New Data Types

5

The principal data type of the MMX technology is the packed fixed-point integer. The decimal point of the fixed-point values is implicit and is left for the user to control for maximum flexibility.

The MMX technology defines the following four new 64-bit quantity:

(1) Packed byte: Eight bytes packed into one 64-bit quantity (2) Packed word: Four words packed into one 64-bit quantity (3) Packed doubleword: Two words packed into one 64-bit quantity (4) Quadword : one 64-bit quantity 2.2 MMX Registers The IA MMX technology provides eight 64-bit, general-purpose registers. The registers are aliased on the floating-point registers. The operating system handles the MMX technology as it would handle floating-point. The MMX registers can hold packed 64-bit data types. The MMX instructions access the MMX registers directly using the register names MM0 to MM7. The MMX registers can be used to perform calculations on data. They cannot be used to address memory; addressing is accomplished by using the integer registers and standard IA addressing modes.

3. MMX Instructions (Total 57) Overview 3.1 Types of Instructions

• Arithmetic: add, subtract, multiply, arithmetic shift and multiply add. • Comparison: • Logic: AND, AND NOT, OR, and XOR • Shift: • Conversion: • Data transfer: • EMMS: empty MMX state 3.2 MMX Instructions: Syntax

• Typical MMX instruction: -- Prefix: P for Packed -- Instruction operation: for example, ADD, CMP, XOR -- Suffix: US for Unsigned Saturation S for Signed saturation B, W, D, Q for the data type: Example: PADDUSW Packed Add Unsigned with Saturation for word 3.3 MMX Instructions: Format

6

• For data transfer instruction: -- destination and source operands can reside in memory, integer registers, or MMX registers • For all other MMX instructions: -- destination operand: MMX register -- source operand: MMX register, memory, or immediate operands

7

3.4 MMX Instructions: Conventions

• source operand: at right place destination operand: at left place e.g. PSLLW mm, mm/m64 • memory address: as the least significant byte of the data

3.5 MMX Instructions: Conventions

• Wrap Around: if overflow or underflow , a data is truncated , only the lower (least significant) bits are returned. Carry is ignored. • Saturation: if overflow or underflow , a data is clipped (saturated) to a data-range limit for the data type. Lower limit upper limit signed byte 80H 7FH signed word 8000H 7FFFH unsigned byte 00H FFH unsigned word 0000H FFFFH • e.g for unsigned byte, e5H+62H= ffH (saturation) e5H+62H= 47H (wrap around)

4. MMX Instructions Arithmetic (PADD, Wrap around)

8

• PADDB mm, mm/m64, Operation as: mm(7…0) ← mm(7…0) + mm/m64(7...0) mm(15…8) ← mm(15…8) + mm/m64(15…8) ……………………………. mm(63…56) ← mm(63…56) +mm/m64(63…56)

• PADDW mm, mm/m64, Operation as: mm(15…0) ← mm(15…0) + mm/m64(15...0) mm(31…16) ← mm(31…16) + mm/m64(31…16) ……………………………. mm(63…48) ← mm(63…48) + mm/m64(63…48)

• PADDD mm, mm/m64, Operation as: mm(31…0) ← mm(31…0) + mm/m64(31...0) mm(63…32) ← mm(63…32) + mm/m64(63…32)

Arithmetic (PADD, saturation)

9

• PADDSB mm, mm/m64, Operation as: mm(7…0) ←SaturateToSignedByte( mm(7…0) + mm/m64(7...0)) mm(15…8) ← SaturateToSignedByte( mm(15…8) + mm/m64(15…8)) …………………………….

mm(63…56) ← SaturateToSignedByte( mm(63…56) +mm/m64(63…56))

• PADDSW mm, mm/m64, Operation as: mm(15…0) ← SaturateToSignedWord( mm(15…0) + mm/m64(15...0)) mm(31…16) ← SaturateToSignedWord( mm(31…16) + mm/m64(31…16)) ……………………………. mm(63…48) ← SaturateToSignedWord( mm(63…48) + mm/m64(63…48))

Arithmetic

• Packed Add Unsigned with Saturation --- PADDUSB mm, mm/m64 --- PADDUSW mm, mm/m64 • Subtraction: --- PSUB[B,W,D] mm, mm/m64 (Wrap Around) --- PSUBS[B,W] mm, mm/m64 (Saturation) --- PSUBUS[B,W] mm, mm/m64 (Saturation)

Arithmetic

10

• Packed Multiply and Add

--- PMADDWD mm, mm/m64, Multiply the packed word by the packed word in MMX reg/memory. Add the 32-bit results pairwise and store in MMX register as dword.

• Packed Multiply High --- PMULHW mm, mm/m64, Multiply the signed packed word in MMX register with the signed packed word in MMX reg/memory, then store the high-order 16 bits of the result in MMX register. mm(15…0) ← (mm(15…0) * mm/m64(15...0)) (31…16); mm(31…16) ← (mm(31…16) * mm/m64(31…16)) (31…16); mm(47…32) ← (mm(47…32) * mm/m64(47…32)) (31…16); mm(63…48) ← (mm(63…48) * mm/m64(63…48)) (31…16);

• Packed Multiply Low --- PMULHL mm, mm/m64, Multiply the signed packed word in MMX register with the signed packed word in MMX reg/memory, then store the low-order 16 bits of the result in MMX register. mm(15…0) ← (mm(15…0) * mm/m64(15...0)) (15…0);

mm(31…16) ← (mm(31…16) * mm/m64(31…16)) (15…0); mm(47…32) ← (mm(47…32) * mm/m64(47…32)) (15…0); mm(63…48) ← (mm(63…48) * mm/m64(63…48)) (15…0); Comparison

11

• Packed Compare for Equality [byte, word, doubleword] --- PCMPEQB mm, mm/m64, Return (0xff, or 0) --- PCMPEQW mm, mm/m64, Return (0xffff, or 0) --- PCMPEQD mm, mm/m64, Return (0xffffffff, or 0) • Packed Compare for Greater than --- PCMPGT[B, W,Q];

Logic

• Bit-wise Logical Exclusive OR

--- PXOR mm, mm/m64,

mm← mm XOR mm/m64 • Bit-wise Logical AND

--- PAND mm, mm/m64,

mm← mm AND mm/m64 • Bit-wise Logical AND NOT

--- PANDN mm, mm/m64,

mm←(NOT mm) AND mm/m64 • Bit-wise Logical OR

--- POR mm, mm/m64,

mm← mm OR mm/m64

Shift

12

• Packed shift left logical (Shifting in zero)

--- PSLL[W, D, Q] mm, mm/m64, • Packed shift Right logical (Shifting in zero)

--- PSRL[W, D,Q] mm, mm/m64, • Packed shift right arithmetic (Shifting in sign bits)

--- PSRA[W, D] mm, mm/m64,

Conversion

mm(15…8) ← SaturateSignedWordToUnsignedByte mm(31…16);

13

• Pack with unsigned saturation --- PACKUSWB mm, mm/m64, Pack and saturate signed words from MMX register and MMX register /memory into unsigned bytes in MMX register. mm(7…0) ← SaturateSignedWordToUnsignedByte mm(15...0);

mm(23…16)← SaturateSignedWordToUnsignedByte mm(47…32); mm(31…24) ← SaturateSignedWordToUnsignedByte mm(63…48); mm(39…32) ← SaturateSignedWordToUnsignedByte m/m64(15...0); mm(47…40)← SaturateSignedWordToUnsignedByte mm/m64(31…16); mm(55…48)← SaturateSignedWordToUnsignedByte mm/m64(47…32); mm(63…56)← aturateSignedWordToUnsignedByte mm/m64(63…48);

• Pack with unsigned saturation --- PACKUSWB mm, mm/m64,

• Pack with signed saturation --- PACKSSWB mm, mm/m64, Pack and saturate signed words from MMX register and MMX register /memory into signed bytes in MMX register. mm(7…0) ← SaturateSignedWordToSigignedByte mm(15...0); mm(15…8) ← SaturateSignedWordToSignedByte mm(31…16); mm(23…16) ← SaturateSignedWordToSignedByte mm(47…32); mm(31…24) ← SaturateSignedWordToSignedByte mm(63…48); mm(39…32)← SaturateSignedWordToSignedByte mm/m64(15...0); mm(47…40)← SaturateSignedWordToSignedByte mm/m64(31…16);

mm(55…48)← SaturateSignedWordToSignedByte mm/m64(47…32); mm(63…56)← SaturateSignedWordToSignedByte mm/m64(63…48); • Pack with signed saturation --- PACKSSDW mm, mm/m64, Pack and saturate signed dwords from MMX register and MMX register /memory into signed words in MMX register. mm(15…0) ← SaturateSignedDwordToSigignedWord mm(31...0); mm(31…16) ← SaturateSignedDwordToSignedWord mm(63…32); mm(47…32) ← SaturateSignedDwordToSignedWord mm/m64(31...0); mm(63…48) ← SaturateSignedDwordToSignedWord mm/m64(63…32); • Unpack High Packed Data --- PUNPCKH[BW, WD, DQ]SSDW mm, mm/m64, Unpack and interleave the high-order data elements of the destination and source operands into the destination operand. The low order elements are ignored. E.g. PUNPCKHWD mm(63…48) ← mm/m64(63…48); mm(47…32) ← mm (63…48); mm(31…16) ← mm/m64(47…32); mm(15…0) ← mm (47…32); • Unpack Low Packed Data --- PUNPCKL[BW, WD, DQ]SSDW mm, mm/m64, Unpack and interleave the low-order data elements of the destination and source operands into the destination operand. The high order elements are ignored. E.g. PUNPCKLWD mm(63…48) ← mm/m64(31…16); mm(47…32) ← mm (31…16); mm(31…16) ← mm/m64(15…0); mm(15…0) ← mm (15…0);

14

Data Transfer

• Move 32 bits --- MOVD mm, r/m32 move 32 bits from integer register/memory to MMX register mm(63…0) ← ZeroExtend(r/m32);

• Move 32 bits --- MOVD r/m32 , mm move 32 bits from MMX register to integer register/memory r/m32 ← mm(31…0).

• Move 64 bits --- MOVQ mm, mm/m64 move 64 bits from MMX register/memory to MMX register mm← mm/m64; --- MOVQ mm/64, mm move 64 bits from MMX register to MMX register/memory mm/m64← mm;

Instruction Samples

15

e.g. MOVD MM0, EAX;

PSLLQ MM0, 32;

MOVD MM1, EBX; POR MM0, MM1;

MOVQ MM2, MM3;

PSLLQ MM3, 1;

PXOR MM3, MM2;

5. MMX Code Optimization 5.1 Code Optimization Guidelines

• use the current compiler • do not intermix MMX instructions and FP instructions • use the opcode reg, mem instruction format whenever possible • Put an EMMS instruction at the end of all MMX code sections that will transition to FP code • Optimize data cache bandwidth to MMX register

5.2 Accessing Memory

• change MOVQ reg, reg and opcode reg, mem to

• Recommend: merging loads whenever the same address is used more than twice. (not memory-bound)

16

• Pentium II and III, -- opcode reg, mem (2 micro-ops) -- opcode reg, reg (1 micro-op) • Recommend: merging loads whenever the same address is used more than once. (memory-bound)

MOVQ reg, mem and opcode reg, reg to save one micro-op.

17

6. Programming Tools and Examples

6.1 Programming Tools

• MASM 6.11 or above. With 6.14 Patch ( install the ML614.exe)

• VC++ 6.0 can compile MMX instructions key functions written with assembly language including MMX instructions.

• Some CIC++ compilers also including the MASM tool

6.2 Programming Examples

// Name: cpu,_test.C// Purpose: to test some MMX instructions

#include <stdio.h>

// to test if the CPU is MMX compatible int cpu_test( );

// to left shift 16 bit for X and append the low 8 bit of y return x; unsigned int MMX-test(unsigned int x, unsigned int y);

// Main function for the program void main( void){I

int found-MMX=cpu - testo;if (found-MMX==l)

printf("This CPU support MMX technology\n");

elseprintf("This CPU doen NOT support MMX technology\n");

// test the MMX instruction unsigned int x= OxI2345678; unsigned int y= Ox99999999; printf("The original value of x is Ox%x\n", x); printf("The original value of y is Ox%x\n", y); x=MMX-test(x, y); printf("After left shifting 16 bit of x and append the \n");

18

printfi("low 8 bit of y , value of y is Ox%x\n", x);}// Function Name: cpu _test//Return: If the CPU supports MMX, returns value 1, otherwise returns value 2int cpu_testo{ _asm{

// test if the cpu support MMXmov eax, 1;cpuid;test edx, 00800000h;jnz found;mov eax, 2;jmp end;

found: mov eax, 1;end: EMMS;} /* Return with result in EAX

}

//Function Name: MMX_test //Parameters: Two unsigned integers x, and y //Purpose: to test some MMX instructions //Return: to left shift 16 bit for X and append it with the low 8 bit of Y, return X; unsigned int MMX-test(unsigned int x, unsigned int y) { _asm{

mov eax, x; mov ebx, y;

movd mmO, eax;mov eax, Oxff; movd mm2, eax psllq mmO, 16; movd mm I, ebx; pand mm 1, mm2;por mmO, mm I; movd eax, mmO;

}}

Results: This CPU support MMX technology

19

The original value of x is OxI2345678 The original value of y is Ox99999999 After left shifting 16 bit of x and appending the low 8 bit of y , value of y is Ox56780099

Reference

1. MMX Technology Programmer's Reference Manual2. MMX Technology Technical Overview3. Intel Architecture Optimization Reference Manual4. Intel Architecture Software Developer manual 15. Intel Architecture Software Developer manual 26. Intel Architecture Software Developer manual 37. Http://www.intel.com/

20

9 - cs.uregina.cazhang/802-09/part3.doc · web viewparallel compare greater 8,4,2 all is where...

Documents