measuring the performance of multimedia instruction sets

35
Funding for this research has been provided by the State of California under the MICRO program, and by Cisco Corporation, Fujitsu Microelectronics, IBM, Intel Corporation, Maxtor Corporation, Microsoft Corporation, Sun Microsystems, Toshiba Corporation and Veritas Software Corporation. Measuring the Performance of Multimedia Instruction Sets Nathan Slingerland Alan Jay Smith [email protected] [email protected] Apple Computer, Inc. University of California at Berkeley Abstract Many microprocessor instruction sets include instructions for accelerating multimedia applications such as DVD playback, speech recognition and 3-D graphics. Despite general agreement on the need to support this emerging workload, there are considerable differences between the instruction sets that have been designed to do so. In this paper we study the performance of five instruction sets on kernels extracted from a broad multimedia workload. We compare the performance of contemporary implementations of each extension against each other as well as to the original compiled C performance. From our analysis we determine how well multimedia workloads map to current instruction sets, noting what was useful and what was not. We also propose two enhancements: fat subwords and strided memory operations. Keywords: SIMD, subword parallel, multimedia, performance, measurement, MMX, SSE, AltiVec, VIS, MVI 1. Introduction Specialized instructions have been introduced by microprocessor vendors to support the unique computational demands of multimedia applications. The mismatch between wide data paths and the relatively short data types found in multimedia applications has led the industry to embrace SIMD (single instruction, multiple data) style processing. Unlike traditional forms of SIMD computing in which multiple individual processors execute the same instruction, multimedia instructions are executed by a single processor, and pack multiple short data elements into a single wide (64 or 128-bit) register, with all of the sub-elements being operated on in parallel. The goal of this paper is to quantify how architectural differences between multimedia instruction sets translate into differences in performance. Prior studies have primarily focused on a single instruction set in isolation and have measured the performance of kernels taken from vendor provided libraries [1], [5], [8], [25], [26], [30], [32]. Our contribution is unique as we do not focus exclusively on a single architecture, and we study the performance of kernels derived

Upload: others

Post on 13-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Measuring the Performance of Multimedia Instruction Sets

Funding for this research has been provided by the State of California under the MICRO program, and by Cisco Corporation, Fujitsu Microelectronics, IBM, Intel Corporation, Maxtor Corporation, Microsoft Corporation, Sun Microsystems, Toshiba Corporation and Veritas Software Corporation.

Measuring the Performance of Multimedia Instruction Sets Nathan Slingerland Alan Jay Smith [email protected] [email protected] Apple Computer, Inc. University of California at Berkeley

Abstract Many microprocessor instruction sets include instructions for accelerating multimedia

applications such as DVD playback, speech recognition and 3-D graphics. Despite general

agreement on the need to support this emerging workload, there are considerable differences

between the instruction sets that have been designed to do so. In this paper we study the

performance of five instruction sets on kernels extracted from a broad multimedia workload. We

compare the performance of contemporary implementations of each extension against each other

as well as to the original compiled C performance. From our analysis we determine how well

multimedia workloads map to current instruction sets, noting what was useful and what was not.

We also propose two enhancements: fat subwords and strided memory operations.

Keywords: SIMD, subword parallel, multimedia, performance, measurement, MMX, SSE,

AltiVec, VIS, MVI

1. Introduction

Specialized instructions have been introduced by microprocessor vendors to support the

unique computational demands of multimedia applications. The mismatch between wide data

paths and the relatively short data types found in multimedia applications has led the industry to

embrace SIMD (single instruction, multiple data) style processing. Unlike traditional forms of

SIMD computing in which multiple individual processors execute the same instruction,

multimedia instructions are executed by a single processor, and pack multiple short data

elements into a single wide (64 or 128-bit) register, with all of the sub-elements being operated

on in parallel.

The goal of this paper is to quantify how architectural differences between multimedia

instruction sets translate into differences in performance. Prior studies have primarily focused on

a single instruction set in isolation and have measured the performance of kernels taken from

vendor provided libraries [1], [5], [8], [25], [26], [30], [32]. Our contribution is unique as we do

not focus exclusively on a single architecture, and we study the performance of kernels derived

Page 2: Measuring the Performance of Multimedia Instruction Sets

2

from a real, measured, general purpose multimedia workload. Our results are obtained from

actual hardware measurements rather than through simulation, instilling confidence in our

results.

Section 2 summarizes the multimedia workload we studied, and details the sixteen

computationally important kernels which we extracted from it. Our methodology for recoding

the kernels with multimedia instructions and their measurement is described in Section 3. An

overview of the five instructions sets, and their implementations, is given in Section 4.

Our analysis is divided into two parts. Section 5 reflects our experience in coding the

kernels, and lends insight into those instruction set features that we found useful. In Section 6 we

compare the performance of five different processors, each of which implements a particular

multimedia instruction set. Section 7 proposes two new directions for multimedia architectures

on general purpose microprocessors: fat subwords and strided memory operations. Readers who

are interested in a much more detailed discussion of the material presented here should see [36].

2. Workload

The lack of a standardized multimedia benchmark has meant that workload selection is the

most difficult aspect of any study of multimedia. It was for this reason that we developed the

Berkeley multimedia workload, which is described in [35] (we recommend this paper to any

reader concerned about how applications were selected or the technical merit and relevance of

the Berkeley multimedia workload). It details why we felt existing multimedia benchmarks were

unsuitable, develops and characterizes our workload, and extracts the kernels that we study here.

In selecting the component applications, we strove to cover as many types of media

processing as possible: image compression (DjVu, JPEG), 3-D graphics (Mesa, POVray),

document rendering (Ghostscript), audio synthesis (Timidity), audio compression (ADPCM,

LAME, mpg123), video compression (MPEG-2 at DVD and HDTV resolutions), speech

synthesis (Rsynth), speech compression (GSM), speech recognition (Rasta) and video game

(Doom) applications. Open source software was used both for its portability (allowing for cross

platform comparisons) and the fact that we could analyze the source code directly.

3. Methodology

3.1. Berkeley Multimedia Kernel Library

In order to measure the performance of the multimedia instruction sets, we distilled from

our Berkeley multimedia workload a set of computationally important kernel functions, which

Page 3: Measuring the Performance of Multimedia Instruction Sets

3

when taken together form the Berkeley multimedia kernel library (BMKL). Kernels were chosen

based on their computational significance and amenability to hand optimization with SIMD

instructions; [35] discusses the kernel selection process in detail. Table 1 lists the characteristics

of the kernel codes examined.

Kernel Name Source Application

Data Type

Sat Arith

Native Width

Src Lines

Static Instr

%Static Instr

%CPU Cycles

Add Block MPEG-2 Decode 8-bit (U) √ 64-bits 46 191 0.1% 13.7% Block Match MPEG-2 Encode 8-bit (U) 128-bits 52 294 0.4% 59.8% Clip Test and Project Mesa FP limited 95 447 0.1% 0.8% Color Space Conversion JPEG Encode 8-bit (U) limited 22 78 0.1% 9.8% DCT MPEG-2 Encode 16-bit (S) 128-bits 14 116 0.1% 12.3% FFT LAME FP limited 208 981 4.4% 14.5% Inverse DCT MPEG-2 Decode 16-bit (S) √ 128-bits 75 649 0.3% 29.7% Max Value LAME 32-bit (S) unlimited 8 39 0.2% 12.0% Mix Timidity 16-bit (S) √ unlimited 143 829 1.0% 35.7% Quantize LAME FP unlimited 55 312 1.4% 15.3% Short Term Analysis Filter GSM Encode 16-bit (S) √ 128-bits 15 79 0.1% 20.2% Short Term Synth Filter GSM Decode 16-bit (S) √ 128-bits 15 114 0.2% 72.7% Subsample Horizontal MPEG-2 Encode 8-bit (U) √ 88-bits 35 244 0.3% 2.6% Subsample Vertical MPEG-2 Encode 8-bit (U) √ unlimited 76 478 0.6% 2.1% Synthesis Filtering Mpg123 FP √ 512-bits 67 348 0.4% 39.6% Transform and Normalize Mesa FP limited 51 354 0.1% 0.7%

Table 1. Multimedia Kernels Studied - from left to right, the columns list 1) primary data type specified as N-bit ({Unsigned, Signed}) integer or floating point (FP), 2) if saturating arithmetic is used, 3) native width; the longest width which does not load/store/compute excess unused elements, 4) static C source line count (for those lines which are executed), 5) static instruction count (for those instructions which are executed) 6) percentage of total static instructions, 7) percentage of total CPU time spent in the kernel. The later three statistics are machine specific and are for the original C code on a Compaq DS20 (dual 500 MHz Alpha 21264, Tru64 Unix v5.0 Rev. 910) machine.

From a performance standpoint, no piece of code can realistically be extracted and studied

in isolation. All of the parent applications in the Berkeley multimedia workload were modified to

make calls to the BMKL rather than their original internal functions. By measuring our kernel

codes from within real applications we could realistically include the effects of the shared

resources of our test systems (e.g. CPU, caches, TLB, memory, and registers).

3.2. Coding Process

Ideally, high level language compilers would be able to systematically and automatically

identify parallelizable sections of code and generate the appropriate SIMD instruction sequence.

The lack of languages which allow programmers to specify data types and overflow semantics at

variable declaration time has hindered the development of automated compiler support for

multimedia instruction sets [10].

Page 4: Measuring the Performance of Multimedia Instruction Sets

4

As is the case with digital signal processors (DSPs), the most efficient way to program

with multimedia extensions is to have an expert programmer tune software using assembly

language [19]. Although this is more tedious and error prone than other methods such as hand

coded standard shared libraries or automatic code generation by a compiler, it is a method that is

available on every platform, and allows for the greatest flexibility and precision when coding.

We chose to code in assembly language ourselves rather than measure vendor supplied

optimized libraries because they would make a fair cross-platform comparison next to

impossible. Each vendor implements a different set of functions which presume particular (and

likely different) data formats. For example, a vendor supplied color space conversion function

might assume band-interleaved format when the program to be accelerated uses a band-separated

structure. Some overhead would then be needed to transform data structures into the expected

format, perturbing the system under test and adding unnecessary overhead.

All of the assembly codes in our study were written by the same programmer (Slingerland)

with an approximately equal amount of time spent on each platform. Coding ourselves had the

supplemental benefit of preventing any differences in programmer ability and time spent coding

between the vendors from potentially skewing our comparison. This also gave us considerable

insight into the reasons for performance differences, where we would otherwise be limited to

simply reporting our measurements. It is important to stress that the kernel algorithms and

applications in question were studied in considerable detail before assembly coding began. Our

level of understanding made it possible to play the role of applications expert and thereby make

educated choices about SIMD algorithms, sufficient precision and required data types, and is

reflected in the development and characterization of our workload in [34].

In some cases it was not practical to code a SIMD version of a kernel if an instruction set

lacked the requisite functionality. Sun’s VIS and DEC’s MVI do not support packed floating

point, so floating point kernels were not recoded for those platforms. DEC’s MVI does not

support any data communication (e.g. permute, mix, merge) or subword parallel integer multiply

instructions. In general, if there was no compelling opportunity for performance gain, the C

version was considered to be a particular platform’s “solution” to a kernel when the needed

SIMD operations were not provided. Situations which called for a C solution are clearly denoted

in our results, and the typically lesser performance of compiler generated code must be kept in

mind when comparing against the other (hand-tuned) kernels.

Page 5: Measuring the Performance of Multimedia Instruction Sets

5

3.2.1. C Compilers and Optimization Flags

The most architecturally tuned compiler on each architecture was used to compile the C

reference version of the Berkeley Multimedia Kernel library. The optimization flags used were

those which give the best general speedup and highest level of optimization without resorting to

tuning the compiler flags for each particular kernel. The specific compiler and associated

optimization flags used on each platform are listed in [36].

4. Instruction Sets

In this paper we study five architectures with different multimedia instruction sets (Table 2

lists the parameters for the exact parts). Many of the instruction sets (AMD’s 3DNow!, DEC’s

MVI, Intel’s MMX, Sun’s VIS) use 64-bit wide registers, while Motorola’s AltiVec and Intel’s

SSE are 128-bits wide. The size of the register file available on each architecture varies widely,

ranging from eight 64-bit registers on the x86 based AMD Athlon to thirty-two 128-bit wide

registers with Motorola’s AltiVec extension.

AMD Athlon

DEC Alpha 21264

Intel Pentium III

Motorola 7400 (G4)

Sun UltraSPARC IIi

Clock (MHz) 500 500 500 500 360 SIMD Extension(s) MMX/3DNow!+ MVI MMX/SSE AltiVec VIS SIMD Instructions 57/24 13 57/70 162 121 First Shipped 6/1999 2/1998 2/1999 8/1999 11/1998 Transistors (×106) 22.0 15.2 9.5 6.5 5.8 Process (µm) [Die Size (mm2)] 0.25 [184] 0.25 [225] 0.20 [105] 0.20 [83] 0.25 [148] L1 $I Cache, $D Cache (KB) 64, 64 64, 64 16, 16 32, 32 32, 64 L2 Cache (MB) 0.5 4 0.5 1 2 Register File (# × width) FP(8x64b)/

FP(8x64b) Int (31x64b) FP(8x64b)/

8x128b 32x128b FP(32x64b)

Inst. Reorder Buffer Entries 72 35 40 6 12 SIMD Int, FP Multimedia Units 2 (combined) 2, 0 2 (combined) 3, 1 2, 0 SIMD Int Add Latency 2 [64b/cycle] - 1 [64b/cycle] 1 [128b/cycle] 1 [64b/cycle] SIMD Int Multiply Latency 3 [64b/cycle] - 3 [64b/cycle] 3 [128b/cycle] 3 [64b/cycle] SIMD FP Add Latency 4 [64b/cycle] - 4 [64b/cycle] 4 [128b/cycle] - SIMD FP Multiply Latency 4 [64b/cycle] - 5 [64b/cycle] 4 [128b/cycle] - SIMD FP ~1/sqrt Latency 3 [32b/cycle] - 2 [64b/cycle] 4 [128b/cycle] - Table 2. Microprocessors Studied - All parameters are for the actual chips used in this study. A SIMD register file may be shared with existing integer (Int) or floating point (FP) registers, or be separate. Note that some of the processors implement several multimedia extensions (e.g. Pentium III has MMX and SSE) - the corresponding parameters for each are separated with “/”. A dash (-) indicates that a particular operation is not available on a given architecture, and so no latency and throughput numbers are given. DEC’s MVI extension (on the 21264) does not include any of the listed operations, but all MVI instructions have a latency of 3 cycles [64b/cycle]. [2], [3], [4], [6], [7], [9], [12], [13], [16], [18], [24], [27], [28], [40]

All of the multimedia extensions support integer operations, although the types and widths

of available operations vary greatly. The earliest multimedia instruction sets (e.g. Sun’s VIS and

Page 6: Measuring the Performance of Multimedia Instruction Sets

6

DEC’s MVI) had design goals limited by the immaturity of their target workload and unproven

benefit to performance. Any approach which greatly modified the overall architecture or

significantly affected die size was out of the question, so leveraging as much functionality as

possible from the existing chip architecture was a high priority. Thus, the Sun and DEC

extensions do not implement subword parallel floating point instructions. Intel’s first multimedia

extension, MMX (1997), had as one of its primary design goals to not require any new operating

system support, while and Intel’s later SSE (1999) added new operating system maintained state

(eight 128-bit wide registers) for the first time since the introduction of the Intel386 instruction

set in 1985.

The processors studied (Table 2) vary in terms of instruction latency as well as the

throughput per clock cycle. Processor clock rates were all 500 MHz, with the exception of the

Sun UltraSPARC IIi, for which only a 360 MHz system was available. Although multimedia

extensions primarily focus on extracting data level parallelism, most modern microprocessors are

also superscalar, and thereby allow for multiple multimedia instructions to be issued every cycle.

All of the architectures we look at are fully pipelined, so barring any data dependencies, one new

SIMD instruction can begin per parallel functional unit each clock cycle. Typically, there are

separate functional units for SIMD integer and SIMD floating point processing, although on the

x86 architectures they are combined. [35] surveys current multimedia instruction sets in more

detail.

5. Instruction Set Analysis

This section encompasses our experience coding each of the sixteen kernels in five

different multimedia extensions. We point out those instruction set features (data types and

operations) that we found to be useful in accelerating our workload, as well as those for which

we found no application or that created significant bottlenecks in current multimedia

architectures. Those readers primarily interested in our performance measurements of specific

implementations of these instruction sets can skip ahead to Section 6 and return here to Section 5

at their leisure. The complete original C source code for each kernel as well as for the assembly

language SIMD implementations of the BMKL are available on the web at

http://www.cs.berkeley.edu/~slingn/research/. Please contact the authors by email should this

URL ever become invalid.

Page 7: Measuring the Performance of Multimedia Instruction Sets

7

5.1. Register File and Data Path

Multimedia instruction sets can be broadly categorized according to the location and

geometry of the register file upon which SIMD instructions operate. Alternatives include the

reuse of the existing integer or floating point register files, or implementing an entirely separate

one. The type of register file affects the width and therefore the number of subwords that can be

operated on simultaneously (vector length).

Implementing multimedia instructions on the integer data path has the advantage that the

functional units for shift and logical operations need not be replicated. Subword parallel addition

and subtraction are easily created by blocking the appropriate carry bits. However, modifications

to the integer data path to accommodate multimedia instructions can potentially adversely affect

the critical path of the integer data-path pipeline [19]. In addition, on architectures such as x86

(AMD, Intel) and PowerPC (Motorola) the integer data path is 32-bits wide, making a shared

integer data path approach less compelling due to the limited amount of data level parallelism

possible.

The reuse of floating point rather than integer registers has the advantage that they need

not be shared with pointers and loop and other variables. In addition, multimedia and floating

point instructions are not typically used simultaneously [19], so there is seldom any added

register pressure. Finally, many architectures support at least double-precision (64-bit wide)

operations, which is often wider than the integer data path.

A separate data path that is not shared with existing integer or floating point data paths has

the advantage of simpler pipeline control and potentially an increased number of registers.

Disadvantages include the need for saving and restoring the new registers during context

switches, as well as the relative difficulty and high overhead of moving data between register

files. This can potentially limit performance if scalar code depends upon a vector result.

A significant design parameter for SIMD instruction sets is vector length (the number of

packed elements or subwords that can be operated on simultaneously). A vector length that is too

short limits the ability to exploit data parallelism, while an excessively long vector length can

degrade performance due to insufficiently available data parallelism in an algorithm as well as

increase the amount of clean up code overhead. Table 1 classifies the “native width” of each of

the kernels:

Page 8: Measuring the Performance of Multimedia Instruction Sets

8

unlimited width - data elements are truly independent and are naturally arranged so as to be

amenable to SIMD processing. Program loops can be strip mined at almost any length, with the

increase in performance of a longer vector being directly proportional to the increase in vector

length.

limited width - data elements are independent, but there is some overhead required to

rearrange input data so that it may be operated on in a SIMD manner. The performance

advantage of longer vectors is limited by the overhead (which typically increases with vector

length) required to employ them.

exact width - a precise natural width which can be considered to be the right match for the

kernel. This width is the longest width that does not load, store, or compute unused elements.

Most multimedia kernels are either unlimited/limited or most often have an exact required

width of 128-bits. The remaining exact native widths (64-, 88- and 512-bits) come up only once

each. Thus, we consider a total vector length of 128-bits to be optimal for most multimedia

algorithms.

5.1.1. Number of Registers

Multimedia applications and their kernels can generally take advantage of quite large

register files. This rule of thumb is echoed in the architectures of dedicated media processors

such as MicroUnity’s chip, which has a 64 x 64-bit register file (also accessible as 128 x 32-bits)

[11], while the Philips Trimedia TM-1 has a 128 x 32-bit register file [31].

The DCT and IDCT kernels are excellent examples of where the importance of a large

register file can be seen. The two-dimensional discrete cosine transform (DCT) is the algorithmic

centerpiece to many lossy image compression methods; it is similar to the discrete Fourier

transform (DFT) in that it maps values from the time domain to the frequency domain, producing

an array of coefficients representing frequencies [17]. The inverse DCT maps in the opposite

direction, from the frequency to the time domain.

A two dimensional (2-D) DCT or IDCT is efficiently computed by one dimensional (1-D)

transforms on each row followed by 1-D transforms on each column and then scaling the

resulting matrix appropriately. A SIMD approach requires that multiple data elements from

several iterations be operated on in parallel for the greatest efficiency. Computation is

straightforward for the 1-D column DCT, since the corresponding elements of each loop iteration

are adjacent in memory (assuming a row-major storage format). However, performing a 1-D row

Page 9: Measuring the Performance of Multimedia Instruction Sets

9

DCT is more problematic since the corresponding elements of adjacent rows are not adjacent in

memory. The solution is to employ a matrix transposition (making corresponding “row”

elements adjacent in memory), perform the desired computation, and then re-transpose the

matrix to put the result matrix back into the correct configuration. We found this technique to

only be of performance benefit on those architectures with register files that were able to hold the

entire 16-bit 8x8 matrix (at least 16 64-bit or 8 128-bit registers).

5.2. Data Types

The data types supported by the different multimedia instruction sets include {signed,

unsigned} {8, 16, 32, 64} bit values, as well as single precision floating point. Most of the

instruction sets do not support all of these types, and usually only a subset of operations on each.

To determine which data types and operations are useful, we looked at the dynamic SIMD

instruction counts for the kernels when running the Berkeley multimedia workload. We then

broke down the dynamic SIMD instruction counts on each architecture in two ways: data type

distribution per instruction class (e.g. add, multiply) and per kernel. Here we discuss only the

salient features of tables of these categorizations, although the full set of results is available in

[36].

It is useful to distinguish an application’s storage data type (how data is stored in memory

or on disk) from its computational data type (the intermediate data type used to compute a

result). In general, storage data types, although narrow and therefore potentially offering the

greatest degrees of parallelism, are insufficiently wide for intermediate computations to occur

without overflow.

Image and video data is typically stored as packed unsigned 8-bit values, but intermediate

processing usually requires precision greater than 8-bits. With the exception of width promotion,

most 8-bit functionality was wasted on our set of multimedia kernels. The signed 16-bit data type

is the most heavily used because it is both the native data type for audio and speech data, as well

as the typical intermediate data type for video and imaging. On the wide end of the spectrum, 32-

bit and longer data types are typically only used for accumulation and simple data

communication operations such as alignment. Operations on 32-bit values tend to be limited to

addition and subtraction (for accumulation), width promotion and demotion (for converting to a

narrower output data type) and shifts (for data alignment).

Page 10: Measuring the Performance of Multimedia Instruction Sets

10

Single precision floating point plays an important role in many of the multimedia kernels

such as the geometry stage of 3-D rendering (the clipping and transform kernels) and the FFT,

where the wide dynamic range of floating point is required. Double precision SIMD is

uncommon; currently it is only available with Intel’s SSE2 extension on their Pentium IV

processor (released after this study was completed). This data type is targeted more towards

applications such as scientific and engineering workloads rather than multimedia [14], [15], and

has no use in the kernels we examine.

5.3. Operations

One of the primary goals of this work is to separate useful SIMD operations from those

that are not. The large differences in current multimedia instruction sets for general purpose

processors are fertile ground for making such a determination because many different design

choices have been made. Please note the very important caveat that this determination is made

with respect to supporting our multimedia workload. We explicitly comment on those

instructions which we found to be “useless” for our workload but for which we believe the

designers had some other purpose or workload in mind.

5.3.1. Modulo/Saturation

Modulo arithmetic “wraps around” to the next representable value when overflow occurs,

while saturating arithmetic clamps the output value to the highest or lowest representable value

for positive and negative overflow respectively. This characteristic is useful because of its

desirable aesthetic result in imaging and audio multimedia computations. For example, when

adding pixels in an imaging application, modulo addition is undesirable because if overflow

occurs, a small change in operand values may result in a glaring visual difference (e.g. the

addition of two white pixels results in a black pixel). Saturating arithmetic also allows for

overflow in multiple packed elements to be dealt with efficiently. If overflow within packed

elements were to be handled similarly to traditional scalar arithmetic, an overflowing SIMD

operation would have to be repeated and tested serially to determine which element overflowed.

The only added cost of saturating arithmetic is separate instructions for signed and unsigned

operands since the values must be interpreted by the hardware as a particular data type. This is

unlike modulo operations for which the same instruction can be used for both unsigned and

signed 2’s complement values.

Page 11: Measuring the Performance of Multimedia Instruction Sets

11

The kernels in the BMKL that employ saturating arithmetic are noted in Table 1. From this

it is clear that the most important types of saturating arithmetic are unsigned 8-bit and signed 16-

bit integers. Saturating 32-bit operations are of little value since overflow is not often a concern

for such a wide data type. Despite the utility of saturating operations, modulo computations are

still important because they allow for results to be numerically identical to existing scalar

algorithms. This is sometimes a consideration for the sake of compatibility and comparability.

5.3.2. Shift

SIMD shift operations provide an inexpensive way to perform multiplication and division

by powers of two. They are also crucial for supporting fixed point integer arithmetic in which a

common sequence of operations is the multiplication of an N-bit integer by an M-bit fixed point

fractional constant, producing an (N+M)-bit result, with a binary point at the Mth most significant

bit. At the end of computation, the final result is rounded by adding the fixed point fraction

representing 0.5, and then shifting right M-bits to eliminate the fractional bits.

5.3.3. Min/Max and Comparisons

Min and max instructions output the minimum or maximum values of the corresponding

elements in two input registers, respectively. A max instruction is of obvious utility in the

maximum value search kernel, which searches through an array of signed 32-bit integers for the

greatest maximum absolute value. Max and min instructions have other less apparent uses as

well. Signed minimum and maximum operations are often used with a constant second operand

to saturate results to arbitrary ranges (this is the case in the IDCT kernel which clips its output

value range to -256...+255 (9-bit signed integer)). Architecturally, the implementation cost of

max and min instructions should be low since the necessary comparators must already exist for

saturating arithmetic. The only difference is that instead of comparing to a constant, a register

value is used.

We have found that in many cases where comparisons seem to be required, max and min

instructions are sufficient. Subword integer comparisons were only useful on architectures

lacking max/min operations (e.g. Sun’s VIS). In general the same limited usefulness held true for

subword floating point comparisons, with the exception of a specialized comparison used in the

project and clip test kernel. In that kernel three dimensional (3-D) objects are first mapped to 2-D

space through a matrix multiplication of 1x4 vectors and 4x4 matrices. Objects are then clipped

to the viewable area to avoid unnecessary rendering. Motorola’s AltiVec includes a specialized

Page 12: Measuring the Performance of Multimedia Instruction Sets

12

comparison instruction, vcmpbfp, which handles this boundary testing. The four coordinates of

a vertex {x, y, z, w} are tested in parallel to see if any clipping is needed; a branch instruction

can then act as a fast out if no clipping is necessary. This technique is extremely effective

because no clipping is the common case, with most vertices within screen boundaries. For the

Mesa “gears” application, the fast out case held true for 61946 of the 64560 (96.0%) clipping

tests performed in our application run of 30 frames rendered at 1024x768 resolution.

5.3.4. Sum of Absolute Differences

A sum of absolute differences (SAD) instruction operates on a pair of packed 8-bit

unsigned input registers, summing the absolute differences between the respective packed

elements and placing (or accumulating) the scalar sum in another register. The block match

kernel sums the absolute differences between the corresponding pixels (8-bit unsigned) of two

16x16 macroblocks, and is the only one in which sum of absolute differences (SAD) instructions

is used.

Although DEC’s MVI extension is quite small (only 13 instructions), one of the few

operations that DEC did include was SAD. DEC architects found (in agreement with our

experience) that this operation provides the most performance benefit of all multimedia

extension operations [33]. Interestingly, Intel’s MMX, although a much richer set of instructions,

did not include this operation (it was later included in both AMD’s enhanced 3DNow! as well as

Intel’s SSE extensions to MMX). Sun’s VIS also includes a sum of absolute differences

instruction.

Processor Instruction Latency: Throughput AMD Athlon PSADBW 3 cycles : 64b every 1 cycles DEC Alpha PERR 2 cycles : 64b every 1 cycle Intel Pentium III PSADBW 5 cycles : 64b every 2 cycles Sun UltraSPARC IIi PDIST 3 cycles : 64b every 1 cycle

Table 3. SAD Instruction Survey

The Motorola G4 microprocessor was the only CPU in our survey that did not include

some form of SAD operation, forcing us to synthesize it from other instructions. Although Intel’s

SSE extension includes the psadbw instruction, this offers only a one cycle (per macroblock

row) performance advantage when compared to an AltiVec implementation. Despite the fact that

Intel has a performance win with a dedicated SAD instruction, several instructions are still

required due to the fact that MMX is a 64-bit wide extension and each row is 128-bits long. To

Page 13: Measuring the Performance of Multimedia Instruction Sets

13

estimate the latency of a hypothetical SAD instruction for a 128-bit extension such as AltiVec,

we first try to understand the latency of this instruction on other architectures (Table 3).

The latency of a 128-bit SAD instruction would be higher than the 64-bit instructions listed

in the table because this instruction requires a cascade of adders to sum (reduce) the differences

between the elements. An N-bit SAD instruction (M=N/8) can be broken down into three steps:

1) calculate M 8-bit differences, 2) compute the absolute value of the differences, 3) perform

log2M cascaded summations. The architects of the 64-bit DEC MVI extension comment that a 3-

cycle implementation of PERR would have been easily attainable, but in the end the architects

achieved a more aggressive 2-cycle instruction [7]. If a SAD operation were to be implemented

in Motorola’s AltiVec, we estimate it would have a latency of 4 cycles; certainly superior to the

existing 9 cycle multi-instruction solution.

5.3.5. Average

In addition to computing the sum of absolute differences, half-pixel interpolation, for

which MPEG-2 encoding offers three varieties, is also important. Interpolation is done by

averaging a set of pixel values with pixels spatially offset by one horizontally, vertically or both.

Integer average operations were only used in the block match kernel (for interpolation). This

kernel operates on 8-bit unsigned values and was the only type of average instruction that was

used within our workload.

5.3.6. High Latency Function Approximation.

3-D rendering applications use floating point math functions such as reciprocal (1x ) and

square-root ( 1x

) that are typically very high latency. In the transform and normalize kernel,

graphics primitives are transformed to the viewer’s frame of reference through matrix

multiplications, an operation which has at its heart a floating point reciprocal square root

operation. On some architectures scalar versions of these functions (reciprocal and square root)

are computed in software, while others have hardware instructions. The computation of these

functions is iterative, so an operation’s latency is directly proportional to the number of bits of

precision returned; full IEEE compliant operations return 24-bits of mantissa.

All of the SIMD floating point extensions (AMD’s 3DNow!, Intel’s SSE and Motorola’s

AltiVec) include approximation instructions for 1x and 1

x. These are typically implemented

Page 14: Measuring the Performance of Multimedia Instruction Sets

14

as hardware lookup tables, returning k-bits of precision. In Intel’s SSE, for example, approximate

reciprocal (rcp) and reciprocal square root (rsqrt) return 12-bits of mantissa.

One unique aspect of Intel’s SSE instruction set is that not only does it include

approximations of 1x and 1

x, but it also includes full precision (24-bits of mantissa)

versions of xy and x . Of course, this added precision comes at a price – namely much higher

latency than the pipelined approximations (in fact Intel’s full precision instructions are not

pipelined on the Pentium III). A better approach is to use the Newton-Raphson method for

improving the accuracy of approximations in software through specially derived functions. In the

case of 1x

it is possible to iteratively increase the precision of an initial approximation

through the equation: x1 = x0 − 0.5 ⋅ a ⋅ x03 − 0.5⋅ x0( )= 0.5 ⋅ x0 ⋅ 3.0 − a ⋅ x0

2( ). Employing an approximation instruction in conjunction with the Newton-Raphson method

to achieve full precision is actually faster than the full precision version of the instruction that

Intel provides (25 vs. 36 cycles). One iteration of the Newton-Raphson method is enough to

improve a 12-bit approximation to the full 24-bit precision of IEEE single precision.

The cost of the Newton-Raphson method is the register space needed to hold intermediate

values and constants. AMD’s 3DNow! extension circumvents this by including instructions to

internally perform the Newton-Raphson method, rather than having the programmer explicitly

implement it. The only odd thing about AMD’s reciprocal square root instructions are that they

are actually scalar; they only produce one result value, based on the lower packed element. These

instructions are a part of 3DNow! and not the normal x87 floating-point; they expect input data

to be in a packed form (64-bits comprised of two 32-bit single-precision sub-words). It is

necessary to execute two Newton-Raphson operations to compute both subwords when using

AMD’s 3DNow! extension (20 cycles per subword).

5.3.7. Floating Point Rounding and Exception Handling

The techniques which SIMD instruction sets employ to deal with floating point rounding

and exception handling are very similar to those used with integer overflow. Checking result

flags or generating an exception from a packed operation requires considerable time to determine

which packed element caused the problem. Fortunately, in most cases where an exception might

be raised it is possible to fill in a value which will give reasonable results for most applications

Page 15: Measuring the Performance of Multimedia Instruction Sets

15

and continue computation without interruption. In the case of rounding, it is often acceptable to

incur a slight loss in precision and only use the flush to zero (FTZ) rounding mode [41]. Flush to

zero clamps to a minimum representable result in the event of underflow (a number too small to

be represented in single precision floating point). This speeds execution because no explicit error

condition checking need be done, and is similar to saturating integer arithmetic where maximum

or minimum result values are substituted rather than checking for and reporting positive or

negative overflow. Only Intel’s SSE implements full IEEE compliant SIMD floating point

exceptions and all four IEEE rounding modes.

5.3.8. Type Conversion

Width promotion is the expansion of an N-bit value to some larger width. For unsigned

fixed point numbers this requires zero extension (filling any additional bits with zeros). Zero

extension is usually not specified as such in a multimedia architecture because it overlaps in

functionality with data rearrangement instructions such as unpack or merge if packed values are

merged with another register which has been zeroed prior to merging. Signed element unpacking

is not as simple, but is rarely supported directly by hardware; only the AltiVec instruction set

includes it. It can be synthesized with multiplication by one since a multiplication yields a result

that has the total overall width of both its operands.

Video and imaging algorithms often utilize an 8-bit unsigned data type for storage, but 16-

bits for computation, making 8-bit to 16-bit zero extension quite useful. Audio and speech

algorithms typically employ signed 16-bit values, but because multiplication by a fractional fixed

point constant is a common operation, these values are often unpacked as a natural consequence

of computation. So, although a signed unpack operation would likely be faster than

multiplication by 1, it is seldom necessary to resort to this in practice.

5.3.9. Data Rearrangement

SIMD instructions perform the same operation on multiple data elements. Because all of

the data within a register must be treated identically, the ability to efficiently rearrange data bytes

within and between registers is critical to performance. We will refer to this class of instructions

as “data communication”. Interleave or merge instructions (also referred to as mixing or

unpacking) merge alternate data elements from the upper or lower half of the elements in each of

two source registers. Align or rotate operations allow for arbitrary byte-boundary data

realignment of the data in two source registers; essentially a shift operation that is performed in

Page 16: Measuring the Performance of Multimedia Instruction Sets

16

byte multiples. Both interleave and align type operations have hard coded data communication

patterns. Insert and extract operations allow for a specific packed element to be extracted as a

scalar or a scalar value to be inserted to a specified location. Shuffle (also called permute)

operations allow for greater flexibility than those operations with fixed communication patterns,

but this added flexibility requires that the communication pattern be specified either in a third

source register or as an immediate value in part of the instruction encoding.

[21] presents a novel set of simple data communication primitives which can perform all

24 permutations of a 2x2 matrix in a single cycle on a processor with dual data communication

functional units. This is useful because any larger data communication problem can be

decomposed into 2x2 matrices. Although this approach might make some very complex data

communication patterns slow to compute, we have found that most multimedia algorithms have

patterns which are relatively simple. Because of this we endorse [21]’s technique for supplying

the data communication needs of multimedia applications.

A related class of instructions that we found to be quite useful when programming with the

AltiVec extension was “splat” instructions. A splat operation places either an immediate scalar or

specified element from a source register into every element of the destination register. This was

very useful when constants were required; on other architectures it is necessary to statically store

these types of values in memory, and then load them into a register as needed.

5.3.10. Prefetching

Prefetching is a hardware or software technique which tries to predict data access needs in

advance, overlapping memory access with useful computation. Although we will not otherwise

examine the performance implications of prefetch instructions (which would be a useful

extension to this study), we mention them briefly because they are often a part of multimedia

instruction sets due to the highly predictable nature of data accesses in multimedia applications.

Prefetching can be implemented in software, hardware or both, with each technique

distinguished by its methods for detection (discovering when prefetching will be of benefit) and

synchronization (the means for ensuring that data is available in the cache before it is needed but

not so far in advance that it pollutes the cache by purging still needed items) [38].

Many multimedia instruction sets take a purely software approach by providing prefetch

instructions that fetch only the cache block at a specified address into the cache from main

memory (without blocking normal load/store instruction accesses). Determining the ideal

Page 17: Measuring the Performance of Multimedia Instruction Sets

17

location for this type of prefetch instruction depends on many architectural parameters.

Unfortunately, these include such things as the number of CPU clocks for memory latency and

the number of CPU clocks to transfer a cache line, which are both highly machine dependent and

not readily available to the programmer.

Another technique is a software-hardware hybrid approach in which a software-directed

hardware prefetching scheme is used. One variation of this is implemented by Motorola’s

AltiVec and its data stream touch instruction (dst) instruction which indicates the memory

sequence or pattern that is likely to be accessed; a separate prefetch instruction need not be

issued for each data element as hardware optimizes the number of cache blocks to prefetch.

Thus, it is not necessary for the programmer to know the parameters of the cache system.

Currently there is no means for synchronization in AltiVec implementations – the data specified

by a dst is brought into the cache as quickly as possible rather than at the same rate as its

consumption. Thus, it is currently necessary to periodically issue dst instructions specifying

short access patterns rather than a single long instruction. This limits the potential benefit of this

technique over a pure software approach. Several hardware techniques for maintaining the

synchronization of data prefetch and consumption in hybrid solutions as well as pure hardware

prefetching are discussed in [29].

5.4. Bottlenecks and Unused Functionality

In this section we discuss those features which appear in multimedia instruction sets that

do not appear to be useful, and are not “free”; i.e. they aren’t a low (or no) cost side effect of

some other useful feature.

5.4.1. Instruction Primitives

The VIS instruction set does not include full 16-bit multiply instructions. It instead offers

multiplication primitives, the results of which must be combined through addition (see Listing 1

and Listing 2). fmul8sux16 %f0, %f2, %f4 fmul8ulx16 %f0, %f2, %f6 fpadd16 %f4, %f6, %f8 Listing 1. Sun VIS 16-bit x 16-bit → 16-bit Multiply

fmuld8sux16 %f0, %f2, %f4 fmuld8ulx16 %f0, %f2, %f6 fpadd32 %f4, %f6, %f8 Listing 2. Sun VIS 16-bit x 16-bit → 32-bit Multiply

Page 18: Measuring the Performance of Multimedia Instruction Sets

18

The reason that the architects of VIS divided up 16-bit multiplication in this way was to

decrease die area. Not providing a full 16x16 multiplier subunit cut the size of the arrays in half

[42]. Unfortunately, dividing an operation into several instructions which are not otherwise

useful in and of themselves increases register pressure, decreases instruction decoding bandwidth

and creates additional data dependencies. Splitting SIMD instructions (which have been

introduced for their ability to extract data parallelism) can actually cripple a superscalar

processor’s ability to extract instruction level parallelism. A multi-cycle operation can be a better

solution than a multi-instruction operation because instruction latencies can be transparently

upgraded in future processors, while poor instruction semantics cannot be repaired without

adding new instructions.

5.4.2. Unused High Latency Approximation Instructions

Floating point approximation of 1x instructions, although available on several platforms,

did not find application in the kernels we studied. AltiVec includes approximate log2 (logarithm

base 2) and exp2 (2 raised to a power) instructions, which can potentially be used in the lighting

stage of a 3-D rendering pipeline. Lighting calculations are currently handled by 3-D accelerator

cards, and not the CPU, and so were not included in the BMKL.

5.4.3. Unused Pixel Conversion Instructions

Motorola’s AltiVec extension includes pixel pack (vpkpx) and pixel unpack (vupkhpx,

vupklpx) instructions for converting between 32-bit true color and 16-bit color representations.

These did not find application within the BMKL, although it is possible that they might be of

utility in situations where AltiVec needs to operate on 16-bit color data; many video games use

16-bit textures, for example.

5.4.4. Unused Memory Access Instructions

Sun’s VIS includes two sets of instructions for accelerating multimedia operations with

sophisticated memory addressing needs. The first, edge8, edge16, and edge32, produce bit

vectors to be used in conjunction with partial store instructions to deal with the boundaries in 2--

D images. The second group of addressing instructions include array8, array16 and

array32 which find use in volumetric imaging (the process of displaying a two dimensional

slice of three dimensional data). An array instruction converts (x,y,z) coordinates into a memory

Page 19: Measuring the Performance of Multimedia Instruction Sets

19

address. The Berkeley multimedia workload does not include any volumetric imaging

applications, so it is not surprising that these instructions found no use in our workload.

5.4.5. Singular, Highly Utilized Resources

Although we usually think of SIMD architectures as extracting data level parallelism, all of

the implementations of the instruction sets we have examined are also superscalar, with multiple

parallel SIMD functional units. In fact, unless the SIMD vector length is long enough to hold the

entire data set being operated on, there is almost always the potential to extract instruction as

well as data level parallelism.

Sun’s VIS architecture does not include subword parallel shift instructions. Instead, the

graphics status register (GSR) has a 3-bit addr_offset field that is used implicitly for byte

granularity alignment, as well as a 4-bit scale_factor field for packing/truncation

operations. The VIS GSR is a serializing bottleneck because any time packing or alignment

functionality is needed, it must be pushed through the GSR. Because VIS lacks subword shift

operations, we found ourselves synthesizing such operations with the packing and alignment

operations whenever we could not otherwise avoid them. Even with the careful planning of

packing and alignment operations it was often necessary to write to the GSR (a slow operation)

several times during each iteration of the loops of our multimedia kernels. The serializing effect

of this singular resource prevented VIS operations from proceeding at the fullest possible degree

of (instruction) parallelism.

5.5. Overall Instruction Mix

Table 4 shows the total mix of dynamic instructions executed by each architecture. Counts

are for the code within the kernels only, and do not include instructions from the remainder of

each application. “Control state” instructions are those that interact with special purpose machine

registers (e.g. the GSR on Sun’s VIS). Branches and other data dependent codes such as

conditional moves are categorized as “control flow” type instructions.

We can see that SIMD kernels utilize a significant number of scalar operations for

pointers, loop variables and clean up code, evidence that sharing the integer data path is not a

good idea. Intel’s SSE is a 128-bit wide extension, as compared to AMD’s 64-bit wide 3DNow!,

explaining why Intel’s overall instruction count is lower by about 1 billion (22%) instructions.

The same reasoning applies to Motorola’s G4 which has the 128-bit wide (both packed integer

and floating point) AltiVec extension. DEC’s bloated instruction count is due to the fact that

Page 20: Measuring the Performance of Multimedia Instruction Sets

20

their MVI extension is very limited in functionality (13 instructions in total), and so many

operations need to be done with scalar instructions.

AMD Athlon

DEC 21264A

Intel Pentium III

Motorola G4

Sun UltraSPARC IIi

Int Load/Store 5.4% 20.3% 6.2% 2.4% 2.6%Int Arithmetic 12.1% 29.2% 12.3% 19.1% 26.6%Int Control Flow 12.8% 11.3% 13.4% 13.8% 9.3%Int Data Communication 2.3% 6.7% 2.1% 1.1% 0.0%FP Load/Store 3.2% 11.2% 4.0% 7.4% 8.0%FP Arithmetic 0.0% 18.1% 0.0% 2.3% 8.4%FP Control Flow 0.0% 0.1% 0.0% 0.0% 6.2%FP Data Communication 0.0% 0.0% 0.0% 0.0% 0.0%SIMD Load/Store 19.4% 21.0% 17.9% 11.7%SIMD Data Communication 12.3% 12.6% 17.0% 9.7%SIMD Int Arithmetic 12.5% 3.1% 18.8% 11.8% 13.6%SIMD Int Control Flow 0.0% 0.0% 0.0% 3.5%SIMD FP Arithmetic 19.9% 9.3% 6.9% SIMD FP Control Flow 0.0% 0.0% 0.0% Control State 0.2% 0.0% 0.2% 0.3% 0.4%

Total Count 4.49E+09 7.45E+09 3.52E+09 4.53E+09 5.60E+09Table 4. Overall Instruction Mix – values are percentages of the total counts listed at the bottom of each column. Control state instructions are those that interact with special purpose registers. Control flow instructions include branches as well as conditional moves and comparisons. Gray table cells indicate functionality that was not implemented by a particular architecture (as opposed to implemented but unused, which is indicated by 0%).

Data communication operations represent the overhead necessary to put data in a format

amenable to SIMD operations. Ideally, these types of instructions should make up as small a part

of the dynamic instruction mix as possible. Table 4 reveals that the Motorola AltiVec extension

executes the largest percentage of these instructions. This is due to the fact that a wider register

width means that it is less likely that the data is arranged correctly as first loaded from memory.

Also, data communication operations are used by AltiVec to simulate unaligned access to

memory (typically a separate instruction on other architectures).

6. Performance Comparison

6.1. SIMD Speedup over C

In order to establish a baseline for performance, the average execution time of the C and

SIMD assembly versions of the BMKL were measured on each platform (Table 5). A speedup

from 2x-5x was typical, although some kernels were sped up considerably more. Note that the C

compiler for the Apple system is from a pre-release version of MacOS X (developer’s preview 3)

and is known to be weak, inflating the speedup of AltiVec over C.

Page 21: Measuring the Performance of Multimedia Instruction Sets

21

AMD Athlon 500 1,087 1,801 7,374 1.22E+08 24,332 125,484 2,999 2,407 956 34,233 15,268 21,849 1.60E+07 2.55E+07 7,308 11,527 1.0E+07 3.7E+04DEC Alpha 21264 500 669 979 8,591 2.24E+07 12,475 67,321 1,276 1,143 616 19,549 11,120 19,208 9.21E+06 1.47E+07 4,900 7,990 2.9E+06 2.2E+04Intel Pentium III 500 969 1,855 7,938 6.01E+07 38,192 147,047 2,772 2,536 1,065 37,100 16,129 43,582 1.58E+07 2.20E+07 8,349 20,491 6.1E+06 4.1E+04Motorola 7400 (G4) 500 1,054 1,466 8,906 5.80E+07 33,176 232,272 3,326 4,110 1,710 29,488 24,156 23,721 1.75E+07 2.93E+07 4,917 81,985 6.6E+06 4.7E+04Sun UltraSPARC IIi 360 1,662 2,408 12,222 6.20E+07 30,922 111,258 3,782 6,165 3,550 34,928 36,773 48,660 1.77E+07 3.99E+07 7,138 14,455 7.5E+06 5.3E+04Sun UltraSPARC IIi 500* 1,197 1,734 8,800 4.47E+07 22,264 80,106 2,723 4,439 2,556 25,148 26,477 35,035 1.27E+07 2.87E+07 5,140 10,408 5.4E+06 3.8E+04Average 1,088 1,702 9,006 6.49E+07 27,819 136,677 2,831 3,272 1,579 31,059 20,689 31,404 1.52E+07 2.63E+07 6,522 27,290 6.7E+06 4.2E+04

AMD Athlon 500 114 257 4,559 1.29E+07 1,438 63,446 827 604 341 18,557 9,887 7,425 1.13E+07 1.08E+07 2,585 4,597 2.2E+06 1.1E+04DEC Alpha 21264 500 350 379 1.30E+07 1,168 9.26E+06 4.49E+06 4.5E+06 6.6E+04Intel Pentium III 500 152 381 6,336 1.08E+07 1,228 84,661 993 669 528 15,010 11,003 5,759 1.17E+07 1.00E+07 4,718 4,338 2.0E+06 1.2E+04Motorola 7400 (G4) 500 423 437 3,427 1.30E+07 907 116,828 847 618 446 14,335 3,729 3,672 1.08E+07 3.07E+06 2,727 2,998 1.7E+06 1.0E+04Sun UltraSPARC IIi 360 342 852 2.30E+07 3,353 2,508 8,715 1,716 15,009 15,202 1.95E+07 1.05E+07 4.8E+06 3.2E+04Sun UltraSPARC IIi 500* 246 613 1.66E+07 2,415 1,806 6,275 1,235 10,806 10,945 1.41E+07 7.53E+06 3.5E+06 2.3E+04Average 276 461 4,774 1.45E+07 1,732 88,312 1,294 2,355 758 15,967 9,907 8,015 1.25E+07 7.77E+06 3,343 3,978 2.2E+06 1.5E+04

AMD Athlon 500 9.6x 7.0x 1.6x 9.4x 16.9x 2.0x 3.6x 4.0x 2.8x 1.8x 1.5x 2.9x 1.4x 2.4x 2.8x 2.5x 4.5x 3.4xDEC Alpha 21264 500 1.9x 2.6x 1.7x 1.0x 1.0x 3.3x 1.9x 1.7xIntel Pentium III 500 6.4x 4.9x 1.3x 5.6x 31.1x 1.7x 2.8x 3.8x 2.0x 2.5x 1.5x 7.6x 1.3x 2.2x 1.8x 4.7x 5.1x 3.3xMotorola 7400 (G4) 500 2.5x 3.4x 2.6x 4.5x 36.6x 2.0x 3.9x 6.6x 3.8x 2.1x 6.5x 6.5x 1.6x 9.6x 1.8x 27.4x 7.6x 4.6xSun UltraSPARC IIi 500* 4.9x 2.8x 2.7x 9.2x 1.5x 0.7x 2.1x 2.5x 3.2x 0.9x 3.8x 3.1x 2.5xAverage 5.0x 4.1x 1.8x 4.8x 23.5x 1.9x 3.0x 3.2x 2.7x 2.1x 3.0x 5.0x 1.3x 4.2x 2.1x 11.5x 5.0x 3.6x

AMD Athlon 500 -9% -15% 11% -99% 7% 4% -14% 18% 31% -18% 18% 24% -12% -6% -19% 56% -1%DEC Alpha 21264 500 33% 38% -3% 64% 52% 48% 51% 61% 55% 33% 40% 33% 35% 39% 20% 70% 42%Intel Pentium III 500 3% -18% 5% 2% -46% -13% -6% 13% 23% -27% 13% -52% -11% 9% -36% 23% -7%Motorola 7400 (G4) 500 -6% 6% -7% 6% -27% -78% -27% -40% -24% -1% -30% 17% -23% -22% 20% -210% -28%Sun UltraSPARC IIi 500* -20% -11% -6% 27% 15% 39% -4% -52% -85% 14% -42% -22% 10% -19% 16% 61% -5%

AMD Athlon 500 56% 38% 28% 3% -9% 23% 28% 68% 46% 0% -6% 21% 1% -50% 36% 24% 19%DEC Alpha 21264 500 -36% 8% -35% 2% 52% 18% -11% 37% 3% -6% -19% -104% 19% 38% -22% -32% -6%Intel Pentium III 500 41% 8% 0% 19% 7% -3% 14% 64% 17% 19% -18% 39% -2% -40% -18% 28% 11%Motorola 7400 (G4) 500 -65% -6% 46% 2% 31% -42% 26% 67% 30% 23% 60% 61% 5% 57% 32% 51% 24%Sun UltraSPARC IIi 500* 4% -48% -39% -25% -82% 3% -57% -236% -95% -36% -16% -16% -23% -5% -28% -72% -48%

Table 5. Kernel Performance Statistics – From top to bottom: C time (ns), SIMD time (ns), SIMD speedup relative to C, C percent deviation from the mean (%DM) and SIMD %DM (see text for further details on this metric). Kernels with a gray background were not implemented in SIMD due to insufficient instruction set functionality. The DCT kernel was originally coded in floating point, but implemented in fixed point integer for SIMD codes. (*)Sun UltraSPARC IIi 500MHz values are extrapolated from measurements taken on a 360MHz system.

The DCT and IDCT algorithms are of the same computational order of magnitude so it

might seem strange that they demonstrate such different performance improvements (Table 5).

The reason for this is the DCT kernel was originally coded in floating point, but implemented in

fixed point integer for the SIMD codes. The IDCT is in fixed point integer for both the original

and SIMD versions. Fixed point integer calculations are faster than floating point on most

architectures, and provide sufficient precision for the DCT and IDCT in video applications [39].

6.2. Cross-Architecture Comparison

All of our performance measurements use the metric of time rather than instruction counts

since each instruction set does not necessarily do the same amount of work (computation) in the

same number of instructions, nor take the same amount of time to perform similar instructions.

Our analysis is complicated by the fact that the Sun UltraSPARC IIi used in our study has

a 360 MHz clock, while the remainder of the chips are 500 MHz parts. Of course clock

Page 22: Measuring the Performance of Multimedia Instruction Sets

22

frequency is not the only determinant of performance; there can be significantly disparate

amounts of work accomplished between machines in each cycle depending on the pipeline length

and number of functional units available. We have chosen to present scaled results (by 500/360 =

1.4x) alongside our measurements from the actual 360 MHz system in order to place an upper

bound on UltraSPARC IIi performance and eliminate a degree of freedom from our comparison.

Certainly whether or not the scaled results are a fair basis for comparison is debatable. Even now

(October 2001) the maximum available clock frequency for the UltraSPARC IIi is 480MHz,

while the other architectures in our study have scaled considerably more in frequency. The

semiconductor process of the UltraSPARC IIi (see Table 2) is comparable to that of the other

processors in our study, so it is not clear that scaling is appropriate. The slower clock frequency

of the UltraSPARC IIi most likely stems from greater delays between latches.

To measure how well or how poorly a given platform performs relative to the competition

we use the metric of percent deviation from the mean:

−⋅=

ttt

DM i100%

where ti is the time taken on platform i, and t is the average performance (time) across all

of the platforms examined for a particular kernel. This metric indicates how much better or

worse than average performance is, and provides a normalized result for computing the average

improvement or degradation in performance. Please note that we use the scaled (500MHz)

UltraSPARC IIi values in computing the mean ( t ). Table 5 lists the average %DM (the

arithmetic average of the %DM values for all of the kernels on a given platform), which is a

good general measure of how well a platform stacks up against other architectures.

6.2.1. AMD Athlon, Intel Pentium III

At first glance the AMD Athlon and Intel Pentium III processors in our study appear to be

very similar: both run at 500 MHz, and both share Intel’s MMX SIMD integer extension. In fact,

because we implemented the Intel kernel set first, it was possible to reuse much of the same code

for the AMD SIMD version of many of the integer kernels in the BMKL. The SIMD integer

kernels include all but the clip test, FFT, quantize, synthesis filter and transform kernels. It is

interesting to observe the differences in performance between the two processors on what is

often identical code. Among the important differences between the two processors are [37]:

Page 23: Measuring the Performance of Multimedia Instruction Sets

23

• Athlon has 64 KB instruction and data L1 caches, while the Pentium III’s are each 16 KB

• Athlon has three instruction decoders working in parallel, while Pentium III has two

• Athlon’s pipeline is 10-cycles long, while Pentium III’s is 12-17 cycles; the cost of branches

in the Pentium III is exacerbated by its weaker branch prediction unit

• Intel’s MMX instruction latency is lower (1 cycle add/sub, 1 cycle shuffle, 3 cycle mult)

compared to Athlon (2 cycle add/sub, 2 cycle shuffle, 3 cycle mult)

• AMD’s 3DNow! FP latency (4 cycle add/sub, 4 cycle multiply) is lower than Intel’s SSE (4

cycle add/sub, 5 cycle multiply, throughput of 64-bits per cycle like 3DNow!)

So, although the SIMD integer instruction set is the same in both cases, AMD’s Athlon

SIMD integer is a more powerful implementation; its only deficiency is the higher latency of

simple SIMD integer instructions when compared to the Pentium III. We can see this very

clearly in Table 5, although where each architectural difference comes into play depends on the

kernel.

One interesting case is that of the DCT and IDCT kernels - the Pentium III is faster on the

DCT, while the Athlon is faster on the IDCT. This is surprising given that the two kernels are

algorithmically quite similar - each is a mirror image of the other. Upon further investigation, we

found that the DCT code has shuffle instructions (pshufw) in several places that are not

scheduled well for the Athlon instruction’s higher latency, a circumstance which was not

duplicated in the IDCT code.

In terms of multimedia, the greatest difference between the two processors is their SIMD

floating point extensions. AMD’s 3DNow! is quite similar to MMX, reusing the eight x87

floating point registers as 64-bit wide SIMD registers. Intel’s SSE has new 128-bit wide registers

and instructions, although in the Pentium III implementation, each 128-bit instruction is actually

decoded into two 64-bit wide micro-ops. This gives the Pentium III more overall register space

to work with, although it has the same number of registers.

In the arena of floating point operations, the AMD Athlon architecture eliminates the only

deficiency it had compared to the Pentium III in SIMD integer - namely, slightly higher

instruction latency. From Table 5, we can see that overall SIMD AMD Athlon code does 19%

better than average, while the Pentium III is 11% better than the average of all of the multimedia

instruction sets. Both chips perform well, but the AMD Athlon matches or outperforms the Intel

chip on all but one SIMD floating point kernel (quantize).

Page 24: Measuring the Performance of Multimedia Instruction Sets

24

Quantization is a process by which a real value is converted to a discrete integer

representation. In the quantization kernel, an array of real double precision floating point values,

xr[] , is converted to an array of 32-bit integers, xi[] , according to the function

xi[i] = xr[i] ⋅ xr[i] + 0.4054 . Note that original C implementation utilizes a lookup table for

some values (not shown in Listing 3). static INT32 lutab[10000]; INT32 l_end; FLOAT64 xr[]; INT32 ix[]; FLOAT64 *istep_p; FLOAT32 temp; for (i=0;i<l_end;i++) { temp=(*istep)*fabs(xr[i]); ix[i] = (int)( sqrt(sqrt(temp)*temp) + 0.4054); } Listing 3. Quantize

Although SIMD floating point instruction sets include approximations for x

1 , rather

than x , the equivalence x

xx1

⋅= can be used. The reason for AMD’s lackluster

performance on the quantize kernel is because AMD’s reciprocal square root instructions are

actually scalar - they only compute the result for the lower packed element. This costs 3DNow!

performance on this highly square-root intensive kernel.

6.2.2. DEC Alpha 21264

From Table 5 we see that the DEC Alpha platform far outstrips the other systems in terms

of compiled C performance. It has been claimed that a broader multimedia instruction set would

not be useful on Alpha, as an extension like Intel’s MMX only fixes x86 legacy-architecture

performance deficiencies which are not present with the Alpha architecture [33]. Our

performance comparison shows this argument to be false, as the kernels programmed with the

extensions from AMD, Intel and Motorola were able to not only match the performance of those

on the Alpha 21264, but often exceeded it. This indicates that sophisticated multimedia

instruction sets are far more than a means of overcoming x86 architectural limitations.

6.2.3. Motorola G4

Motorola’s AltiVec was the only multimedia extension which was architected from the

ground up - all of the others in some way leverage existing resources. The instructions included

in AltiVec are far more general than those required by multimedia applications. Almost every

possible operation and data type is supported, which should allow it to be applied to other

Page 25: Measuring the Performance of Multimedia Instruction Sets

25

application domains as well. A 128-bit register width, combined with latencies that are at least as

good as those found on the other (64-bit) architectures, allows AltiVec to come in with the best

overall performance (23.7% better than average on SIMD accelerated code, according to Table

5). However, performance on a few of the kernels is still poor, especially add block and FFT.

The add block kernel’s problem has already been discussed - the 128-bit register width is

actually too long for this kernel, causing unnecessary data to be loaded from memory, since there

is no 64-bit short load in AltiVec, although there are 8, 16, and 32-bit short loads. The reason for

the poor FFT performance is not entirely clear, although revisiting our code revealed that the

way in which we coded stores to unaligned memory could be improved slightly.

6.2.4. Sun UltraSPARC IIi

The performance of the VIS multimedia extension was mediocre at best (even when scaled

from 360 to 500 MHz). It is the oldest multimedia instruction set we have looked at, the second

to be released after HP’s MAX-1 (1996). As we have pointed out earlier, this instruction set

suffers from some odd instruction choices (e.g. multiplication primitives), missing functionality

(no subword shift operations) and an over-utilized graphics status register that is a bottleneck.

7. Future Directions for Multimedia Instruction Sets

SIMD instructions split traditional wide scalar data paths (e.g. 32- or 64-bits) into multiple

narrow data paths which perform the same operation on each packed element. This is of

significant performance benefit for multimedia applications where data types are relatively

narrow and algorithms often perform the same operation over a large number of elements. The

greatest performance limiting factor facing current SIMD implementations is dealing with data

that is not natively in a format amenable to SIMD processing. We have found two major sources

of mismatch: storage data types which are inappropriate for computation (necessitating

conversion overhead before and likely also following computation) and operands which are not

adjacent in memory; recall our earlier example of the matrix transpose step in the DCT kernel.

The first type has led us to propose “fat subwords” (Section 7.1), while the later suggests

“strided” memory operations (Section 7.2).

7.1. Fat Subwords

With current SIMD instruction sets, data is loaded into registers in the same packed format

that it is stored in memory. Frequently, unpacking is required before operations are performed,

Page 26: Measuring the Performance of Multimedia Instruction Sets

26

and the unpacked data of course no longer fits within a single register. We therefore propose a

“fat subword” architecture which has the following characteristics: 1) registers (and therefore

subwords) that are wider than the data to be loaded into them, 2) implicit unpack with load and

3) implicit pack (and saturate) with store.

A fat subword architecture is in some ways similar to the as yet unimplemented MIPS

MDMX instruction set [23]. The MIPS MDMX extension has a single 192-bit accumulator

register as its architectural cornerstone, with the more usual style of register to register SIMD

operations also included; non-accumulator SIMD operations share the 64-bit floating point

datapath. The destination of normal SIMD instructions can be either another SIMD register or

the accumulator (to be loaded or accumulated). When accumulating packed unsigned bytes, the

accumulator is partitioned into eight 24-bit unsigned subwords. Packed 16-bit operations cause

the accumulator to be split into four 48-bit subwords.

What is good about the MDMX accumulator approach is that implicit width promotion

provides an elegant solution to overflow and other issues caused by packing data as tightly as

possible. For example, multiplication is semantically difficult to deal with on current SIMD

architectures because the result of a multiply is longer than either operand [22]. This problem is

avoided by the MDMX accumulator, because there is enough extra space to prevent overflow.

Fixed point arithmetic is also more precise because full precision results can be

accumulated, and the total rounded only once at the end of the loop. Similarly, most scalar

multimedia algorithms only specify saturation at the end of computation because it is more

precise. Unfortunately, in SIMD architectures where the packed elements maximally fill the

available register space, the choice is either to be imprecise (compute with saturation at every

step) or lose parallelism (explicitly promote the input data to be wider and saturate at the end).

Saturating arithmetic can also produce unexpected results since, unlike normal addition, the

order of operations matters.

While the MIPS MDMX solution may seem elegant in principle, it ignores the

architectural aspect of actually making such a design fast. The accumulator is a singular and

unique shared resource, and as such has a tendency to limit instruction level parallelism. We

found that the existence of only a single accumulator would be a severe handicap to avoiding

data dependencies. On a non-accumulator but otherwise superscalar architecture it is usually

possible to perform some other useful, non data dependent operation in parallel so that

Page 27: Measuring the Performance of Multimedia Instruction Sets

27

processing can proceed at the greatest degree of instruction level parallelism possible. On

MDMX all computations that need to use the accumulator must proceed serially since

accumulation is inherently dependent on the previous result.

128-bit total load/store width 64-bit total load/store width Storage Computation Storage Computation

8U 12S x 16 8U 24S x 8 16S 24S x 8 16S 48S x 4 32S 48S x 4 - -

32FP 32FP x 4 - - Table 6. Supported Data Types - data types are divided into storage (as stored in memory) and computation (the width of arithmetic and other instruction elements) widths.

7.1.1. Fat Subword Data Types

To avoid the problems of MDMX, we suggest that a normal register file architecture be

used (as opposed to a single accumulator), with the entire SIMD register file made wider than the

width of loads and stores. For the sake of example we have chosen a width of 192-bits for our

datapath (registers and functional units), corresponding to a load/store width of either 64- or 128-

bits. A 64-bit only implementation would also be possible but of course with a lesser degree of

potential parallelism. Another possibility, should very wide registers prove infeasible, is the use

of register sets (e.g. triplets in the case of 64-bit wide registers). A similar approach was used by

Cyrix in their extension to MMX (found in their 6x86MX/M-II processor) to compensate for the

eight register encoding limit inherent in x86 based architectures.

Supported storage data types include those that we have seen to be of importance in

multimedia for storage formats: 8-bit unsigned, 16-bit signed, 32-bit signed and single precision

floating point. The data types supported by packed arithmetic operations for computation are

different: 12-bit signed, 24-bit signed, 48-bit signed and single precision floating point.

Depending on the algorithm it may be desirable to access memory at either 64-bit or 128-bit

widths (see Table 6), which would then unpacked to different computational widths: a smaller

number of high-precision subwords (64-bit storage width) or a larger number of low-precision

subwords (128-bit storage width). This allows for better matching with the natural width of each

algorithm.

In the case of packed floating point operations we suggest operating on four single

precision values in parallel, leaving the remaining 16 bits of each subword idle. This allows for

SIMD computations to precisely match scalar results while still being able to take advantage of

the same data rearrangement capabilities used for integer data types.

Page 28: Measuring the Performance of Multimedia Instruction Sets

28

7.1.2. Fat Subword Operations

In [36] we present our proposed instruction set (97 instructions in total) for a fat subword

SIMD multimedia extension. Unlike existing instruction sets which are fundamentally byte-

based, this instruction set is centered on quantities which are multiples of 12-bits wide. This can

be thought of as four extra bits of precision for every byte of actual storage data loaded. Because

we have fundamentally changed the treatment of multimedia data types, several types of typical

instructions are no longer useful. These include operations such as average, pack/unpack and

truncating multiplication.

7.1.3. Fat Subword Example

A small example of how to code for the proposed architecture is shown in Figure 1. Note

that load and store instructions specify both computational and storage data types.

1515 253253 3434 55 255255 22 22 00

20

0225553425315223974623

0x10080x1000

0 0 0 0 0 0 0 0191167143119957147230 24 48 72 96 120 144 168

33 22 4646 77 99 00 2323 220 0 0 0 0 0 0 0191167143119957147230 24 48 72 96 120 144 168

4545 506506 15641564 3535 22952295 00 4646 00191167143119957147230 24 48 72 96 120 144 168

0 04625535255255450x1000

lvx8u24s v0,[1000]

lvx8u24s v1,[1008]

mult24s v2,v0,v1

stv24s8u v2,[1000]6355473931231570 8 16 24 32 40 48 56

Figure 1. Fat Subword Example – two vectors of eight 8-bit unsigned integers are loaded and implicitly expanded to 24-bit signed notation from address 0x1000 and 0x1008 into vector registers v0 and v1 respectively. Vector register v0 and v1 are then multiplied and the result placed in v2. Overflow cannot occur because the maximum width of the result is 16-bits per multiplication (the sum of the operand widths). The result vector is then simultaneously repacked to 8-bit unsigned notation and stored out to memory address 0x1000.

7.2. Strided Memo ry Access

We propose that SIMD architectures implement strided load and store instructions to make

the gathering of non-adjacent data elements more efficient. This is similar to the prefetch

mechanism in AltiVec, except that the data elements would be assembled together by the

hardware into a single register, rather than simply loaded into the cache. Of course such a

memory operation would necessarily be slower than a traditional one, but it would cut down

immensely on the overhead that would have to go into reorganizing data as loaded from memory.

Strided loads and stores would have three operands, vD, rA and rB, where vD is the destination

Page 29: Measuring the Performance of Multimedia Instruction Sets

29

fat-subword register, rA is the base address and rB contains a description of the memory access

pattern:

width - width of the data elements to be loaded [2 bits], 00 = 8-bits, 01 = 16-bits, 10 = 32-

bits, 11= 64-bits.

offset - number of bytes offset from the base address from which to begin loading [6 bits],

interpreted as a signed value.

stride - number of bytes between the effective address of one element in the sequence and

the next [24 bits], interpreted as a signed value.

The color space conversion kernel is an excellent example of where strided load and store

instructions could be used. Pixel data consists of one or more channels or bands, with each

channel representing some independent value associated with a given pixel’s (x,y) position. A

single channel, for example, represents grayscale, while a three (or more) channel image is

typically color. The band data may be interleaved (each pixel’s red, green, and blue data are

adjacent in memory) or separated (e.g. the red data for adjacent pixels are adjacent in memory).

In image processing algorithms such as color space conversion we operate on each channel in a

different way, so band separated format is the most convenient for SIMD processing. Converting

from the RGB to YCBCR color space is done through the conversion coefficients shown in

Listing 4. UINT8 *rowp, *y_p, *u_p, v_p; INT32 red = *rowp++, green = *rowp++, blue = *rowp++; *y_p++ = +0.2558*red + 0.5022*green + 0.0975*blue + 16.5; *u_p++ = -0.1476*red - 0.2899*green + 0.4375*blue + 128.5; *v_p++ = +0.4375*red - 0.3664*green - 0.0711*blue + 128.5; Listing 4. Color Space Conversion

Listing 5 replaces thirty-eight instructions in the original AltiVec color space conversion

kernel (the corresponding code fragment is listed in [36]). In the original AltiVec code, it was

necessary to load six permute control vectors (each 128-bits wide) before executing the six

vperm instructions required to rearrange the data into band separated format. rgb_to_yuv: oris r11,r11 RED_PATTERN oris r12,r12,GREEN_PATTERN oris r13,r13,BLUE_PATTERN lvxstrd v28,0,r3,r11 ;; v28: |r0|r1|r2|r3|r4|r5|r6|r7|r8|r9|rA|rB|rC|rD|rE|rF| lvxstrd v29,0,r3,r12 ;; v29: |g0|g1|g2|g3|g4|g5|g6|g7|g8|g9|gA|gB|gC|gD|gE|gF| lvxstrd v30,0,r3,r13 ;; v30: |b0|b1|b2|b3|b4|b5|b6|b7|b8|b9|bA|bB|bC|bD|bE|bF| Listing 5. Modified Color Space Conversion

Page 30: Measuring the Performance of Multimedia Instruction Sets

30

Traditional SIMD data communication operations have trouble with data that is not aligned

on boundaries that are powers of two; in the case of color space conversion, visually adjacent

pixels from each band are spaced 3 bytes apart. Strided loads and stores are by definition

unaligned, so this would need to be handled by the load/store hardware in the CPU. It would also

make sense to have additional versions of these instructions to provide hints to the memory

hardware whenever data elements will not be of near-term use (especially if the stride is on the

same or greater order of the line size). Another approach to strided data access is presented in

[43] which discusses the Impulse system architecture. Impulse allows for a strided mapping that

creates dense cache lines from array elements that are not contiguous in memory.

8. Conclusion

Specialized instructions have been introduced by microprocessor vendors to support the

unique computational demands of multimedia applications. The mismatch between wide data

paths and the relatively short data types found in multimedia applications has led the industry to

embrace SIMD (single instruction, multiple data) style processing. A SIMD approach has the

significant added benefit of simplifying control circuitry, typically one of the most difficult

aspects of superscalar microprocessor design. We have found that the greatest performance

limiting factor facing current SIMD implementations is dealing with data that is not natively in a

format amenable to SIMD processing.

Storage and computation have opposing goals – the first seeks to use a minimal number of

bits in order to conserve space, while the later requires extra bits to maintain precision during

intermediate calculations. With current SIMD instruction sets, data is loaded into registers in the

same packed format that it is stored in memory. This is inefficient because unpacking is

frequently required before operations are performed, and the unpacked data of course no longer

fits within a single register. When it is time to write back SIMD data to memory the data must be

repacked and saturated to fit within a typically narrower storage data type. Our proposed fat-

subword register architecture reduces the amount of unpacking, packing and saturation overhead

that SIMD computation must incur by encoding the current and desired computational and

storage data type in SIMD load/store instructions and then implicitly performing the requisite

unpacking or packing and sign extension or saturation.

SIMD instructions perform the same operation on multiple data elements. Because all of

the data within a register must be treated identically, the ability to efficiently rearrange data bytes

Page 31: Measuring the Performance of Multimedia Instruction Sets

31

within and between registers has become critical to performance. With regard to this we have

suggested strided memory operations as having the potential to greatly simplify access to sparse

data structures as well as the conversion between band-interleaved and band-separated storage

formats. If these strided operations can be made fast in hardware (this seems likely given that

strided vector loads and stores have long been integral components to traditional vector

supercomputers, so their implementation should already be well understood) one of the major

impediments to efficient SIMD processing will be lifted.

We have presented a unified comparison of five multimedia instruction sets for general

purpose microprocessors, which discusses the advantages and disadvantages of specific features

of current instruction sets. In addition to examining the five instruction sets from a functional

standpoint, we have also empirically compared the performance of specific implementations of

these instruction sets on real hardware. This has allowed us to measure typical speedups relative

to a highly tuned compiler on a given platform as well as to make cross-architecture

comparisons.

Today most microprocessors found in desktop workstations are equipped with multimedia

extensions that remain unused due to the lack of compiler support for these instruction sets [20]

(our analysis has been based entirely on measurements of hand tuned SIMD kernels). A valuable

extension to this study would be an analysis of what characteristics would make SIMD most

instructions most amenable to use by a compiler. In addition, multimedia applications, and

therefore our target workload, are constantly evolving: MPEG-4, and MP3-Pro audio are only

two examples of up and coming algorithms that dema nd further examination. As these and other

algorithms and applications come to light we will need to revisit our design choices for

multimedia instruction sets.

References [1] G.E. Allen, B.L. Evans, L.K. John, “Real-Time High-Throughput Sonar Beamforming Kernels

Using Native Signal Processing and Memory Latency Hiding Techniques,” Proc. 33rd IEEE

Asilomar Conf. on Signals, Systems and Computers, Pacific Grove, Calif., 1999, pp. 137-141

[2] Advanced Micro Devices, Inc., “AMD Athlon Processor x86 Code Optimization Guide,”

Publication #22007, Rev. G, http://www.amd.com/products/cpg/athlon/techdocs/ pdf/22007.pdf,

(current Apr. 2000)

Page 32: Measuring the Performance of Multimedia Instruction Sets

32

[3] Advanced Micro Devices, Inc., “AMD Athlon Processor Technical Brief,” Publication #22054,

Rev. D, http://www.amd.com/products/cpg/athlon/techdocs/pdf/22054.pdf, (current Apr. 2000)

[4] Advanced Micro Devices, “3DNow! Technology vs. KNI,” White Paper,

http://www.amd.com/products/cpg/3dnow/vskni.html, (current Apr. 2000)

[5] R. Bhargava, L.K. John, B.L. Evans, R. Radhakrishnan, “Evaluating MMX Technology Using DSP

and Multimedia Applications,” Proc. 31st IEEE Intl. Symp. on Microarchitecture (MICRO-31),

Dallas, Texas, 1998, pp. 37-46

[6] T. Burd, “General Processor Info,” CPU Info Center, http://bwrc.eecs.berkeley.edu/CIC/

summary/local/summary.pdf, (current Apr. 2000)

[7] D.A. Carlson, R.W. Castelino, R.O. Mueller, “Multimedia Extensions for a 550-MHz RISC

Microprocessor,” IEEE J. of Solid-State Circuits, vol. 32, no. 11, Nov. 1997, pp. 1618-1624

[8] W. Chen, H. John Reekie, S. Bhave, E.A. Lee, “Native Signal Processing on the UltraSparc in the

Ptolemy Environment,” Proc. 30th Ann. Asilomar Conf. on Signals, Systems, and Computers,

Pacific Grove, California, 1996, vol. 2, pp. 1368-1372

[9] Compaq Computer Corporation, “Alpha 21264 Microprocessor Hardware Reference Manual,” Part

No. DS-0027A-TE, Feb. 2000, http://www.support.compaq.com/alpha-tools/

documentation/current/21264_EV67/ds-0027a-te_21264_hrm.pdf, (current Apr. 2000)

[10] T.M. Conte, P.K. Dubey, M.D. Jennings, R.B. Lee, A. Peleg, S. Rathnam, M. Schlansker, P. Song,

A. Wolfe, "Challenges to Combining General-Purpose and Multimedia Processors," IEEE

Computer, vol. 30, no. 12, Dec. 1997, pp. 33-37

[11] C. Hansen, “MicroUnity’s MediaProcessor Architecture,” IEEE Micro, vol. 16, no. 4, Aug. 1996,

pp. 34-41

[12] Intel Corporation, “Intel Introduces The Pentium Processor With MMX Technology,” January 8,

1997 Press Release, http://www.intel.com/pressroom/archive/ releases/dp010897.htm, (current Apr.

2000)

[13] Intel Corporation, “Intel Architecture Software Developer’s Manual, Volume 1: Basic

Architecture,” Publication 243190, , http://developer.intel.com/design/pentiumii/

manuals/24319002.PDF, (current Apr. 2000)

[14] Intel Corporation, “IA32 Intel Architecture Software Developer’s Manual with Preliminary Intel

Pentium 4 Processor Information,” Volume 1: Basic Architecture,

http://developer.intel.com/design/processor/future/manuals/24547001.pdf, (current Sept. 2000)

Page 33: Measuring the Performance of Multimedia Instruction Sets

33

[15] Intel Corporation, “Intel Announces New NetBurst Micro-Architecture for Pentium IV Processor,”

http://www.intel.com/pressroom/archive/releases/dp082200.htm, (current Sept. 2000)

[16] J. Keshava, V. Pentkovski, “Pentium III Processor Implementation Tradeoffs,” Intel Tech. J.,

Quarter 2, 1999, http://developer.intel.com/technology/itj/q21999/pdf/impliment.pdf, (current Apr.

2000)

[17] T. Kientzle, “Implementing Fast DCTs,” Dr. Dobb’s J., vol. 24, no. 3, March 1999, pp. 115-119

[18] L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, G. Zyner, “The Visual Instruction Set (VIS) in

UltraSPARC,” Proc. Compcon ’95, San Francisco, 1995, pp. 462-469

[19] I. Kuroda, T. Nishitani, “Multimedia Processors,” Proc. IEEE, vol. 86 no. 6, June 1998, pp. 1203-

1221

[20] S. Larsen, S. Amarasinghe, “Exploiting Superword Level Parallelism with Multimedia Instruction

Sets,” Proc. ACM SIGPLAN ‘00 Conf. on Programming Language Design and Implementation,

Vancouver, BC Canada, 2000, pp. 145-156

[21] R.B. Lee, “Subword Permutation Instructions for Two-Dimensional Multimedia Processing in

MicroSIMD Architectures,” Proc. IEEE Intl. Conf. on Application-Specific Systems, Architectures,

and Processors, Boston, 2000, pg. 3-14

[22] R.B. Lee, “Multimedia Extensions for General Purpose Processors,” Proc. IEEE Workshop on VLSI

Signal Processing, Leicester, UK, 1997, pp. 1-15

[23] MIPS Technologies, Inc., “MIPS Extension for Digital Media with 3D,” WhitePaper, Mar. 12,

1997 http://www.mips.com/Documentation/isa5_tech_brf.pdf, (current Apr. 2000)

[24] Motorola Inc., “MPC7400 RISC Microprocessor User’s Manual, Rev. 0,” Document

MPC7400UM/D, http://www.mot.com/SPS/PowerPC/teksupport/teklibrary/manuals/

MPC7400UM.pdf, (current Apr. 2000)

[25] J. Nakashima, K. Tallman, “The VIS Advantage: Benchmark Results Chart VIS Performance,”

White Paper, Oct. 1996, http://www.sun.com/microelectronics/vis/, (current Apr. 2000)

[26] H. Nguyen, L.K. John, “Exploiting SIMD Parallelism in DSP and Multimedia Algorithms Using

the AltiVec Technology,” Proc. 1999 Int’l Conf. on Supercomputing, Rhodes, Greece, 1999, pp.

11-20

[27] K. Noer, “Heat Dissipation Per Square Millimeter Die Size Specifications,”

http://home.worldonline.dk/~noer/, (current Apr. 2000)

Page 34: Measuring the Performance of Multimedia Instruction Sets

34

[28] K.B. Normoyle, M.A. Csoppenszky, A. Tzeng, T.P. Johnson, C.D. Furman, J. Mostoufi,

“UltraSPARC-IIi: Expanding the Boundaries of a System on a Chip,” IEEE Micro, vol. 18, no. 2,

Mar./Apr. 1998, pp. 14-24

[29] A.D. Pimentel, P. Struik, P. van der Wolf, L.O. Hertzberger, “Hardware versus Hybrid Data

Prefetching in Multimedia Processors: A Case Study,” Proc. IEEE Intl. Performance, Computing

and Communications Conference, Phoenix, Ariz., 2000, pp. 525-531,

[30] P. Ranganathan, S. Adve, N.P. Jouppi, “Performance of Image and Video Processing with General-

Purpose Processors and Media ISA Extensions,” Proc. 26th Ann. Intl. Symp. on Computer

Architecture, Atlanta, 1999, pp. 124-135

[31] S. Rathnam, G. Slavenberg, “An Architectural Overview of the Programmable Multimedia

Processor, TM-1,” Proc. of COMPCON ‘96, Santa Clara, Calif., 1996, pp. 319-326

[32] D.S. Rice, “High-Performance Image Processing Using Special-Purpose CPU Instructions: The

UltraSPARC Visual Instruction Set,” Univ. Calif. at Berkeley, Master’s Report, Mar., 1996

[33] P. Rubinfeld, B. Rose, M. McCallig, “Motion Video Instruction Extensions for Alpha,” White

Paper, Oct., 1996 http://www.digital.com/alphaoem/papers/pmvi-abstract.htm, (current Apr. 2000)

[34] N.T. Slingerland, A.J. Smith, “Design and Characterization of the Berkeley Multimedia Workload,”

University of California at Berkeley Technical Report CSD-00-1122, Dec. 2000

[35] N.T. Slingerland, A.J. Smith, “Multimedia Instruction Sets for General Purpose Microprocessors: A

Survey,” University of California at Berkeley Technical Report CS-00-1124, Dec. 2000

[36] N.T. Slingerland, A.J. Smith, “Measuring the Performance of Multimedia Instruction Sets,”

University of California at Berkeley Technical Report CSD-00-1125, Dec. 2000

[37] A. Stiller, “Architecture Contest,” c’t Magazine, vol. 16/99,

http://www.heise.de/ct/english/99/16/092/, (current Dec. 2000)

[38] P. Struik, P. van der Wolf, A.D. Pimentel, “A Combined Hardware/Software Solution for Stream

Prefetching in Multimedia Applications,” Proc. 10th Ann. Symp. on Electronic Imaging

(Multimedia Hardware Architectures track), San Jose, Calif., 1998, pp. 120-130

[39] M.T. Sun ed., “IEEE Standard Specifications for the Implementation of 8x8 Inverse Discrete

Cosine Transform”, IEEE Standard 1180-1990, IEEE, 1991

[40] Sun Microsystems Inc., “UltraSPARC-IIi User’s Manual,” 1997, Part No. 805-0087-01,

http://www.sun.com/microelectronics/UltraSPARC-IIi/docs/805-0087.pdf, (current Apr. 2000)

Page 35: Measuring the Performance of Multimedia Instruction Sets

35

[41] S. Thakkar, T. Huff, “The Internet Streaming SIMD Extensions,” Intel Tech. J., Q2 1999,

http://developer.intel.com/technology/itj/q21999.htm, (current Apr. 2000)

[42] M. Tremblay, J.M. O’Connor, V. Narayanan, L. He, “VIS Speeds New Media Processing,” IEEE

Micro, vol. 16, no. 4, Aug. 1996, pp. 10-20

[43] L. Zhang, J.B. Carter, W. Hsieh, S.A. McKee, “Memory System Support for Image Processing,”

Proc. 1999 International Conference on Parallel Architectures and Compilation Techniques,

Newport Beach, Calif., 1999, pp. 98-107