measuring the performance of multimedia instruction sets
TRANSCRIPT
Funding for this research has been provided by the State of California under the MICRO program, and by Cisco Corporation, Fujitsu Microelectronics, IBM, Intel Corporation, Maxtor Corporation, Microsoft Corporation, Sun Microsystems, Toshiba Corporation and Veritas Software Corporation.
Measuring the Performance of Multimedia Instruction Sets Nathan Slingerland Alan Jay Smith [email protected] [email protected] Apple Computer, Inc. University of California at Berkeley
Abstract Many microprocessor instruction sets include instructions for accelerating multimedia
applications such as DVD playback, speech recognition and 3-D graphics. Despite general
agreement on the need to support this emerging workload, there are considerable differences
between the instruction sets that have been designed to do so. In this paper we study the
performance of five instruction sets on kernels extracted from a broad multimedia workload. We
compare the performance of contemporary implementations of each extension against each other
as well as to the original compiled C performance. From our analysis we determine how well
multimedia workloads map to current instruction sets, noting what was useful and what was not.
We also propose two enhancements: fat subwords and strided memory operations.
Keywords: SIMD, subword parallel, multimedia, performance, measurement, MMX, SSE,
AltiVec, VIS, MVI
1. Introduction
Specialized instructions have been introduced by microprocessor vendors to support the
unique computational demands of multimedia applications. The mismatch between wide data
paths and the relatively short data types found in multimedia applications has led the industry to
embrace SIMD (single instruction, multiple data) style processing. Unlike traditional forms of
SIMD computing in which multiple individual processors execute the same instruction,
multimedia instructions are executed by a single processor, and pack multiple short data
elements into a single wide (64 or 128-bit) register, with all of the sub-elements being operated
on in parallel.
The goal of this paper is to quantify how architectural differences between multimedia
instruction sets translate into differences in performance. Prior studies have primarily focused on
a single instruction set in isolation and have measured the performance of kernels taken from
vendor provided libraries [1], [5], [8], [25], [26], [30], [32]. Our contribution is unique as we do
not focus exclusively on a single architecture, and we study the performance of kernels derived
2
from a real, measured, general purpose multimedia workload. Our results are obtained from
actual hardware measurements rather than through simulation, instilling confidence in our
results.
Section 2 summarizes the multimedia workload we studied, and details the sixteen
computationally important kernels which we extracted from it. Our methodology for recoding
the kernels with multimedia instructions and their measurement is described in Section 3. An
overview of the five instructions sets, and their implementations, is given in Section 4.
Our analysis is divided into two parts. Section 5 reflects our experience in coding the
kernels, and lends insight into those instruction set features that we found useful. In Section 6 we
compare the performance of five different processors, each of which implements a particular
multimedia instruction set. Section 7 proposes two new directions for multimedia architectures
on general purpose microprocessors: fat subwords and strided memory operations. Readers who
are interested in a much more detailed discussion of the material presented here should see [36].
2. Workload
The lack of a standardized multimedia benchmark has meant that workload selection is the
most difficult aspect of any study of multimedia. It was for this reason that we developed the
Berkeley multimedia workload, which is described in [35] (we recommend this paper to any
reader concerned about how applications were selected or the technical merit and relevance of
the Berkeley multimedia workload). It details why we felt existing multimedia benchmarks were
unsuitable, develops and characterizes our workload, and extracts the kernels that we study here.
In selecting the component applications, we strove to cover as many types of media
processing as possible: image compression (DjVu, JPEG), 3-D graphics (Mesa, POVray),
document rendering (Ghostscript), audio synthesis (Timidity), audio compression (ADPCM,
LAME, mpg123), video compression (MPEG-2 at DVD and HDTV resolutions), speech
synthesis (Rsynth), speech compression (GSM), speech recognition (Rasta) and video game
(Doom) applications. Open source software was used both for its portability (allowing for cross
platform comparisons) and the fact that we could analyze the source code directly.
3. Methodology
3.1. Berkeley Multimedia Kernel Library
In order to measure the performance of the multimedia instruction sets, we distilled from
our Berkeley multimedia workload a set of computationally important kernel functions, which
3
when taken together form the Berkeley multimedia kernel library (BMKL). Kernels were chosen
based on their computational significance and amenability to hand optimization with SIMD
instructions; [35] discusses the kernel selection process in detail. Table 1 lists the characteristics
of the kernel codes examined.
Kernel Name Source Application
Data Type
Sat Arith
Native Width
Src Lines
Static Instr
%Static Instr
%CPU Cycles
Add Block MPEG-2 Decode 8-bit (U) √ 64-bits 46 191 0.1% 13.7% Block Match MPEG-2 Encode 8-bit (U) 128-bits 52 294 0.4% 59.8% Clip Test and Project Mesa FP limited 95 447 0.1% 0.8% Color Space Conversion JPEG Encode 8-bit (U) limited 22 78 0.1% 9.8% DCT MPEG-2 Encode 16-bit (S) 128-bits 14 116 0.1% 12.3% FFT LAME FP limited 208 981 4.4% 14.5% Inverse DCT MPEG-2 Decode 16-bit (S) √ 128-bits 75 649 0.3% 29.7% Max Value LAME 32-bit (S) unlimited 8 39 0.2% 12.0% Mix Timidity 16-bit (S) √ unlimited 143 829 1.0% 35.7% Quantize LAME FP unlimited 55 312 1.4% 15.3% Short Term Analysis Filter GSM Encode 16-bit (S) √ 128-bits 15 79 0.1% 20.2% Short Term Synth Filter GSM Decode 16-bit (S) √ 128-bits 15 114 0.2% 72.7% Subsample Horizontal MPEG-2 Encode 8-bit (U) √ 88-bits 35 244 0.3% 2.6% Subsample Vertical MPEG-2 Encode 8-bit (U) √ unlimited 76 478 0.6% 2.1% Synthesis Filtering Mpg123 FP √ 512-bits 67 348 0.4% 39.6% Transform and Normalize Mesa FP limited 51 354 0.1% 0.7%
Table 1. Multimedia Kernels Studied - from left to right, the columns list 1) primary data type specified as N-bit ({Unsigned, Signed}) integer or floating point (FP), 2) if saturating arithmetic is used, 3) native width; the longest width which does not load/store/compute excess unused elements, 4) static C source line count (for those lines which are executed), 5) static instruction count (for those instructions which are executed) 6) percentage of total static instructions, 7) percentage of total CPU time spent in the kernel. The later three statistics are machine specific and are for the original C code on a Compaq DS20 (dual 500 MHz Alpha 21264, Tru64 Unix v5.0 Rev. 910) machine.
From a performance standpoint, no piece of code can realistically be extracted and studied
in isolation. All of the parent applications in the Berkeley multimedia workload were modified to
make calls to the BMKL rather than their original internal functions. By measuring our kernel
codes from within real applications we could realistically include the effects of the shared
resources of our test systems (e.g. CPU, caches, TLB, memory, and registers).
3.2. Coding Process
Ideally, high level language compilers would be able to systematically and automatically
identify parallelizable sections of code and generate the appropriate SIMD instruction sequence.
The lack of languages which allow programmers to specify data types and overflow semantics at
variable declaration time has hindered the development of automated compiler support for
multimedia instruction sets [10].
4
As is the case with digital signal processors (DSPs), the most efficient way to program
with multimedia extensions is to have an expert programmer tune software using assembly
language [19]. Although this is more tedious and error prone than other methods such as hand
coded standard shared libraries or automatic code generation by a compiler, it is a method that is
available on every platform, and allows for the greatest flexibility and precision when coding.
We chose to code in assembly language ourselves rather than measure vendor supplied
optimized libraries because they would make a fair cross-platform comparison next to
impossible. Each vendor implements a different set of functions which presume particular (and
likely different) data formats. For example, a vendor supplied color space conversion function
might assume band-interleaved format when the program to be accelerated uses a band-separated
structure. Some overhead would then be needed to transform data structures into the expected
format, perturbing the system under test and adding unnecessary overhead.
All of the assembly codes in our study were written by the same programmer (Slingerland)
with an approximately equal amount of time spent on each platform. Coding ourselves had the
supplemental benefit of preventing any differences in programmer ability and time spent coding
between the vendors from potentially skewing our comparison. This also gave us considerable
insight into the reasons for performance differences, where we would otherwise be limited to
simply reporting our measurements. It is important to stress that the kernel algorithms and
applications in question were studied in considerable detail before assembly coding began. Our
level of understanding made it possible to play the role of applications expert and thereby make
educated choices about SIMD algorithms, sufficient precision and required data types, and is
reflected in the development and characterization of our workload in [34].
In some cases it was not practical to code a SIMD version of a kernel if an instruction set
lacked the requisite functionality. Sun’s VIS and DEC’s MVI do not support packed floating
point, so floating point kernels were not recoded for those platforms. DEC’s MVI does not
support any data communication (e.g. permute, mix, merge) or subword parallel integer multiply
instructions. In general, if there was no compelling opportunity for performance gain, the C
version was considered to be a particular platform’s “solution” to a kernel when the needed
SIMD operations were not provided. Situations which called for a C solution are clearly denoted
in our results, and the typically lesser performance of compiler generated code must be kept in
mind when comparing against the other (hand-tuned) kernels.
5
3.2.1. C Compilers and Optimization Flags
The most architecturally tuned compiler on each architecture was used to compile the C
reference version of the Berkeley Multimedia Kernel library. The optimization flags used were
those which give the best general speedup and highest level of optimization without resorting to
tuning the compiler flags for each particular kernel. The specific compiler and associated
optimization flags used on each platform are listed in [36].
4. Instruction Sets
In this paper we study five architectures with different multimedia instruction sets (Table 2
lists the parameters for the exact parts). Many of the instruction sets (AMD’s 3DNow!, DEC’s
MVI, Intel’s MMX, Sun’s VIS) use 64-bit wide registers, while Motorola’s AltiVec and Intel’s
SSE are 128-bits wide. The size of the register file available on each architecture varies widely,
ranging from eight 64-bit registers on the x86 based AMD Athlon to thirty-two 128-bit wide
registers with Motorola’s AltiVec extension.
AMD Athlon
DEC Alpha 21264
Intel Pentium III
Motorola 7400 (G4)
Sun UltraSPARC IIi
Clock (MHz) 500 500 500 500 360 SIMD Extension(s) MMX/3DNow!+ MVI MMX/SSE AltiVec VIS SIMD Instructions 57/24 13 57/70 162 121 First Shipped 6/1999 2/1998 2/1999 8/1999 11/1998 Transistors (×106) 22.0 15.2 9.5 6.5 5.8 Process (µm) [Die Size (mm2)] 0.25 [184] 0.25 [225] 0.20 [105] 0.20 [83] 0.25 [148] L1 $I Cache, $D Cache (KB) 64, 64 64, 64 16, 16 32, 32 32, 64 L2 Cache (MB) 0.5 4 0.5 1 2 Register File (# × width) FP(8x64b)/
FP(8x64b) Int (31x64b) FP(8x64b)/
8x128b 32x128b FP(32x64b)
Inst. Reorder Buffer Entries 72 35 40 6 12 SIMD Int, FP Multimedia Units 2 (combined) 2, 0 2 (combined) 3, 1 2, 0 SIMD Int Add Latency 2 [64b/cycle] - 1 [64b/cycle] 1 [128b/cycle] 1 [64b/cycle] SIMD Int Multiply Latency 3 [64b/cycle] - 3 [64b/cycle] 3 [128b/cycle] 3 [64b/cycle] SIMD FP Add Latency 4 [64b/cycle] - 4 [64b/cycle] 4 [128b/cycle] - SIMD FP Multiply Latency 4 [64b/cycle] - 5 [64b/cycle] 4 [128b/cycle] - SIMD FP ~1/sqrt Latency 3 [32b/cycle] - 2 [64b/cycle] 4 [128b/cycle] - Table 2. Microprocessors Studied - All parameters are for the actual chips used in this study. A SIMD register file may be shared with existing integer (Int) or floating point (FP) registers, or be separate. Note that some of the processors implement several multimedia extensions (e.g. Pentium III has MMX and SSE) - the corresponding parameters for each are separated with “/”. A dash (-) indicates that a particular operation is not available on a given architecture, and so no latency and throughput numbers are given. DEC’s MVI extension (on the 21264) does not include any of the listed operations, but all MVI instructions have a latency of 3 cycles [64b/cycle]. [2], [3], [4], [6], [7], [9], [12], [13], [16], [18], [24], [27], [28], [40]
All of the multimedia extensions support integer operations, although the types and widths
of available operations vary greatly. The earliest multimedia instruction sets (e.g. Sun’s VIS and
6
DEC’s MVI) had design goals limited by the immaturity of their target workload and unproven
benefit to performance. Any approach which greatly modified the overall architecture or
significantly affected die size was out of the question, so leveraging as much functionality as
possible from the existing chip architecture was a high priority. Thus, the Sun and DEC
extensions do not implement subword parallel floating point instructions. Intel’s first multimedia
extension, MMX (1997), had as one of its primary design goals to not require any new operating
system support, while and Intel’s later SSE (1999) added new operating system maintained state
(eight 128-bit wide registers) for the first time since the introduction of the Intel386 instruction
set in 1985.
The processors studied (Table 2) vary in terms of instruction latency as well as the
throughput per clock cycle. Processor clock rates were all 500 MHz, with the exception of the
Sun UltraSPARC IIi, for which only a 360 MHz system was available. Although multimedia
extensions primarily focus on extracting data level parallelism, most modern microprocessors are
also superscalar, and thereby allow for multiple multimedia instructions to be issued every cycle.
All of the architectures we look at are fully pipelined, so barring any data dependencies, one new
SIMD instruction can begin per parallel functional unit each clock cycle. Typically, there are
separate functional units for SIMD integer and SIMD floating point processing, although on the
x86 architectures they are combined. [35] surveys current multimedia instruction sets in more
detail.
5. Instruction Set Analysis
This section encompasses our experience coding each of the sixteen kernels in five
different multimedia extensions. We point out those instruction set features (data types and
operations) that we found to be useful in accelerating our workload, as well as those for which
we found no application or that created significant bottlenecks in current multimedia
architectures. Those readers primarily interested in our performance measurements of specific
implementations of these instruction sets can skip ahead to Section 6 and return here to Section 5
at their leisure. The complete original C source code for each kernel as well as for the assembly
language SIMD implementations of the BMKL are available on the web at
http://www.cs.berkeley.edu/~slingn/research/. Please contact the authors by email should this
URL ever become invalid.
7
5.1. Register File and Data Path
Multimedia instruction sets can be broadly categorized according to the location and
geometry of the register file upon which SIMD instructions operate. Alternatives include the
reuse of the existing integer or floating point register files, or implementing an entirely separate
one. The type of register file affects the width and therefore the number of subwords that can be
operated on simultaneously (vector length).
Implementing multimedia instructions on the integer data path has the advantage that the
functional units for shift and logical operations need not be replicated. Subword parallel addition
and subtraction are easily created by blocking the appropriate carry bits. However, modifications
to the integer data path to accommodate multimedia instructions can potentially adversely affect
the critical path of the integer data-path pipeline [19]. In addition, on architectures such as x86
(AMD, Intel) and PowerPC (Motorola) the integer data path is 32-bits wide, making a shared
integer data path approach less compelling due to the limited amount of data level parallelism
possible.
The reuse of floating point rather than integer registers has the advantage that they need
not be shared with pointers and loop and other variables. In addition, multimedia and floating
point instructions are not typically used simultaneously [19], so there is seldom any added
register pressure. Finally, many architectures support at least double-precision (64-bit wide)
operations, which is often wider than the integer data path.
A separate data path that is not shared with existing integer or floating point data paths has
the advantage of simpler pipeline control and potentially an increased number of registers.
Disadvantages include the need for saving and restoring the new registers during context
switches, as well as the relative difficulty and high overhead of moving data between register
files. This can potentially limit performance if scalar code depends upon a vector result.
A significant design parameter for SIMD instruction sets is vector length (the number of
packed elements or subwords that can be operated on simultaneously). A vector length that is too
short limits the ability to exploit data parallelism, while an excessively long vector length can
degrade performance due to insufficiently available data parallelism in an algorithm as well as
increase the amount of clean up code overhead. Table 1 classifies the “native width” of each of
the kernels:
8
unlimited width - data elements are truly independent and are naturally arranged so as to be
amenable to SIMD processing. Program loops can be strip mined at almost any length, with the
increase in performance of a longer vector being directly proportional to the increase in vector
length.
limited width - data elements are independent, but there is some overhead required to
rearrange input data so that it may be operated on in a SIMD manner. The performance
advantage of longer vectors is limited by the overhead (which typically increases with vector
length) required to employ them.
exact width - a precise natural width which can be considered to be the right match for the
kernel. This width is the longest width that does not load, store, or compute unused elements.
Most multimedia kernels are either unlimited/limited or most often have an exact required
width of 128-bits. The remaining exact native widths (64-, 88- and 512-bits) come up only once
each. Thus, we consider a total vector length of 128-bits to be optimal for most multimedia
algorithms.
5.1.1. Number of Registers
Multimedia applications and their kernels can generally take advantage of quite large
register files. This rule of thumb is echoed in the architectures of dedicated media processors
such as MicroUnity’s chip, which has a 64 x 64-bit register file (also accessible as 128 x 32-bits)
[11], while the Philips Trimedia TM-1 has a 128 x 32-bit register file [31].
The DCT and IDCT kernels are excellent examples of where the importance of a large
register file can be seen. The two-dimensional discrete cosine transform (DCT) is the algorithmic
centerpiece to many lossy image compression methods; it is similar to the discrete Fourier
transform (DFT) in that it maps values from the time domain to the frequency domain, producing
an array of coefficients representing frequencies [17]. The inverse DCT maps in the opposite
direction, from the frequency to the time domain.
A two dimensional (2-D) DCT or IDCT is efficiently computed by one dimensional (1-D)
transforms on each row followed by 1-D transforms on each column and then scaling the
resulting matrix appropriately. A SIMD approach requires that multiple data elements from
several iterations be operated on in parallel for the greatest efficiency. Computation is
straightforward for the 1-D column DCT, since the corresponding elements of each loop iteration
are adjacent in memory (assuming a row-major storage format). However, performing a 1-D row
9
DCT is more problematic since the corresponding elements of adjacent rows are not adjacent in
memory. The solution is to employ a matrix transposition (making corresponding “row”
elements adjacent in memory), perform the desired computation, and then re-transpose the
matrix to put the result matrix back into the correct configuration. We found this technique to
only be of performance benefit on those architectures with register files that were able to hold the
entire 16-bit 8x8 matrix (at least 16 64-bit or 8 128-bit registers).
5.2. Data Types
The data types supported by the different multimedia instruction sets include {signed,
unsigned} {8, 16, 32, 64} bit values, as well as single precision floating point. Most of the
instruction sets do not support all of these types, and usually only a subset of operations on each.
To determine which data types and operations are useful, we looked at the dynamic SIMD
instruction counts for the kernels when running the Berkeley multimedia workload. We then
broke down the dynamic SIMD instruction counts on each architecture in two ways: data type
distribution per instruction class (e.g. add, multiply) and per kernel. Here we discuss only the
salient features of tables of these categorizations, although the full set of results is available in
[36].
It is useful to distinguish an application’s storage data type (how data is stored in memory
or on disk) from its computational data type (the intermediate data type used to compute a
result). In general, storage data types, although narrow and therefore potentially offering the
greatest degrees of parallelism, are insufficiently wide for intermediate computations to occur
without overflow.
Image and video data is typically stored as packed unsigned 8-bit values, but intermediate
processing usually requires precision greater than 8-bits. With the exception of width promotion,
most 8-bit functionality was wasted on our set of multimedia kernels. The signed 16-bit data type
is the most heavily used because it is both the native data type for audio and speech data, as well
as the typical intermediate data type for video and imaging. On the wide end of the spectrum, 32-
bit and longer data types are typically only used for accumulation and simple data
communication operations such as alignment. Operations on 32-bit values tend to be limited to
addition and subtraction (for accumulation), width promotion and demotion (for converting to a
narrower output data type) and shifts (for data alignment).
10
Single precision floating point plays an important role in many of the multimedia kernels
such as the geometry stage of 3-D rendering (the clipping and transform kernels) and the FFT,
where the wide dynamic range of floating point is required. Double precision SIMD is
uncommon; currently it is only available with Intel’s SSE2 extension on their Pentium IV
processor (released after this study was completed). This data type is targeted more towards
applications such as scientific and engineering workloads rather than multimedia [14], [15], and
has no use in the kernels we examine.
5.3. Operations
One of the primary goals of this work is to separate useful SIMD operations from those
that are not. The large differences in current multimedia instruction sets for general purpose
processors are fertile ground for making such a determination because many different design
choices have been made. Please note the very important caveat that this determination is made
with respect to supporting our multimedia workload. We explicitly comment on those
instructions which we found to be “useless” for our workload but for which we believe the
designers had some other purpose or workload in mind.
5.3.1. Modulo/Saturation
Modulo arithmetic “wraps around” to the next representable value when overflow occurs,
while saturating arithmetic clamps the output value to the highest or lowest representable value
for positive and negative overflow respectively. This characteristic is useful because of its
desirable aesthetic result in imaging and audio multimedia computations. For example, when
adding pixels in an imaging application, modulo addition is undesirable because if overflow
occurs, a small change in operand values may result in a glaring visual difference (e.g. the
addition of two white pixels results in a black pixel). Saturating arithmetic also allows for
overflow in multiple packed elements to be dealt with efficiently. If overflow within packed
elements were to be handled similarly to traditional scalar arithmetic, an overflowing SIMD
operation would have to be repeated and tested serially to determine which element overflowed.
The only added cost of saturating arithmetic is separate instructions for signed and unsigned
operands since the values must be interpreted by the hardware as a particular data type. This is
unlike modulo operations for which the same instruction can be used for both unsigned and
signed 2’s complement values.
11
The kernels in the BMKL that employ saturating arithmetic are noted in Table 1. From this
it is clear that the most important types of saturating arithmetic are unsigned 8-bit and signed 16-
bit integers. Saturating 32-bit operations are of little value since overflow is not often a concern
for such a wide data type. Despite the utility of saturating operations, modulo computations are
still important because they allow for results to be numerically identical to existing scalar
algorithms. This is sometimes a consideration for the sake of compatibility and comparability.
5.3.2. Shift
SIMD shift operations provide an inexpensive way to perform multiplication and division
by powers of two. They are also crucial for supporting fixed point integer arithmetic in which a
common sequence of operations is the multiplication of an N-bit integer by an M-bit fixed point
fractional constant, producing an (N+M)-bit result, with a binary point at the Mth most significant
bit. At the end of computation, the final result is rounded by adding the fixed point fraction
representing 0.5, and then shifting right M-bits to eliminate the fractional bits.
5.3.3. Min/Max and Comparisons
Min and max instructions output the minimum or maximum values of the corresponding
elements in two input registers, respectively. A max instruction is of obvious utility in the
maximum value search kernel, which searches through an array of signed 32-bit integers for the
greatest maximum absolute value. Max and min instructions have other less apparent uses as
well. Signed minimum and maximum operations are often used with a constant second operand
to saturate results to arbitrary ranges (this is the case in the IDCT kernel which clips its output
value range to -256...+255 (9-bit signed integer)). Architecturally, the implementation cost of
max and min instructions should be low since the necessary comparators must already exist for
saturating arithmetic. The only difference is that instead of comparing to a constant, a register
value is used.
We have found that in many cases where comparisons seem to be required, max and min
instructions are sufficient. Subword integer comparisons were only useful on architectures
lacking max/min operations (e.g. Sun’s VIS). In general the same limited usefulness held true for
subword floating point comparisons, with the exception of a specialized comparison used in the
project and clip test kernel. In that kernel three dimensional (3-D) objects are first mapped to 2-D
space through a matrix multiplication of 1x4 vectors and 4x4 matrices. Objects are then clipped
to the viewable area to avoid unnecessary rendering. Motorola’s AltiVec includes a specialized
12
comparison instruction, vcmpbfp, which handles this boundary testing. The four coordinates of
a vertex {x, y, z, w} are tested in parallel to see if any clipping is needed; a branch instruction
can then act as a fast out if no clipping is necessary. This technique is extremely effective
because no clipping is the common case, with most vertices within screen boundaries. For the
Mesa “gears” application, the fast out case held true for 61946 of the 64560 (96.0%) clipping
tests performed in our application run of 30 frames rendered at 1024x768 resolution.
5.3.4. Sum of Absolute Differences
A sum of absolute differences (SAD) instruction operates on a pair of packed 8-bit
unsigned input registers, summing the absolute differences between the respective packed
elements and placing (or accumulating) the scalar sum in another register. The block match
kernel sums the absolute differences between the corresponding pixels (8-bit unsigned) of two
16x16 macroblocks, and is the only one in which sum of absolute differences (SAD) instructions
is used.
Although DEC’s MVI extension is quite small (only 13 instructions), one of the few
operations that DEC did include was SAD. DEC architects found (in agreement with our
experience) that this operation provides the most performance benefit of all multimedia
extension operations [33]. Interestingly, Intel’s MMX, although a much richer set of instructions,
did not include this operation (it was later included in both AMD’s enhanced 3DNow! as well as
Intel’s SSE extensions to MMX). Sun’s VIS also includes a sum of absolute differences
instruction.
Processor Instruction Latency: Throughput AMD Athlon PSADBW 3 cycles : 64b every 1 cycles DEC Alpha PERR 2 cycles : 64b every 1 cycle Intel Pentium III PSADBW 5 cycles : 64b every 2 cycles Sun UltraSPARC IIi PDIST 3 cycles : 64b every 1 cycle
Table 3. SAD Instruction Survey
The Motorola G4 microprocessor was the only CPU in our survey that did not include
some form of SAD operation, forcing us to synthesize it from other instructions. Although Intel’s
SSE extension includes the psadbw instruction, this offers only a one cycle (per macroblock
row) performance advantage when compared to an AltiVec implementation. Despite the fact that
Intel has a performance win with a dedicated SAD instruction, several instructions are still
required due to the fact that MMX is a 64-bit wide extension and each row is 128-bits long. To
13
estimate the latency of a hypothetical SAD instruction for a 128-bit extension such as AltiVec,
we first try to understand the latency of this instruction on other architectures (Table 3).
The latency of a 128-bit SAD instruction would be higher than the 64-bit instructions listed
in the table because this instruction requires a cascade of adders to sum (reduce) the differences
between the elements. An N-bit SAD instruction (M=N/8) can be broken down into three steps:
1) calculate M 8-bit differences, 2) compute the absolute value of the differences, 3) perform
log2M cascaded summations. The architects of the 64-bit DEC MVI extension comment that a 3-
cycle implementation of PERR would have been easily attainable, but in the end the architects
achieved a more aggressive 2-cycle instruction [7]. If a SAD operation were to be implemented
in Motorola’s AltiVec, we estimate it would have a latency of 4 cycles; certainly superior to the
existing 9 cycle multi-instruction solution.
5.3.5. Average
In addition to computing the sum of absolute differences, half-pixel interpolation, for
which MPEG-2 encoding offers three varieties, is also important. Interpolation is done by
averaging a set of pixel values with pixels spatially offset by one horizontally, vertically or both.
Integer average operations were only used in the block match kernel (for interpolation). This
kernel operates on 8-bit unsigned values and was the only type of average instruction that was
used within our workload.
5.3.6. High Latency Function Approximation.
3-D rendering applications use floating point math functions such as reciprocal (1x ) and
square-root ( 1x
) that are typically very high latency. In the transform and normalize kernel,
graphics primitives are transformed to the viewer’s frame of reference through matrix
multiplications, an operation which has at its heart a floating point reciprocal square root
operation. On some architectures scalar versions of these functions (reciprocal and square root)
are computed in software, while others have hardware instructions. The computation of these
functions is iterative, so an operation’s latency is directly proportional to the number of bits of
precision returned; full IEEE compliant operations return 24-bits of mantissa.
All of the SIMD floating point extensions (AMD’s 3DNow!, Intel’s SSE and Motorola’s
AltiVec) include approximation instructions for 1x and 1
x. These are typically implemented
14
as hardware lookup tables, returning k-bits of precision. In Intel’s SSE, for example, approximate
reciprocal (rcp) and reciprocal square root (rsqrt) return 12-bits of mantissa.
One unique aspect of Intel’s SSE instruction set is that not only does it include
approximations of 1x and 1
x, but it also includes full precision (24-bits of mantissa)
versions of xy and x . Of course, this added precision comes at a price – namely much higher
latency than the pipelined approximations (in fact Intel’s full precision instructions are not
pipelined on the Pentium III). A better approach is to use the Newton-Raphson method for
improving the accuracy of approximations in software through specially derived functions. In the
case of 1x
it is possible to iteratively increase the precision of an initial approximation
through the equation: x1 = x0 − 0.5 ⋅ a ⋅ x03 − 0.5⋅ x0( )= 0.5 ⋅ x0 ⋅ 3.0 − a ⋅ x0
2( ). Employing an approximation instruction in conjunction with the Newton-Raphson method
to achieve full precision is actually faster than the full precision version of the instruction that
Intel provides (25 vs. 36 cycles). One iteration of the Newton-Raphson method is enough to
improve a 12-bit approximation to the full 24-bit precision of IEEE single precision.
The cost of the Newton-Raphson method is the register space needed to hold intermediate
values and constants. AMD’s 3DNow! extension circumvents this by including instructions to
internally perform the Newton-Raphson method, rather than having the programmer explicitly
implement it. The only odd thing about AMD’s reciprocal square root instructions are that they
are actually scalar; they only produce one result value, based on the lower packed element. These
instructions are a part of 3DNow! and not the normal x87 floating-point; they expect input data
to be in a packed form (64-bits comprised of two 32-bit single-precision sub-words). It is
necessary to execute two Newton-Raphson operations to compute both subwords when using
AMD’s 3DNow! extension (20 cycles per subword).
5.3.7. Floating Point Rounding and Exception Handling
The techniques which SIMD instruction sets employ to deal with floating point rounding
and exception handling are very similar to those used with integer overflow. Checking result
flags or generating an exception from a packed operation requires considerable time to determine
which packed element caused the problem. Fortunately, in most cases where an exception might
be raised it is possible to fill in a value which will give reasonable results for most applications
15
and continue computation without interruption. In the case of rounding, it is often acceptable to
incur a slight loss in precision and only use the flush to zero (FTZ) rounding mode [41]. Flush to
zero clamps to a minimum representable result in the event of underflow (a number too small to
be represented in single precision floating point). This speeds execution because no explicit error
condition checking need be done, and is similar to saturating integer arithmetic where maximum
or minimum result values are substituted rather than checking for and reporting positive or
negative overflow. Only Intel’s SSE implements full IEEE compliant SIMD floating point
exceptions and all four IEEE rounding modes.
5.3.8. Type Conversion
Width promotion is the expansion of an N-bit value to some larger width. For unsigned
fixed point numbers this requires zero extension (filling any additional bits with zeros). Zero
extension is usually not specified as such in a multimedia architecture because it overlaps in
functionality with data rearrangement instructions such as unpack or merge if packed values are
merged with another register which has been zeroed prior to merging. Signed element unpacking
is not as simple, but is rarely supported directly by hardware; only the AltiVec instruction set
includes it. It can be synthesized with multiplication by one since a multiplication yields a result
that has the total overall width of both its operands.
Video and imaging algorithms often utilize an 8-bit unsigned data type for storage, but 16-
bits for computation, making 8-bit to 16-bit zero extension quite useful. Audio and speech
algorithms typically employ signed 16-bit values, but because multiplication by a fractional fixed
point constant is a common operation, these values are often unpacked as a natural consequence
of computation. So, although a signed unpack operation would likely be faster than
multiplication by 1, it is seldom necessary to resort to this in practice.
5.3.9. Data Rearrangement
SIMD instructions perform the same operation on multiple data elements. Because all of
the data within a register must be treated identically, the ability to efficiently rearrange data bytes
within and between registers is critical to performance. We will refer to this class of instructions
as “data communication”. Interleave or merge instructions (also referred to as mixing or
unpacking) merge alternate data elements from the upper or lower half of the elements in each of
two source registers. Align or rotate operations allow for arbitrary byte-boundary data
realignment of the data in two source registers; essentially a shift operation that is performed in
16
byte multiples. Both interleave and align type operations have hard coded data communication
patterns. Insert and extract operations allow for a specific packed element to be extracted as a
scalar or a scalar value to be inserted to a specified location. Shuffle (also called permute)
operations allow for greater flexibility than those operations with fixed communication patterns,
but this added flexibility requires that the communication pattern be specified either in a third
source register or as an immediate value in part of the instruction encoding.
[21] presents a novel set of simple data communication primitives which can perform all
24 permutations of a 2x2 matrix in a single cycle on a processor with dual data communication
functional units. This is useful because any larger data communication problem can be
decomposed into 2x2 matrices. Although this approach might make some very complex data
communication patterns slow to compute, we have found that most multimedia algorithms have
patterns which are relatively simple. Because of this we endorse [21]’s technique for supplying
the data communication needs of multimedia applications.
A related class of instructions that we found to be quite useful when programming with the
AltiVec extension was “splat” instructions. A splat operation places either an immediate scalar or
specified element from a source register into every element of the destination register. This was
very useful when constants were required; on other architectures it is necessary to statically store
these types of values in memory, and then load them into a register as needed.
5.3.10. Prefetching
Prefetching is a hardware or software technique which tries to predict data access needs in
advance, overlapping memory access with useful computation. Although we will not otherwise
examine the performance implications of prefetch instructions (which would be a useful
extension to this study), we mention them briefly because they are often a part of multimedia
instruction sets due to the highly predictable nature of data accesses in multimedia applications.
Prefetching can be implemented in software, hardware or both, with each technique
distinguished by its methods for detection (discovering when prefetching will be of benefit) and
synchronization (the means for ensuring that data is available in the cache before it is needed but
not so far in advance that it pollutes the cache by purging still needed items) [38].
Many multimedia instruction sets take a purely software approach by providing prefetch
instructions that fetch only the cache block at a specified address into the cache from main
memory (without blocking normal load/store instruction accesses). Determining the ideal
17
location for this type of prefetch instruction depends on many architectural parameters.
Unfortunately, these include such things as the number of CPU clocks for memory latency and
the number of CPU clocks to transfer a cache line, which are both highly machine dependent and
not readily available to the programmer.
Another technique is a software-hardware hybrid approach in which a software-directed
hardware prefetching scheme is used. One variation of this is implemented by Motorola’s
AltiVec and its data stream touch instruction (dst) instruction which indicates the memory
sequence or pattern that is likely to be accessed; a separate prefetch instruction need not be
issued for each data element as hardware optimizes the number of cache blocks to prefetch.
Thus, it is not necessary for the programmer to know the parameters of the cache system.
Currently there is no means for synchronization in AltiVec implementations – the data specified
by a dst is brought into the cache as quickly as possible rather than at the same rate as its
consumption. Thus, it is currently necessary to periodically issue dst instructions specifying
short access patterns rather than a single long instruction. This limits the potential benefit of this
technique over a pure software approach. Several hardware techniques for maintaining the
synchronization of data prefetch and consumption in hybrid solutions as well as pure hardware
prefetching are discussed in [29].
5.4. Bottlenecks and Unused Functionality
In this section we discuss those features which appear in multimedia instruction sets that
do not appear to be useful, and are not “free”; i.e. they aren’t a low (or no) cost side effect of
some other useful feature.
5.4.1. Instruction Primitives
The VIS instruction set does not include full 16-bit multiply instructions. It instead offers
multiplication primitives, the results of which must be combined through addition (see Listing 1
and Listing 2). fmul8sux16 %f0, %f2, %f4 fmul8ulx16 %f0, %f2, %f6 fpadd16 %f4, %f6, %f8 Listing 1. Sun VIS 16-bit x 16-bit → 16-bit Multiply
fmuld8sux16 %f0, %f2, %f4 fmuld8ulx16 %f0, %f2, %f6 fpadd32 %f4, %f6, %f8 Listing 2. Sun VIS 16-bit x 16-bit → 32-bit Multiply
18
The reason that the architects of VIS divided up 16-bit multiplication in this way was to
decrease die area. Not providing a full 16x16 multiplier subunit cut the size of the arrays in half
[42]. Unfortunately, dividing an operation into several instructions which are not otherwise
useful in and of themselves increases register pressure, decreases instruction decoding bandwidth
and creates additional data dependencies. Splitting SIMD instructions (which have been
introduced for their ability to extract data parallelism) can actually cripple a superscalar
processor’s ability to extract instruction level parallelism. A multi-cycle operation can be a better
solution than a multi-instruction operation because instruction latencies can be transparently
upgraded in future processors, while poor instruction semantics cannot be repaired without
adding new instructions.
5.4.2. Unused High Latency Approximation Instructions
Floating point approximation of 1x instructions, although available on several platforms,
did not find application in the kernels we studied. AltiVec includes approximate log2 (logarithm
base 2) and exp2 (2 raised to a power) instructions, which can potentially be used in the lighting
stage of a 3-D rendering pipeline. Lighting calculations are currently handled by 3-D accelerator
cards, and not the CPU, and so were not included in the BMKL.
5.4.3. Unused Pixel Conversion Instructions
Motorola’s AltiVec extension includes pixel pack (vpkpx) and pixel unpack (vupkhpx,
vupklpx) instructions for converting between 32-bit true color and 16-bit color representations.
These did not find application within the BMKL, although it is possible that they might be of
utility in situations where AltiVec needs to operate on 16-bit color data; many video games use
16-bit textures, for example.
5.4.4. Unused Memory Access Instructions
Sun’s VIS includes two sets of instructions for accelerating multimedia operations with
sophisticated memory addressing needs. The first, edge8, edge16, and edge32, produce bit
vectors to be used in conjunction with partial store instructions to deal with the boundaries in 2--
D images. The second group of addressing instructions include array8, array16 and
array32 which find use in volumetric imaging (the process of displaying a two dimensional
slice of three dimensional data). An array instruction converts (x,y,z) coordinates into a memory
19
address. The Berkeley multimedia workload does not include any volumetric imaging
applications, so it is not surprising that these instructions found no use in our workload.
5.4.5. Singular, Highly Utilized Resources
Although we usually think of SIMD architectures as extracting data level parallelism, all of
the implementations of the instruction sets we have examined are also superscalar, with multiple
parallel SIMD functional units. In fact, unless the SIMD vector length is long enough to hold the
entire data set being operated on, there is almost always the potential to extract instruction as
well as data level parallelism.
Sun’s VIS architecture does not include subword parallel shift instructions. Instead, the
graphics status register (GSR) has a 3-bit addr_offset field that is used implicitly for byte
granularity alignment, as well as a 4-bit scale_factor field for packing/truncation
operations. The VIS GSR is a serializing bottleneck because any time packing or alignment
functionality is needed, it must be pushed through the GSR. Because VIS lacks subword shift
operations, we found ourselves synthesizing such operations with the packing and alignment
operations whenever we could not otherwise avoid them. Even with the careful planning of
packing and alignment operations it was often necessary to write to the GSR (a slow operation)
several times during each iteration of the loops of our multimedia kernels. The serializing effect
of this singular resource prevented VIS operations from proceeding at the fullest possible degree
of (instruction) parallelism.
5.5. Overall Instruction Mix
Table 4 shows the total mix of dynamic instructions executed by each architecture. Counts
are for the code within the kernels only, and do not include instructions from the remainder of
each application. “Control state” instructions are those that interact with special purpose machine
registers (e.g. the GSR on Sun’s VIS). Branches and other data dependent codes such as
conditional moves are categorized as “control flow” type instructions.
We can see that SIMD kernels utilize a significant number of scalar operations for
pointers, loop variables and clean up code, evidence that sharing the integer data path is not a
good idea. Intel’s SSE is a 128-bit wide extension, as compared to AMD’s 64-bit wide 3DNow!,
explaining why Intel’s overall instruction count is lower by about 1 billion (22%) instructions.
The same reasoning applies to Motorola’s G4 which has the 128-bit wide (both packed integer
and floating point) AltiVec extension. DEC’s bloated instruction count is due to the fact that
20
their MVI extension is very limited in functionality (13 instructions in total), and so many
operations need to be done with scalar instructions.
AMD Athlon
DEC 21264A
Intel Pentium III
Motorola G4
Sun UltraSPARC IIi
Int Load/Store 5.4% 20.3% 6.2% 2.4% 2.6%Int Arithmetic 12.1% 29.2% 12.3% 19.1% 26.6%Int Control Flow 12.8% 11.3% 13.4% 13.8% 9.3%Int Data Communication 2.3% 6.7% 2.1% 1.1% 0.0%FP Load/Store 3.2% 11.2% 4.0% 7.4% 8.0%FP Arithmetic 0.0% 18.1% 0.0% 2.3% 8.4%FP Control Flow 0.0% 0.1% 0.0% 0.0% 6.2%FP Data Communication 0.0% 0.0% 0.0% 0.0% 0.0%SIMD Load/Store 19.4% 21.0% 17.9% 11.7%SIMD Data Communication 12.3% 12.6% 17.0% 9.7%SIMD Int Arithmetic 12.5% 3.1% 18.8% 11.8% 13.6%SIMD Int Control Flow 0.0% 0.0% 0.0% 3.5%SIMD FP Arithmetic 19.9% 9.3% 6.9% SIMD FP Control Flow 0.0% 0.0% 0.0% Control State 0.2% 0.0% 0.2% 0.3% 0.4%
Total Count 4.49E+09 7.45E+09 3.52E+09 4.53E+09 5.60E+09Table 4. Overall Instruction Mix – values are percentages of the total counts listed at the bottom of each column. Control state instructions are those that interact with special purpose registers. Control flow instructions include branches as well as conditional moves and comparisons. Gray table cells indicate functionality that was not implemented by a particular architecture (as opposed to implemented but unused, which is indicated by 0%).
Data communication operations represent the overhead necessary to put data in a format
amenable to SIMD operations. Ideally, these types of instructions should make up as small a part
of the dynamic instruction mix as possible. Table 4 reveals that the Motorola AltiVec extension
executes the largest percentage of these instructions. This is due to the fact that a wider register
width means that it is less likely that the data is arranged correctly as first loaded from memory.
Also, data communication operations are used by AltiVec to simulate unaligned access to
memory (typically a separate instruction on other architectures).
6. Performance Comparison
6.1. SIMD Speedup over C
In order to establish a baseline for performance, the average execution time of the C and
SIMD assembly versions of the BMKL were measured on each platform (Table 5). A speedup
from 2x-5x was typical, although some kernels were sped up considerably more. Note that the C
compiler for the Apple system is from a pre-release version of MacOS X (developer’s preview 3)
and is known to be weak, inflating the speedup of AltiVec over C.
21
AMD Athlon 500 1,087 1,801 7,374 1.22E+08 24,332 125,484 2,999 2,407 956 34,233 15,268 21,849 1.60E+07 2.55E+07 7,308 11,527 1.0E+07 3.7E+04DEC Alpha 21264 500 669 979 8,591 2.24E+07 12,475 67,321 1,276 1,143 616 19,549 11,120 19,208 9.21E+06 1.47E+07 4,900 7,990 2.9E+06 2.2E+04Intel Pentium III 500 969 1,855 7,938 6.01E+07 38,192 147,047 2,772 2,536 1,065 37,100 16,129 43,582 1.58E+07 2.20E+07 8,349 20,491 6.1E+06 4.1E+04Motorola 7400 (G4) 500 1,054 1,466 8,906 5.80E+07 33,176 232,272 3,326 4,110 1,710 29,488 24,156 23,721 1.75E+07 2.93E+07 4,917 81,985 6.6E+06 4.7E+04Sun UltraSPARC IIi 360 1,662 2,408 12,222 6.20E+07 30,922 111,258 3,782 6,165 3,550 34,928 36,773 48,660 1.77E+07 3.99E+07 7,138 14,455 7.5E+06 5.3E+04Sun UltraSPARC IIi 500* 1,197 1,734 8,800 4.47E+07 22,264 80,106 2,723 4,439 2,556 25,148 26,477 35,035 1.27E+07 2.87E+07 5,140 10,408 5.4E+06 3.8E+04Average 1,088 1,702 9,006 6.49E+07 27,819 136,677 2,831 3,272 1,579 31,059 20,689 31,404 1.52E+07 2.63E+07 6,522 27,290 6.7E+06 4.2E+04
AMD Athlon 500 114 257 4,559 1.29E+07 1,438 63,446 827 604 341 18,557 9,887 7,425 1.13E+07 1.08E+07 2,585 4,597 2.2E+06 1.1E+04DEC Alpha 21264 500 350 379 1.30E+07 1,168 9.26E+06 4.49E+06 4.5E+06 6.6E+04Intel Pentium III 500 152 381 6,336 1.08E+07 1,228 84,661 993 669 528 15,010 11,003 5,759 1.17E+07 1.00E+07 4,718 4,338 2.0E+06 1.2E+04Motorola 7400 (G4) 500 423 437 3,427 1.30E+07 907 116,828 847 618 446 14,335 3,729 3,672 1.08E+07 3.07E+06 2,727 2,998 1.7E+06 1.0E+04Sun UltraSPARC IIi 360 342 852 2.30E+07 3,353 2,508 8,715 1,716 15,009 15,202 1.95E+07 1.05E+07 4.8E+06 3.2E+04Sun UltraSPARC IIi 500* 246 613 1.66E+07 2,415 1,806 6,275 1,235 10,806 10,945 1.41E+07 7.53E+06 3.5E+06 2.3E+04Average 276 461 4,774 1.45E+07 1,732 88,312 1,294 2,355 758 15,967 9,907 8,015 1.25E+07 7.77E+06 3,343 3,978 2.2E+06 1.5E+04
AMD Athlon 500 9.6x 7.0x 1.6x 9.4x 16.9x 2.0x 3.6x 4.0x 2.8x 1.8x 1.5x 2.9x 1.4x 2.4x 2.8x 2.5x 4.5x 3.4xDEC Alpha 21264 500 1.9x 2.6x 1.7x 1.0x 1.0x 3.3x 1.9x 1.7xIntel Pentium III 500 6.4x 4.9x 1.3x 5.6x 31.1x 1.7x 2.8x 3.8x 2.0x 2.5x 1.5x 7.6x 1.3x 2.2x 1.8x 4.7x 5.1x 3.3xMotorola 7400 (G4) 500 2.5x 3.4x 2.6x 4.5x 36.6x 2.0x 3.9x 6.6x 3.8x 2.1x 6.5x 6.5x 1.6x 9.6x 1.8x 27.4x 7.6x 4.6xSun UltraSPARC IIi 500* 4.9x 2.8x 2.7x 9.2x 1.5x 0.7x 2.1x 2.5x 3.2x 0.9x 3.8x 3.1x 2.5xAverage 5.0x 4.1x 1.8x 4.8x 23.5x 1.9x 3.0x 3.2x 2.7x 2.1x 3.0x 5.0x 1.3x 4.2x 2.1x 11.5x 5.0x 3.6x
AMD Athlon 500 -9% -15% 11% -99% 7% 4% -14% 18% 31% -18% 18% 24% -12% -6% -19% 56% -1%DEC Alpha 21264 500 33% 38% -3% 64% 52% 48% 51% 61% 55% 33% 40% 33% 35% 39% 20% 70% 42%Intel Pentium III 500 3% -18% 5% 2% -46% -13% -6% 13% 23% -27% 13% -52% -11% 9% -36% 23% -7%Motorola 7400 (G4) 500 -6% 6% -7% 6% -27% -78% -27% -40% -24% -1% -30% 17% -23% -22% 20% -210% -28%Sun UltraSPARC IIi 500* -20% -11% -6% 27% 15% 39% -4% -52% -85% 14% -42% -22% 10% -19% 16% 61% -5%
AMD Athlon 500 56% 38% 28% 3% -9% 23% 28% 68% 46% 0% -6% 21% 1% -50% 36% 24% 19%DEC Alpha 21264 500 -36% 8% -35% 2% 52% 18% -11% 37% 3% -6% -19% -104% 19% 38% -22% -32% -6%Intel Pentium III 500 41% 8% 0% 19% 7% -3% 14% 64% 17% 19% -18% 39% -2% -40% -18% 28% 11%Motorola 7400 (G4) 500 -65% -6% 46% 2% 31% -42% 26% 67% 30% 23% 60% 61% 5% 57% 32% 51% 24%Sun UltraSPARC IIi 500* 4% -48% -39% -25% -82% 3% -57% -236% -95% -36% -16% -16% -23% -5% -28% -72% -48%
Table 5. Kernel Performance Statistics – From top to bottom: C time (ns), SIMD time (ns), SIMD speedup relative to C, C percent deviation from the mean (%DM) and SIMD %DM (see text for further details on this metric). Kernels with a gray background were not implemented in SIMD due to insufficient instruction set functionality. The DCT kernel was originally coded in floating point, but implemented in fixed point integer for SIMD codes. (*)Sun UltraSPARC IIi 500MHz values are extrapolated from measurements taken on a 360MHz system.
The DCT and IDCT algorithms are of the same computational order of magnitude so it
might seem strange that they demonstrate such different performance improvements (Table 5).
The reason for this is the DCT kernel was originally coded in floating point, but implemented in
fixed point integer for the SIMD codes. The IDCT is in fixed point integer for both the original
and SIMD versions. Fixed point integer calculations are faster than floating point on most
architectures, and provide sufficient precision for the DCT and IDCT in video applications [39].
6.2. Cross-Architecture Comparison
All of our performance measurements use the metric of time rather than instruction counts
since each instruction set does not necessarily do the same amount of work (computation) in the
same number of instructions, nor take the same amount of time to perform similar instructions.
Our analysis is complicated by the fact that the Sun UltraSPARC IIi used in our study has
a 360 MHz clock, while the remainder of the chips are 500 MHz parts. Of course clock
22
frequency is not the only determinant of performance; there can be significantly disparate
amounts of work accomplished between machines in each cycle depending on the pipeline length
and number of functional units available. We have chosen to present scaled results (by 500/360 =
1.4x) alongside our measurements from the actual 360 MHz system in order to place an upper
bound on UltraSPARC IIi performance and eliminate a degree of freedom from our comparison.
Certainly whether or not the scaled results are a fair basis for comparison is debatable. Even now
(October 2001) the maximum available clock frequency for the UltraSPARC IIi is 480MHz,
while the other architectures in our study have scaled considerably more in frequency. The
semiconductor process of the UltraSPARC IIi (see Table 2) is comparable to that of the other
processors in our study, so it is not clear that scaling is appropriate. The slower clock frequency
of the UltraSPARC IIi most likely stems from greater delays between latches.
To measure how well or how poorly a given platform performs relative to the competition
we use the metric of percent deviation from the mean:
−⋅=
ttt
DM i100%
where ti is the time taken on platform i, and t is the average performance (time) across all
of the platforms examined for a particular kernel. This metric indicates how much better or
worse than average performance is, and provides a normalized result for computing the average
improvement or degradation in performance. Please note that we use the scaled (500MHz)
UltraSPARC IIi values in computing the mean ( t ). Table 5 lists the average %DM (the
arithmetic average of the %DM values for all of the kernels on a given platform), which is a
good general measure of how well a platform stacks up against other architectures.
6.2.1. AMD Athlon, Intel Pentium III
At first glance the AMD Athlon and Intel Pentium III processors in our study appear to be
very similar: both run at 500 MHz, and both share Intel’s MMX SIMD integer extension. In fact,
because we implemented the Intel kernel set first, it was possible to reuse much of the same code
for the AMD SIMD version of many of the integer kernels in the BMKL. The SIMD integer
kernels include all but the clip test, FFT, quantize, synthesis filter and transform kernels. It is
interesting to observe the differences in performance between the two processors on what is
often identical code. Among the important differences between the two processors are [37]:
23
• Athlon has 64 KB instruction and data L1 caches, while the Pentium III’s are each 16 KB
• Athlon has three instruction decoders working in parallel, while Pentium III has two
• Athlon’s pipeline is 10-cycles long, while Pentium III’s is 12-17 cycles; the cost of branches
in the Pentium III is exacerbated by its weaker branch prediction unit
• Intel’s MMX instruction latency is lower (1 cycle add/sub, 1 cycle shuffle, 3 cycle mult)
compared to Athlon (2 cycle add/sub, 2 cycle shuffle, 3 cycle mult)
• AMD’s 3DNow! FP latency (4 cycle add/sub, 4 cycle multiply) is lower than Intel’s SSE (4
cycle add/sub, 5 cycle multiply, throughput of 64-bits per cycle like 3DNow!)
So, although the SIMD integer instruction set is the same in both cases, AMD’s Athlon
SIMD integer is a more powerful implementation; its only deficiency is the higher latency of
simple SIMD integer instructions when compared to the Pentium III. We can see this very
clearly in Table 5, although where each architectural difference comes into play depends on the
kernel.
One interesting case is that of the DCT and IDCT kernels - the Pentium III is faster on the
DCT, while the Athlon is faster on the IDCT. This is surprising given that the two kernels are
algorithmically quite similar - each is a mirror image of the other. Upon further investigation, we
found that the DCT code has shuffle instructions (pshufw) in several places that are not
scheduled well for the Athlon instruction’s higher latency, a circumstance which was not
duplicated in the IDCT code.
In terms of multimedia, the greatest difference between the two processors is their SIMD
floating point extensions. AMD’s 3DNow! is quite similar to MMX, reusing the eight x87
floating point registers as 64-bit wide SIMD registers. Intel’s SSE has new 128-bit wide registers
and instructions, although in the Pentium III implementation, each 128-bit instruction is actually
decoded into two 64-bit wide micro-ops. This gives the Pentium III more overall register space
to work with, although it has the same number of registers.
In the arena of floating point operations, the AMD Athlon architecture eliminates the only
deficiency it had compared to the Pentium III in SIMD integer - namely, slightly higher
instruction latency. From Table 5, we can see that overall SIMD AMD Athlon code does 19%
better than average, while the Pentium III is 11% better than the average of all of the multimedia
instruction sets. Both chips perform well, but the AMD Athlon matches or outperforms the Intel
chip on all but one SIMD floating point kernel (quantize).
24
Quantization is a process by which a real value is converted to a discrete integer
representation. In the quantization kernel, an array of real double precision floating point values,
xr[] , is converted to an array of 32-bit integers, xi[] , according to the function
xi[i] = xr[i] ⋅ xr[i] + 0.4054 . Note that original C implementation utilizes a lookup table for
some values (not shown in Listing 3). static INT32 lutab[10000]; INT32 l_end; FLOAT64 xr[]; INT32 ix[]; FLOAT64 *istep_p; FLOAT32 temp; for (i=0;i<l_end;i++) { temp=(*istep)*fabs(xr[i]); ix[i] = (int)( sqrt(sqrt(temp)*temp) + 0.4054); } Listing 3. Quantize
Although SIMD floating point instruction sets include approximations for x
1 , rather
than x , the equivalence x
xx1
⋅= can be used. The reason for AMD’s lackluster
performance on the quantize kernel is because AMD’s reciprocal square root instructions are
actually scalar - they only compute the result for the lower packed element. This costs 3DNow!
performance on this highly square-root intensive kernel.
6.2.2. DEC Alpha 21264
From Table 5 we see that the DEC Alpha platform far outstrips the other systems in terms
of compiled C performance. It has been claimed that a broader multimedia instruction set would
not be useful on Alpha, as an extension like Intel’s MMX only fixes x86 legacy-architecture
performance deficiencies which are not present with the Alpha architecture [33]. Our
performance comparison shows this argument to be false, as the kernels programmed with the
extensions from AMD, Intel and Motorola were able to not only match the performance of those
on the Alpha 21264, but often exceeded it. This indicates that sophisticated multimedia
instruction sets are far more than a means of overcoming x86 architectural limitations.
6.2.3. Motorola G4
Motorola’s AltiVec was the only multimedia extension which was architected from the
ground up - all of the others in some way leverage existing resources. The instructions included
in AltiVec are far more general than those required by multimedia applications. Almost every
possible operation and data type is supported, which should allow it to be applied to other
25
application domains as well. A 128-bit register width, combined with latencies that are at least as
good as those found on the other (64-bit) architectures, allows AltiVec to come in with the best
overall performance (23.7% better than average on SIMD accelerated code, according to Table
5). However, performance on a few of the kernels is still poor, especially add block and FFT.
The add block kernel’s problem has already been discussed - the 128-bit register width is
actually too long for this kernel, causing unnecessary data to be loaded from memory, since there
is no 64-bit short load in AltiVec, although there are 8, 16, and 32-bit short loads. The reason for
the poor FFT performance is not entirely clear, although revisiting our code revealed that the
way in which we coded stores to unaligned memory could be improved slightly.
6.2.4. Sun UltraSPARC IIi
The performance of the VIS multimedia extension was mediocre at best (even when scaled
from 360 to 500 MHz). It is the oldest multimedia instruction set we have looked at, the second
to be released after HP’s MAX-1 (1996). As we have pointed out earlier, this instruction set
suffers from some odd instruction choices (e.g. multiplication primitives), missing functionality
(no subword shift operations) and an over-utilized graphics status register that is a bottleneck.
7. Future Directions for Multimedia Instruction Sets
SIMD instructions split traditional wide scalar data paths (e.g. 32- or 64-bits) into multiple
narrow data paths which perform the same operation on each packed element. This is of
significant performance benefit for multimedia applications where data types are relatively
narrow and algorithms often perform the same operation over a large number of elements. The
greatest performance limiting factor facing current SIMD implementations is dealing with data
that is not natively in a format amenable to SIMD processing. We have found two major sources
of mismatch: storage data types which are inappropriate for computation (necessitating
conversion overhead before and likely also following computation) and operands which are not
adjacent in memory; recall our earlier example of the matrix transpose step in the DCT kernel.
The first type has led us to propose “fat subwords” (Section 7.1), while the later suggests
“strided” memory operations (Section 7.2).
7.1. Fat Subwords
With current SIMD instruction sets, data is loaded into registers in the same packed format
that it is stored in memory. Frequently, unpacking is required before operations are performed,
26
and the unpacked data of course no longer fits within a single register. We therefore propose a
“fat subword” architecture which has the following characteristics: 1) registers (and therefore
subwords) that are wider than the data to be loaded into them, 2) implicit unpack with load and
3) implicit pack (and saturate) with store.
A fat subword architecture is in some ways similar to the as yet unimplemented MIPS
MDMX instruction set [23]. The MIPS MDMX extension has a single 192-bit accumulator
register as its architectural cornerstone, with the more usual style of register to register SIMD
operations also included; non-accumulator SIMD operations share the 64-bit floating point
datapath. The destination of normal SIMD instructions can be either another SIMD register or
the accumulator (to be loaded or accumulated). When accumulating packed unsigned bytes, the
accumulator is partitioned into eight 24-bit unsigned subwords. Packed 16-bit operations cause
the accumulator to be split into four 48-bit subwords.
What is good about the MDMX accumulator approach is that implicit width promotion
provides an elegant solution to overflow and other issues caused by packing data as tightly as
possible. For example, multiplication is semantically difficult to deal with on current SIMD
architectures because the result of a multiply is longer than either operand [22]. This problem is
avoided by the MDMX accumulator, because there is enough extra space to prevent overflow.
Fixed point arithmetic is also more precise because full precision results can be
accumulated, and the total rounded only once at the end of the loop. Similarly, most scalar
multimedia algorithms only specify saturation at the end of computation because it is more
precise. Unfortunately, in SIMD architectures where the packed elements maximally fill the
available register space, the choice is either to be imprecise (compute with saturation at every
step) or lose parallelism (explicitly promote the input data to be wider and saturate at the end).
Saturating arithmetic can also produce unexpected results since, unlike normal addition, the
order of operations matters.
While the MIPS MDMX solution may seem elegant in principle, it ignores the
architectural aspect of actually making such a design fast. The accumulator is a singular and
unique shared resource, and as such has a tendency to limit instruction level parallelism. We
found that the existence of only a single accumulator would be a severe handicap to avoiding
data dependencies. On a non-accumulator but otherwise superscalar architecture it is usually
possible to perform some other useful, non data dependent operation in parallel so that
27
processing can proceed at the greatest degree of instruction level parallelism possible. On
MDMX all computations that need to use the accumulator must proceed serially since
accumulation is inherently dependent on the previous result.
128-bit total load/store width 64-bit total load/store width Storage Computation Storage Computation
8U 12S x 16 8U 24S x 8 16S 24S x 8 16S 48S x 4 32S 48S x 4 - -
32FP 32FP x 4 - - Table 6. Supported Data Types - data types are divided into storage (as stored in memory) and computation (the width of arithmetic and other instruction elements) widths.
7.1.1. Fat Subword Data Types
To avoid the problems of MDMX, we suggest that a normal register file architecture be
used (as opposed to a single accumulator), with the entire SIMD register file made wider than the
width of loads and stores. For the sake of example we have chosen a width of 192-bits for our
datapath (registers and functional units), corresponding to a load/store width of either 64- or 128-
bits. A 64-bit only implementation would also be possible but of course with a lesser degree of
potential parallelism. Another possibility, should very wide registers prove infeasible, is the use
of register sets (e.g. triplets in the case of 64-bit wide registers). A similar approach was used by
Cyrix in their extension to MMX (found in their 6x86MX/M-II processor) to compensate for the
eight register encoding limit inherent in x86 based architectures.
Supported storage data types include those that we have seen to be of importance in
multimedia for storage formats: 8-bit unsigned, 16-bit signed, 32-bit signed and single precision
floating point. The data types supported by packed arithmetic operations for computation are
different: 12-bit signed, 24-bit signed, 48-bit signed and single precision floating point.
Depending on the algorithm it may be desirable to access memory at either 64-bit or 128-bit
widths (see Table 6), which would then unpacked to different computational widths: a smaller
number of high-precision subwords (64-bit storage width) or a larger number of low-precision
subwords (128-bit storage width). This allows for better matching with the natural width of each
algorithm.
In the case of packed floating point operations we suggest operating on four single
precision values in parallel, leaving the remaining 16 bits of each subword idle. This allows for
SIMD computations to precisely match scalar results while still being able to take advantage of
the same data rearrangement capabilities used for integer data types.
28
7.1.2. Fat Subword Operations
In [36] we present our proposed instruction set (97 instructions in total) for a fat subword
SIMD multimedia extension. Unlike existing instruction sets which are fundamentally byte-
based, this instruction set is centered on quantities which are multiples of 12-bits wide. This can
be thought of as four extra bits of precision for every byte of actual storage data loaded. Because
we have fundamentally changed the treatment of multimedia data types, several types of typical
instructions are no longer useful. These include operations such as average, pack/unpack and
truncating multiplication.
7.1.3. Fat Subword Example
A small example of how to code for the proposed architecture is shown in Figure 1. Note
that load and store instructions specify both computational and storage data types.
1515 253253 3434 55 255255 22 22 00
20
0225553425315223974623
0x10080x1000
0 0 0 0 0 0 0 0191167143119957147230 24 48 72 96 120 144 168
33 22 4646 77 99 00 2323 220 0 0 0 0 0 0 0191167143119957147230 24 48 72 96 120 144 168
4545 506506 15641564 3535 22952295 00 4646 00191167143119957147230 24 48 72 96 120 144 168
0 04625535255255450x1000
lvx8u24s v0,[1000]
lvx8u24s v1,[1008]
mult24s v2,v0,v1
stv24s8u v2,[1000]6355473931231570 8 16 24 32 40 48 56
Figure 1. Fat Subword Example – two vectors of eight 8-bit unsigned integers are loaded and implicitly expanded to 24-bit signed notation from address 0x1000 and 0x1008 into vector registers v0 and v1 respectively. Vector register v0 and v1 are then multiplied and the result placed in v2. Overflow cannot occur because the maximum width of the result is 16-bits per multiplication (the sum of the operand widths). The result vector is then simultaneously repacked to 8-bit unsigned notation and stored out to memory address 0x1000.
7.2. Strided Memo ry Access
We propose that SIMD architectures implement strided load and store instructions to make
the gathering of non-adjacent data elements more efficient. This is similar to the prefetch
mechanism in AltiVec, except that the data elements would be assembled together by the
hardware into a single register, rather than simply loaded into the cache. Of course such a
memory operation would necessarily be slower than a traditional one, but it would cut down
immensely on the overhead that would have to go into reorganizing data as loaded from memory.
Strided loads and stores would have three operands, vD, rA and rB, where vD is the destination
29
fat-subword register, rA is the base address and rB contains a description of the memory access
pattern:
width - width of the data elements to be loaded [2 bits], 00 = 8-bits, 01 = 16-bits, 10 = 32-
bits, 11= 64-bits.
offset - number of bytes offset from the base address from which to begin loading [6 bits],
interpreted as a signed value.
stride - number of bytes between the effective address of one element in the sequence and
the next [24 bits], interpreted as a signed value.
The color space conversion kernel is an excellent example of where strided load and store
instructions could be used. Pixel data consists of one or more channels or bands, with each
channel representing some independent value associated with a given pixel’s (x,y) position. A
single channel, for example, represents grayscale, while a three (or more) channel image is
typically color. The band data may be interleaved (each pixel’s red, green, and blue data are
adjacent in memory) or separated (e.g. the red data for adjacent pixels are adjacent in memory).
In image processing algorithms such as color space conversion we operate on each channel in a
different way, so band separated format is the most convenient for SIMD processing. Converting
from the RGB to YCBCR color space is done through the conversion coefficients shown in
Listing 4. UINT8 *rowp, *y_p, *u_p, v_p; INT32 red = *rowp++, green = *rowp++, blue = *rowp++; *y_p++ = +0.2558*red + 0.5022*green + 0.0975*blue + 16.5; *u_p++ = -0.1476*red - 0.2899*green + 0.4375*blue + 128.5; *v_p++ = +0.4375*red - 0.3664*green - 0.0711*blue + 128.5; Listing 4. Color Space Conversion
Listing 5 replaces thirty-eight instructions in the original AltiVec color space conversion
kernel (the corresponding code fragment is listed in [36]). In the original AltiVec code, it was
necessary to load six permute control vectors (each 128-bits wide) before executing the six
vperm instructions required to rearrange the data into band separated format. rgb_to_yuv: oris r11,r11 RED_PATTERN oris r12,r12,GREEN_PATTERN oris r13,r13,BLUE_PATTERN lvxstrd v28,0,r3,r11 ;; v28: |r0|r1|r2|r3|r4|r5|r6|r7|r8|r9|rA|rB|rC|rD|rE|rF| lvxstrd v29,0,r3,r12 ;; v29: |g0|g1|g2|g3|g4|g5|g6|g7|g8|g9|gA|gB|gC|gD|gE|gF| lvxstrd v30,0,r3,r13 ;; v30: |b0|b1|b2|b3|b4|b5|b6|b7|b8|b9|bA|bB|bC|bD|bE|bF| Listing 5. Modified Color Space Conversion
30
Traditional SIMD data communication operations have trouble with data that is not aligned
on boundaries that are powers of two; in the case of color space conversion, visually adjacent
pixels from each band are spaced 3 bytes apart. Strided loads and stores are by definition
unaligned, so this would need to be handled by the load/store hardware in the CPU. It would also
make sense to have additional versions of these instructions to provide hints to the memory
hardware whenever data elements will not be of near-term use (especially if the stride is on the
same or greater order of the line size). Another approach to strided data access is presented in
[43] which discusses the Impulse system architecture. Impulse allows for a strided mapping that
creates dense cache lines from array elements that are not contiguous in memory.
8. Conclusion
Specialized instructions have been introduced by microprocessor vendors to support the
unique computational demands of multimedia applications. The mismatch between wide data
paths and the relatively short data types found in multimedia applications has led the industry to
embrace SIMD (single instruction, multiple data) style processing. A SIMD approach has the
significant added benefit of simplifying control circuitry, typically one of the most difficult
aspects of superscalar microprocessor design. We have found that the greatest performance
limiting factor facing current SIMD implementations is dealing with data that is not natively in a
format amenable to SIMD processing.
Storage and computation have opposing goals – the first seeks to use a minimal number of
bits in order to conserve space, while the later requires extra bits to maintain precision during
intermediate calculations. With current SIMD instruction sets, data is loaded into registers in the
same packed format that it is stored in memory. This is inefficient because unpacking is
frequently required before operations are performed, and the unpacked data of course no longer
fits within a single register. When it is time to write back SIMD data to memory the data must be
repacked and saturated to fit within a typically narrower storage data type. Our proposed fat-
subword register architecture reduces the amount of unpacking, packing and saturation overhead
that SIMD computation must incur by encoding the current and desired computational and
storage data type in SIMD load/store instructions and then implicitly performing the requisite
unpacking or packing and sign extension or saturation.
SIMD instructions perform the same operation on multiple data elements. Because all of
the data within a register must be treated identically, the ability to efficiently rearrange data bytes
31
within and between registers has become critical to performance. With regard to this we have
suggested strided memory operations as having the potential to greatly simplify access to sparse
data structures as well as the conversion between band-interleaved and band-separated storage
formats. If these strided operations can be made fast in hardware (this seems likely given that
strided vector loads and stores have long been integral components to traditional vector
supercomputers, so their implementation should already be well understood) one of the major
impediments to efficient SIMD processing will be lifted.
We have presented a unified comparison of five multimedia instruction sets for general
purpose microprocessors, which discusses the advantages and disadvantages of specific features
of current instruction sets. In addition to examining the five instruction sets from a functional
standpoint, we have also empirically compared the performance of specific implementations of
these instruction sets on real hardware. This has allowed us to measure typical speedups relative
to a highly tuned compiler on a given platform as well as to make cross-architecture
comparisons.
Today most microprocessors found in desktop workstations are equipped with multimedia
extensions that remain unused due to the lack of compiler support for these instruction sets [20]
(our analysis has been based entirely on measurements of hand tuned SIMD kernels). A valuable
extension to this study would be an analysis of what characteristics would make SIMD most
instructions most amenable to use by a compiler. In addition, multimedia applications, and
therefore our target workload, are constantly evolving: MPEG-4, and MP3-Pro audio are only
two examples of up and coming algorithms that dema nd further examination. As these and other
algorithms and applications come to light we will need to revisit our design choices for
multimedia instruction sets.
References [1] G.E. Allen, B.L. Evans, L.K. John, “Real-Time High-Throughput Sonar Beamforming Kernels
Using Native Signal Processing and Memory Latency Hiding Techniques,” Proc. 33rd IEEE
Asilomar Conf. on Signals, Systems and Computers, Pacific Grove, Calif., 1999, pp. 137-141
[2] Advanced Micro Devices, Inc., “AMD Athlon Processor x86 Code Optimization Guide,”
Publication #22007, Rev. G, http://www.amd.com/products/cpg/athlon/techdocs/ pdf/22007.pdf,
(current Apr. 2000)
32
[3] Advanced Micro Devices, Inc., “AMD Athlon Processor Technical Brief,” Publication #22054,
Rev. D, http://www.amd.com/products/cpg/athlon/techdocs/pdf/22054.pdf, (current Apr. 2000)
[4] Advanced Micro Devices, “3DNow! Technology vs. KNI,” White Paper,
http://www.amd.com/products/cpg/3dnow/vskni.html, (current Apr. 2000)
[5] R. Bhargava, L.K. John, B.L. Evans, R. Radhakrishnan, “Evaluating MMX Technology Using DSP
and Multimedia Applications,” Proc. 31st IEEE Intl. Symp. on Microarchitecture (MICRO-31),
Dallas, Texas, 1998, pp. 37-46
[6] T. Burd, “General Processor Info,” CPU Info Center, http://bwrc.eecs.berkeley.edu/CIC/
summary/local/summary.pdf, (current Apr. 2000)
[7] D.A. Carlson, R.W. Castelino, R.O. Mueller, “Multimedia Extensions for a 550-MHz RISC
Microprocessor,” IEEE J. of Solid-State Circuits, vol. 32, no. 11, Nov. 1997, pp. 1618-1624
[8] W. Chen, H. John Reekie, S. Bhave, E.A. Lee, “Native Signal Processing on the UltraSparc in the
Ptolemy Environment,” Proc. 30th Ann. Asilomar Conf. on Signals, Systems, and Computers,
Pacific Grove, California, 1996, vol. 2, pp. 1368-1372
[9] Compaq Computer Corporation, “Alpha 21264 Microprocessor Hardware Reference Manual,” Part
No. DS-0027A-TE, Feb. 2000, http://www.support.compaq.com/alpha-tools/
documentation/current/21264_EV67/ds-0027a-te_21264_hrm.pdf, (current Apr. 2000)
[10] T.M. Conte, P.K. Dubey, M.D. Jennings, R.B. Lee, A. Peleg, S. Rathnam, M. Schlansker, P. Song,
A. Wolfe, "Challenges to Combining General-Purpose and Multimedia Processors," IEEE
Computer, vol. 30, no. 12, Dec. 1997, pp. 33-37
[11] C. Hansen, “MicroUnity’s MediaProcessor Architecture,” IEEE Micro, vol. 16, no. 4, Aug. 1996,
pp. 34-41
[12] Intel Corporation, “Intel Introduces The Pentium Processor With MMX Technology,” January 8,
1997 Press Release, http://www.intel.com/pressroom/archive/ releases/dp010897.htm, (current Apr.
2000)
[13] Intel Corporation, “Intel Architecture Software Developer’s Manual, Volume 1: Basic
Architecture,” Publication 243190, , http://developer.intel.com/design/pentiumii/
manuals/24319002.PDF, (current Apr. 2000)
[14] Intel Corporation, “IA32 Intel Architecture Software Developer’s Manual with Preliminary Intel
Pentium 4 Processor Information,” Volume 1: Basic Architecture,
http://developer.intel.com/design/processor/future/manuals/24547001.pdf, (current Sept. 2000)
33
[15] Intel Corporation, “Intel Announces New NetBurst Micro-Architecture for Pentium IV Processor,”
http://www.intel.com/pressroom/archive/releases/dp082200.htm, (current Sept. 2000)
[16] J. Keshava, V. Pentkovski, “Pentium III Processor Implementation Tradeoffs,” Intel Tech. J.,
Quarter 2, 1999, http://developer.intel.com/technology/itj/q21999/pdf/impliment.pdf, (current Apr.
2000)
[17] T. Kientzle, “Implementing Fast DCTs,” Dr. Dobb’s J., vol. 24, no. 3, March 1999, pp. 115-119
[18] L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, G. Zyner, “The Visual Instruction Set (VIS) in
UltraSPARC,” Proc. Compcon ’95, San Francisco, 1995, pp. 462-469
[19] I. Kuroda, T. Nishitani, “Multimedia Processors,” Proc. IEEE, vol. 86 no. 6, June 1998, pp. 1203-
1221
[20] S. Larsen, S. Amarasinghe, “Exploiting Superword Level Parallelism with Multimedia Instruction
Sets,” Proc. ACM SIGPLAN ‘00 Conf. on Programming Language Design and Implementation,
Vancouver, BC Canada, 2000, pp. 145-156
[21] R.B. Lee, “Subword Permutation Instructions for Two-Dimensional Multimedia Processing in
MicroSIMD Architectures,” Proc. IEEE Intl. Conf. on Application-Specific Systems, Architectures,
and Processors, Boston, 2000, pg. 3-14
[22] R.B. Lee, “Multimedia Extensions for General Purpose Processors,” Proc. IEEE Workshop on VLSI
Signal Processing, Leicester, UK, 1997, pp. 1-15
[23] MIPS Technologies, Inc., “MIPS Extension for Digital Media with 3D,” WhitePaper, Mar. 12,
1997 http://www.mips.com/Documentation/isa5_tech_brf.pdf, (current Apr. 2000)
[24] Motorola Inc., “MPC7400 RISC Microprocessor User’s Manual, Rev. 0,” Document
MPC7400UM/D, http://www.mot.com/SPS/PowerPC/teksupport/teklibrary/manuals/
MPC7400UM.pdf, (current Apr. 2000)
[25] J. Nakashima, K. Tallman, “The VIS Advantage: Benchmark Results Chart VIS Performance,”
White Paper, Oct. 1996, http://www.sun.com/microelectronics/vis/, (current Apr. 2000)
[26] H. Nguyen, L.K. John, “Exploiting SIMD Parallelism in DSP and Multimedia Algorithms Using
the AltiVec Technology,” Proc. 1999 Int’l Conf. on Supercomputing, Rhodes, Greece, 1999, pp.
11-20
[27] K. Noer, “Heat Dissipation Per Square Millimeter Die Size Specifications,”
http://home.worldonline.dk/~noer/, (current Apr. 2000)
34
[28] K.B. Normoyle, M.A. Csoppenszky, A. Tzeng, T.P. Johnson, C.D. Furman, J. Mostoufi,
“UltraSPARC-IIi: Expanding the Boundaries of a System on a Chip,” IEEE Micro, vol. 18, no. 2,
Mar./Apr. 1998, pp. 14-24
[29] A.D. Pimentel, P. Struik, P. van der Wolf, L.O. Hertzberger, “Hardware versus Hybrid Data
Prefetching in Multimedia Processors: A Case Study,” Proc. IEEE Intl. Performance, Computing
and Communications Conference, Phoenix, Ariz., 2000, pp. 525-531,
[30] P. Ranganathan, S. Adve, N.P. Jouppi, “Performance of Image and Video Processing with General-
Purpose Processors and Media ISA Extensions,” Proc. 26th Ann. Intl. Symp. on Computer
Architecture, Atlanta, 1999, pp. 124-135
[31] S. Rathnam, G. Slavenberg, “An Architectural Overview of the Programmable Multimedia
Processor, TM-1,” Proc. of COMPCON ‘96, Santa Clara, Calif., 1996, pp. 319-326
[32] D.S. Rice, “High-Performance Image Processing Using Special-Purpose CPU Instructions: The
UltraSPARC Visual Instruction Set,” Univ. Calif. at Berkeley, Master’s Report, Mar., 1996
[33] P. Rubinfeld, B. Rose, M. McCallig, “Motion Video Instruction Extensions for Alpha,” White
Paper, Oct., 1996 http://www.digital.com/alphaoem/papers/pmvi-abstract.htm, (current Apr. 2000)
[34] N.T. Slingerland, A.J. Smith, “Design and Characterization of the Berkeley Multimedia Workload,”
University of California at Berkeley Technical Report CSD-00-1122, Dec. 2000
[35] N.T. Slingerland, A.J. Smith, “Multimedia Instruction Sets for General Purpose Microprocessors: A
Survey,” University of California at Berkeley Technical Report CS-00-1124, Dec. 2000
[36] N.T. Slingerland, A.J. Smith, “Measuring the Performance of Multimedia Instruction Sets,”
University of California at Berkeley Technical Report CSD-00-1125, Dec. 2000
[37] A. Stiller, “Architecture Contest,” c’t Magazine, vol. 16/99,
http://www.heise.de/ct/english/99/16/092/, (current Dec. 2000)
[38] P. Struik, P. van der Wolf, A.D. Pimentel, “A Combined Hardware/Software Solution for Stream
Prefetching in Multimedia Applications,” Proc. 10th Ann. Symp. on Electronic Imaging
(Multimedia Hardware Architectures track), San Jose, Calif., 1998, pp. 120-130
[39] M.T. Sun ed., “IEEE Standard Specifications for the Implementation of 8x8 Inverse Discrete
Cosine Transform”, IEEE Standard 1180-1990, IEEE, 1991
[40] Sun Microsystems Inc., “UltraSPARC-IIi User’s Manual,” 1997, Part No. 805-0087-01,
http://www.sun.com/microelectronics/UltraSPARC-IIi/docs/805-0087.pdf, (current Apr. 2000)
35
[41] S. Thakkar, T. Huff, “The Internet Streaming SIMD Extensions,” Intel Tech. J., Q2 1999,
http://developer.intel.com/technology/itj/q21999.htm, (current Apr. 2000)
[42] M. Tremblay, J.M. O’Connor, V. Narayanan, L. He, “VIS Speeds New Media Processing,” IEEE
Micro, vol. 16, no. 4, Aug. 1996, pp. 10-20
[43] L. Zhang, J.B. Carter, W. Hsieh, S.A. McKee, “Memory System Support for Image Processing,”
Proc. 1999 International Conference on Parallel Architectures and Compilation Techniques,
Newport Beach, Calif., 1999, pp. 98-107