vector processors prof. sivarama dandamudi school of computer science carleton university

Vector Processors

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Carleton University © S. Dandamudi 2

Pipelining

Vector machines exploit pipelining in all its activitiesComputationsMovement of data from/to memory

Pipelining provides overlapped execution Increases throughputHides latency …

Pipelining (cont’d)

Pipeline overlaps execution:6 versus 18 cycles

One measure of performance:

Ideal case:n-stage pipeline should give a speedup of n

Two factors affect this:Pipeline fillPipeline drain

Non-pipelined execution time

Pipelined execution time Speedup =

N computations, each takes n * T time

Non-pipelined time = N * n * T time

Pipelined time = n * T + (N – 1) T time

= (n + N –1) T time

n * Nn + N 1

Speedup = 1/N + 1/n – 1/(n * N )

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

Number of elements, N

Pipeline depth, n

Vector Machines

Provide high-level operationsWork on vectors (linear arrays of numbers)A typical vector operation

Add two 64-element floating-point vectorsEquivalent to an entire loop

CRAY formatV3 V2 VOP V1 V3 V2 VOP V1

Vector Machines (cont’d)

Consists of Scalar unit

Works on scalarsAddress arithmetic

Vector unitResponsible for vector operationsSeveral vector functional units

Integer add, FP add, FP multiply …

Vector Machines (cont’d)

Two types of architectureMemory-to-memory architecture

Vectors are memory resident First machines are of this type Example: CDC Star 100, CYBER 205

Vector-register architecture Vectors are stored in registers

Modern vector machines belong to this type Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200,

Hitachi S820

Components

Primary components of vector-register machineVector registers

Each register can hold a small vectorExample: Cray-1 has 8 vector registers

Each vector register can hold 64 doublewords (64-bit values) Two read ports and one write port

Allows overlap among the vector operations

Cray-1Architecture

ComponentsVector functional units

Each unit is fully pipelined Can start a new operation on every clock cycle Cray-1 has six functional units

FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift

Scalar registersStore scalarsCompute addresses to pass on to the load/store unit

ComponentsVector load/store unit

Moves vectors between memory and vector registers Load and store operations are pipelined

Some processors have more than one load/store unit NEC SX/2 has 8 load/store units

MemoryDesigned to allow pipelined accessTypically use interleaved memories

Will discuss later

Some Example Vector Machines

Machine Year # VR VR size # LSUs

CRAY-2 1985 8 64 1

Cray Y-MP 1988 8 64 2 loads/1 store

Fujitsu VP100 1982 8-256 32-1024 2

Hitachi S810 1983 32 256 4

NEC SX/2 1984 8+8192 256+var. 8

Convex C-1 1985 8 128 1

Some Example Vector Machines (cont’d)

Vector functional unitsCray X-MP/Y-MP

8 units FP add, FP multiply, FP reciprocal Integer add, 2 logical Shift Population count/parity

Some Example Vector Machines (cont’d)

Vector functional units (cont’d)

NEX SX/216 units

4 FP add, 4 FP multiply/divide 4 Integer add/logical, 4 Shift

Advantages of Vector Machines

Flynn’s bottleneck can be reducedVector instructions significantly improve code densityA single vector instruction specifies a great deal of

workReduce the number of instructions needed to execute a

programEliminate control overhead of a loop

A vector instruction represents the entire loop Loop overhead can be substantial

Advantages of Vector Machines (cont’d)

Impact of main memory latency can be reducedVector instructions that access memory have a known

patternPipelined access can be usedCan exploit interleaved memoryHigh latency associated with memory can be amortized over

the entire vector Latency is not associated with each data item

When accessing a floating-point number

Advantages of Vector Machines (cont’d)

Control hazards can be reducedVector machines organize data operands into regular

sequences Suitable for pipelined access in hardware

Vector operation loop

Data hazards can be eliminatedDue to structured nature of data

Allows planned prefetching of data

Example Problem A Typical Vector Problem

Y = a * X + Y X and Y are vectors This problem is known as

SAXPY (single precision A*X Plus Y)DAXPY (double precision A*X Plus Y)

SAXPY/DAXPY represents a small piece of code that takes most of the time in the benchmark

Example Problem (cont’d)

Non-vector code fragment LD F0,a ADDI R4,Rx,#512 ;last address to loadloop: LD F2,0(Rx) ;F2 := M[0+Rx]

; i.e., load X[i] MULT F2,F0,F2 ;a*X[i]

LD F4,0(Ry) ;load Y[i]

ADD F4,F2,F4 ;a*X[i] + y[i]

SD F4,0(Ry) ;store into Y[i]

ADDI Rx,Rx,#8 ;increment index to X

ADDI Ry,Ry,#8 ;increment index to Y

SUB R20,R4,Rx ;R20 := R4-Rx

JNZ R20,loop ;jump if not done9 instructions in the loop

Vector code fragment LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTSV V2,F0,V1 ;V2 := F0 * V1 LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;V4 := V2 + V3 SV Ry,V4 ; store the result Only 6 vector instructions!

Two main observationsExecution efficiency

Vector code Executes 6 instructions

Non-vector code Nearly 600 instructions (9 * 64) Lots of control overhead

4 out of 9 instructions! Absent in the vector code

Two main observationsFrequency of pipeline interlock

Non-vector code: Every ADD must wait for MULT Every SD must wait for ADD

Loop unrolling can eliminate this interlockVector code

Each instruction is independent Pipeline stalls once per vector operation

Not once per vector element

Vector Length

Vector register has a natural vector length64 elements in CRAY systems

What if the vector has a different length?Three cases

Vector length < Vector register length Use a vector length register to indicate the vector length

Vector length = Vector register lengthVector length > Vector register length

Vector Length (cont’d)

Vector length > Vector register lengthUse strip miningVector is partitioned into strips that are less than or

equal to the vector register length

Odd strip

Vector Stride

Vector strideDistance separating the elements that are to be merged

into a single vectorIn elements, not bytes

Typically multidimensional matrices may have non-unit stride access patternsExample: matrix multiply

Vector Stride (cont’d)

Matrix multiplicationfor (i = 1, 100)

for (j = 1, 100)

A[i,j] = 0

for (k = 1, 100)

A[i,j] = A[i,j] + B[i,k] * C[k,j]

Non-unit stride

Unit stride

Access pattern of B and C depends on how the matrix is storedRow-major

Matrix is stored row-by-rowUsed by most languages except FORTRAN

Column-majorMatrix is stored column-by-columnUsed by FORTRAN

11 12 13 1421 22 23 2431 32 33 3441 42 43 44

Cray X-MP Instructions

Integer additionVi Vj+Vk Vi = Vj + VkVi Sj+Vk Vi = Sj + Vk

Sj is a scalar

Floating-point additionVi Vj+FVk Vi = Vj + VkVi Sj+FVk Vi = Sj + Vk

Sj is a scalar

Cray X-MP Instructions (cont’d)

Load instructionsVi ,A0,Ak Vi = M(A0)+Ak

Vector load with stride AkLoads VL elements from memory address A0

Vi ,A0,1 Vi = M(A0)+1Vector load with stride 1Special case

Store instructions ,A0,Ak Vi

Vector store with stride AkStores VL elements starting at memory address A0

,A0,1 ViVector store with stride 1Special case

Logical AND instructionsVi Vj&Vk Vi = Vj & VkVi Sj&Vk Vi = Sj & Vk

Sj is a scalar

Shift instructionsVi Vj>Ak Vi = Vj >> AkVi Vj<Ak Vi = Vj << Ak

Left/right shift each element of Vj and store the result in Vi

Sample Vector Functional Units

Vector functional unit # Stages Available to chain

Vector results

Integer ADD (64-bit) 3 8 VL+8

64-bit shift 3 8 VL+8

128-bit shift 4 9 VL+9

Floating ADD 6 11 VL+11

Floating MULTIPLY 7 12 VL+12

X-MP Pipeline Operation

Three phasesSetup phase

Sets functional units to perform the appropriate operationEstablishes routes to source and destination vector registersRequires 3 clock cycles for all functional units

Execution phaseShutdown phase

X-MP Pipeline Operation (Cont’d)

Three phases (cont’d)

Execution phaseSource and destination vector registers are reserved

Cannot be used by another instruction

Source vector register is reserved for VL+3 clock cycles VL = vector length

One pair of operands/clock cycle enter the first stage

Shutdown phaseShutdown time = 3 clock cyclesShutdown time

Time difference between when the last result emerges and when the destination vector register becomes available for other

instructions

Shutdown phaseDestination register becomes available after

3 + n + (VL1) + 3 = n + VL + 5 clock cyclesSetup time = shutdown time = 3 clock cyclesFirst result comes after n clock cyclesRemaining (VL1) results come out at one/clock cycle

A Simple Vector Add Operation

A1 5VL A1V1 V2+FV3

Overlapped Vector OperationsA1 5VL A1V1 V2+FV3V4 V5*FV6

Chaining ExampleA1 5VL A1V1 V2+FV3V4 V5*FV1

Vector Processing Performance

Interleaved Memories

Traditional memory designsProvide sequential, non-overlapped access

Use high-order interleaving

Interleaved memoriesFacilitate overlapped, pipelined accessUsed by vector and high performance systems

Use low-order interleaving

Interleaved Memories (cont’d)

Two types of designsSynchronized access organization

Upper m bits are given to all memory banks simultaneouslyRequires output latchesDoes not efficiently support non-sequential access

Independent access organizationSupports pipelined access for arbitrary access patternRequire address registers

Synchronized access organization

Pipelined transfer of datain interleaved memories

Independent access organization

Number of banks B

B MM = memory access time in cycles

Sequential access if stride = B B = 8, M = 6 clock cycles, stride = 1

Time to read 16 words = 6 + 16 = 22 clock cycles If stride is 8, it takes 16 * 6 = 96 clock cycles

Last slide

vector processors prof. sivarama dandamudi school of computer science carleton university

nn n

nstage pipeline

vector processorsprof

t timepipelined time

t timenonpipelined time

memory residentfirst

pipelining contdn computations

memory architecturevectors

Documents

overview of assembly language - carleton...

procedures and the stack - carleton...

overview of assembly language - carleton · pdf fileoverview...

the pentium processor - carleton...

combinational circuits chapter 3 s. dandamudi. 2003 to be...

cache memory chapter 17 s. dandamudi. 2003 to be used with...

chapter 8 s. dandamudi - carleton...

fundamentals of computer organization and...

distributed-memory multicomputers prof. sivarama dandamudi...

memory system design - carleton...

virtual memory - the school of computer...

mips assembly language - carleton...

addressing modes chapter 6 s. dandamudi. 2005 to be used...

virtual memory - carleton university · s. dandamudi...

interrupts & input/output - carleton...

risc processors - · risc processors chapter 14 s....

interrupts & input/output - carleton university · 1...

siksa outside iskcon - sivarama swami

basic computer organization chapter 2 s. dandamudi

addressing modes - carleton...