vector processors prof. sivarama dandamudi school of computer science carleton university

Post on 14-Jan-2016

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Vector Processors

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Carleton University © S. Dandamudi 2

Pipelining

Vector machines exploit pipelining in all its activitiesComputationsMovement of data from/to memory

Pipelining provides overlapped execution Increases throughputHides latency …

Carleton University © S. Dandamudi 3

Pipelining (cont’d)

Pipeline overlaps execution:6 versus 18 cycles

Carleton University © S. Dandamudi 4

Pipelining (cont’d)

One measure of performance:

Ideal case:n-stage pipeline should give a speedup of n

Two factors affect this:Pipeline fillPipeline drain

Non-pipelined execution time

Pipelined execution time Speedup =

Carleton University © S. Dandamudi 5

Pipelining (cont’d)

N computations, each takes n * T time

Non-pipelined time = N * n * T time

Pipelined time = n * T + (N – 1) T time

= (n + N –1) T time

n * Nn + N 1

Speedup = 1/N + 1/n – 1/(n * N )

1=

Carleton University © S. Dandamudi 6

Pipelining (cont’d)

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

Number of elements, N

Spee

dup

n = 9

n = 3

n = 6

Carleton University © S. Dandamudi 7

Pipelining (cont’d)

Pipeline depth, n

Carleton University © S. Dandamudi 8

Vector Machines

Provide high-level operationsWork on vectors (linear arrays of numbers)A typical vector operation

Add two 64-element floating-point vectorsEquivalent to an entire loop

CRAY formatV3 V2 VOP V1 V3 V2 VOP V1

Carleton University © S. Dandamudi 9

Vector Machines (cont’d)

Consists of Scalar unit

Works on scalarsAddress arithmetic

Vector unitResponsible for vector operationsSeveral vector functional units

Integer add, FP add, FP multiply …

Carleton University © S. Dandamudi 10

Vector Machines (cont’d)

Two types of architectureMemory-to-memory architecture

Vectors are memory resident First machines are of this type Example: CDC Star 100, CYBER 205

Vector-register architecture Vectors are stored in registers

Modern vector machines belong to this type Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200,

Hitachi S820

Carleton University © S. Dandamudi 11

Components

Primary components of vector-register machineVector registers

Each register can hold a small vectorExample: Cray-1 has 8 vector registers

Each vector register can hold 64 doublewords (64-bit values) Two read ports and one write port

Allows overlap among the vector operations

Carleton University © S. Dandamudi 12

Cray-1Architecture

Carleton University © S. Dandamudi 13

ComponentsVector functional units

Each unit is fully pipelined Can start a new operation on every clock cycle Cray-1 has six functional units

FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift

Scalar registersStore scalarsCompute addresses to pass on to the load/store unit

Carleton University © S. Dandamudi 14

ComponentsVector load/store unit

Moves vectors between memory and vector registers Load and store operations are pipelined

Some processors have more than one load/store unit NEC SX/2 has 8 load/store units

MemoryDesigned to allow pipelined accessTypically use interleaved memories

Will discuss later

Carleton University © S. Dandamudi 15

Some Example Vector Machines

Machine Year # VR VR size # LSUs

CRAY-2 1985 8 64 1

Cray Y-MP 1988 8 64 2 loads/1 store

Fujitsu VP100 1982 8-256 32-1024 2

Hitachi S810 1983 32 256 4

NEC SX/2 1984 8+8192 256+var. 8

Convex C-1 1985 8 128 1

Carleton University © S. Dandamudi 16

Some Example Vector Machines (cont’d)

Vector functional unitsCray X-MP/Y-MP

8 units FP add, FP multiply, FP reciprocal Integer add, 2 logical Shift Population count/parity

Carleton University © S. Dandamudi 17

Some Example Vector Machines (cont’d)

Vector functional units (cont’d)

NEX SX/216 units

4 FP add, 4 FP multiply/divide 4 Integer add/logical, 4 Shift

Carleton University © S. Dandamudi 18

Advantages of Vector Machines

Flynn’s bottleneck can be reducedVector instructions significantly improve code densityA single vector instruction specifies a great deal of

workReduce the number of instructions needed to execute a

programEliminate control overhead of a loop

A vector instruction represents the entire loop Loop overhead can be substantial

Carleton University © S. Dandamudi 19

Advantages of Vector Machines (cont’d)

Impact of main memory latency can be reducedVector instructions that access memory have a known

patternPipelined access can be usedCan exploit interleaved memoryHigh latency associated with memory can be amortized over

the entire vector Latency is not associated with each data item

When accessing a floating-point number

Carleton University © S. Dandamudi 20

Advantages of Vector Machines (cont’d)

Control hazards can be reducedVector machines organize data operands into regular

sequences Suitable for pipelined access in hardware

Vector operation loop

Data hazards can be eliminatedDue to structured nature of data

Allows planned prefetching of data

Carleton University © S. Dandamudi 21

Example Problem A Typical Vector Problem

Y = a * X + Y X and Y are vectors This problem is known as

SAXPY (single precision A*X Plus Y)DAXPY (double precision A*X Plus Y)

SAXPY/DAXPY represents a small piece of code that takes most of the time in the benchmark

Carleton University © S. Dandamudi 22

Example Problem (cont’d)

Non-vector code fragment LD F0,a ADDI R4,Rx,#512 ;last address to loadloop: LD F2,0(Rx) ;F2 := M[0+Rx]

; i.e., load X[i] MULT F2,F0,F2 ;a*X[i]

Carleton University © S. Dandamudi 23

Example Problem (cont’d)

LD F4,0(Ry) ;load Y[i]

ADD F4,F2,F4 ;a*X[i] + y[i]

SD F4,0(Ry) ;store into Y[i]

ADDI Rx,Rx,#8 ;increment index to X

ADDI Ry,Ry,#8 ;increment index to Y

SUB R20,R4,Rx ;R20 := R4-Rx

JNZ R20,loop ;jump if not done9 instructions in the loop

Carleton University © S. Dandamudi 24

Example Problem (cont’d)

Vector code fragment LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTSV V2,F0,V1 ;V2 := F0 * V1 LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;V4 := V2 + V3 SV Ry,V4 ; store the result Only 6 vector instructions!

Carleton University © S. Dandamudi 25

Example Problem (cont’d)

Two main observationsExecution efficiency

Vector code Executes 6 instructions

Non-vector code Nearly 600 instructions (9 * 64) Lots of control overhead

4 out of 9 instructions! Absent in the vector code

Carleton University © S. Dandamudi 26

Example Problem (cont’d)

Two main observationsFrequency of pipeline interlock

Non-vector code: Every ADD must wait for MULT Every SD must wait for ADD

Loop unrolling can eliminate this interlockVector code

Each instruction is independent Pipeline stalls once per vector operation

Not once per vector element

Carleton University © S. Dandamudi 27

Vector Length

Vector register has a natural vector length64 elements in CRAY systems

What if the vector has a different length?Three cases

Vector length < Vector register length Use a vector length register to indicate the vector length

Vector length = Vector register lengthVector length > Vector register length

Carleton University © S. Dandamudi 28

Vector Length (cont’d)

Vector length > Vector register lengthUse strip miningVector is partitioned into strips that are less than or

equal to the vector register length

Odd strip

Carleton University © S. Dandamudi 29

Vector Stride

Vector strideDistance separating the elements that are to be merged

into a single vectorIn elements, not bytes

Typically multidimensional matrices may have non-unit stride access patternsExample: matrix multiply

Carleton University © S. Dandamudi 30

Vector Stride (cont’d)

Matrix multiplicationfor (i = 1, 100)

for (j = 1, 100)

A[i,j] = 0

for (k = 1, 100)

A[i,j] = A[i,j] + B[i,k] * C[k,j]

Non-unit stride

Unit stride

Carleton University © S. Dandamudi 31

Vector Stride (cont’d)

Access pattern of B and C depends on how the matrix is storedRow-major

Matrix is stored row-by-rowUsed by most languages except FORTRAN

Column-majorMatrix is stored column-by-columnUsed by FORTRAN

Carleton University © S. Dandamudi 32

Vector Stride (cont’d)

11 12 13 1421 22 23 2431 32 33 3441 42 43 44

Carleton University © S. Dandamudi 33

Cray X-MP Instructions

Integer additionVi Vj+Vk Vi = Vj + VkVi Sj+Vk Vi = Sj + Vk

Sj is a scalar

Floating-point additionVi Vj+FVk Vi = Vj + VkVi Sj+FVk Vi = Sj + Vk

Sj is a scalar

Carleton University © S. Dandamudi 34

Cray X-MP Instructions (cont’d)

Load instructionsVi ,A0,Ak Vi = M(A0)+Ak

Vector load with stride AkLoads VL elements from memory address A0

Vi ,A0,1 Vi = M(A0)+1Vector load with stride 1Special case

Carleton University © S. Dandamudi 35

Cray X-MP Instructions (cont’d)

Store instructions ,A0,Ak Vi

Vector store with stride AkStores VL elements starting at memory address A0

,A0,1 ViVector store with stride 1Special case

Carleton University © S. Dandamudi 36

Cray X-MP Instructions (cont’d)

Logical AND instructionsVi Vj&Vk Vi = Vj & VkVi Sj&Vk Vi = Sj & Vk

Sj is a scalar

Shift instructionsVi Vj>Ak Vi = Vj >> AkVi Vj<Ak Vi = Vj << Ak

Left/right shift each element of Vj and store the result in Vi

Carleton University © S. Dandamudi 37

Sample Vector Functional Units

Vector functional unit # Stages Available to chain

Vector results

Integer ADD (64-bit) 3 8 VL+8

64-bit shift 3 8 VL+8

128-bit shift 4 9 VL+9

Floating ADD 6 11 VL+11

Floating MULTIPLY 7 12 VL+12

Carleton University © S. Dandamudi 38

X-MP Pipeline Operation

Three phasesSetup phase

Sets functional units to perform the appropriate operationEstablishes routes to source and destination vector registersRequires 3 clock cycles for all functional units

Execution phaseShutdown phase

Carleton University © S. Dandamudi 39

X-MP Pipeline Operation (Cont’d)

Three phases (cont’d)

Execution phaseSource and destination vector registers are reserved

Cannot be used by another instruction

Source vector register is reserved for VL+3 clock cycles VL = vector length

One pair of operands/clock cycle enter the first stage

Carleton University © S. Dandamudi 40

X-MP Pipeline Operation (Cont’d)

Three phases (cont’d)

Shutdown phaseShutdown time = 3 clock cyclesShutdown time

Time difference between when the last result emerges and when the destination vector register becomes available for other

instructions

Carleton University © S. Dandamudi 41

X-MP Pipeline Operation (Cont’d)

Three phases (cont’d)

Shutdown phaseDestination register becomes available after

3 + n + (VL1) + 3 = n + VL + 5 clock cyclesSetup time = shutdown time = 3 clock cyclesFirst result comes after n clock cyclesRemaining (VL1) results come out at one/clock cycle

Carleton University © S. Dandamudi 42

A Simple Vector Add Operation

A1 5VL A1V1 V2+FV3

Carleton University © S. Dandamudi 43

Overlapped Vector OperationsA1 5VL A1V1 V2+FV3V4 V5*FV6

Carleton University © S. Dandamudi 44

Chaining ExampleA1 5VL A1V1 V2+FV3V4 V5*FV1

Carleton University © S. Dandamudi 45

Vector Processing Performance

Carleton University © S. Dandamudi 46

Interleaved Memories

Traditional memory designsProvide sequential, non-overlapped access

Use high-order interleaving

Interleaved memoriesFacilitate overlapped, pipelined accessUsed by vector and high performance systems

Use low-order interleaving

Carleton University © S. Dandamudi 47

Interleaved Memories (cont’d)

Carleton University © S. Dandamudi 48

Interleaved Memories (cont’d)

Two types of designsSynchronized access organization

Upper m bits are given to all memory banks simultaneouslyRequires output latchesDoes not efficiently support non-sequential access

Independent access organizationSupports pipelined access for arbitrary access patternRequire address registers

Carleton University © S. Dandamudi 49

Interleaved Memories (cont’d)

Synchronized access organization

Carleton University © S. Dandamudi 50

Interleaved Memories (cont’d)

Pipelined transfer of datain interleaved memories

Carleton University © S. Dandamudi 51

Interleaved Memories (cont’d)

Independent access organization

Carleton University © S. Dandamudi 52

Interleaved Memories (cont’d)

Number of banks B

B MM = memory access time in cycles

Sequential access if stride = B B = 8, M = 6 clock cycles, stride = 1

Time to read 16 words = 6 + 16 = 22 clock cycles If stride is 8, it takes 16 * 6 = 96 clock cycles

Last slide

top related