vector processing. vector processors combine vector operands (inputs) element by element to produce...

44
Vector Processing

Upload: lewis-harte

Post on 15-Dec-2015

230 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Vector Processing

Page 2: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Vector Processors• Combine vector operands (inputs) element by element to produce

an output vector. Typical array-oriented operations are:

– processing one or more vectors to produce a scalar result,

– combining two vectors to produce a third vector,

– combining a scalar and a vector to produce a vector, and

– a combination of the above.

Page 3: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Vector Processor Models

• When dealing with scalar operations such as was shown on the previous slide little can be gained in a vector processor, but vector or non-scalar operations can take advantage of vector processors.

Ci : = Ai + Bi 1 ≤ i ≤ N

On a SISD system we would code this as

for(i=1; i <=N;i++)

C[i]=A[i]+B[i];

Page 4: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Vector Processor Models

On a SISD system we would code this as

for(i=1; i <=N;i++)

C[i]=A[i]+B[i];

• Assuming two machine instructions for loop control and four machines instructions to implement the assignment statement (Read A, Read B, Add, Write C) the execution time is;

(6 x N x T)

where T is average instruction cycle time.

Page 5: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Vector Processor Models

• If memory could be accessed directly without requiring loop control there could be one instruction (add).

• The figure shows a fours stage add pipeline resulting in one add per cycle.

Page 6: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Vector Processor Model

• The pipeline execution time is (4 + N – 1)T• Therefore the speedup is;

TN

NTS

)14(

6

Page 7: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Vector Processor Models

• We can generalize the previous vector model as follows;

Page 8: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Vector Processor Models

• Further Improvements can be made;

Page 9: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Vector Processors

• Vector processors are supercomputers optimized for fast execution (main criteria for design and implementation) of vectorizable scientific code that operates on large data sets.

• Vector processors are extensively pipelined to operate on array-oriented data. The CPU is highly pipelined and with a large set of registers. Memory is also pipelined and interleaved to match CPU demands.

Page 10: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Memory Design• Note that if the pipe provides a result every d cycles (i.e.,

w = 1/d), then memory must supply a pair of operands (ai and bi) every d cycles.

• Note that we need to fetch (read) two values and write a result simultaneously (within d cycles).

Page 11: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Memory Design

• If d = 1, then the memory system must have at least a bandwidth 3 times that of a conventional memory. To meet memory bandwidth requirements, two approaches have been implemented in commercial machines:

1. Use of multiple independent memory modules.

2. Use of intermediate high speed memory to:– shorten the access cycle.

– use data several times between cpu and intermediate memory.

– provide for certain desirable patterns of data access (i.e., rows, columns, diagonals, etc.).

Page 12: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Memory Design• Multiple memory modules - 3-port memory modules used with a

pipeline arithmetic.

• Only one port per module is active at one time but all 3 streams can be active simultaneously.

Page 13: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Memory Design• Care must be taken when laying out data in memory

modules other wise simultaneous access is denied as is seen here.

Page 14: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Memory Design• The following RT shows the effect of 2-cycle memory

access timing. Note the output conflict and resultant delays.

NoteConflict!

Page 15: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Memory Design

Page 16: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• Major characteristics affecting supercomputer performance– Clock speed– Instruction issue rate– Memory Size– Number of concurrent paths to memory– Ability to fetch/store vectors efficiently– Number of duplicate arithmetic functional units– Chaining– Indirect addressing capabilities– Handling conditional blocks of code

Page 17: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• High performance of vector architectures can be attributed to the following characteristics:

1. Pipelined functional units2. Multiple functional units operating in parallel3. Chaining of functional units4. Large number of programmable registers5. Block load/store capabilities with buffer registers6. Multiprocessors operating in parallel in a coarse-grained parallel mode7. Instructions buffers

Page 18: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• Sustained computation rates (as opposed to peak computation rates obtained under ideal circumstances) depend on factor such as:

1. Level of vectorization (fraction of the code that is vectorizable)2. Average vector length3. Possibility of vector chaining4. Possible overlap of scalar, vector, and memory load/store operations5. Mechanisms to resolve memory contention

Page 19: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• What is Amdhal’s Law?

Page 20: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• Amdhal’s Law

• Given that the fraction of serial work in a given problem is small, say s, the maximum speedup obtainable from even an infinite number of parallel processors is only 1/s.

Page 21: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation• Ideally speedup is

• Ideally parallel execution time is

• Speedup is then ideally P

)(E TimeExecution Parallel

)(E TimeExecution Serial

p

sS

(P) Processors ofNumber

)(E TimeExecution Serial spE

Page 22: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation• Amdhal’s Law changes this speedup analysis to include the

serial component that cannot be parallelized.

Page 23: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• Let P denote an application program, Tscalar the time to execute P in scalar mode (serial execution)

• s is the maximum speedup

• Ideally the time to execute P on the vector computer is Tscalar/s

Page 24: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• The problem Amdhal pointed out is that there is always some fraction of P, (f) that can be executed in parallel and some fraction that cannot (1-f)

• Therefore the actual parallel execution time is Tactual=(1-f)Tscalar+f ·Tscalar/s

Page 25: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• The speedup now becomes

sff

S

sTfTf

TS

T

TS

scalarscalar

scalar

actual

scalar

)1(

1

)1(

Page 26: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• So if f = 1 speedup is s, the ideal speedup, and for f = 0 speedup is 1.

sff

S

)1(

1

Page 27: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• For number of processors = 10

Page 28: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• Time to execute loops can be used to estimate peak and sustained performance. Let

Page 29: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Performance Evaluation

• Then;

Page 30: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Programming Vector Processors

• the hardware structure that makes vector processors powerful also makes the assembler code difficult.

Page 31: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Programming Vector Processors

• Programming tools:– Languages: to express parallelism inherent in the algorithm

– Compilers: to recognize vectorizable code

– Combination of the above optimizes parallelism

Page 32: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Programming Vector Processors

• Vector pipelining is obviously one benefit that is exploited when executing a program.

Page 33: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Programming Vector Processors

• Chaining is another important characteristic of some vector processors.

• Chaining is the ability to activate additional independent functional units as soon as intermediate results are known.

Page 34: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Chaining

Page 35: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Chaining

• Consider the following

Page 36: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Chaining

Simultaneous

Page 37: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Scalar Renaming• How might this be improved?

Page 38: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Scalar Renaming• This becomes this

• This renaming makes the code segments independent allowing for better vectorization

Page 39: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Scalar Expansion

• How might this be improved?

Page 40: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Scalar Expansion

• If scalar x is expanded into a vector the two statements become independent

Page 41: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Loop Unrolling

• The loop becomes this

Page 42: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

• What about this?Loop fusion

Page 43: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

• Note that each loop would be equivalent to a vector instruction. X is stored back into memory by the first instruction and then retrieved by the second. If these loops are fused as follows, then memory traffic is reduced:

• What else might be done to improve this?

Loop fusion

Page 44: Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations

Loop fusion

• Note that this is possible if there are enough registers available to retain X. If chaining is supported then the loop can be reduced to: