computer architecture vector architectures ola flygt växjö university [email protected] +46...

37
Computer Architecture Computer Architecture Vector Architectures Vector Architectures Ola Flygt Växjö University http://w3.msi.vxu.se/users/ofl/ [email protected] +46 470 70 86 49

Upload: zane-baulch

Post on 14-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Computer Computer ArchitectureArchitecture

Vector ArchitecturesVector ArchitecturesOla Flygt

Växjö Universityhttp://w3.msi.vxu.se/users/ofl/

[email protected]+46 470 70 86 49

Page 2: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Outline

IntroductionBasic priciplesSdSdExamples

Crayxcx

CH01

Page 3: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Scalar processing

4n clock cycles required to process n elements!

Time op

0 a0

4 a1

8 a2

… …

4n an

Page 4: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Pipelining

4n/(4+n) clock cycles required to process n elements!

Time

op0 op1 op2 op3

0 a0

1 a1 a0

2 a2 a1 a0

3 a3 a2 a1 a0

4 a4 a3 a2 a1

… … … … …

n an an-1 an-2 an-3

Page 5: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

PipelineBasic Principle

Stream of objects Number of objects = stream length n

Operation can be subdivided into sequence of steps Number of steps = pipeline length p

Advantage Speedup = pn/(p+n)

Stream length >> pipeline length Speedup approx.p

Speedup is limited by pipeline length!

Page 6: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Vector Operations

Operations on vectors of data (floating point numbers) Vector-vector

V1 <-V2 + V3 (component-wise sum) V1 <-- V2

Vector-scalar V1 <-c * V2

Vector-memory V <-A (vector load) A <-V (vector store)

Vector reduction c <-min(V) c <-sum(V) c <-V1 * V2 (dot product)

Page 7: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Vector Operations, cont.

Gather/scatter V1,V2 <-GATHER(A)

load all non-zero elements of A into V1 and their indices into V2

A <-SCATTER(V1,V2) store elements of V1 into A at indices denoted by V2 and fill

rest with zeros

Mask V1 <-MASK(V2,V3) store elements of V2 into V1 for which corresponding

position in V3 is non-zero

Page 8: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Example, Scalar Loop

approx. 6n clock cycles to execute loop.

Fortran loop:

DO I=1,N A(I) = A(I)+B(I)ENDDO

Scalar assembly code:

R0 <- NR1 <- IJMP J

L: R2 <- A(R1)R3 <- B(R1)R2 <- R2+R3A(R1) <- R2R1 <- R1+1

J: JLE R1, R0, L

Page 9: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Example, Vector Loop

4n clock cycles, because no loop iteration overhead (ignoring speedup by pipelining)

Fortran loop:

DO I=1,N A(I) = A(I)+B(I)ENDDO

Vectorized assembly code:

V1 <- AV2 <- BV3 <- V1+V2A <- V2

Page 10: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Chaining

Overlapping of vector instructions (see Hwang, Figure 8.18)

Hence: c+n ticks (for small c) Speedup approx.6 (c=16, n=128, s=(6*128)/(16+128)=5.33)

The longer the vector chain, the better the speedup! A <-B*C+D chaining degree 5

Vectorization speedups between 5 and 25

Page 11: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Vector Programming

How to generate vectorized code?

1.Assembly programming.2.Vectorized Libraries.3.High-level vector statements.4.Vectorizing compiler.

Page 12: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Vectorized Libraries

Predefined vector operations (partially implemented in assembly language) VECLIB, LINPACK, EISPACK, MINPACK

C = SSUM(100, A(1,2), 1, B(3,1), N)100 ...vector length

A(1,2) ...vector address A1 ...vector stride A

B(3,1) ...vector address BN ...vector stride B

Addition of matrix column to matrix row.

Page 13: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

High-Level Vector Statements

e.g. Fortran 90

INTEGER A(100), B(100), C(100), S A(1:100) = S*B(1:100)+C(1:100)

* Vector-vector operations. * Vector-scalar operations. * Vector reduction. * ...

Easy transformation into vector code.

Page 14: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Vectorizing Compiler 1. Fortran 77 DO Loop *

DO I=1, N D(I) = A(I)*B+C(I) ENDDO

2. Vectorization *

D(1:N) = A(1:N)*B+C(1:N)

3. Strip mining *

DO I=1, N/128 D(I:I+127) = A(I:I+127)*B + C(I:I+127) ENDDO IF ((N.MOD.128).NEQ.0) A((N/128)*128+1:N) = ... ENDIF

4. Code generation *

V0 <- V0*B ...

Related techniques for parallelizing compiler!

Page 15: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Vectorization

In which cases can loop be vectorized?

DO I = 1, N-1 A(I) = A(I+1)*B(I)ENDDO

| V

A(1:128) = A(2:129)*B(1:128)A(129:256) = A(130:257)*B(129:256)....

Vectorization preserves semantics.

Page 16: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Loop Vectorization

s semantics always preserved?

DO I = 2, N A(I) = A(I-1)*B(I)ENDDO

| V

A(2:129) = A(1:128)*B(2:129)A(130:257) = A(129:256)*B(130:257)....

Vectorization has changed semantics!

Page 17: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Vectorization Inhibitors

Vectorization must be conservative; when in doubt, loop must not be vectorized.

Vectorization is inhibited byFunction callsInput/output operationsGOTOs into or out of loopRecurrences (References to vector elements

modified in previous iterations)

Page 18: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Components of a vectorizing supercomputer

Page 19: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

The DS for floating-point precision

Page 20: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

The DS for integer precision

Page 21: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

How vectorization worksUn-vectorized computation

Page 22: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

How vectorization worksvectorized computation

Page 23: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

How vectorization speeds up computation

Page 24: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Speed improvementsNon-pipelined computation

Page 25: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Speed improvementspipelined computation

Page 26: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Increasing the granularity of a pipelineRepetition governed by slowest

component

Page 27: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Increasing the granularity of a pipelineGranularity increased to improve

repetition

Page 28: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Parallel computation of floating point and integer

results

Page 29: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Mixed functional and data parallelism

Page 30: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

The DS for parallel computational

functionality

Page 31: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Performance of four generations of Cray

systems

Page 32: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Communication between CPUs and memory

Page 33: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

The increasing complexity in Cray systems

Page 34: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Integration density

Page 35: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

Convex C4/XA system

Page 36: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

The configuration of the crossbar switch

Page 37: Computer Architecture Vector Architectures Ola Flygt Växjö University  Ola.Flygt@msi.vxu.se +46 470 70 86 49

The processor configuration