chapter one introduction to pipelined processors

Chapter One Introduction to Pipelined

Processors

Superscalar Processors


• Scalar processors: one instruction per cycle• Superscalar : multiple instruction pipelines

are used.• Purpose: To exploit more instruction level

parallelism in user programs.• Only independent instructions can be

executed in parallel.


• The fundamental structure (m=3) is as follows:


• Here, the instruction decoding and execution resources are increased

• Example: A dual pipeline superscalar processor

Superscalar Processor - Example

Superscalar Processor - Example

• Can issue two instructions per cycle• There are two pipelines with four processing

stages : fetch, decode, execute and store• Two instruction streams are from a single I-

cache.• Assume each stage requires one cycle except

execution stage.

Superscalar Processor - Example• The four functional units of execution stage are:

• Functional units are shared on dynamic basis• Look-ahead Window: for out-of-order instruction

issue

Functional Unit Number of stages

Adder 02

Multiplier 03

Logic 01

Load 01

Superscalar Performance• Time required by scalar base machine is

T(1,1) = k+N-1• The ideal execution time required by an m-

issue superscalar machine is

k – time required to execute first m instructions(N-m)/m – time required to execute remaining (N-m) instructions

mm)-(Nk T(m,1)

Superscalar Performance

• The ideal speedup of the superscalar machine is

= ?

T(m,1)

T(1,1) S(m,1)



• As N ∞, the speedup S(m,1) =?

T(m,1)

T(1,1) S(m,1)

1)-m(kN

1)-km(N S(m,1)



• As N ∞, the speedup S(m,1) m.

T(m,1)

T(1,1) S(m,1)

1)-m(kN

1)-km(N S(m,1)

Superpipeline Processors

• In a superpipelined processor of degree n, the pipeline cycle time is 1/n of base cycle.

Superpipeline Performance• Time to execute N instructions for a superpipelined

machine of degree n with k stages is T(1,n) = k + (N-1)/n

• Speedup is given as

• As N ∞ , S(1,n) n

1)-(Nnk

1)-Nn(k

n)T(1,

T(1,1) n)S(1,

Superpipelined Superscalar Processors• This machine executes m instructions every

cycle with a pipeline cycle 1/n of base cycle.

Superpipelined Superscalar Performance

• Time taken to execute N independent instructions on a superpipelined superscalar machine of degree (m,n) is

• The speedup over base machine is

• As N ∞, S(m,n)mn

mn

m-N k n)(m, T

mNmnk

Nkmn

)1(

n)T(m,

T(1,1)n)S(m,


• Rely on spatial parallelism• Multiple operations running

on separate hardware concurrently

• Achieved by duplicating hardware resources such as execution units and register file ports

• Requires more transistors

Superpipelined Processors

• Rely on temporal parallelism• Overlapping multiple

operations on a common hardware

• Achieved through more deeply pipelined execution units with faster clock cycles

• Requires faster transistors

Systolic Architecture

Systolic Architecture• Conventional architecture operate on load

and store operations from memory.• This requires more memory references which

slows down the system as shown below:

Systolic Architecture

• In systolic processing, data to be processed flows through various operation stages and finally put in memory as shown below:

Systolic Architecture• The basic architecture constitutes processing

elements (PEs) that are simple and identical in behavior at all instants.

• Each PE may have some registers and an ALU.• PEs are interlinked in a manner dictated by

the requirements of the specific algorithm.• E.g. 2D mesh, hexagonal arrays etc.

Systolic Architecture• PEs at the boundary of structure are connected

to memory • Data picked up from memory is circulated

among PEs which require it in a rhythmic manner and the result is fed back to memory and hence the name systolic

• Example : Multiplication of two n x n matrices

Example : Multiplication of two n x n matrices

• Every element in input is picked up n times from memory as it contributes to n elements in the output.

• To reduce this memory access, systolic architecture ensures that each element is pulled only once

• Consider an example where n = 3

Matrix Multiplicationa11 a12 a13a21 a22 a23a31 a32 a33 *

b11 b12 b13b21 b22 b23b31 b32 b33

=c11 c12 c13c21 c22 c23c31 c32 c33

Conventional Method: O(n3)

For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];

Systolic MethodThis will run in O(n) time!

To run in n time we need n x n processing units, in our example n = 9.

P9P8P7

P6P5P4

P1 P2 P3

For systolic processing, the input data need to be modified as:

a13 a12 a11a23 a22 a21a33 a32 a31

b31 b32 b33b21 b22 b23b11 b12 b13

Flip columns 1 & 3

Flip rows 1 & 3

and finally stagger the data sets for input.

At every tick of the global system clock, data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.

a13 a12 a11

a23 a22 a21

a33 a32 a31

b31b21b11

b32b22b12

b33b23b13

P9P8P7

P6P5P4

P1 P2 P3

3 4 2 2 5 33 2 5

* =

3 4 2 2 5 33 2 5

23 36 28 25 39 3428 32 37

Using a systolic array.

2 4 3

3 5 2

5 2 3

323

254

532

P9P8P7

P6P5P4

P1 P2 P3

P1 9

P2 0

P3 0

P4 0

P5 0

P6 0

P7 0

P8 0

P9 0

2 4

3 5 2

5 2 3

32

254

532

P9P8P7

P6P5P4

3*3 P2 P3

Clock tick : 1

P1 9+8=17

P2 12

P3 0

P4 6

P5 0

P6 0

P7 0

P8 0

P9 0

2

3 5

5 2 3

325

532

P9P8P7

P6P52*3

4*2 3*4 P3

Clock tick : 2

P1 17+6=23

P2 12+20=32

P3 6

P4 6+10=16

P5 8

P6 0

P7 9

P8 0

P9 0

3

5 2

2

53

P9P83*3

P62*45*2

2*3 4*5 3*2

Clock tick : 3

P1 23

P2 32+4=36

P3 6+12=18

P4 16+9=25

P5 8+25=33

P6 4

P7 9+4=13

P8 12

P9 05

5

P93*42*2

2*25*53*3

23 2*2 4*3

Clock tick : 4

P1 23

P2 36

P3 18+10=28

P4 25

P5 33+6=39

P6 4+15=19

P7 13+15=28

P8 12+10=22

P9 63*22*55*3

5*33*225

23 36 2*5

Clock tick : 5

P1 23

P2 36

P3 28

P4 25

P5 39

P6 19+15=34

P7 28

P8 22+10=32

P9 6+6=122*35*228

3*53925

23 36 28

Clock tick : 6

P1 23

P2 36

P3 28

P4 25

P5 39

P6 34

P7 28

P8 32

P9 12+25=375*53228

343925

23 36 28

Clock tick : 7

P1 23

P2 36

P3 28

P4 25

P5 39

P6 34

P7 28

P8 32

P9 37373228

343925

23 36 28

End

Samba: Systolic Accelerator for Molecular Biological Applications

This systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip houses 4 processors, and one processor performs 10 millions matrix cells per second.

chapter one introduction to pipelined processors

Documents

n instructions

issue superscalar machine

superscalar processorshere

n of base cycle

remaining n

base machine isas n

n isthe speedup

n independent instructions