array processors ch4

Upload: arshad-matin

Post on 07-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Array Processors CH4

    1/45

    ARRAY PROCESSORS

    SIMD Computer Organization

  • 8/3/2019 Array Processors CH4

    2/45

    SIMD Computer Organization:

    CONFIG I

    I/O Instruction and data

    PE0

    PEM0

    PEN-1

    PEMN-1

    PE1

    PEM1

    CU memory

    CP

    Interconnection N/W

    Control

    Databus

    CU

  • 8/3/2019 Array Processors CH4

    3/45

    Config II:I/O

    CU memory

    CU

    PE0 PE1 PEN-1

    Alignment Network

    M0 M1 Mp-1

    Data

    Bus

  • 8/3/2019 Array Processors CH4

    4/45

    Formally, an SIMD computer is characterized by the following set ofparameters:

    C = ( N, F, I, M)

    N:Number of PEs in the system.Illiac IV ___ 64

    MPP __ 16,384 (Massively parallel processing)

    F: Set of data routing functions provided bythe interconnection network. Example: Mesh, star, Omega andbutterfly.

    I:The set of instructions for scalar, vector,

    data routing and network manipulation operations.

    M: Set of masking schemes, where each maskpartitions the PEs into 2 disjoint subnets of enabled PEs & disablePEs.

  • 8/3/2019 Array Processors CH4

    5/45

    Components in a PE:

    Ai, Bi and Ci are general purpose registers.

    Only the contents of Ri are transformed to other PEs during data

    transfer. If N = 2^m ( m = no. of bits required to identify a PE) PEs are

    there, then Di will holdmbit address of the destined PE.

    Each PEi is either active or inactive during instruction cycle:

    Si = 1Active Si = 0Inactive

    Ai

    Si

    Di Ii Ri

    CiBi

    ALU

    Status Register

    IndexRegister

    DestinationRegister

    DataRoutingRegister

  • 8/3/2019 Array Processors CH4

    6/45

    Necessity of Data Routing:

    Consider an Array:

    A = (A0, A1,..,An-1)

    Now, for computing:

    S (n) = Ai

    For n = 8 with N = 8, addition is performed in log2 N steps i.e., 3 steps.

    In SISD, the same thing would take 8 steps/loops by formula:

    sum = sum + A[i]

    If you want to make it faster, youll burn this loop into hardware steps.

    But the SAMEthing would take ONLY 3 steps in SIMD!!

    i=0

    n-1

  • 8/3/2019 Array Processors CH4

    7/45

    SIMD:

    1

    2

    3

    4

    5

    6

    7

    8

    1

    1 + 2=3

    2 + 3=5

    3 + 4=7

    4 + 5=9

    5 + 6=11

    6 + 7=13

    7 + 8=15

    1

    3

    5(1+5=6)

    7(3+7=10)

    9(5+9=14)

    11(7+11=18)

    13(9+13=22)

    15(11+15=26)

    1

    3

    6

    10

    1+!4

    3+18

    6+22

    10+26=36

    Step 3Step 2Step 1

    S(7)

    S(6)

    S(5)

    S(4)

    S(3)

    S(2)

    S(1)

    S(0)

  • 8/3/2019 Array Processors CH4

    8/45

    Algorithm:Step # 1: Ai would transfer data in Ri i = 0-6

    Ai Ri

    Ri Ri+1 i = 0-6

    Ai + Ri Ai i = 1-7

    Step # 2:Ai Ri i = 0-5

    Ri Ri+2 i = 0-5

    Ai + Ri Ai i = 2-7

    Step # 3:

    Ai Ri i = 0-3

    Ri Ri+4 i = 0-3

    Ai + Ri Ai i = 4-7

  • 8/3/2019 Array Processors CH4

    9/45

    Masking Scheme:During Data Routing:

    Step # 1: PE7 is disabled.

    Step # 2: PE6 and PE7 are disabled.

    Step # 3: PE4 PE7 are disabled.

    During Addition:

    Step # 1: PE0 is not involved.

    Step # 2: PE0, PE1 are not involved.

    Step # 3: PE0-PE3 are not involved.

  • 8/3/2019 Array Processors CH4

    10/45

    SIMD Interconnection network

    Interconnectionnetworks

    Static network

    1-D 2-D3-D and

    hypercube

    Dynamicnetwork

    Bus Based Switch Based

    Single Stage Multistage crossbar

  • 8/3/2019 Array Processors CH4

    11/45

  • 8/3/2019 Array Processors CH4

    12/45

    Static Networks (1D) LinearArray

    N nodes connected by n-1 links;

    Internal nodes have degree 2

    End nodes have degree 1.

    Diameter = n-1

  • 8/3/2019 Array Processors CH4

    13/45

    2DRing Network

    Like a linear array, but the two end nodes are connected

    by an nth link.

    Can be unidirectional or bi-directional.

    Node degree (d)= 2

    Diameter (D)

    Unidirectional = n-1

    Bidirectional = n/2

  • 8/3/2019 Array Processors CH4

    14/45

    Star network

    Internal node degree = n-1

    External nodes have degree = 1

    Network diameter = 2

  • 8/3/2019 Array Processors CH4

    15/45

    Static Interconnection Networks (2D) A n-dimensional mesh[torusor wraparound mesh] is an

    extension of the linear array [ring]. Degree: 2-4 Examples: Intel Paragon (2D mesh),

  • 8/3/2019 Array Processors CH4

    16/45

    Mesh interconnection network (ILLIAC IV N/W)

    0 1 2 3

    4 5 6 7

    8 9 10 11

    12 13 14 15

    a b c d

    a b c d

    f

    g

    h

    f

    g

    h

    e

    e

    It is a chordal ring network.

  • 8/3/2019 Array Processors CH4

    17/45

    In the Illiac IV, each processor iwas connected to processors: Ex: N=16

    {i+1, i1, i+4, and i4} (mod 16).

    Here are the routing functions:

    R+1(i) (i + 1) mod N

    R1(i) (i 1) mod N

    R+r(i) (i + r) mod N

    Rr(i) (ir) mod N

    where r = N

    Where 0 i N-1.

    The diameter of an Illiac-IV mesh is N 1.

  • 8/3/2019 Array Processors CH4

    18/45

    Chordal Ring N/W is ILLIAC-IV n/w. Also called as partiall

    connected n/w. (Diag.)0

    1

    2

    3

    4

    5

    6

    79

    10

    11

    12

    13

    14

    15

    8

    By adding additional links (e.g. chords in a circle), the

    node degree is increased, and the network diameter is

    reduced.

  • 8/3/2019 Array Processors CH4

    19/45

    Barrel shifting network:

    A barrel shifter is sometimes called a plus-minus-2inetwork.

    Routing functions:

    B+i (j ) = (j + 2i) modN

    Bi (j ) = (j2i) modN

    where 0 j N1 and 0 i < log2N.

    B+0, B-0, B+1, B-1, B+2, B-2, B+3 and B-3

    In general, the diameter of a barrel shifter is (log2

    N)/2.

  • 8/3/2019 Array Processors CH4

    20/45

  • 8/3/2019 Array Processors CH4

    21/45

    How Barrel Shifter Network is an enhancement over MeshInterconnection network?

    In Mesh Network:

    R+1=(0 1 2 3 .15) R-1=(15 14 13.0)

    R+4=(0 4 8 12) (1 5 9 12) (2 6 10 14) (3 7 11 15)

    R-4=(15 11 7 3) (14 10 6 2) (12 9 5 1) (12 8 4 0)In Barrel Shifting Network:

    B+0=(0 1 2 3 .15) B-0=(15 14 13.0)

    B+1=(0 2 4 6 8..14)(1 3 5 .15) B-1=(15 13 11.1) (14 12 ..0)

    B+2=(0 4 8 12) (1 5 9 13) (2 6 10 14) (3 7 11 15)

    B-2=(15 11 7 3) (14 10 6 2) (13 9 5 1) (12 8 4 0)

    B+3= (0 8) (1 9) (2 10) (3 11) (4 12) (5 13) (6 14) (7 15)

    B-3= (15 7) (14 6) (13 5) (12 4) (11 3) (10 2) (9 1) (8 0)

  • 8/3/2019 Array Processors CH4

    22/45

    Contd

    Comparing the two networks

    B+0= R+1 B

    -0

    = R-1

    B+2= R+4 B-2 =R-4

    Which means in general

    B+n/2 = R+r B-n/2 = R-r where n=log 2N

  • 8/3/2019 Array Processors CH4

    23/45

    3D- Fully Connected

    6 3

    1 2

    5 4

    In the limit, we obtain a fully-connected network, with a nodedegree ofn-1 and a diameter of 1.

  • 8/3/2019 Array Processors CH4

    24/45

    3D networks : 3- cube

    00 0

    10 0

    11 0

    01 0

    00 1

    01 1

    11 1

    10 1

    A hypercube is a generalized cube.

    In a hypercube, there are 2n nodes, for some n.

    Each node is connected to all other nodes whosenumbers differ from it in only one bit position.

    The node degree of n cube equals n and so does the

    network diameter.

  • 8/3/2019 Array Processors CH4

    25/45

    3-cube connected cyclic (CCC) network:

    Is obtained from 3-cube n/w. The idea is to cutoff corner

    nodes of 3-cube and replace each by a cycle of 3 nodes.

    CCC n/w is constructed from k-cube with n = 2k cycles

    nodes.

    Hence a 3-cube can be transformed to a 3-CCC with k x 2k

    nodes.

  • 8/3/2019 Array Processors CH4

    26/45

    4D hypercube

    4D hypercube = two 3D hypercubes with an additional link

    connecting corresponding processors

  • 8/3/2019 Array Processors CH4

    27/45

    A x B switch module

    A inputs and B outputs

    In practice, A = B = power of 2

    Each input is connected to one or more outputs

    (conflicts must be avoided)

    One-to-one (permutation) and one-to-many are allowed

    Multistage Interconnection Network

    Switch Modules

  • 8/3/2019 Array Processors CH4

    28/45

    Multistage Interconnection Network

    Switch Modules

    A 2 2 switch can be configured for

    Straight-through

    Crossover

    Upper broadcast (upper input to both outputs)

    Lower broadcast (lower input to both outputs)

  • 8/3/2019 Array Processors CH4

    29/45

    Binary Switch

    2x2Switch

    Legitimate States = 4

  • 8/3/2019 Array Processors CH4

    30/45

    Perfect-shuffle interconnection:

    This interconnection network is defined by the routing functionS (an1 a1a0)2 = (an2 a1a0an1)2

    7 7

    6

    5

    4

    3

    2

    1

    0

    6

    5

    4

    3

    2

    1

    0

    7 7

    6

    5

    4

    3

    2

    1

    0

    6

    5

    4

    3

    2

    1

    0

    Perfect Shuffle Inverse Shuffle

  • 8/3/2019 Array Processors CH4

    31/45

    a shuffle network is not a complete interconnection network. This can be

    seen by looking at what happens as data is reci rculated through the

    network.

    An exchangepermutation can be added to a shuffle network to make

    it into a complete interconnection structure.

    0 1 2 3 4 5 6 7

  • 8/3/2019 Array Processors CH4

    32/45

    with a shuffle-exchange network, arbitrary cyclic shifts of

    an N-element array can be performed in log Nsteps. Here

    is a diagram of a multistage omega network for N= 8.

    E(an1 a1a0)2 = an1 a10

  • 8/3/2019 Array Processors CH4

    33/45

    0

    1

    2

    3

    4

    5

    6

    7

    0

    4

    1

    5

    2

    6

    3

    7

    0

    2

    4

    6

    1

    3

    5

    7

    0

    1

    2

    3

    4

    5

    6

    7

    Exch. 1 Exch. 2 Exch. 3

    Shuffle 1 Shuffle 2 Shuffle 3

  • 8/3/2019 Array Processors CH4

    34/45

    Omega network features

    There are log pstages each with p/2 switching elements each = p/2 * log

    ptotal

    Simple routing algorithm

    At each stage, look at the corresponding bit (starting with the msb) of the

    source and destination address

    If the bits are the same, messages passes through, otherwise is crossed-

    over

  • 8/3/2019 Array Processors CH4

    35/45

    1

    2

    3

    4

    6

    7

    5

    0 0

    1

    2

    3

    4

    5

    6

    7

    4

    Path Contention

    5

  • 8/3/2019 Array Processors CH4

    36/45

    1

    2

    3

    4

    6

    7

    5

    0 0

    1

    2

    3

    4

    5

    6

    7

    4

    Path Contention

    5

  • 8/3/2019 Array Processors CH4

    37/45

    1

    2

    3

    4

    6

    7

    5

    0 0

    1

    2

    3

    4

    5

    6

    7

    4

    Path Contention

    5

  • 8/3/2019 Array Processors CH4

    38/45

    1

    2

    3

    4

    6

    7

    5

    0 0

    1

    2

    3

    4

    5

    6

    7

    4

    Path Contention

    5

  • 8/3/2019 Array Processors CH4

    39/45

    1

    2

    3

    4

    6

    7

    5

    0 0

    1

    2

    3

    4

    5

    6

    7

    Path Contention

  • 8/3/2019 Array Processors CH4

    40/45

    1

    2

    3

    4

    6

    7

    5

    0 0

    1

    2

    3

    4

    5

    6

    7

    Path Contention

    5

  • 8/3/2019 Array Processors CH4

    41/45

    1

    2

    3

    4

    6

    7

    5

    0 0

    1

    2

    3

    4

    5

    6

    7

    Path Contention

    5

  • 8/3/2019 Array Processors CH4

    42/45

    1

    2

    3

    4

    6

    7

    5

    0 0

    1

    2

    3

    4

    5

    6

    7

    Path Contention

    5

  • 8/3/2019 Array Processors CH4

    43/45

    Extra Problems

  • 8/3/2019 Array Processors CH4

    44/45

    . Find the following in a 16x16 omega network

    a) Number of stages.b) Number of 2 x 2 switches needed in each stage?.

    c) Draw a 16-input Omega network using 2 x 2 switches as building

    blocks.

    d) Show switch settings for routing a message

    from node 1101 to node 0101 and from node 0111 to node 1001

    simultaneously.

    From node 2 to 7 and 6 to 4 simultaneously.

    e) Does blocking exist in above two case?

  • 8/3/2019 Array Processors CH4

    45/45

    Show the necessity of data routing and masking mechanisms during

    the addition of 15 numbers. Assume each PE holds one element.

    Find the number of steps required to add 16 elements.

    Calculate the different routing functions.

    Show the routing and addition in each step.