array processors ch4
TRANSCRIPT
-
8/3/2019 Array Processors CH4
1/45
ARRAY PROCESSORS
SIMD Computer Organization
-
8/3/2019 Array Processors CH4
2/45
SIMD Computer Organization:
CONFIG I
I/O Instruction and data
PE0
PEM0
PEN-1
PEMN-1
PE1
PEM1
CU memory
CP
Interconnection N/W
Control
Databus
CU
-
8/3/2019 Array Processors CH4
3/45
Config II:I/O
CU memory
CU
PE0 PE1 PEN-1
Alignment Network
M0 M1 Mp-1
Data
Bus
-
8/3/2019 Array Processors CH4
4/45
Formally, an SIMD computer is characterized by the following set ofparameters:
C = ( N, F, I, M)
N:Number of PEs in the system.Illiac IV ___ 64
MPP __ 16,384 (Massively parallel processing)
F: Set of data routing functions provided bythe interconnection network. Example: Mesh, star, Omega andbutterfly.
I:The set of instructions for scalar, vector,
data routing and network manipulation operations.
M: Set of masking schemes, where each maskpartitions the PEs into 2 disjoint subnets of enabled PEs & disablePEs.
-
8/3/2019 Array Processors CH4
5/45
Components in a PE:
Ai, Bi and Ci are general purpose registers.
Only the contents of Ri are transformed to other PEs during data
transfer. If N = 2^m ( m = no. of bits required to identify a PE) PEs are
there, then Di will holdmbit address of the destined PE.
Each PEi is either active or inactive during instruction cycle:
Si = 1Active Si = 0Inactive
Ai
Si
Di Ii Ri
CiBi
ALU
Status Register
IndexRegister
DestinationRegister
DataRoutingRegister
-
8/3/2019 Array Processors CH4
6/45
Necessity of Data Routing:
Consider an Array:
A = (A0, A1,..,An-1)
Now, for computing:
S (n) = Ai
For n = 8 with N = 8, addition is performed in log2 N steps i.e., 3 steps.
In SISD, the same thing would take 8 steps/loops by formula:
sum = sum + A[i]
If you want to make it faster, youll burn this loop into hardware steps.
But the SAMEthing would take ONLY 3 steps in SIMD!!
i=0
n-1
-
8/3/2019 Array Processors CH4
7/45
SIMD:
1
2
3
4
5
6
7
8
1
1 + 2=3
2 + 3=5
3 + 4=7
4 + 5=9
5 + 6=11
6 + 7=13
7 + 8=15
1
3
5(1+5=6)
7(3+7=10)
9(5+9=14)
11(7+11=18)
13(9+13=22)
15(11+15=26)
1
3
6
10
1+!4
3+18
6+22
10+26=36
Step 3Step 2Step 1
S(7)
S(6)
S(5)
S(4)
S(3)
S(2)
S(1)
S(0)
-
8/3/2019 Array Processors CH4
8/45
Algorithm:Step # 1: Ai would transfer data in Ri i = 0-6
Ai Ri
Ri Ri+1 i = 0-6
Ai + Ri Ai i = 1-7
Step # 2:Ai Ri i = 0-5
Ri Ri+2 i = 0-5
Ai + Ri Ai i = 2-7
Step # 3:
Ai Ri i = 0-3
Ri Ri+4 i = 0-3
Ai + Ri Ai i = 4-7
-
8/3/2019 Array Processors CH4
9/45
Masking Scheme:During Data Routing:
Step # 1: PE7 is disabled.
Step # 2: PE6 and PE7 are disabled.
Step # 3: PE4 PE7 are disabled.
During Addition:
Step # 1: PE0 is not involved.
Step # 2: PE0, PE1 are not involved.
Step # 3: PE0-PE3 are not involved.
-
8/3/2019 Array Processors CH4
10/45
SIMD Interconnection network
Interconnectionnetworks
Static network
1-D 2-D3-D and
hypercube
Dynamicnetwork
Bus Based Switch Based
Single Stage Multistage crossbar
-
8/3/2019 Array Processors CH4
11/45
-
8/3/2019 Array Processors CH4
12/45
Static Networks (1D) LinearArray
N nodes connected by n-1 links;
Internal nodes have degree 2
End nodes have degree 1.
Diameter = n-1
-
8/3/2019 Array Processors CH4
13/45
2DRing Network
Like a linear array, but the two end nodes are connected
by an nth link.
Can be unidirectional or bi-directional.
Node degree (d)= 2
Diameter (D)
Unidirectional = n-1
Bidirectional = n/2
-
8/3/2019 Array Processors CH4
14/45
Star network
Internal node degree = n-1
External nodes have degree = 1
Network diameter = 2
-
8/3/2019 Array Processors CH4
15/45
Static Interconnection Networks (2D) A n-dimensional mesh[torusor wraparound mesh] is an
extension of the linear array [ring]. Degree: 2-4 Examples: Intel Paragon (2D mesh),
-
8/3/2019 Array Processors CH4
16/45
Mesh interconnection network (ILLIAC IV N/W)
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
a b c d
a b c d
f
g
h
f
g
h
e
e
It is a chordal ring network.
-
8/3/2019 Array Processors CH4
17/45
In the Illiac IV, each processor iwas connected to processors: Ex: N=16
{i+1, i1, i+4, and i4} (mod 16).
Here are the routing functions:
R+1(i) (i + 1) mod N
R1(i) (i 1) mod N
R+r(i) (i + r) mod N
Rr(i) (ir) mod N
where r = N
Where 0 i N-1.
The diameter of an Illiac-IV mesh is N 1.
-
8/3/2019 Array Processors CH4
18/45
Chordal Ring N/W is ILLIAC-IV n/w. Also called as partiall
connected n/w. (Diag.)0
1
2
3
4
5
6
79
10
11
12
13
14
15
8
By adding additional links (e.g. chords in a circle), the
node degree is increased, and the network diameter is
reduced.
-
8/3/2019 Array Processors CH4
19/45
Barrel shifting network:
A barrel shifter is sometimes called a plus-minus-2inetwork.
Routing functions:
B+i (j ) = (j + 2i) modN
Bi (j ) = (j2i) modN
where 0 j N1 and 0 i < log2N.
B+0, B-0, B+1, B-1, B+2, B-2, B+3 and B-3
In general, the diameter of a barrel shifter is (log2
N)/2.
-
8/3/2019 Array Processors CH4
20/45
-
8/3/2019 Array Processors CH4
21/45
How Barrel Shifter Network is an enhancement over MeshInterconnection network?
In Mesh Network:
R+1=(0 1 2 3 .15) R-1=(15 14 13.0)
R+4=(0 4 8 12) (1 5 9 12) (2 6 10 14) (3 7 11 15)
R-4=(15 11 7 3) (14 10 6 2) (12 9 5 1) (12 8 4 0)In Barrel Shifting Network:
B+0=(0 1 2 3 .15) B-0=(15 14 13.0)
B+1=(0 2 4 6 8..14)(1 3 5 .15) B-1=(15 13 11.1) (14 12 ..0)
B+2=(0 4 8 12) (1 5 9 13) (2 6 10 14) (3 7 11 15)
B-2=(15 11 7 3) (14 10 6 2) (13 9 5 1) (12 8 4 0)
B+3= (0 8) (1 9) (2 10) (3 11) (4 12) (5 13) (6 14) (7 15)
B-3= (15 7) (14 6) (13 5) (12 4) (11 3) (10 2) (9 1) (8 0)
-
8/3/2019 Array Processors CH4
22/45
Contd
Comparing the two networks
B+0= R+1 B
-0
= R-1
B+2= R+4 B-2 =R-4
Which means in general
B+n/2 = R+r B-n/2 = R-r where n=log 2N
-
8/3/2019 Array Processors CH4
23/45
3D- Fully Connected
6 3
1 2
5 4
In the limit, we obtain a fully-connected network, with a nodedegree ofn-1 and a diameter of 1.
-
8/3/2019 Array Processors CH4
24/45
3D networks : 3- cube
00 0
10 0
11 0
01 0
00 1
01 1
11 1
10 1
A hypercube is a generalized cube.
In a hypercube, there are 2n nodes, for some n.
Each node is connected to all other nodes whosenumbers differ from it in only one bit position.
The node degree of n cube equals n and so does the
network diameter.
-
8/3/2019 Array Processors CH4
25/45
3-cube connected cyclic (CCC) network:
Is obtained from 3-cube n/w. The idea is to cutoff corner
nodes of 3-cube and replace each by a cycle of 3 nodes.
CCC n/w is constructed from k-cube with n = 2k cycles
nodes.
Hence a 3-cube can be transformed to a 3-CCC with k x 2k
nodes.
-
8/3/2019 Array Processors CH4
26/45
4D hypercube
4D hypercube = two 3D hypercubes with an additional link
connecting corresponding processors
-
8/3/2019 Array Processors CH4
27/45
A x B switch module
A inputs and B outputs
In practice, A = B = power of 2
Each input is connected to one or more outputs
(conflicts must be avoided)
One-to-one (permutation) and one-to-many are allowed
Multistage Interconnection Network
Switch Modules
-
8/3/2019 Array Processors CH4
28/45
Multistage Interconnection Network
Switch Modules
A 2 2 switch can be configured for
Straight-through
Crossover
Upper broadcast (upper input to both outputs)
Lower broadcast (lower input to both outputs)
-
8/3/2019 Array Processors CH4
29/45
Binary Switch
2x2Switch
Legitimate States = 4
-
8/3/2019 Array Processors CH4
30/45
Perfect-shuffle interconnection:
This interconnection network is defined by the routing functionS (an1 a1a0)2 = (an2 a1a0an1)2
7 7
6
5
4
3
2
1
0
6
5
4
3
2
1
0
7 7
6
5
4
3
2
1
0
6
5
4
3
2
1
0
Perfect Shuffle Inverse Shuffle
-
8/3/2019 Array Processors CH4
31/45
a shuffle network is not a complete interconnection network. This can be
seen by looking at what happens as data is reci rculated through the
network.
An exchangepermutation can be added to a shuffle network to make
it into a complete interconnection structure.
0 1 2 3 4 5 6 7
-
8/3/2019 Array Processors CH4
32/45
with a shuffle-exchange network, arbitrary cyclic shifts of
an N-element array can be performed in log Nsteps. Here
is a diagram of a multistage omega network for N= 8.
E(an1 a1a0)2 = an1 a10
-
8/3/2019 Array Processors CH4
33/45
0
1
2
3
4
5
6
7
0
4
1
5
2
6
3
7
0
2
4
6
1
3
5
7
0
1
2
3
4
5
6
7
Exch. 1 Exch. 2 Exch. 3
Shuffle 1 Shuffle 2 Shuffle 3
-
8/3/2019 Array Processors CH4
34/45
Omega network features
There are log pstages each with p/2 switching elements each = p/2 * log
ptotal
Simple routing algorithm
At each stage, look at the corresponding bit (starting with the msb) of the
source and destination address
If the bits are the same, messages passes through, otherwise is crossed-
over
-
8/3/2019 Array Processors CH4
35/45
1
2
3
4
6
7
5
0 0
1
2
3
4
5
6
7
4
Path Contention
5
-
8/3/2019 Array Processors CH4
36/45
1
2
3
4
6
7
5
0 0
1
2
3
4
5
6
7
4
Path Contention
5
-
8/3/2019 Array Processors CH4
37/45
1
2
3
4
6
7
5
0 0
1
2
3
4
5
6
7
4
Path Contention
5
-
8/3/2019 Array Processors CH4
38/45
1
2
3
4
6
7
5
0 0
1
2
3
4
5
6
7
4
Path Contention
5
-
8/3/2019 Array Processors CH4
39/45
1
2
3
4
6
7
5
0 0
1
2
3
4
5
6
7
Path Contention
-
8/3/2019 Array Processors CH4
40/45
1
2
3
4
6
7
5
0 0
1
2
3
4
5
6
7
Path Contention
5
-
8/3/2019 Array Processors CH4
41/45
1
2
3
4
6
7
5
0 0
1
2
3
4
5
6
7
Path Contention
5
-
8/3/2019 Array Processors CH4
42/45
1
2
3
4
6
7
5
0 0
1
2
3
4
5
6
7
Path Contention
5
-
8/3/2019 Array Processors CH4
43/45
Extra Problems
-
8/3/2019 Array Processors CH4
44/45
. Find the following in a 16x16 omega network
a) Number of stages.b) Number of 2 x 2 switches needed in each stage?.
c) Draw a 16-input Omega network using 2 x 2 switches as building
blocks.
d) Show switch settings for routing a message
from node 1101 to node 0101 and from node 0111 to node 1001
simultaneously.
From node 2 to 7 and 6 to 4 simultaneously.
e) Does blocking exist in above two case?
-
8/3/2019 Array Processors CH4
45/45
Show the necessity of data routing and masking mechanisms during
the addition of 15 numbers. Assume each PE holds one element.
Find the number of steps required to add 16 elements.
Calculate the different routing functions.
Show the routing and addition in each step.