2016/1/6part i1 a taste of parallel algorithms. 2016/1/6part i2 we examine five simple...

112/04/21 Part I 1

A Taste of Parallel Algorithms

112/04/21 Part I 2

• We examine five simple building-block parallel operations and look at the corresponding algorithms on four simple parallel architectures: linear array, binary tree, 2D mesh, and a simple shared variable computer.

112/04/21 Part I 3

Semigroup Computation

112/04/21 Part I 4

Parallel Prefix Computation

112/04/21 Part I 5

Packet Routing

• A packet of information resides at Processor i and must be sent to Processor j. The problem is to route the packet through intermediate processors, if needed, such that it gets to the destination as quickly as possible.

• The problem becomes more challenging when multiple packets reside at different processors, each with its own destination.

• When each processor has at most one packet to send and one packet to receive, the packet routing problem is called one-to-one communication or 1-1 routing.

112/04/21 Part I 6

Broadcasting

• Given a value a known at a certain processor i, disseminate it to all p processors as quickly as possible, so that at the end, every processor has access to, or "knows," the value. This is sometimes referred to as one-to-all communication.

• one-to-many communication, is known as multicasting.

112/04/21 Part I 7

Sorting

• Rather than sorting a set of records, each with a key and data elements, we focus on sorting a set of keys for simplicity.

112/04/21 Part I 8

Linear Array

• D=p-1

• d=2

• Ring?

112/04/21 Part I 9

Binary Tree

• If all leaf levels are identical and every nonleaf processor has two children, the binary tree is said to be complete.

• D=

• d=3

112/04/21 Part I 10

2D Mesh

• D=

• d=4

• Torus?

112/04/21 Part I 11

Shared memory

• A shared-memory multiprocessor can be modeled as a complete graph, in which every node is connected to every other node.

• D=1

• d=p-1

112/04/21 Part I 12

Algorithms for a Linear Array (1)

• Semigroup Computation – Let us consider first a special case of semigroup computation,

namely, that of maximum finding. Each of the p processors holds a value initially and our goal is for every processor to know the largest of these values.

112/04/21 Part I 13


• Parallel Prefix Computation (Case1)

112/04/21 Part I 14


• Parallel Prefix Computation (Case2, more than one value)

112/04/21 Part I 15


• Packet Routing

112/04/21 Part I 16


• Broadcasting – If Processor i wants to broadcast a value a to all processors, it

sends an rbcast(a) (read r-broadcast) message to its right neighbor and an lbcast(a) message to its left neighbor.

112/04/21 Part I 17


• Sorting (Case 1)

112/04/21 Part I 18


• Sorting (Case 2, odd-even transposition) (efficiency?)

112/04/21 Part I 19

Algorithms for a Binary Tree (1)

• In algorithms for a binary tree of processors, we will assume that the data elements are initially held by the leaf processors only.

• The nonleaf (inner) processors participate in the computation, but do not hold data elements of their own.

112/04/21 Part I 20


• Semigroup Computation– Each inner node receives two values from its children, applies

the operator to them, and passes the result upward to its parent.

112/04/21 Part I 21


• Parallel Prefix Computation

112/04/21 Part I 22


• Packet Routing – depends on the processor numbering scheme used.

– Preorder

112/04/21 Part I 23


• Broadcasting– Processor i sends the desired data upwards to the root processor,

which then broadcasts the data downwards to all processors.

112/04/21 Part I 24


• Sorting

112/04/21 Part I 25

Algorithms for 2D Mesh (1)

• In all of the 2D mesh algorithms presented in this section, we use the linear-array algorithms of Section 2.3 as building blocks.

• This leads to simple algorithms, but not necessarily the most efficient ones. Mesh-based architectures and their algorithms will be discussed in great detail in Part III.

112/04/21 Part I 26


• Semigroup Computation– For example, in finding the maximum of a set of p values, stored

one per processor, the row maximums are computed first and made available to every processor in the row. Then column maximums are identified.

112/04/21 Part I 27


• Parallel Prefix Computation– (1) do a parallel prefix computation on each row,

– (2) do a diminished parallel prefix computation in the rightmost column, and

– (3) broadcast the results in the rightmost column to all of the elements in the respective rows and combine with the initially computed row prefix value.

112/04/21 Part I 28


• Packet Routing– To route a data packet from the processor in Row r, Column c, to

the processor in Row r', Column c', we first route it within Row r to Column c'. Then, we route it in Column c' from Row r to Row r'. (row-first routing)

112/04/21 Part I 29


• Broadcasting– (1) broadcast the packet to every processor in the source node's

row and

– (2) broadcast in all columns.

112/04/21 Part I 30


• Sorting

112/04/21 Part I 31

Algorithms for Shared Variables

• Semigroup Computation

• Parallel Prefix computation

• Packet Routing (Trivial in view of the direct communication path between any pair of processors)

• Broadcasting (Trivial, as each processor can send a data item to all processors directly)

• Sorting

2016/1/6part i1 a taste of parallel algorithms. 2016/1/6part i2 we examine five simple...

Documents