characteristics of embarrassingly parallel computations easily parallelizable little or no...

Characteristics of Embarrassingly Parallel Computations

• Easily parallelizable

• Little or no interaction between processes

• Can give maximum speedup if all available processors are kept busy

• The only constructs required are simply to distribute the data and to start the processes

• Since the data is not shared, message-passing multicomputers are appropriate for such computations

Representation of Images

• The most basic way to store a two-dimensional image is a pixmap, in which each pixel is stored as a binary number in a two-dimensional array.– black-and-white - 1 bit per pixel– greyscale - 8 bits per pixel– color - 24 bits (RGB)

• Geometrical transformations require mathematical operations performed on pixels coordinates– Transformations move a pixel’s position without affecting its value.– Transformations must be done at high speed to be acceptable

• Pixels transformations are independent– Truly embarrassingly parallel computations

Parallel Programming Concern

• The input data is the bitmap typically held in a file copied into an array

• Main parallel programming concern: division of bitmap into group of pixels for each process– Typically more pixels than processors

• Two general methods of grouping– By square/rectangular region– By columns/rows

• Example: A 640 x 480 image, 48 processes– Divide display area into 48 80 x 80 square areas– Divide display area into 48 rows of 640 x 10 pixels

• This method of division appears in applications involving processing 2-D data

Partition into Rows: Master Process

for (i = 0, row = 0; i < 48; i++, row = row + 10) send(row,P[i]);for (i = 0; i < 480; i++) for (j = 0; j < 640; j++) temp_map[i][j] = 0;for (i = 0; i < (640 * 480); i++) { recv(oldrow,oldcol,newrow,newcol,P[ANY]); if (!((newrow < 0)||(newrow >= 480)|| (newcol < 0)||(newcol >= 640))) temp_map[newrow][newcol] = map[oldrow][oldcol];}for (i = 0; i < 480; i++) for (j = 0; j < 640; j++) map[i][j] = temp_map[i][j];

Slave Processes

recv(row,P[MASTER]);

for (oldrow = row; oldrow < (row + 10); oldrow++)

for (oldcol = 0; oldcol < 640; oldcol++) {

newrow = oldrow + delta_x;

newcol = oldcol + delta_y;

send(oldrow,oldcol,newrow,newcol,P[MASTER]);

}

Program Analysis• Suppose each pixel requires two computational steps and

there are n x n pixels.– ts = 2n2 - (O(n2))

• Communication:– p processes– Before the computation, the starting row numbers must be sent to

each process.– The individual processes have to send back the transformed

coordinates of their group of pixels.– tcomm = p(tstartup +tdata)+n2(tstartup +4tdata) = O(p+n2)

• Computation:– Groups of n2/p.– Each pixel requires 2 additions.– tcomp = 2(n2/p) = O(n2/p)

• For fixed p, the time complexity is O(n2).

Mandelbrot Set

• The Mandelbrot set is a widely used test in parallel computer systems– It is computationally intensive

• Displaying this set is another example of processing a bit-mapped image

• In contrast to the previous example, the image is computed in this case

Mandelbrot Set (cont’d)

• Computing the function zk+1=zk2+c, is simplified by

recognizing that– z2 =a2+2abi+bi2 = a2-b2+2abi

• Hence if zreal is the real part of z and zimag is the imaginary part of z, the next iteration values can be produced by computing:– zreal =zreal

2-zimag2+creal

– zimag=2zrealzimag+cimag• The following C structure can be used to represent z:

typedef struct { float real; float imag; } complex;


• The code for computing and displaying the points requires some scaling of the coordinate system of the display area– Actual viewing area will usually be a rectangular window of any size and sited

anywhere of interest in the complex plane

• Let disp_heigt, disp_width and (x,y) be the display height, width and the coord of a point in the display area

• If this window is to display the complex plane with minimum values (real_min, imag_min) and maximum values (real_max,imag_max), each (x,y) point needs to be scaled by:

c.real = real_min + x*(real_max-real_min)/disp_width;

c.imag = imag_min + y*(imag_max-imag_min)/disp_height;

• For computational efficiency, let– Scale_real= (real_max-real_min)/disp_width– Scale_imag= *(imag_max-imag_min)/disp_height


• Including scaling, the code could be of the form:

for(x=0; x<disp_width;x++)

for(y=0; y<disp_height;y++){

c.real = real_min + ((float)x*scale_real);

c.imag = imag_min + ((float)y*scale_imag);

color = cal_pixel(c);

display(x,y,color);

}

Static Task Assignment

• Master for (i = 0,row=0; i < 48; i++,row=row+10) send(&row,P[i]);for (i = 0; i < (480*640); i++) { recv(&c,&color,P[ANY]); display(c,color); }

Static Task Assignment (cont’d)

• Slave (process i)recv(&row,P[MASTER]);

for (x = 0; i < disp_width; x++) for(y=row; y<row+10; y++){


c.imag = imag_min + ((float)y*scale_imag);

color = cal_pixel(c);

send(&c,&color,P[MASTER]); }

Dynamic Task Assignment

• Mandelbrot Set - significant iterative computation per pixel.• The number of iterations will generally be different for each

pixel.• Computers may be of different types and speeds in a cluster• Ideally, we want all processors to complete together,

achieving a system efficiency of 100%.• Assigning regions of different sizes to different processors

also has problems– Need to know a processor’s speed a priori

• Work Pool approach (processor farm)– Individual processors are supplied with work when they become idle.

• Dynamic load balancing can be achieved using a work-pool approach

Dynamic Task Assignment

• Master count=0; /* counter for termination */row=0; /* row being sent */for (k = 0, k < num_proc; k++){/*assume num_proc<disp_height*/ send(&row,P[k],data_tag); count++; row++;}do{ recv(&slave, &r, color, P[ANY],result_tag); count--; /* reduce count as rows received */ if(row<display_height){ send(&row,P[SLAVE],data_tag); row++; count++; } else send(&row,P[SLAVE],terminator_tag); display(r,color); } while(count>0);

Dynamic Task Assignment (cont’d)

• Slave

recv(&y,P[MASTER],source_tag);

while(source_tag==data_tag){

c.imag = imag_min + ((float)y*scale_imag); for(x=0;x<disp_width;x++){


color[x] = cal_pixel(c );

}

send(c,color,P[MASTER],result_tag);

recv(&y,P[MASTER],source_tage); /* recv next row */

}

Analysis

• Exact analysis of the Mandelbrot computation is complicated by not knowing how many iterations are needed for each pixel.

• The number of iterations for each pixel is some function of c but cannot exceed max.

• Therefore, the sequential time is

• Sequential time complexity of O(n).• Let us just consider the static assignment.• Three phases:• Phase 1: Communication

– First, the row number is sent to each slave one data item to each p-1 slavestcomm1 = (p-1)(tstartup + tdata)

niterts *max_

Analysis (cont’d)

• Phase 2: Computation– The slaves perform the Mandelbrot computation in parallel

• Phase 3: Communication– Results are passed back to the master, one row of pixel colors at a time.

– Suppose each slave handles u rows and there are v pixels on a row:

– For static assignment, u and v will be constant (unless the solution of the image was changed), so we can assume tcomm2 = k, a constant

1

*max_

p

nitertcomp

)(2 datastartupcomm vttut

Overall Execution Time

• Overall, the parallel time is given by

where the total number of processors is p

kttpp

nitert datastartupp

))(1(

1

*max_

characteristics of embarrassingly parallel computations easily parallelizable little or no...

Documents

mapij slide

case slide

d data slide

n x n pixels

z real

row oldrow row

group of pixels

p processes