tp parallel processing applications

8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

1/39

TERM PAPER

OFCOMPUTER ORGANIZATION & ARCHITECTURE

CSE (211)

TOPIC: PARALLEL PROCESSING APPLICATIONS

Submitted to: Submitted by:MR. SUMIT MITTU ASHISH KUMAR

ROLL NO-RG1901 A65

SECTION-G1901GROUP-1

REG.NO-10800114


2/39

TABLE OF CONTENT

INTRODUCTION TO PARALLEL PROCESSING

VARIOUS APPLICATION AREAS OF PARALLEL PROCESSING

MATRIX OPERATIONS

VIRTUAL REALITY SYSTEM

A ROBUST FRAMEWORK FOR

DISTRIBUTEDPROCESSING OF GIFTS

HYPERSPECTRAL DATA

BIBLIOGRAPHY


3/39

INTRODUCTION TO PARALLEL

PROCESSING

Parallel processing is the ability to

carry out multiple operations or tasks

simultaneously. The term is used in

the contexts of both human

cognition, particularly in the ability

of the brain to simultaneously

process incoming stimuli, and in

parallel computing by machines.

PARALLEL PROCESSING IN

COMPUTERS

The simultaneous use of more than

one CPU or processor core to

execute a program or multiple

computational threads. Ideally,

parallel processing makes programs

run faster because there are more

engines (CPUs or cores) running it.

In practice, it is often difficult todivide a program in such a way that

separate CPUs or cores can execute

different portions without interfering

with each other. Most computers

have just one CPU, but some models

have several, and multi-core

processor chips are becoming the

norm. There are even computers with

thousands of CPUs.

With single-CPU, single-core

computers, it is possible to perform

parallel processing by connecting the

computers in a network. However,

this type of parallel processing

requires very sophisticated software

called distributed processing

software.

Note that parallelism differs from

concurrency. Concurrency is a term

used in the operating systems and

databases communities which refers

to the property of a system in which

multiple tasks remain logically active

and make progress at the same time

by interleaving the execution order

of the tasks and thereby creating an

illusion of simultaneously executing

instructions. Parallelism, on the other

hand, is a term typically used by the

supercomputing community to

describe executions that physically


4/39

execute simultaneously with the goal

of solving a problem in less time or

solving a larger problem in the same

time. Parallelism exploits

concurrency.

Parallel processing is also called

parallel computing. In the quest of

cheaper computing alternatives

parallel processing provides a viable

option. The idle time of processor

cycles across network can be used

effectively by sophisticated

distributed computing software.

VARIOUS APPLICATION

AREAS OF PARALLEL

PROCESSING

MATRIX

OPERATION

Performance Analysis and

Evaluation of Parallel Matrix

Multiplication Algorithms :-

Multiplication of large matrices requires a

lot of computation time as its complexity is

O (n ). Because 3 most current applications

require higher computational throughputs

with minimum time, many sequential and

parallel algorithms are developed. In this

paper, a theoretical analysis for the

performance and evaluation of the parallel

matrix multiplication algorithms is carried

out. However, an experimental analysis is

performed to support the theoretical analysis

results. Recommendations are made based

on this analysis to select the proper parallel

multiplication algorithms. Matrix

multiplication is commonly used in the areas

of graph theory, numerical algorithms,

digital control, digital image processing and

signal processing. Multiplication of large

matrices requires a lot of computation time

as its complexity is O (n ), where n is the

dimension of the 3 matrix. Because most

current applications require higher

computational throughputs, many algorithms

based on sequential and parallel approaches

were developed to improve the performance

of matrix multiplication. Even with such

improvements , for example, Strassen's

algorithm for sequential matrix

multiplication has shown a limitation in

performance. For this reason, parallel

approaches have been examined for decades.

Most of parallel matrix multiplication

algorithms use matrix decomposition that is

based on the number of processors available.

This includes the systolic algorithm,

Cannon's algorithm , Fox and Otto's


5/39

algorithm, PUMMA (Parallel Universal

Matrix Multiplication), SUMMA (Scalable

Universal Matrix Multiplication) and

DIMMA (Distribution Independent Matrix

Multiplication) . Each one of these

algorithms uses the matrices that

decomposed into sub-matrices. During

execution process, a processor calculates a

partial result using the sub-matrices that are

currently accessed by it. It successively

performs the same calculation on new

sub-matrices, adding the new results to the

previous. When the multiplication sub

process is completed, the root processor

assembles the partial results and generates

the complete matrix multiplication result.


6/39

In order to make a proper selection for the

given multiplication operation and to decidewhich is the best uitable algorithm that

generates a high throughput with minimum

time, a comparison analysis and a

performance valuation for the above

mentioned algorithms is carried out using

the same performance parameters and based

on parallel processing. Moreover, thesealgorithms are implemented using C++

programming language.


7/39

PARALLEL COMPUTING

PARADIGMS

Parallel computing process depends on how

the processors are connected to a memory.

The way of system connection can be

classified into a shared memory system or

distributed memory, each of these two types

is discussed as follows:-

Shared Memory System: In such

a system, a single address space exists,

within it every memory location is given a

unique address and the data stored in

memory are accessible to each processor.

The P processor reads the i data written by

P processor. Therefore, in order to enforce

sequential consistency, it is necessary to use

synchronization.

The Open MP is one of the popular

programming languages executed on the

shared memory system. It provides a

portable, scalable and efficient approach to

run parallel programs in C/C++ and

FORTRAN. In Open MP, a sequential

programming language can be parallelized

with preprocessor compiler directives such

as #pragma omp in C and $OMP in

FORTRAN based with library support.

Distributed Memory System:In

such a system, each processor has its own

memory and can only access its local

memory. The processors are connected with

other processors via a high-speed

communication network. Processors

exchanges information with one another

using send and receive operations. A

common approach to programming

multiprocessors is to use message-passing

library routines in addition to conventional

sequential program. MPI (Message Passing

Interface) is useful for a distributed memory

systems since it provides a widely used

standard of message passing program. It

provides a practical, portable, efficient and

flexible standard for message passing. In

MPI, data is distributed among processors,

where no data is shared and data iscommunicated by message passing.

THEORETICAL ANALYSIS

In order to evaluate the performance of any

matrix multiplication based on using parallel

processors and different algorithms, a

theoretical analysis is carried out based on

the following assumptions:

f = number of arithmetic operations units

tf = time per arithmetic operation


8/39

c = number of communication units

q = f / c average number of flops per

communication access

Minimum possible time = f* tf when no

communication

Efficiency(speedup) SP=q*(tf/tc)

f * tf + c* tc = f * tf * (1 + tc/tf * 1/q)


9/39

So, larger q indicates that time is closer to

minimumf*tf, thus increasing the speed up

and the efficiency of the algorithm. SP must

be closer to number of processors to achieve

the highest efficiency. Using the above

assumption and with reference to the

information in Table(1) we can obtain f, c

and q for each parallel matrix multiplication

as shown in Table(2).

EXPERIMENTAL RESULTS

Table (4) shows the experimental results

obtained by implementing each algorithm

100 times and Then, ten fastest five ones aretaken and averaged.

CONCLUSION AND FUTURE

WORKS

A theoretical analysis for the performance of

most seven used algorithms is carried out

using one and four processors. This analysis

has shown that systolic algorithm is

considered the best algorithm that produced

a high efficiency and then followed by

puma, dimma and

then summa. However, these algorithms are

implemented and run on one and four

processors to evaluate their performance for

a matrix multiplication. The experimental

results are matched with the theoretical one.

This analysis is useful for making a proper

recommendation to select the best algorithm

among others as a future works to be done

by others.


10/39

VIRTUAL REALITY

SYSTEM

Application of Parallel Processing to

Rendering in a Virtual Reality System :-

The rapid rendering of images is pivotal in

all applications of computer graphics, and

perhaps

most of all in the field of virtual reality.

Research suggests that there is a limit to the

improvement in performance that can be

attained by the selection of the appropriate

algorithms

and their careful optimisation. Hardware

implementation increases expense and limits

the

flexibility of the equipment. A relatively

unexplored option is the use of parallel

processing in

image rendering, especially image rendering

in virtual reality systems.

This paper examines the application ofparallel processing to image rendering in a

virtual

reality system. The requirements of a virtual

reality system are noted. A number of

different

parallelisation techniques are proposed, and

each analyzed with regard to its

effectiveness in

meeting these requirements. Performance

values, measured from an implementation of

each

technique on a transputer architecture, are

used to highlight the advantages and

disadvantages of

each technique.

Rendering for a virtual reality

system:-Rendering is a well understoodapplication. Many systems have been

created in which three-dimensional scenes

have to be rendered. The use of parallelism

for rendering has also been explored.

However, previous research has not placed

the emphasis on rendering scenes in real-

time. In particular, little work has been done

on using parallel processing when rendering

for virtual reality applications. Virtual

reality applications are different in that the

images must be rendered very rapidly, in

real time, and that the delay between a user


11/39

input and the corresponding change in the

rendered output, the latency, must be small.

In working virtual reality systems, frame

rates of 6 frames/second and a latency of

less than 200 milliseconds are suggested as

minimum requirements for the illusion of

reality. These two factors, the speed of

rendering, and the latencyperiod are used as

the criteria for evaluating the parallelisation

techniques explored in this paper.

Rendering on a single processor: -

Renderers are typically constructed by

creating a pipeline of several standard

operations. These operations include

clipping, transformation, projection and

hidden-surface removal. The description of

the scene to be displayed is fed into the

pipeline which eventually produces a

sequence of primitives (two dimensional

lines and polygons, for example) to be

drawn. The pipeline typically ends with

some routines capable of producing a visual

representation of these primitives. The

renderer described in this paper is modelled

as three stages. The first selects those

objects in the virtual world which will bevisible on the screen. The next stage

transforms the objects to screen coordinates.

The last stage performs hidden-surface

removal, clips the objects to the dimensions

of the screen and renders each primitive.

This level of breakdown is adequate for the

parallelisation used which concentrates more

on parallelisation inside each stage, rather

than between stages.

Parallelisation of the graphics

pipeline: - A number of different parallel

decomposition strategies for polygon

rendering are described in a paper by

Whitman. Four decomposition strategies

were implemented, and are described in the

following sections. The various strategieswere implemented on a transputer-based

architecture. However, in most cases, the

techniques are applicable to other

architectures. The hardware platform on

which the parallelisation was carried out was

a MIMD architecture consisting of a cluster

of 32 T800 transputers. Each transputer is a

reasonably powerful processor in its own

right, capable of running a number of

concurrent processes. The addition of 4

communications links allows it to transfer

data to and from others transputers. Each

transputer has its own private memory and

all synchronisation and data transfer occurs

through the links. A single bidirectional link

can transfer a theoretical maximum of 2.35

Mbytes/s, or 1.74 Mbytes/s if

communication is only in one direction. A

set of standard test data was used to obtain


12/39

timing data from the various

implementations. The speed of the rendering

system on a single processor is given in

Table 1. The simple test world contains 30

objects amounting to about 250 polygons

(each with 4 or moresides). The complex

world consists of a single object containing

about 1650 triangular polygons.

Pipeline parallelism

The most obvious strategy when confronted

with the graphics pipeline is to place each

component of the pipeline on a separate

processor. The connections between

components can be constructed using

transputer links. Two factors are worthy of

consideration when setting up a pipeline

configuration for virtual reality on a

transputer architecture. Firstly, greater

bandwidth is obtained from the transputer

links when larger messages are transmitted.

Protocol overhead is less, and there is less

delay waiting for each of the communicating

processors to synchronise. The few bytes

representing a point are not worth

transmitting alone, it is better to wait until a


13/39

number of points are ready to be transmitted

and then send them all together. For

convenience, all the data corresponding to

an entire object is sent in one packet. The

second consideration is the latency problem,

the delay between input and the

corresponding output. Lots of data is held in

long pipelines, even with a limited amount

of data at each node. All of this must be

processed before new data entering the node

can have any effect on the image being

displayed. Since there is a certain overhead

associated with the communication between

pipeline components, the delay between data

entering the pipeline and it reaching the

output increases with pipeline length, even

though the rate at which images are

produced is increased.

This may be illustrated by example:

Consider a job which on a single processor

would accept a datum and take 10 seconds

to process it before outputting the result.

Assume this job is capable of being

parallelised linearly. A pipeline in which

each node shares equally in the workload is

illustrated in Figure 1. The data is arriving

as fast as possible, but will take 1 second to

transmit across a link. With 5 processors the

job can be divided into a pipeline of 5

parallel components, and so each will work

on the datum for 2 seconds. Thus results can

be output every 2 seconds. The time

between a given datum being input and the

corresponding output is at least 16 seconds.

With 10 processors, output occurs once

every second but at least 21 seconds pass

between input and corresponding output.

In general:

let N be the number of processors,

let T be the time for the job on the single

processor system and

let C be the communication time between

processors. The data output rate is then T/N.

The latency is at least (N+1)C + T seconds.


14/39

The latency increases with number of

processors for non zero communication

time. For zero communication time it

remains constant, independent of the data

output rate. For this reason pipeline

parallelism is better suited to systems that

require computationally intensive rendering

with limited user interaction. A limited

rendering pipeline with three nodes was

implemented. The processor layout is shown

in Figure 2. Timing figures are given in

Table 2. The differences in speedup at

different screen resolutions are due to

change in the load balance. The renderer has

to do more work at higher resolutions and so

acts as a bottleneck.

Coarse grained parallelism

The coarse grained parallel decomposition

strategy renders successive frames on

separate processors. The transputer network

is arranged to allow a number of

independent pipelines to be created, as

shown in Figure 3. Each pipeline renders

one frame in memory which is then

transferred directly to the graphics node.

This technique promises the greatest

speedup since it requires no load balancing.

Each processor can start work on another

frame as soon as the current one is finished.

The load is also approximately equal from

one frame to the next, so they are likely to

stay in step. There is thus no waiting for

other processors to catch up. The delay

between input and output is still the time

taken for the pipeline to render the frame.

No decrease in this time occurs even when


15/39

the number of parallel pipelines is increased.

The latency is thus constant and independent

of the number of processors used. Values for

the speedup obtained from an

implementation of this technique are shown

in Figure 4. As may be seen, linear speedup

is obtained for most of these examples. The

only factor preventing linear speedup in all

cases is the bandwidth restriction on the

links to the graphics node. With each link

capable of 1.7 Mbytes/second and four links,

this allows a maximum of 26.6

frames/second at 512 x 512 resolution or

108 frames/second at 320 x 200 resolution.

The example with the simple world at 512 x

512 is using about half of the total

bandwidth available. There is a general

tendency for a drop in the speedup when

more than four pipelines are used, especially

in high traffic situations. This tends to occur

with all the parallelism strategies tried. It

can be ascribed to the necessary doubling of

the traffic on one of the links to the video

controller card.


16/39

Fine grained parallelism

The pipeline approach uses a number of

transputers connected in series. It has the

disadvantage of high latency. A more

suitable parallel decomposition for

interactive graphics applications would have

the processors connected in parallel, and all

working on the same frame. In this way the

delay between input and corresponding

output would be shortened. A profile of the

graphics routines running on a singleprocessor reveals that the bulk of the effort

is spent on the stage involving hidden

surface removal and polygon rendering.

Thus the bulk of the effort in parallelisation

has gone into decreasing the time spent on

this stage. The other stages of the graphics

pipeline are left as a short pipeline on

separate processors. The fine grained

parallelisation technique is an image space

pixel based parallel decomposition strategy.

The display area is partitioned into

horizontal strips, and the hidden surface

removal and rendering for each strip occurs

on a separate processor. The strips are

rendered in memory and transferred to the

graphics node which fits each block into the

correct area of screen memory. The amount

of data that is transmitted to the video

controller node is the same as with the

coarse grained parallelism. The messages

are shorter but more frequent. The processor

layout used to implement this technique is

shown in Figure 5. The shaded processors

represent one possible graphics pipeline for

rendering a single strip. The transformation

stage can be parallelised in the same

manner, with transformations restricted to

those objects appearing on any particular

strip. However, in practice most objects

appear on every strip anyway and so this

does not result in any significant saving. The

transformation stage will be placed on a

single processor to simplify the model.

Consider a simplified version of the graphics

pipeline that consists of K stages with each

stage taking time T to complete their task.

By using N processors on one of the stages

and assuming linear speedup, the time for

that stage will be reduced to T/N. The total

time taken to traverse the pipeline is (K -

1)T + T/N. The latency decreases as the

number of processors is increased. The

restrictions given above can be relaxed, and

as long as speedup is obtained, latency will

decrease.


17/39


18/39


19/39

The speedup values obtained from

implementations of this strategy are shown

in Figure 6. The times measured are for the

entire graphics pipeline, and not just the

hidden surface removal and polygon

rendering. This latter part of the pipeline

does most of the work however, and controls

the overall speed. This technique does not

produce a large speedup, nor does the

speedup increase very much with the

number of processors used. Two factors are

found to contribute to these effects. When

rendering the horizontal strips, the polygonsthat fall partially outside these strips must be

clipped. As the number of strips increases,

each strip becomes smaller and the number

of polygons needing to be clipped increases.

In one test with the screen divided into 4

strips the time taken for clipping increased

from about 25% of the time taken for

rendering to about 500% of the rendering

time. The rendering time did decrease by a

factor of 4 due to the parallelism, but theclipping overhead was still significant. The

clipping is a significant factor with the

single object in the complex world with its

many polygons and limits the speedup the

most for this case. The second factor

involves load balancing. The picture is not

always evenly distributed over the screen.

Sometimes all the objects cluster in one

region of the screen. In this case one

processor is required to render most of the

objects while others render almost nothing.

This situation is almost equivalent to a

single pipeline. The effect of this can be

seen in the speedup values for the complex

world with 512 x 512 resolution. The singleobject occupies the middle half of the

screen. Thus for three pipelines, the

processors rendering the top and bottom

slices have little work to do and the

processor rendering the middle slice does

most of the rendering. This explains the drop

in speedup over the two pipeline version

where the load is more evenly balanced. The

object fills the screen in the 320 x 200

resolution and clipping limits the speedup in

this case. Alternative techniques for

distributing sections of the display area over


20/39

N processors to improve the load balancing

include:

Placing successive scan-lines on different

processors.

Dividing the screen into N2 horizontal

slices and placing successive slices on

different processors.

These alternatives, however, incur greater

overheads. The second alternative would

result in more clipping, making it more

expensive. The first alternative proved

possible to implement with surprisingly few

alterations to the rendering routines. Tests of

the method show a more consistent tendency

to increase in speed as the number of

processors is increased, as opposed to the

more erratic behaviour evidenced by the

original routine. This can be attributed to the

improved load balancing. Despite the

improvement in scaling, the added

complexity results in the improved

algorithm running about 10% slower than

the original.The frame rates obtained withthese methods were not high enough for

bandwidth limitations to be a constraint.

Object-based parallelism

An object space parallelisation strategy was

designed in which rendering of different

objects occurs on different processors. For

the case of N processors, the objects are

sorted into N groups based on their depth.

This is easily done as part of the depth sort

involved in the hidden object removal

routine. Each group is then rendered on one

of the processors. The images produced are

sent to the graphics node where the images

are combined. The combination technique is

similar to the Painters algorithm, used for

hidden surface removal, where images

corresponding to nearer objects are painted

over those belonging to more distant ones.

The transputer has a specialised block move

instruction that allows this to be done

relatively quickly and easily. Both the

transformation and hidden surface/polygon

rendering stages can be done in parallel with

this method. The transputer network used to

implement this strategy is shown in Figure

7. The shaded nodes show the graphics

pipeline for one group of objects. This

parallelisation strategy has one immediate

disadvantage. If the scene consists of only

one object, then there is little point in using

more than one processor. Consequently a

world containing 16 objects, each with 276

triangles was used instead of the complex

world used for previous tests. Values

obtained for the speedup for an

implementation of this strategy are shown in

Figure 8. The bandwidth limitation is


21/39

particularly severe in this case. In the case

where the number of processors used is less

than five, an entire image needs to be sent

across each of the transputers links. This

places an upper limit on the frame rate of 6.6

frames per second for 512 x 512 resolution

or 27 frames per second for 320 x 200

resolution. The speedup values tend to level

off as this barrier is reached. Adding a fifth

processor only exacerbates the situation. A

further problem with this technique is load

balancing. Some objects are more complex

than others, and just distributing equal

numbers of objects may result in some

processors having to do more work. A finer,

more balanced approach would be to

distribute on a polygon basis. This could be

implemented relatively easily. However

such an approach is not worth exploring

until higher bandwidth systems are

available.


22/39

The parallelism would also suffer when the

number of visible objects is less than the

number of processors. Polygon distribution

will be valuable for this situation as well.

The simplified version of the graphics

pipeline used in the previous latency

calculation, gives the total time taken to

traverse the pipeline as (K - 2)T + 2T/N,

since two stages of the pipeline are

parallelised. The latency is less than that for

the fine grained decomposition and also

decreases as the number of processors is

increased.

Parallel rendering in practice

With each of the techniques implemented,

analysis of run-time profiling information

showed that processor time was being


23/39

wasted in waiting for other processors to

catch up. Processors were blocking, waiting

for others to communicate. This problem

was overcome by buffering incoming

messages, enabling the process sending the

data to continue with other processing. A

notable speedup in the frame rate was

achieved. This solution had a corresponding

disadvantage. Adding extra processes on the

communication links has the effect of

lengthening the pipeline. This increases the

latency. The size of the buffers on each link

also affects the latency adversely. In order to

get the maximum benefit from this

technique the size of the buffers should be

related to the complexity of the scene. Small

buffers are better suited for scenes with a

few complex objects, while larger ones

smooth over the delays for large numbers of

simple objects.

Conclusions

This paper describes a number of different

parallelisation strategies that were tested in

an attempt to increase the rendering speed

and decrease the interaction latency in

virtual reality systems. The strategies

explored tended to exhibit a tradeoff

between speedup and latency. A frame per

processor approach gave almost linear

speedup, but left the latency unaffected.

Strategies which distributed the calculations

for a single frame over a number of

processors resulted in latency decreasing as

the number of processors increased. These

techniques did, however, have a lesser

speedup. A significant factor in the tests

carried out was the limited bandwidth of the

transputer links. An increase in the

communication speed could contribute

toward producing acceptable rendering

speeds with some of the lower latency

strategies. In particular, the technique which

rendered slices of the image on different

processors proved to be the most satisfying

to use in practice. The object per processor

strategy will become acceptable once the

communication bandwidth increases.


24/39

A Robust Framework

For Distributed

Processing Of GIFTS

Hyperspectral Data

Abstract The Geosynchronous Imaging

Fourier Transform Spectrometer (GIFTS)

instrument can provide raw data in the order

of multiple Terabytes per day. Due to the

high data rate, satellite ground data

processing will require considerable

computing power to process and archive

data in near real-time. Cluster technologiesemploying a multi-processor system

combined with a parallel file system is the

only cost effective solution for such

processing and storage. GIFTS data

processing system is required to generate

critical products within 5 minutes of

gathering observation. In this paper we

present an approach for GIFTS ground

system processing based on the master-

worker paradigm which providesperformance, reliability, and scalability of

candidate hardware and software using the

Message Passing Interface (MPI) standard.

The framework used, alleviates the need for

earth scientists to understand parallel

computing and fault-tolerant operations.

Benchmarking results are presented for a

selected number of science algorithms for

the GIFTS instrument showing that

considerable performance can be gained

without sacrificing the reliability and high

availability constraints imposed on the

operational cluster system. A maximum

speed up of 54.56 (85.9% efficiency) is

obtained for a total number of 64 processors

over 64 data cubes of 128 x 128 pixels in the

long wave and short-medium wave spectral

range. This prototype system shows that

considerable performance can be gained for

candidate science algorithms without

sacrificing reliability and high availability

needed for a real-time system.

I. INTRODUCTION

Future satellite instruments can provide raw

data of the order of multiple Terabytes per

day. Due to the high data rate, satellite

ground data processing will require

considerable computing power to process

and archive this data in near real-time. Theprimary mission of the National Oceanic and

Atmospheric Administration (NOAA) is to

understand and predict changes in the

Earths environment which requires a

continuous capability to acquire, process and

archive data in real-time. NOAA

participation in the GIFTS technology

transfer represents a risk reduction activity

in the design of the NOAA GOES-R series

of imager and sounder instruments and their

associated science algorithms. The GIFTS

instrument uses a combination of Large area

Focal Plane Arrays (LFPAs), and a Fourier

Transform Spectrometer (FTS), providing a

spectral resolution of 0.6 cm-1 for a 128 x

128 set of 4 km foot-prints every 11 seconds

[1]. It is anticipated that the GIFTS Level-0

data rate is about 55 Mbps or about 1.5

Terabyte per day [1]. The computing powerneeded for Level-0 to Level-1 processing is

not only due to the data volume but also due

to the inverse Fourier transform and non-

linearity correction.

The volume of Level-1 data is

approximately the same as Level-0. Since


25/39

there is little reduction in data volume from

Level-0 to Level-1, producing Level-2 data

also requires significant computing power.

GIFTS data processing system is required to

generate critical products within 5 minutes

of gathering observation. Cluster

technologies employing a multi-processor

system present the only current

economically viable option to accommodate

the processing needed for GIFTS ground

data processing. However such systems are

inherently unstable; failure of one

component may result in a failure of the

system if necessary measures are not taken.

Operational real-time systems need to bereliable and fault-tolerant, operate on

continuous data streams and be operator-

friendly. To sustain high levels of system

reliability and operability in a cluster-

oriented operational environment, a fault-

tolerant data processing framework is

proposed to provide a platform for

encapsulating science algorithms and hide

the complexities involved with an

operational cluster system. Many Earthscience algorithms are very complex, but

they have only a small degree of spatial

dependency and thus are ideal for parallel

processing.

Designing a highly reliable operational

cluster system for NOAAs future ground

segments is the focus of this paper. In the

rest of this section, work reported in the

literature that is most relevant to our work isbriefly discussed. In section 2, we describe

GIFTS instrument and the associated since

algorithms. Section 3 describes framework

architecture used for GIFTS processing.

Section 4 describes the implementation

details of the prototype system. Section 5

shows benchmarking results and tradeoff

study for GIFTS ground system processing.

Section 6 concludes the paper and comments

on future directions of this work.

II. GIFTS INSTRUMENT ANDSCIENCE

ALGORITHMS

A. GIFTS Instrument Background

The GIFTS instrument consists of

large area focal plane detector arrays

(128 x 128 pixels) within a Fourier

Transform Spectrometer (FTS),mounted on a geostationary satellite.

The instrument provides observations

of Earth infrared radiance spectra at

high spectral resolution (0.6 cm -1)

and high spatial resolution (4 km x 4

km pixels). Depending on spectral

resolution, GIFTS views a large area

(512 km x 512 km) of the Earthwithin a 1 to 11 second time interval.

Extended Earth coverage is achieved

by step-scanning the instrument field

of view in a contiguous fashion across

any desired portion of the visible

Earth. A visible camera provides

daytime imaging of clouds at 1 km

spatial resolution. Figure 1 shows a

selection of GIFTS measurement

modes.

GIFTS uses two detector arrays to

cover the spectral bands 685 to 1130

cm-1 and 1650 to 2250 cm-1, as


26/39

shown in Figure 2, and a Michelson

interferometer to obtain the spectrum

of radiance within these bands. The

spectral resolution of the

measurements is sufficient to resolve,within 1-2 km vertical resolution,

dynamic features of the atmospheric

temperature and moisture profiles.

The geostationary platform enables

the tracing of fine scale features of the

atmospheric water (cloud and vapor)

distribution to permit the derivation of

altitude resolved wind profiles.

Nevertheless, GIFTS will cover a

major portion of the visible disk with

high vertical resolution soundings in

less than 30 minutes. This feature is

important for obtaining wind profilesfrom geostationary temperature and

moisture sounding data. As part of

University of Wisconsins (UW's)

algorithm development, simulations

of expected top of atmosphere (TOA)

radiances are being used for algorithm

development and processing.


27/39

B. GIFTS Science Algorithm Pipeline

The GIFTS sensor will sample the

interferogram from each detector as a

function of optical path delay and

numerically filter the data in real-time to

reduce the data rate before transmission to

the ground-based X-band receiver. The

sensor collects views of the onboard

calibration references and deep space at

regular intervals. The ground reception

facility will decode the telemetry stream andpass the GIFTS sensor data in real-time to a

ground data processing facility. The GIFTS

science algorithms are developed by SpaceScience and Engineering Center (SSEC) at

University of Wisconsin (UW). These

algorithms can be described as a pipeline

consisting of a set of modules including an

initial Fast Fourier Transform (FFT), Non-

linearity Correction, Radiometric

Calibration, Spectral Calibration, Instrument

Line Shape Correction and Spectral

Resampling as shown in Figure 3. These

modules are described below but completedetails can be obtained from Knuteson .


28/39

Initially, FFT operations are applied to

GIFTS data to convert the measured

interferograms into complex spectra. A

complex Fourier transform and data folding

is performed to convert the complex

interferograms to complex spectra

corresponding to a wave number scale. In

second step we apply non-linearity

correction. This work is still under

investigation by SSEC, UW. It is expected

that the GIFTS detector material is highly

linear in the range of photon fluxes used, but

the electronics readout of the focal plane

array can introduce small signal non-

linearity. The non-linearity correction

algorithm is being designed using ground-

based AERI instrument as well as the

Scanning-HIS instrument. In the third step,

we perform radiometric calibration. This

step ensures the GIFTS instrument

requirement to measure brightness

temperature to better than 1 K, with a

reproducibility of 0.2 K. GIFTS uses views

of two on-board blackbody sources (300 K

and 265 K) along with cold space,

sequenced at regular programmable

intervals. The temperature difference

between the two internal blackbody views


29/39

provides the sensor slope term in the

calibration equation, while the deep space

view corrects for radiant emission from the

telescope by establishing the offset term.

This is followed by spectral calibration,

which uses ground-based and aircraft FTS

systems. The spectral characteristics of these

instruments are defined by an Instrument

Line Shape (ILS) and a spectral sampling

interval. The spectral sampling scale is

maintained very accurately by the stable

laser used to trigger sampling at equal

intervals of Optical Path Difference (OPD).

The next stage is to perform the instrument

line shape correction. It has been evaluatedthat this effect is very negligible for GIFTS

instrument since it has an extremely small

range of angles contributing to each

individual detector pixel (< 1 mrad in the

interferometer). As a result, the variation of

it is extremely small and could even be

ignored without introducing significant

errors. Once the spectral calibration is

determined we perform wave number

resampling. The GIFTS radiance spectrumcan be resampled from the original sampling

interval to a standard reference wave

number scale. The resampling is performed

in software using FFT, zero padding, and

linear interpolation of an oversampled

spectrum. The results of the wave number

resampling operation are equivalent to

GIFTS spectra with a common wave number

scale independent of their location in the

focal plane array. Our benchmarking results

as presented in section 5 include the initial

Fast Fourier Transform (FFT), the

radiometric calibration and the spectral

resampling stages, since the other modules

are still in the development phase and not

yet available for use in the framework.

III. GIFTS FRAMEWORK

ARCHITECTURE

In this section we propose an architectural

model for a robust and reliable operational

satellite data processing frameworkconsisting of:

Active/Standby Master

Active/Standby Data Input Server

Active/Standby Data Output Server

Reference and Audit Database Servers

Workers

The master is responsible for cluster

management and task scheduling. The data

input server provides the real-time data

which will be retrieved by the workers while

the data output server gathers the results

produced by the workers. The reference

database server provides access to a

database storage unit which may contain

instrument factory parameters, science

algorithm specific parameters, algorithm

descriptions, and algorithm initialization

parameters. The audit database server

provides monitoring capability of algorithms

used, produced products and so forth. The

workers do the actual processing of the

science algorithms. A Front End Processor

(FEP) receives (FEP) receives the data from

the satellite, processes the data into packets

and frames according to the CCSDS formatdescription. The frames constitute

interferometer measurements, performance,

engineering and diagnostics data. The FEP

provides the input data to the data input

server and is not part of the framework.

Redundancy for the master, data input and


30/39

data output servers are provided through an

active/standby mechanism. Parallel file

systems with server fail-over capability may

serve as input and output servers. The

databases, consisting of the reference

database and the audit database, are

envisioned to consist of their own

commercially available database clusters

such as the MySQL Cluster for fault

tolerance purposes. The master is

responsible for cluster management and task

scheduling. A task is referred to as a unit of

work which contains its own set of data,

initialization parameters, reference id,

timestamp and so forth. Each task has aunique identifier which differentiates it from

other tasks within the system. Tasks may

also be prioritized in which case tasks with

higher priorities will be executed before low

priority tasks. The data input server retrieves

the new incoming data, packages them

appropriately and assigns unique identifiers

to tasks encapsulating the new data sets. The

identifiers for the new tasks produced by the

data input server are sent to the master,which in turn schedules the new tasks to be

assigned to workers. The master may also

define scheduling policies in a way such that

certain tasks are assigned to specific

workers. We refer to this as selective

scheduling. Such policies are beneficial in

cases where specific tasks may need

initialization parameters, which may reside

on specific workers and may be

computationally time consuming if they had

to be recomputed. By assigning such tasks to

the same worker or the same group of

workers and using application-aware caches,

a large amount of computation may be saved

resulting in higher performance. All

communication between the various servers

may be performed asynchronously to

overlap computation with communication.

A task assigned to a worker by the master

may not complete on time for various

reasons such as network or worker nodefailure. A task is considered completed by

the master when the task identifier is

returned to the master by the worker. To

implement task and data redundancy

mechanisms, the task execution time, also

referred to as task latency, has to be known

in advance. Specifying task latency statically

is unfeasible for most practical parallel

applications. In the current design, task

execution times are estimated dynamically

based on previous actual task execution

times and an over-estimation factor. The

over-estimation factor for each task can be

calculated as the inverse ratio of its

estimated and actual execution time. The

estimated execution time may be computed

by taking the average of a certain number of

previous task execution times multiplied by

an over-estimation factor. The over-estimation factor may or may not be a

constant.

A task which has not been completed within

its estimated execution time is considered

lost by the master and will be reassigned to a

worker requesting tasks. The master keeps

two queues of tasks, one for new tasks

which have not yet been assigned to

workers, and one for assigned tasks. Whenthe master schedules a task to be assigned to

a worker process, the task in the head of the

queue for new tasks is retrieved and

assigned to an appropriate worker. This task

is then moved to the queue of assigned

tasks. When the master receives a


31/39

notification from the worker that the

respective task has been computed, the task

is removed from the queue of assigned tasks.

However if this task is not completed within

its estimated execution time, it is considered

lost. Therefore the task is removed from the

queue of assigned tasks and is placed in the

head of the queue of new tasks to be

reassigned to a new worker. The worker

which was initially assigned the respective

task is considered faulty and will be

removed from the list of available workers.

IV. GIFTS PROTOTYPE

IMPLEMENTATION

The prototype system includes an HP Linux

cluster consisting of 32 dual core AMD

Opteron DL145 compute nodes with 1 GB

RAM per CPU and 4 dual core AMD

Opteron DL385 management nodes with 1

GB RAM per CPU. The system runs Red

Hat Linux Enterprise Server for AMD64 and

has a Gigabit Ethernet as well as Myrinet

interconnect for the entire system. HP

Service guard is used to provide failover

capability for the master node. HP

Serviceguard monitors the health of each

master node and rapidly responds to failures

in a way that minimizes or eliminates

application downtime. This system is also

supplemented by an HP Storage Works

Scalable File Share (HP SFS) which is used

to access data in parallel from all compute

nodes. It has 16 Terabytes of usable storageand provides a bandwidth of 1064 Mbytes/s

for READ and 570 Mbytes/s for WRITE. It

uses the Lustre parallel file system which is

one of the most extensively used parallel file

systems.

Framework Implementation

The framework is implemented using the

Message Passing Interface (MPI) and the C+

+ programming language. Since there are

many implementations of MPI, we evaluated

various implementations such as Open MPI ,

MPICH-2 and LAM-MPI . Finally we

choose the MPICH-2 implementation

considering its support for MPI-2 standards.

Our framework prototype is a complete

implementation of all the various

components as described in section 3 except

for the FEP. For the current framework

implementation, the FEP was not used to

provide the data; rather the data was directly

fetched from the input server. Theframework separates the science algorithm

layer from the cluster management layer and

provides an operational platform within

which algorithm software pipelines can be

deployed for satellite data processing. There

are classes and APIs for each service such

as master, input delivery, output delivery,

reference database, audit database, and

workers. The system can be configured at

startup using configuration files wherehostnames may be specified for the various

worker nodes, reference and audit database

and so forth. Once the master finishes

initialization, it starts the workers and

communicates a unique identifier to each

worker. Once connection is established, if

there is an outstanding task to be processed,

it will be assigned to a worker by the master.

In case a worker is removed at run-time, its

tasks are reassigned to other workers as

described in section 3. In case of a master

failover, all worker master connections are

lost and the worker processes are cleaned

up. The failover process from standby to

active involves reading the last check-point

file written by the active unit, re-creating the


32/39

system state accordingly and failing over

from standby to active unit. The standby

master takes over and creates new worker

processes as well as redistributes tasks to the

workers. The framework provides algorithm

independence through a set of base classes

and interfaces for the algorithmic tasks .

Each independent work unit is encapsulated

in the Task class containing fields for start

time, completion time, compute time, unique

task identifier, algorithm identifier, number

of pixels, input and output data sizes, and

other initialization parameters common to

most satellite data processing algorithms we

have evaluated so far. A task is started assoon as the master schedules it for a specific

worker and is considered completed as soon

as the master receives a notification from

worker that the respective task has been

completed. Hence the completion time for

each task includes communication, read,

write and computation time, respectively.

The compute time stored in the task contains

the actual computational time for the task as

observed by the worker. Each task has aunique task identifier to differentiate it from

other tasks. Each task contains a field for

algorithm identifier specifying the

algorithm/algorithms to be applied to the

data. Each task contains reference to its own

set of input and output data files. Workers

execute a task simply by calling the

execute() method on an instance of the Task

class. This way the science algorithm layer

is clearly separated from the framework

layer.

V. RESULTS

In this section we investigate our framework

prototype in terms of performance and

reliability. The framework prototype is

evaluated using GIFTS science algorithms

as described in section 2. Our

benchmarking results show end to end

performance using simulated data cubes as

well as the effect of task size on the

performance.

A. Performance

Our benchmarking results include the initial

FFT, the radiometric calibration and the

spectral resampling stages, since the other

modules are still in the development phase

and have not yet been released. For the

benchmarking, a total of 64 data cubes areused. This results in sufficient amount of

work to be provided to each worker to hide

any anomalies. Each data cube has a total of

128 x 128 pixels. In the first experiment we

have used a constant task size of 64 data

cubes and shown the execution time, speed

up and efficiency for the GIFTS pipeline.

Two spectral ranges for the datasets are

used: the long wave infrared band (685-1129

cm-1) and the short-medium wave band

(1650-2250 cm-1). Figure 4 shows the total

execution time versus number of processors

using the end-to-end GIFTS pipeline

processing. Figure 5 shows the speedup

versus number of processors for the same

experiment. It is shown that we can achieve

a linear speedup of 54.56 compared to an

ideal speedup of 64. Figure 6 shows the the

efficiency versus number of processors. Itcan be seen that we are able to maintain an

efficiency of 85 percent throughout the

experiment. In the second experiment, we

evaluated the effect of network bandwidth

by increasing the task size in terms of

number of pixels per task. This resulted in


33/39

reduced number of messages between the

workers and the master which in turn

reduces the worker master communication.

This experiment was conducted using a total

of 64 data cubes running on 64 processors.

The experiment was conducted using tasks

containing 128, 256, 1024, and 2048 pixels.

Figure 7 shows the total execution time

versus task size. Increasing the task size

from 128 pixels per task to 2048 pixels per

task, resulted in an increase of the total

execution time of about 0.7 percent. Hence

the effect of task size on the total execution

time is negligible. It was observed that the

total amount of computation time remainedconstant as the total work remained

constant. Figure 8 shows the effect of task

size on READ and WRITE times. For a task

size of 128 pixels per task, blocks of 128

pixels were used to perform READ and

WRITE, whereas for task sizes greater than

or equal to 256 pixels per task, blocks of 256

pixels were used. In the third experiment,

we investigated the effect of the bandwidth

of the parallel file system to determine the

optimum number of pixels which should be

used for performing I/O to obtain the

maximum speedup. In this experiment we

kept the task size to a constant value of 1024

pixels per task while increasing the block

size to perform bulk I/O. Block sizes of 32,

64, 128, 256, 512, 1024 pixels were used.

Figure 9 shows the total execution time

versus block size in pixels. It can be seen

that an optimal value of a block size of 256

pixels results in the least execution time.Figure 10 shows the effect of bulk I/O time

versus block size in pixels. An optimal value

of a block size of 256 pixels is obtained for

READ while the value of block size has

negligible effect on WRITE.

FIGURES:-


34/39


35/39


36/39


37/39

B. Reliability

In this section, we evaluated the reliability

of the framework in case of worker failures

and their impact on the performance. A task

assigned to a worker by the master may not

complete on time due to a worker node orworker process failure, respectively. In our

framework an estimated execution time is

computed by taking the average of a certain

number of previous task execution times

multiplied by an over-estimation factor as

described in section 2. A task which has not

been completed within its estimated

execution time is considered lost by the

master and will be reassigned to another

worker requesting tasks. The worker which

was initially assigned the lost task is

considered faulty and is removed from the

list of available workers. It is assumed that

an operational system will have a safety

margin to ensure the availability of the

minimum required number of worker

processes at all times. To investigate the

capability of the framework in case of

worker failures, artificially induced faults

are introduced in the system. Worker

processes are shut down according to an

exponential distribution after they are

assigned tasks. An example of the workerfailure distribution during the course of one

of the experiments is shown in figure 11. In

our experiments, we are assuming that the

required number of worker processes is 32

with a safety margin of 100 percent. Hence

64 worker processes are shut down until

only half of the initial processes are

available. We verified that the master can

successfully identify failed workers and

reassign tasks to available worker processes.

Figure 12 shows the speedup for a series of

experiments where the speedup varied from

about 38 to 53. A mean value of 44.88 is

obtained for the speedup as shown in figure

12. This value may be compared to the


38/39

speedup of 54.56 shown in figure 5 without

worker failures.

FIGURES:-

VI. CONCLUSIONS AND FUTURE

WORK

In this paper we have proposed a highly

robust and reliable framework for the

distributed real-time processing of satellite

data for a GIFTS ground system.


39/39

Furthermore we presented an architectural

model for providing performance, reliability,

and scalability of candidate hardware and

software for such a framework. It has been

shown that considerable performance can be

gained for GIFTS science algorithms

without sacrificing the reliability and high

availability requirements for the operational

system. Future work involves the processing

of GIFTS science algorithm pipeline using

real thermal vacuum data. Depending on the

flight of the GIFTS instrument, the

framework prototype as shown in this paper

needs to be transitioned into an operational

environment requiring data to be processedin near real-time. This requires further

scalability studies in terms of number of

processors needed to keep up with the

GIFTS data rate.

BIBLIOGRAPHY

I use the following web links from

the main website www.osun.org to

make given term paper topic.

http://vecpar.fe.up.pt/2006/program

me/papers/36.pdf

http://www.idosi.org/wasj/wasj5(2)/

14.pdf

http://comet.lehman.cuny.edu/vpan/

pdf/lpieeetc01_184.pdf

http://www.cs.tufts.edu/~jacob/pape

rs/supercomputing.deligiannidis.pd

f

http://www.cs.ru.ac.za/research/Gr

oups/vrsig/pastprojects/001parallel

anddistributedvr/paper02.pdf

http://www.dhpc.adelaide.edu.au/re

ports/108/dhpc-108.pdf

http://ams.confex.com/ams/pdfpape

rs/69620.pdf
http://www.osun.org/http://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.osun.org/http://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.idosi.org/wasj/wasj5(2)/14.pdf

tp parallel processing applications

Documents