tp parallel processing applications
TRANSCRIPT
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
1/39
TERM PAPER
OFCOMPUTER ORGANIZATION & ARCHITECTURE
CSE (211)
TOPIC: PARALLEL PROCESSING APPLICATIONS
Submitted to: Submitted by:MR. SUMIT MITTU ASHISH KUMAR
ROLL NO-RG1901 A65
SECTION-G1901GROUP-1
REG.NO-10800114
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
2/39
TABLE OF CONTENT
INTRODUCTION TO PARALLEL PROCESSING
VARIOUS APPLICATION AREAS OF PARALLEL PROCESSING
MATRIX OPERATIONS
VIRTUAL REALITY SYSTEM
A ROBUST FRAMEWORK FOR
DISTRIBUTEDPROCESSING OF GIFTS
HYPERSPECTRAL DATA
BIBLIOGRAPHY
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
3/39
INTRODUCTION TO PARALLEL
PROCESSING
Parallel processing is the ability to
carry out multiple operations or tasks
simultaneously. The term is used in
the contexts of both human
cognition, particularly in the ability
of the brain to simultaneously
process incoming stimuli, and in
parallel computing by machines.
PARALLEL PROCESSING IN
COMPUTERS
The simultaneous use of more than
one CPU or processor core to
execute a program or multiple
computational threads. Ideally,
parallel processing makes programs
run faster because there are more
engines (CPUs or cores) running it.
In practice, it is often difficult todivide a program in such a way that
separate CPUs or cores can execute
different portions without interfering
with each other. Most computers
have just one CPU, but some models
have several, and multi-core
processor chips are becoming the
norm. There are even computers with
thousands of CPUs.
With single-CPU, single-core
computers, it is possible to perform
parallel processing by connecting the
computers in a network. However,
this type of parallel processing
requires very sophisticated software
called distributed processing
software.
Note that parallelism differs from
concurrency. Concurrency is a term
used in the operating systems and
databases communities which refers
to the property of a system in which
multiple tasks remain logically active
and make progress at the same time
by interleaving the execution order
of the tasks and thereby creating an
illusion of simultaneously executing
instructions. Parallelism, on the other
hand, is a term typically used by the
supercomputing community to
describe executions that physically
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
4/39
execute simultaneously with the goal
of solving a problem in less time or
solving a larger problem in the same
time. Parallelism exploits
concurrency.
Parallel processing is also called
parallel computing. In the quest of
cheaper computing alternatives
parallel processing provides a viable
option. The idle time of processor
cycles across network can be used
effectively by sophisticated
distributed computing software.
VARIOUS APPLICATION
AREAS OF PARALLEL
PROCESSING
MATRIX
OPERATION
Performance Analysis and
Evaluation of Parallel Matrix
Multiplication Algorithms :-
Multiplication of large matrices requires a
lot of computation time as its complexity is
O (n ). Because 3 most current applications
require higher computational throughputs
with minimum time, many sequential and
parallel algorithms are developed. In this
paper, a theoretical analysis for the
performance and evaluation of the parallel
matrix multiplication algorithms is carried
out. However, an experimental analysis is
performed to support the theoretical analysis
results. Recommendations are made based
on this analysis to select the proper parallel
multiplication algorithms. Matrix
multiplication is commonly used in the areas
of graph theory, numerical algorithms,
digital control, digital image processing and
signal processing. Multiplication of large
matrices requires a lot of computation time
as its complexity is O (n ), where n is the
dimension of the 3 matrix. Because most
current applications require higher
computational throughputs, many algorithms
based on sequential and parallel approaches
were developed to improve the performance
of matrix multiplication. Even with such
improvements , for example, Strassen's
algorithm for sequential matrix
multiplication has shown a limitation in
performance. For this reason, parallel
approaches have been examined for decades.
Most of parallel matrix multiplication
algorithms use matrix decomposition that is
based on the number of processors available.
This includes the systolic algorithm,
Cannon's algorithm , Fox and Otto's
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
5/39
algorithm, PUMMA (Parallel Universal
Matrix Multiplication), SUMMA (Scalable
Universal Matrix Multiplication) and
DIMMA (Distribution Independent Matrix
Multiplication) . Each one of these
algorithms uses the matrices that
decomposed into sub-matrices. During
execution process, a processor calculates a
partial result using the sub-matrices that are
currently accessed by it. It successively
performs the same calculation on new
sub-matrices, adding the new results to the
previous. When the multiplication sub
process is completed, the root processor
assembles the partial results and generates
the complete matrix multiplication result.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
6/39
In order to make a proper selection for the
given multiplication operation and to decidewhich is the best uitable algorithm that
generates a high throughput with minimum
time, a comparison analysis and a
performance valuation for the above
mentioned algorithms is carried out using
the same performance parameters and based
on parallel processing. Moreover, thesealgorithms are implemented using C++
programming language.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
7/39
PARALLEL COMPUTING
PARADIGMS
Parallel computing process depends on how
the processors are connected to a memory.
The way of system connection can be
classified into a shared memory system or
distributed memory, each of these two types
is discussed as follows:-
Shared Memory System: In such
a system, a single address space exists,
within it every memory location is given a
unique address and the data stored in
memory are accessible to each processor.
The P processor reads the i data written by
P processor. Therefore, in order to enforce
sequential consistency, it is necessary to use
synchronization.
The Open MP is one of the popular
programming languages executed on the
shared memory system. It provides a
portable, scalable and efficient approach to
run parallel programs in C/C++ and
FORTRAN. In Open MP, a sequential
programming language can be parallelized
with preprocessor compiler directives such
as #pragma omp in C and $OMP in
FORTRAN based with library support.
Distributed Memory System:In
such a system, each processor has its own
memory and can only access its local
memory. The processors are connected with
other processors via a high-speed
communication network. Processors
exchanges information with one another
using send and receive operations. A
common approach to programming
multiprocessors is to use message-passing
library routines in addition to conventional
sequential program. MPI (Message Passing
Interface) is useful for a distributed memory
systems since it provides a widely used
standard of message passing program. It
provides a practical, portable, efficient and
flexible standard for message passing. In
MPI, data is distributed among processors,
where no data is shared and data iscommunicated by message passing.
THEORETICAL ANALYSIS
In order to evaluate the performance of any
matrix multiplication based on using parallel
processors and different algorithms, a
theoretical analysis is carried out based on
the following assumptions:
f = number of arithmetic operations units
tf = time per arithmetic operation
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
8/39
c = number of communication units
q = f / c average number of flops per
communication access
Minimum possible time = f* tf when no
communication
Efficiency(speedup) SP=q*(tf/tc)
f * tf + c* tc = f * tf * (1 + tc/tf * 1/q)
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
9/39
So, larger q indicates that time is closer to
minimumf*tf, thus increasing the speed up
and the efficiency of the algorithm. SP must
be closer to number of processors to achieve
the highest efficiency. Using the above
assumption and with reference to the
information in Table(1) we can obtain f, c
and q for each parallel matrix multiplication
as shown in Table(2).
EXPERIMENTAL RESULTS
Table (4) shows the experimental results
obtained by implementing each algorithm
100 times and Then, ten fastest five ones aretaken and averaged.
CONCLUSION AND FUTURE
WORKS
A theoretical analysis for the performance of
most seven used algorithms is carried out
using one and four processors. This analysis
has shown that systolic algorithm is
considered the best algorithm that produced
a high efficiency and then followed by
puma, dimma and
then summa. However, these algorithms are
implemented and run on one and four
processors to evaluate their performance for
a matrix multiplication. The experimental
results are matched with the theoretical one.
This analysis is useful for making a proper
recommendation to select the best algorithm
among others as a future works to be done
by others.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
10/39
VIRTUAL REALITY
SYSTEM
Application of Parallel Processing to
Rendering in a Virtual Reality System :-
The rapid rendering of images is pivotal in
all applications of computer graphics, and
perhaps
most of all in the field of virtual reality.
Research suggests that there is a limit to the
improvement in performance that can be
attained by the selection of the appropriate
algorithms
and their careful optimisation. Hardware
implementation increases expense and limits
the
flexibility of the equipment. A relatively
unexplored option is the use of parallel
processing in
image rendering, especially image rendering
in virtual reality systems.
This paper examines the application ofparallel processing to image rendering in a
virtual
reality system. The requirements of a virtual
reality system are noted. A number of
different
parallelisation techniques are proposed, and
each analyzed with regard to its
effectiveness in
meeting these requirements. Performance
values, measured from an implementation of
each
technique on a transputer architecture, are
used to highlight the advantages and
disadvantages of
each technique.
Rendering for a virtual reality
system:-Rendering is a well understoodapplication. Many systems have been
created in which three-dimensional scenes
have to be rendered. The use of parallelism
for rendering has also been explored.
However, previous research has not placed
the emphasis on rendering scenes in real-
time. In particular, little work has been done
on using parallel processing when rendering
for virtual reality applications. Virtual
reality applications are different in that the
images must be rendered very rapidly, in
real time, and that the delay between a user
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
11/39
input and the corresponding change in the
rendered output, the latency, must be small.
In working virtual reality systems, frame
rates of 6 frames/second and a latency of
less than 200 milliseconds are suggested as
minimum requirements for the illusion of
reality. These two factors, the speed of
rendering, and the latencyperiod are used as
the criteria for evaluating the parallelisation
techniques explored in this paper.
Rendering on a single processor: -
Renderers are typically constructed by
creating a pipeline of several standard
operations. These operations include
clipping, transformation, projection and
hidden-surface removal. The description of
the scene to be displayed is fed into the
pipeline which eventually produces a
sequence of primitives (two dimensional
lines and polygons, for example) to be
drawn. The pipeline typically ends with
some routines capable of producing a visual
representation of these primitives. The
renderer described in this paper is modelled
as three stages. The first selects those
objects in the virtual world which will bevisible on the screen. The next stage
transforms the objects to screen coordinates.
The last stage performs hidden-surface
removal, clips the objects to the dimensions
of the screen and renders each primitive.
This level of breakdown is adequate for the
parallelisation used which concentrates more
on parallelisation inside each stage, rather
than between stages.
Parallelisation of the graphics
pipeline: - A number of different parallel
decomposition strategies for polygon
rendering are described in a paper by
Whitman. Four decomposition strategies
were implemented, and are described in the
following sections. The various strategieswere implemented on a transputer-based
architecture. However, in most cases, the
techniques are applicable to other
architectures. The hardware platform on
which the parallelisation was carried out was
a MIMD architecture consisting of a cluster
of 32 T800 transputers. Each transputer is a
reasonably powerful processor in its own
right, capable of running a number of
concurrent processes. The addition of 4
communications links allows it to transfer
data to and from others transputers. Each
transputer has its own private memory and
all synchronisation and data transfer occurs
through the links. A single bidirectional link
can transfer a theoretical maximum of 2.35
Mbytes/s, or 1.74 Mbytes/s if
communication is only in one direction. A
set of standard test data was used to obtain
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
12/39
timing data from the various
implementations. The speed of the rendering
system on a single processor is given in
Table 1. The simple test world contains 30
objects amounting to about 250 polygons
(each with 4 or moresides). The complex
world consists of a single object containing
about 1650 triangular polygons.
Pipeline parallelism
The most obvious strategy when confronted
with the graphics pipeline is to place each
component of the pipeline on a separate
processor. The connections between
components can be constructed using
transputer links. Two factors are worthy of
consideration when setting up a pipeline
configuration for virtual reality on a
transputer architecture. Firstly, greater
bandwidth is obtained from the transputer
links when larger messages are transmitted.
Protocol overhead is less, and there is less
delay waiting for each of the communicating
processors to synchronise. The few bytes
representing a point are not worth
transmitting alone, it is better to wait until a
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
13/39
number of points are ready to be transmitted
and then send them all together. For
convenience, all the data corresponding to
an entire object is sent in one packet. The
second consideration is the latency problem,
the delay between input and the
corresponding output. Lots of data is held in
long pipelines, even with a limited amount
of data at each node. All of this must be
processed before new data entering the node
can have any effect on the image being
displayed. Since there is a certain overhead
associated with the communication between
pipeline components, the delay between data
entering the pipeline and it reaching the
output increases with pipeline length, even
though the rate at which images are
produced is increased.
This may be illustrated by example:
Consider a job which on a single processor
would accept a datum and take 10 seconds
to process it before outputting the result.
Assume this job is capable of being
parallelised linearly. A pipeline in which
each node shares equally in the workload is
illustrated in Figure 1. The data is arriving
as fast as possible, but will take 1 second to
transmit across a link. With 5 processors the
job can be divided into a pipeline of 5
parallel components, and so each will work
on the datum for 2 seconds. Thus results can
be output every 2 seconds. The time
between a given datum being input and the
corresponding output is at least 16 seconds.
With 10 processors, output occurs once
every second but at least 21 seconds pass
between input and corresponding output.
In general:
let N be the number of processors,
let T be the time for the job on the single
processor system and
let C be the communication time between
processors. The data output rate is then T/N.
The latency is at least (N+1)C + T seconds.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
14/39
The latency increases with number of
processors for non zero communication
time. For zero communication time it
remains constant, independent of the data
output rate. For this reason pipeline
parallelism is better suited to systems that
require computationally intensive rendering
with limited user interaction. A limited
rendering pipeline with three nodes was
implemented. The processor layout is shown
in Figure 2. Timing figures are given in
Table 2. The differences in speedup at
different screen resolutions are due to
change in the load balance. The renderer has
to do more work at higher resolutions and so
acts as a bottleneck.
Coarse grained parallelism
The coarse grained parallel decomposition
strategy renders successive frames on
separate processors. The transputer network
is arranged to allow a number of
independent pipelines to be created, as
shown in Figure 3. Each pipeline renders
one frame in memory which is then
transferred directly to the graphics node.
This technique promises the greatest
speedup since it requires no load balancing.
Each processor can start work on another
frame as soon as the current one is finished.
The load is also approximately equal from
one frame to the next, so they are likely to
stay in step. There is thus no waiting for
other processors to catch up. The delay
between input and output is still the time
taken for the pipeline to render the frame.
No decrease in this time occurs even when
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
15/39
the number of parallel pipelines is increased.
The latency is thus constant and independent
of the number of processors used. Values for
the speedup obtained from an
implementation of this technique are shown
in Figure 4. As may be seen, linear speedup
is obtained for most of these examples. The
only factor preventing linear speedup in all
cases is the bandwidth restriction on the
links to the graphics node. With each link
capable of 1.7 Mbytes/second and four links,
this allows a maximum of 26.6
frames/second at 512 x 512 resolution or
108 frames/second at 320 x 200 resolution.
The example with the simple world at 512 x
512 is using about half of the total
bandwidth available. There is a general
tendency for a drop in the speedup when
more than four pipelines are used, especially
in high traffic situations. This tends to occur
with all the parallelism strategies tried. It
can be ascribed to the necessary doubling of
the traffic on one of the links to the video
controller card.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
16/39
Fine grained parallelism
The pipeline approach uses a number of
transputers connected in series. It has the
disadvantage of high latency. A more
suitable parallel decomposition for
interactive graphics applications would have
the processors connected in parallel, and all
working on the same frame. In this way the
delay between input and corresponding
output would be shortened. A profile of the
graphics routines running on a singleprocessor reveals that the bulk of the effort
is spent on the stage involving hidden
surface removal and polygon rendering.
Thus the bulk of the effort in parallelisation
has gone into decreasing the time spent on
this stage. The other stages of the graphics
pipeline are left as a short pipeline on
separate processors. The fine grained
parallelisation technique is an image space
pixel based parallel decomposition strategy.
The display area is partitioned into
horizontal strips, and the hidden surface
removal and rendering for each strip occurs
on a separate processor. The strips are
rendered in memory and transferred to the
graphics node which fits each block into the
correct area of screen memory. The amount
of data that is transmitted to the video
controller node is the same as with the
coarse grained parallelism. The messages
are shorter but more frequent. The processor
layout used to implement this technique is
shown in Figure 5. The shaded processors
represent one possible graphics pipeline for
rendering a single strip. The transformation
stage can be parallelised in the same
manner, with transformations restricted to
those objects appearing on any particular
strip. However, in practice most objects
appear on every strip anyway and so this
does not result in any significant saving. The
transformation stage will be placed on a
single processor to simplify the model.
Consider a simplified version of the graphics
pipeline that consists of K stages with each
stage taking time T to complete their task.
By using N processors on one of the stages
and assuming linear speedup, the time for
that stage will be reduced to T/N. The total
time taken to traverse the pipeline is (K -
1)T + T/N. The latency decreases as the
number of processors is increased. The
restrictions given above can be relaxed, and
as long as speedup is obtained, latency will
decrease.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
17/39
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
18/39
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
19/39
The speedup values obtained from
implementations of this strategy are shown
in Figure 6. The times measured are for the
entire graphics pipeline, and not just the
hidden surface removal and polygon
rendering. This latter part of the pipeline
does most of the work however, and controls
the overall speed. This technique does not
produce a large speedup, nor does the
speedup increase very much with the
number of processors used. Two factors are
found to contribute to these effects. When
rendering the horizontal strips, the polygonsthat fall partially outside these strips must be
clipped. As the number of strips increases,
each strip becomes smaller and the number
of polygons needing to be clipped increases.
In one test with the screen divided into 4
strips the time taken for clipping increased
from about 25% of the time taken for
rendering to about 500% of the rendering
time. The rendering time did decrease by a
factor of 4 due to the parallelism, but theclipping overhead was still significant. The
clipping is a significant factor with the
single object in the complex world with its
many polygons and limits the speedup the
most for this case. The second factor
involves load balancing. The picture is not
always evenly distributed over the screen.
Sometimes all the objects cluster in one
region of the screen. In this case one
processor is required to render most of the
objects while others render almost nothing.
This situation is almost equivalent to a
single pipeline. The effect of this can be
seen in the speedup values for the complex
world with 512 x 512 resolution. The singleobject occupies the middle half of the
screen. Thus for three pipelines, the
processors rendering the top and bottom
slices have little work to do and the
processor rendering the middle slice does
most of the rendering. This explains the drop
in speedup over the two pipeline version
where the load is more evenly balanced. The
object fills the screen in the 320 x 200
resolution and clipping limits the speedup in
this case. Alternative techniques for
distributing sections of the display area over
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
20/39
N processors to improve the load balancing
include:
Placing successive scan-lines on different
processors.
Dividing the screen into N2 horizontal
slices and placing successive slices on
different processors.
These alternatives, however, incur greater
overheads. The second alternative would
result in more clipping, making it more
expensive. The first alternative proved
possible to implement with surprisingly few
alterations to the rendering routines. Tests of
the method show a more consistent tendency
to increase in speed as the number of
processors is increased, as opposed to the
more erratic behaviour evidenced by the
original routine. This can be attributed to the
improved load balancing. Despite the
improvement in scaling, the added
complexity results in the improved
algorithm running about 10% slower than
the original.The frame rates obtained withthese methods were not high enough for
bandwidth limitations to be a constraint.
Object-based parallelism
An object space parallelisation strategy was
designed in which rendering of different
objects occurs on different processors. For
the case of N processors, the objects are
sorted into N groups based on their depth.
This is easily done as part of the depth sort
involved in the hidden object removal
routine. Each group is then rendered on one
of the processors. The images produced are
sent to the graphics node where the images
are combined. The combination technique is
similar to the Painters algorithm, used for
hidden surface removal, where images
corresponding to nearer objects are painted
over those belonging to more distant ones.
The transputer has a specialised block move
instruction that allows this to be done
relatively quickly and easily. Both the
transformation and hidden surface/polygon
rendering stages can be done in parallel with
this method. The transputer network used to
implement this strategy is shown in Figure
7. The shaded nodes show the graphics
pipeline for one group of objects. This
parallelisation strategy has one immediate
disadvantage. If the scene consists of only
one object, then there is little point in using
more than one processor. Consequently a
world containing 16 objects, each with 276
triangles was used instead of the complex
world used for previous tests. Values
obtained for the speedup for an
implementation of this strategy are shown in
Figure 8. The bandwidth limitation is
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
21/39
particularly severe in this case. In the case
where the number of processors used is less
than five, an entire image needs to be sent
across each of the transputers links. This
places an upper limit on the frame rate of 6.6
frames per second for 512 x 512 resolution
or 27 frames per second for 320 x 200
resolution. The speedup values tend to level
off as this barrier is reached. Adding a fifth
processor only exacerbates the situation. A
further problem with this technique is load
balancing. Some objects are more complex
than others, and just distributing equal
numbers of objects may result in some
processors having to do more work. A finer,
more balanced approach would be to
distribute on a polygon basis. This could be
implemented relatively easily. However
such an approach is not worth exploring
until higher bandwidth systems are
available.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
22/39
The parallelism would also suffer when the
number of visible objects is less than the
number of processors. Polygon distribution
will be valuable for this situation as well.
The simplified version of the graphics
pipeline used in the previous latency
calculation, gives the total time taken to
traverse the pipeline as (K - 2)T + 2T/N,
since two stages of the pipeline are
parallelised. The latency is less than that for
the fine grained decomposition and also
decreases as the number of processors is
increased.
Parallel rendering in practice
With each of the techniques implemented,
analysis of run-time profiling information
showed that processor time was being
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
23/39
wasted in waiting for other processors to
catch up. Processors were blocking, waiting
for others to communicate. This problem
was overcome by buffering incoming
messages, enabling the process sending the
data to continue with other processing. A
notable speedup in the frame rate was
achieved. This solution had a corresponding
disadvantage. Adding extra processes on the
communication links has the effect of
lengthening the pipeline. This increases the
latency. The size of the buffers on each link
also affects the latency adversely. In order to
get the maximum benefit from this
technique the size of the buffers should be
related to the complexity of the scene. Small
buffers are better suited for scenes with a
few complex objects, while larger ones
smooth over the delays for large numbers of
simple objects.
Conclusions
This paper describes a number of different
parallelisation strategies that were tested in
an attempt to increase the rendering speed
and decrease the interaction latency in
virtual reality systems. The strategies
explored tended to exhibit a tradeoff
between speedup and latency. A frame per
processor approach gave almost linear
speedup, but left the latency unaffected.
Strategies which distributed the calculations
for a single frame over a number of
processors resulted in latency decreasing as
the number of processors increased. These
techniques did, however, have a lesser
speedup. A significant factor in the tests
carried out was the limited bandwidth of the
transputer links. An increase in the
communication speed could contribute
toward producing acceptable rendering
speeds with some of the lower latency
strategies. In particular, the technique which
rendered slices of the image on different
processors proved to be the most satisfying
to use in practice. The object per processor
strategy will become acceptable once the
communication bandwidth increases.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
24/39
A Robust Framework
For Distributed
Processing Of GIFTS
Hyperspectral Data
Abstract The Geosynchronous Imaging
Fourier Transform Spectrometer (GIFTS)
instrument can provide raw data in the order
of multiple Terabytes per day. Due to the
high data rate, satellite ground data
processing will require considerable
computing power to process and archive
data in near real-time. Cluster technologiesemploying a multi-processor system
combined with a parallel file system is the
only cost effective solution for such
processing and storage. GIFTS data
processing system is required to generate
critical products within 5 minutes of
gathering observation. In this paper we
present an approach for GIFTS ground
system processing based on the master-
worker paradigm which providesperformance, reliability, and scalability of
candidate hardware and software using the
Message Passing Interface (MPI) standard.
The framework used, alleviates the need for
earth scientists to understand parallel
computing and fault-tolerant operations.
Benchmarking results are presented for a
selected number of science algorithms for
the GIFTS instrument showing that
considerable performance can be gained
without sacrificing the reliability and high
availability constraints imposed on the
operational cluster system. A maximum
speed up of 54.56 (85.9% efficiency) is
obtained for a total number of 64 processors
over 64 data cubes of 128 x 128 pixels in the
long wave and short-medium wave spectral
range. This prototype system shows that
considerable performance can be gained for
candidate science algorithms without
sacrificing reliability and high availability
needed for a real-time system.
I. INTRODUCTION
Future satellite instruments can provide raw
data of the order of multiple Terabytes per
day. Due to the high data rate, satellite
ground data processing will require
considerable computing power to process
and archive this data in near real-time. Theprimary mission of the National Oceanic and
Atmospheric Administration (NOAA) is to
understand and predict changes in the
Earths environment which requires a
continuous capability to acquire, process and
archive data in real-time. NOAA
participation in the GIFTS technology
transfer represents a risk reduction activity
in the design of the NOAA GOES-R series
of imager and sounder instruments and their
associated science algorithms. The GIFTS
instrument uses a combination of Large area
Focal Plane Arrays (LFPAs), and a Fourier
Transform Spectrometer (FTS), providing a
spectral resolution of 0.6 cm-1 for a 128 x
128 set of 4 km foot-prints every 11 seconds
[1]. It is anticipated that the GIFTS Level-0
data rate is about 55 Mbps or about 1.5
Terabyte per day [1]. The computing powerneeded for Level-0 to Level-1 processing is
not only due to the data volume but also due
to the inverse Fourier transform and non-
linearity correction.
The volume of Level-1 data is
approximately the same as Level-0. Since
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
25/39
there is little reduction in data volume from
Level-0 to Level-1, producing Level-2 data
also requires significant computing power.
GIFTS data processing system is required to
generate critical products within 5 minutes
of gathering observation. Cluster
technologies employing a multi-processor
system present the only current
economically viable option to accommodate
the processing needed for GIFTS ground
data processing. However such systems are
inherently unstable; failure of one
component may result in a failure of the
system if necessary measures are not taken.
Operational real-time systems need to bereliable and fault-tolerant, operate on
continuous data streams and be operator-
friendly. To sustain high levels of system
reliability and operability in a cluster-
oriented operational environment, a fault-
tolerant data processing framework is
proposed to provide a platform for
encapsulating science algorithms and hide
the complexities involved with an
operational cluster system. Many Earthscience algorithms are very complex, but
they have only a small degree of spatial
dependency and thus are ideal for parallel
processing.
Designing a highly reliable operational
cluster system for NOAAs future ground
segments is the focus of this paper. In the
rest of this section, work reported in the
literature that is most relevant to our work isbriefly discussed. In section 2, we describe
GIFTS instrument and the associated since
algorithms. Section 3 describes framework
architecture used for GIFTS processing.
Section 4 describes the implementation
details of the prototype system. Section 5
shows benchmarking results and tradeoff
study for GIFTS ground system processing.
Section 6 concludes the paper and comments
on future directions of this work.
II. GIFTS INSTRUMENT ANDSCIENCE
ALGORITHMS
A. GIFTS Instrument Background
The GIFTS instrument consists of
large area focal plane detector arrays
(128 x 128 pixels) within a Fourier
Transform Spectrometer (FTS),mounted on a geostationary satellite.
The instrument provides observations
of Earth infrared radiance spectra at
high spectral resolution (0.6 cm -1)
and high spatial resolution (4 km x 4
km pixels). Depending on spectral
resolution, GIFTS views a large area
(512 km x 512 km) of the Earthwithin a 1 to 11 second time interval.
Extended Earth coverage is achieved
by step-scanning the instrument field
of view in a contiguous fashion across
any desired portion of the visible
Earth. A visible camera provides
daytime imaging of clouds at 1 km
spatial resolution. Figure 1 shows a
selection of GIFTS measurement
modes.
GIFTS uses two detector arrays to
cover the spectral bands 685 to 1130
cm-1 and 1650 to 2250 cm-1, as
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
26/39
shown in Figure 2, and a Michelson
interferometer to obtain the spectrum
of radiance within these bands. The
spectral resolution of the
measurements is sufficient to resolve,within 1-2 km vertical resolution,
dynamic features of the atmospheric
temperature and moisture profiles.
The geostationary platform enables
the tracing of fine scale features of the
atmospheric water (cloud and vapor)
distribution to permit the derivation of
altitude resolved wind profiles.
Nevertheless, GIFTS will cover a
major portion of the visible disk with
high vertical resolution soundings in
less than 30 minutes. This feature is
important for obtaining wind profilesfrom geostationary temperature and
moisture sounding data. As part of
University of Wisconsins (UW's)
algorithm development, simulations
of expected top of atmosphere (TOA)
radiances are being used for algorithm
development and processing.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
27/39
B. GIFTS Science Algorithm Pipeline
The GIFTS sensor will sample the
interferogram from each detector as a
function of optical path delay and
numerically filter the data in real-time to
reduce the data rate before transmission to
the ground-based X-band receiver. The
sensor collects views of the onboard
calibration references and deep space at
regular intervals. The ground reception
facility will decode the telemetry stream andpass the GIFTS sensor data in real-time to a
ground data processing facility. The GIFTS
science algorithms are developed by SpaceScience and Engineering Center (SSEC) at
University of Wisconsin (UW). These
algorithms can be described as a pipeline
consisting of a set of modules including an
initial Fast Fourier Transform (FFT), Non-
linearity Correction, Radiometric
Calibration, Spectral Calibration, Instrument
Line Shape Correction and Spectral
Resampling as shown in Figure 3. These
modules are described below but completedetails can be obtained from Knuteson .
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
28/39
Initially, FFT operations are applied to
GIFTS data to convert the measured
interferograms into complex spectra. A
complex Fourier transform and data folding
is performed to convert the complex
interferograms to complex spectra
corresponding to a wave number scale. In
second step we apply non-linearity
correction. This work is still under
investigation by SSEC, UW. It is expected
that the GIFTS detector material is highly
linear in the range of photon fluxes used, but
the electronics readout of the focal plane
array can introduce small signal non-
linearity. The non-linearity correction
algorithm is being designed using ground-
based AERI instrument as well as the
Scanning-HIS instrument. In the third step,
we perform radiometric calibration. This
step ensures the GIFTS instrument
requirement to measure brightness
temperature to better than 1 K, with a
reproducibility of 0.2 K. GIFTS uses views
of two on-board blackbody sources (300 K
and 265 K) along with cold space,
sequenced at regular programmable
intervals. The temperature difference
between the two internal blackbody views
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
29/39
provides the sensor slope term in the
calibration equation, while the deep space
view corrects for radiant emission from the
telescope by establishing the offset term.
This is followed by spectral calibration,
which uses ground-based and aircraft FTS
systems. The spectral characteristics of these
instruments are defined by an Instrument
Line Shape (ILS) and a spectral sampling
interval. The spectral sampling scale is
maintained very accurately by the stable
laser used to trigger sampling at equal
intervals of Optical Path Difference (OPD).
The next stage is to perform the instrument
line shape correction. It has been evaluatedthat this effect is very negligible for GIFTS
instrument since it has an extremely small
range of angles contributing to each
individual detector pixel (< 1 mrad in the
interferometer). As a result, the variation of
it is extremely small and could even be
ignored without introducing significant
errors. Once the spectral calibration is
determined we perform wave number
resampling. The GIFTS radiance spectrumcan be resampled from the original sampling
interval to a standard reference wave
number scale. The resampling is performed
in software using FFT, zero padding, and
linear interpolation of an oversampled
spectrum. The results of the wave number
resampling operation are equivalent to
GIFTS spectra with a common wave number
scale independent of their location in the
focal plane array. Our benchmarking results
as presented in section 5 include the initial
Fast Fourier Transform (FFT), the
radiometric calibration and the spectral
resampling stages, since the other modules
are still in the development phase and not
yet available for use in the framework.
III. GIFTS FRAMEWORK
ARCHITECTURE
In this section we propose an architectural
model for a robust and reliable operational
satellite data processing frameworkconsisting of:
Active/Standby Master
Active/Standby Data Input Server
Active/Standby Data Output Server
Reference and Audit Database Servers
Workers
The master is responsible for cluster
management and task scheduling. The data
input server provides the real-time data
which will be retrieved by the workers while
the data output server gathers the results
produced by the workers. The reference
database server provides access to a
database storage unit which may contain
instrument factory parameters, science
algorithm specific parameters, algorithm
descriptions, and algorithm initialization
parameters. The audit database server
provides monitoring capability of algorithms
used, produced products and so forth. The
workers do the actual processing of the
science algorithms. A Front End Processor
(FEP) receives (FEP) receives the data from
the satellite, processes the data into packets
and frames according to the CCSDS formatdescription. The frames constitute
interferometer measurements, performance,
engineering and diagnostics data. The FEP
provides the input data to the data input
server and is not part of the framework.
Redundancy for the master, data input and
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
30/39
data output servers are provided through an
active/standby mechanism. Parallel file
systems with server fail-over capability may
serve as input and output servers. The
databases, consisting of the reference
database and the audit database, are
envisioned to consist of their own
commercially available database clusters
such as the MySQL Cluster for fault
tolerance purposes. The master is
responsible for cluster management and task
scheduling. A task is referred to as a unit of
work which contains its own set of data,
initialization parameters, reference id,
timestamp and so forth. Each task has aunique identifier which differentiates it from
other tasks within the system. Tasks may
also be prioritized in which case tasks with
higher priorities will be executed before low
priority tasks. The data input server retrieves
the new incoming data, packages them
appropriately and assigns unique identifiers
to tasks encapsulating the new data sets. The
identifiers for the new tasks produced by the
data input server are sent to the master,which in turn schedules the new tasks to be
assigned to workers. The master may also
define scheduling policies in a way such that
certain tasks are assigned to specific
workers. We refer to this as selective
scheduling. Such policies are beneficial in
cases where specific tasks may need
initialization parameters, which may reside
on specific workers and may be
computationally time consuming if they had
to be recomputed. By assigning such tasks to
the same worker or the same group of
workers and using application-aware caches,
a large amount of computation may be saved
resulting in higher performance. All
communication between the various servers
may be performed asynchronously to
overlap computation with communication.
A task assigned to a worker by the master
may not complete on time for various
reasons such as network or worker nodefailure. A task is considered completed by
the master when the task identifier is
returned to the master by the worker. To
implement task and data redundancy
mechanisms, the task execution time, also
referred to as task latency, has to be known
in advance. Specifying task latency statically
is unfeasible for most practical parallel
applications. In the current design, task
execution times are estimated dynamically
based on previous actual task execution
times and an over-estimation factor. The
over-estimation factor for each task can be
calculated as the inverse ratio of its
estimated and actual execution time. The
estimated execution time may be computed
by taking the average of a certain number of
previous task execution times multiplied by
an over-estimation factor. The over-estimation factor may or may not be a
constant.
A task which has not been completed within
its estimated execution time is considered
lost by the master and will be reassigned to a
worker requesting tasks. The master keeps
two queues of tasks, one for new tasks
which have not yet been assigned to
workers, and one for assigned tasks. Whenthe master schedules a task to be assigned to
a worker process, the task in the head of the
queue for new tasks is retrieved and
assigned to an appropriate worker. This task
is then moved to the queue of assigned
tasks. When the master receives a
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
31/39
notification from the worker that the
respective task has been computed, the task
is removed from the queue of assigned tasks.
However if this task is not completed within
its estimated execution time, it is considered
lost. Therefore the task is removed from the
queue of assigned tasks and is placed in the
head of the queue of new tasks to be
reassigned to a new worker. The worker
which was initially assigned the respective
task is considered faulty and will be
removed from the list of available workers.
IV. GIFTS PROTOTYPE
IMPLEMENTATION
The prototype system includes an HP Linux
cluster consisting of 32 dual core AMD
Opteron DL145 compute nodes with 1 GB
RAM per CPU and 4 dual core AMD
Opteron DL385 management nodes with 1
GB RAM per CPU. The system runs Red
Hat Linux Enterprise Server for AMD64 and
has a Gigabit Ethernet as well as Myrinet
interconnect for the entire system. HP
Service guard is used to provide failover
capability for the master node. HP
Serviceguard monitors the health of each
master node and rapidly responds to failures
in a way that minimizes or eliminates
application downtime. This system is also
supplemented by an HP Storage Works
Scalable File Share (HP SFS) which is used
to access data in parallel from all compute
nodes. It has 16 Terabytes of usable storageand provides a bandwidth of 1064 Mbytes/s
for READ and 570 Mbytes/s for WRITE. It
uses the Lustre parallel file system which is
one of the most extensively used parallel file
systems.
Framework Implementation
The framework is implemented using the
Message Passing Interface (MPI) and the C+
+ programming language. Since there are
many implementations of MPI, we evaluated
various implementations such as Open MPI ,
MPICH-2 and LAM-MPI . Finally we
choose the MPICH-2 implementation
considering its support for MPI-2 standards.
Our framework prototype is a complete
implementation of all the various
components as described in section 3 except
for the FEP. For the current framework
implementation, the FEP was not used to
provide the data; rather the data was directly
fetched from the input server. Theframework separates the science algorithm
layer from the cluster management layer and
provides an operational platform within
which algorithm software pipelines can be
deployed for satellite data processing. There
are classes and APIs for each service such
as master, input delivery, output delivery,
reference database, audit database, and
workers. The system can be configured at
startup using configuration files wherehostnames may be specified for the various
worker nodes, reference and audit database
and so forth. Once the master finishes
initialization, it starts the workers and
communicates a unique identifier to each
worker. Once connection is established, if
there is an outstanding task to be processed,
it will be assigned to a worker by the master.
In case a worker is removed at run-time, its
tasks are reassigned to other workers as
described in section 3. In case of a master
failover, all worker master connections are
lost and the worker processes are cleaned
up. The failover process from standby to
active involves reading the last check-point
file written by the active unit, re-creating the
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
32/39
system state accordingly and failing over
from standby to active unit. The standby
master takes over and creates new worker
processes as well as redistributes tasks to the
workers. The framework provides algorithm
independence through a set of base classes
and interfaces for the algorithmic tasks .
Each independent work unit is encapsulated
in the Task class containing fields for start
time, completion time, compute time, unique
task identifier, algorithm identifier, number
of pixels, input and output data sizes, and
other initialization parameters common to
most satellite data processing algorithms we
have evaluated so far. A task is started assoon as the master schedules it for a specific
worker and is considered completed as soon
as the master receives a notification from
worker that the respective task has been
completed. Hence the completion time for
each task includes communication, read,
write and computation time, respectively.
The compute time stored in the task contains
the actual computational time for the task as
observed by the worker. Each task has aunique task identifier to differentiate it from
other tasks. Each task contains a field for
algorithm identifier specifying the
algorithm/algorithms to be applied to the
data. Each task contains reference to its own
set of input and output data files. Workers
execute a task simply by calling the
execute() method on an instance of the Task
class. This way the science algorithm layer
is clearly separated from the framework
layer.
V. RESULTS
In this section we investigate our framework
prototype in terms of performance and
reliability. The framework prototype is
evaluated using GIFTS science algorithms
as described in section 2. Our
benchmarking results show end to end
performance using simulated data cubes as
well as the effect of task size on the
performance.
A. Performance
Our benchmarking results include the initial
FFT, the radiometric calibration and the
spectral resampling stages, since the other
modules are still in the development phase
and have not yet been released. For the
benchmarking, a total of 64 data cubes areused. This results in sufficient amount of
work to be provided to each worker to hide
any anomalies. Each data cube has a total of
128 x 128 pixels. In the first experiment we
have used a constant task size of 64 data
cubes and shown the execution time, speed
up and efficiency for the GIFTS pipeline.
Two spectral ranges for the datasets are
used: the long wave infrared band (685-1129
cm-1) and the short-medium wave band
(1650-2250 cm-1). Figure 4 shows the total
execution time versus number of processors
using the end-to-end GIFTS pipeline
processing. Figure 5 shows the speedup
versus number of processors for the same
experiment. It is shown that we can achieve
a linear speedup of 54.56 compared to an
ideal speedup of 64. Figure 6 shows the the
efficiency versus number of processors. Itcan be seen that we are able to maintain an
efficiency of 85 percent throughout the
experiment. In the second experiment, we
evaluated the effect of network bandwidth
by increasing the task size in terms of
number of pixels per task. This resulted in
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
33/39
reduced number of messages between the
workers and the master which in turn
reduces the worker master communication.
This experiment was conducted using a total
of 64 data cubes running on 64 processors.
The experiment was conducted using tasks
containing 128, 256, 1024, and 2048 pixels.
Figure 7 shows the total execution time
versus task size. Increasing the task size
from 128 pixels per task to 2048 pixels per
task, resulted in an increase of the total
execution time of about 0.7 percent. Hence
the effect of task size on the total execution
time is negligible. It was observed that the
total amount of computation time remainedconstant as the total work remained
constant. Figure 8 shows the effect of task
size on READ and WRITE times. For a task
size of 128 pixels per task, blocks of 128
pixels were used to perform READ and
WRITE, whereas for task sizes greater than
or equal to 256 pixels per task, blocks of 256
pixels were used. In the third experiment,
we investigated the effect of the bandwidth
of the parallel file system to determine the
optimum number of pixels which should be
used for performing I/O to obtain the
maximum speedup. In this experiment we
kept the task size to a constant value of 1024
pixels per task while increasing the block
size to perform bulk I/O. Block sizes of 32,
64, 128, 256, 512, 1024 pixels were used.
Figure 9 shows the total execution time
versus block size in pixels. It can be seen
that an optimal value of a block size of 256
pixels results in the least execution time.Figure 10 shows the effect of bulk I/O time
versus block size in pixels. An optimal value
of a block size of 256 pixels is obtained for
READ while the value of block size has
negligible effect on WRITE.
FIGURES:-
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
34/39
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
35/39
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
36/39
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
37/39
B. Reliability
In this section, we evaluated the reliability
of the framework in case of worker failures
and their impact on the performance. A task
assigned to a worker by the master may not
complete on time due to a worker node orworker process failure, respectively. In our
framework an estimated execution time is
computed by taking the average of a certain
number of previous task execution times
multiplied by an over-estimation factor as
described in section 2. A task which has not
been completed within its estimated
execution time is considered lost by the
master and will be reassigned to another
worker requesting tasks. The worker which
was initially assigned the lost task is
considered faulty and is removed from the
list of available workers. It is assumed that
an operational system will have a safety
margin to ensure the availability of the
minimum required number of worker
processes at all times. To investigate the
capability of the framework in case of
worker failures, artificially induced faults
are introduced in the system. Worker
processes are shut down according to an
exponential distribution after they are
assigned tasks. An example of the workerfailure distribution during the course of one
of the experiments is shown in figure 11. In
our experiments, we are assuming that the
required number of worker processes is 32
with a safety margin of 100 percent. Hence
64 worker processes are shut down until
only half of the initial processes are
available. We verified that the master can
successfully identify failed workers and
reassign tasks to available worker processes.
Figure 12 shows the speedup for a series of
experiments where the speedup varied from
about 38 to 53. A mean value of 44.88 is
obtained for the speedup as shown in figure
12. This value may be compared to the
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
38/39
speedup of 54.56 shown in figure 5 without
worker failures.
FIGURES:-
VI. CONCLUSIONS AND FUTURE
WORK
In this paper we have proposed a highly
robust and reliable framework for the
distributed real-time processing of satellite
data for a GIFTS ground system.
-
8/7/2019 tp PARALLEL PROCESSING APPLICATIONS
39/39
Furthermore we presented an architectural
model for providing performance, reliability,
and scalability of candidate hardware and
software for such a framework. It has been
shown that considerable performance can be
gained for GIFTS science algorithms
without sacrificing the reliability and high
availability requirements for the operational
system. Future work involves the processing
of GIFTS science algorithm pipeline using
real thermal vacuum data. Depending on the
flight of the GIFTS instrument, the
framework prototype as shown in this paper
needs to be transitioned into an operational
environment requiring data to be processedin near real-time. This requires further
scalability studies in terms of number of
processors needed to keep up with the
GIFTS data rate.
BIBLIOGRAPHY
I use the following web links from
the main website www.osun.org to
make given term paper topic.
http://vecpar.fe.up.pt/2006/program
me/papers/36.pdf
http://www.idosi.org/wasj/wasj5(2)/
14.pdf
http://comet.lehman.cuny.edu/vpan/
pdf/lpieeetc01_184.pdf
http://www.cs.tufts.edu/~jacob/pape
rs/supercomputing.deligiannidis.pd
f
http://www.cs.ru.ac.za/research/Gr
oups/vrsig/pastprojects/001parallel
anddistributedvr/paper02.pdf
http://www.dhpc.adelaide.edu.au/re
ports/108/dhpc-108.pdf
http://ams.confex.com/ams/pdfpape
rs/69620.pdf
http://www.osun.org/http://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.osun.org/http://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.idosi.org/wasj/wasj5(2)/14.pdf