tp parallel processing applications

Upload: ashish-saxena

Post on 08-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    1/39

    TERM PAPER

    OFCOMPUTER ORGANIZATION & ARCHITECTURE

    CSE (211)

    TOPIC: PARALLEL PROCESSING APPLICATIONS

    Submitted to: Submitted by:MR. SUMIT MITTU ASHISH KUMAR

    ROLL NO-RG1901 A65

    SECTION-G1901GROUP-1

    REG.NO-10800114

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    2/39

    TABLE OF CONTENT

    INTRODUCTION TO PARALLEL PROCESSING

    VARIOUS APPLICATION AREAS OF PARALLEL PROCESSING

    MATRIX OPERATIONS

    VIRTUAL REALITY SYSTEM

    A ROBUST FRAMEWORK FOR

    DISTRIBUTEDPROCESSING OF GIFTS

    HYPERSPECTRAL DATA

    BIBLIOGRAPHY

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    3/39

    INTRODUCTION TO PARALLEL

    PROCESSING

    Parallel processing is the ability to

    carry out multiple operations or tasks

    simultaneously. The term is used in

    the contexts of both human

    cognition, particularly in the ability

    of the brain to simultaneously

    process incoming stimuli, and in

    parallel computing by machines.

    PARALLEL PROCESSING IN

    COMPUTERS

    The simultaneous use of more than

    one CPU or processor core to

    execute a program or multiple

    computational threads. Ideally,

    parallel processing makes programs

    run faster because there are more

    engines (CPUs or cores) running it.

    In practice, it is often difficult todivide a program in such a way that

    separate CPUs or cores can execute

    different portions without interfering

    with each other. Most computers

    have just one CPU, but some models

    have several, and multi-core

    processor chips are becoming the

    norm. There are even computers with

    thousands of CPUs.

    With single-CPU, single-core

    computers, it is possible to perform

    parallel processing by connecting the

    computers in a network. However,

    this type of parallel processing

    requires very sophisticated software

    called distributed processing

    software.

    Note that parallelism differs from

    concurrency. Concurrency is a term

    used in the operating systems and

    databases communities which refers

    to the property of a system in which

    multiple tasks remain logically active

    and make progress at the same time

    by interleaving the execution order

    of the tasks and thereby creating an

    illusion of simultaneously executing

    instructions. Parallelism, on the other

    hand, is a term typically used by the

    supercomputing community to

    describe executions that physically

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    4/39

    execute simultaneously with the goal

    of solving a problem in less time or

    solving a larger problem in the same

    time. Parallelism exploits

    concurrency.

    Parallel processing is also called

    parallel computing. In the quest of

    cheaper computing alternatives

    parallel processing provides a viable

    option. The idle time of processor

    cycles across network can be used

    effectively by sophisticated

    distributed computing software.

    VARIOUS APPLICATION

    AREAS OF PARALLEL

    PROCESSING

    MATRIX

    OPERATION

    Performance Analysis and

    Evaluation of Parallel Matrix

    Multiplication Algorithms :-

    Multiplication of large matrices requires a

    lot of computation time as its complexity is

    O (n ). Because 3 most current applications

    require higher computational throughputs

    with minimum time, many sequential and

    parallel algorithms are developed. In this

    paper, a theoretical analysis for the

    performance and evaluation of the parallel

    matrix multiplication algorithms is carried

    out. However, an experimental analysis is

    performed to support the theoretical analysis

    results. Recommendations are made based

    on this analysis to select the proper parallel

    multiplication algorithms. Matrix

    multiplication is commonly used in the areas

    of graph theory, numerical algorithms,

    digital control, digital image processing and

    signal processing. Multiplication of large

    matrices requires a lot of computation time

    as its complexity is O (n ), where n is the

    dimension of the 3 matrix. Because most

    current applications require higher

    computational throughputs, many algorithms

    based on sequential and parallel approaches

    were developed to improve the performance

    of matrix multiplication. Even with such

    improvements , for example, Strassen's

    algorithm for sequential matrix

    multiplication has shown a limitation in

    performance. For this reason, parallel

    approaches have been examined for decades.

    Most of parallel matrix multiplication

    algorithms use matrix decomposition that is

    based on the number of processors available.

    This includes the systolic algorithm,

    Cannon's algorithm , Fox and Otto's

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    5/39

    algorithm, PUMMA (Parallel Universal

    Matrix Multiplication), SUMMA (Scalable

    Universal Matrix Multiplication) and

    DIMMA (Distribution Independent Matrix

    Multiplication) . Each one of these

    algorithms uses the matrices that

    decomposed into sub-matrices. During

    execution process, a processor calculates a

    partial result using the sub-matrices that are

    currently accessed by it. It successively

    performs the same calculation on new

    sub-matrices, adding the new results to the

    previous. When the multiplication sub

    process is completed, the root processor

    assembles the partial results and generates

    the complete matrix multiplication result.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    6/39

    In order to make a proper selection for the

    given multiplication operation and to decidewhich is the best uitable algorithm that

    generates a high throughput with minimum

    time, a comparison analysis and a

    performance valuation for the above

    mentioned algorithms is carried out using

    the same performance parameters and based

    on parallel processing. Moreover, thesealgorithms are implemented using C++

    programming language.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    7/39

    PARALLEL COMPUTING

    PARADIGMS

    Parallel computing process depends on how

    the processors are connected to a memory.

    The way of system connection can be

    classified into a shared memory system or

    distributed memory, each of these two types

    is discussed as follows:-

    Shared Memory System: In such

    a system, a single address space exists,

    within it every memory location is given a

    unique address and the data stored in

    memory are accessible to each processor.

    The P processor reads the i data written by

    P processor. Therefore, in order to enforce

    sequential consistency, it is necessary to use

    synchronization.

    The Open MP is one of the popular

    programming languages executed on the

    shared memory system. It provides a

    portable, scalable and efficient approach to

    run parallel programs in C/C++ and

    FORTRAN. In Open MP, a sequential

    programming language can be parallelized

    with preprocessor compiler directives such

    as #pragma omp in C and $OMP in

    FORTRAN based with library support.

    Distributed Memory System:In

    such a system, each processor has its own

    memory and can only access its local

    memory. The processors are connected with

    other processors via a high-speed

    communication network. Processors

    exchanges information with one another

    using send and receive operations. A

    common approach to programming

    multiprocessors is to use message-passing

    library routines in addition to conventional

    sequential program. MPI (Message Passing

    Interface) is useful for a distributed memory

    systems since it provides a widely used

    standard of message passing program. It

    provides a practical, portable, efficient and

    flexible standard for message passing. In

    MPI, data is distributed among processors,

    where no data is shared and data iscommunicated by message passing.

    THEORETICAL ANALYSIS

    In order to evaluate the performance of any

    matrix multiplication based on using parallel

    processors and different algorithms, a

    theoretical analysis is carried out based on

    the following assumptions:

    f = number of arithmetic operations units

    tf = time per arithmetic operation

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    8/39

    c = number of communication units

    q = f / c average number of flops per

    communication access

    Minimum possible time = f* tf when no

    communication

    Efficiency(speedup) SP=q*(tf/tc)

    f * tf + c* tc = f * tf * (1 + tc/tf * 1/q)

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    9/39

    So, larger q indicates that time is closer to

    minimumf*tf, thus increasing the speed up

    and the efficiency of the algorithm. SP must

    be closer to number of processors to achieve

    the highest efficiency. Using the above

    assumption and with reference to the

    information in Table(1) we can obtain f, c

    and q for each parallel matrix multiplication

    as shown in Table(2).

    EXPERIMENTAL RESULTS

    Table (4) shows the experimental results

    obtained by implementing each algorithm

    100 times and Then, ten fastest five ones aretaken and averaged.

    CONCLUSION AND FUTURE

    WORKS

    A theoretical analysis for the performance of

    most seven used algorithms is carried out

    using one and four processors. This analysis

    has shown that systolic algorithm is

    considered the best algorithm that produced

    a high efficiency and then followed by

    puma, dimma and

    then summa. However, these algorithms are

    implemented and run on one and four

    processors to evaluate their performance for

    a matrix multiplication. The experimental

    results are matched with the theoretical one.

    This analysis is useful for making a proper

    recommendation to select the best algorithm

    among others as a future works to be done

    by others.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    10/39

    VIRTUAL REALITY

    SYSTEM

    Application of Parallel Processing to

    Rendering in a Virtual Reality System :-

    The rapid rendering of images is pivotal in

    all applications of computer graphics, and

    perhaps

    most of all in the field of virtual reality.

    Research suggests that there is a limit to the

    improvement in performance that can be

    attained by the selection of the appropriate

    algorithms

    and their careful optimisation. Hardware

    implementation increases expense and limits

    the

    flexibility of the equipment. A relatively

    unexplored option is the use of parallel

    processing in

    image rendering, especially image rendering

    in virtual reality systems.

    This paper examines the application ofparallel processing to image rendering in a

    virtual

    reality system. The requirements of a virtual

    reality system are noted. A number of

    different

    parallelisation techniques are proposed, and

    each analyzed with regard to its

    effectiveness in

    meeting these requirements. Performance

    values, measured from an implementation of

    each

    technique on a transputer architecture, are

    used to highlight the advantages and

    disadvantages of

    each technique.

    Rendering for a virtual reality

    system:-Rendering is a well understoodapplication. Many systems have been

    created in which three-dimensional scenes

    have to be rendered. The use of parallelism

    for rendering has also been explored.

    However, previous research has not placed

    the emphasis on rendering scenes in real-

    time. In particular, little work has been done

    on using parallel processing when rendering

    for virtual reality applications. Virtual

    reality applications are different in that the

    images must be rendered very rapidly, in

    real time, and that the delay between a user

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    11/39

    input and the corresponding change in the

    rendered output, the latency, must be small.

    In working virtual reality systems, frame

    rates of 6 frames/second and a latency of

    less than 200 milliseconds are suggested as

    minimum requirements for the illusion of

    reality. These two factors, the speed of

    rendering, and the latencyperiod are used as

    the criteria for evaluating the parallelisation

    techniques explored in this paper.

    Rendering on a single processor: -

    Renderers are typically constructed by

    creating a pipeline of several standard

    operations. These operations include

    clipping, transformation, projection and

    hidden-surface removal. The description of

    the scene to be displayed is fed into the

    pipeline which eventually produces a

    sequence of primitives (two dimensional

    lines and polygons, for example) to be

    drawn. The pipeline typically ends with

    some routines capable of producing a visual

    representation of these primitives. The

    renderer described in this paper is modelled

    as three stages. The first selects those

    objects in the virtual world which will bevisible on the screen. The next stage

    transforms the objects to screen coordinates.

    The last stage performs hidden-surface

    removal, clips the objects to the dimensions

    of the screen and renders each primitive.

    This level of breakdown is adequate for the

    parallelisation used which concentrates more

    on parallelisation inside each stage, rather

    than between stages.

    Parallelisation of the graphics

    pipeline: - A number of different parallel

    decomposition strategies for polygon

    rendering are described in a paper by

    Whitman. Four decomposition strategies

    were implemented, and are described in the

    following sections. The various strategieswere implemented on a transputer-based

    architecture. However, in most cases, the

    techniques are applicable to other

    architectures. The hardware platform on

    which the parallelisation was carried out was

    a MIMD architecture consisting of a cluster

    of 32 T800 transputers. Each transputer is a

    reasonably powerful processor in its own

    right, capable of running a number of

    concurrent processes. The addition of 4

    communications links allows it to transfer

    data to and from others transputers. Each

    transputer has its own private memory and

    all synchronisation and data transfer occurs

    through the links. A single bidirectional link

    can transfer a theoretical maximum of 2.35

    Mbytes/s, or 1.74 Mbytes/s if

    communication is only in one direction. A

    set of standard test data was used to obtain

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    12/39

    timing data from the various

    implementations. The speed of the rendering

    system on a single processor is given in

    Table 1. The simple test world contains 30

    objects amounting to about 250 polygons

    (each with 4 or moresides). The complex

    world consists of a single object containing

    about 1650 triangular polygons.

    Pipeline parallelism

    The most obvious strategy when confronted

    with the graphics pipeline is to place each

    component of the pipeline on a separate

    processor. The connections between

    components can be constructed using

    transputer links. Two factors are worthy of

    consideration when setting up a pipeline

    configuration for virtual reality on a

    transputer architecture. Firstly, greater

    bandwidth is obtained from the transputer

    links when larger messages are transmitted.

    Protocol overhead is less, and there is less

    delay waiting for each of the communicating

    processors to synchronise. The few bytes

    representing a point are not worth

    transmitting alone, it is better to wait until a

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    13/39

    number of points are ready to be transmitted

    and then send them all together. For

    convenience, all the data corresponding to

    an entire object is sent in one packet. The

    second consideration is the latency problem,

    the delay between input and the

    corresponding output. Lots of data is held in

    long pipelines, even with a limited amount

    of data at each node. All of this must be

    processed before new data entering the node

    can have any effect on the image being

    displayed. Since there is a certain overhead

    associated with the communication between

    pipeline components, the delay between data

    entering the pipeline and it reaching the

    output increases with pipeline length, even

    though the rate at which images are

    produced is increased.

    This may be illustrated by example:

    Consider a job which on a single processor

    would accept a datum and take 10 seconds

    to process it before outputting the result.

    Assume this job is capable of being

    parallelised linearly. A pipeline in which

    each node shares equally in the workload is

    illustrated in Figure 1. The data is arriving

    as fast as possible, but will take 1 second to

    transmit across a link. With 5 processors the

    job can be divided into a pipeline of 5

    parallel components, and so each will work

    on the datum for 2 seconds. Thus results can

    be output every 2 seconds. The time

    between a given datum being input and the

    corresponding output is at least 16 seconds.

    With 10 processors, output occurs once

    every second but at least 21 seconds pass

    between input and corresponding output.

    In general:

    let N be the number of processors,

    let T be the time for the job on the single

    processor system and

    let C be the communication time between

    processors. The data output rate is then T/N.

    The latency is at least (N+1)C + T seconds.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    14/39

    The latency increases with number of

    processors for non zero communication

    time. For zero communication time it

    remains constant, independent of the data

    output rate. For this reason pipeline

    parallelism is better suited to systems that

    require computationally intensive rendering

    with limited user interaction. A limited

    rendering pipeline with three nodes was

    implemented. The processor layout is shown

    in Figure 2. Timing figures are given in

    Table 2. The differences in speedup at

    different screen resolutions are due to

    change in the load balance. The renderer has

    to do more work at higher resolutions and so

    acts as a bottleneck.

    Coarse grained parallelism

    The coarse grained parallel decomposition

    strategy renders successive frames on

    separate processors. The transputer network

    is arranged to allow a number of

    independent pipelines to be created, as

    shown in Figure 3. Each pipeline renders

    one frame in memory which is then

    transferred directly to the graphics node.

    This technique promises the greatest

    speedup since it requires no load balancing.

    Each processor can start work on another

    frame as soon as the current one is finished.

    The load is also approximately equal from

    one frame to the next, so they are likely to

    stay in step. There is thus no waiting for

    other processors to catch up. The delay

    between input and output is still the time

    taken for the pipeline to render the frame.

    No decrease in this time occurs even when

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    15/39

    the number of parallel pipelines is increased.

    The latency is thus constant and independent

    of the number of processors used. Values for

    the speedup obtained from an

    implementation of this technique are shown

    in Figure 4. As may be seen, linear speedup

    is obtained for most of these examples. The

    only factor preventing linear speedup in all

    cases is the bandwidth restriction on the

    links to the graphics node. With each link

    capable of 1.7 Mbytes/second and four links,

    this allows a maximum of 26.6

    frames/second at 512 x 512 resolution or

    108 frames/second at 320 x 200 resolution.

    The example with the simple world at 512 x

    512 is using about half of the total

    bandwidth available. There is a general

    tendency for a drop in the speedup when

    more than four pipelines are used, especially

    in high traffic situations. This tends to occur

    with all the parallelism strategies tried. It

    can be ascribed to the necessary doubling of

    the traffic on one of the links to the video

    controller card.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    16/39

    Fine grained parallelism

    The pipeline approach uses a number of

    transputers connected in series. It has the

    disadvantage of high latency. A more

    suitable parallel decomposition for

    interactive graphics applications would have

    the processors connected in parallel, and all

    working on the same frame. In this way the

    delay between input and corresponding

    output would be shortened. A profile of the

    graphics routines running on a singleprocessor reveals that the bulk of the effort

    is spent on the stage involving hidden

    surface removal and polygon rendering.

    Thus the bulk of the effort in parallelisation

    has gone into decreasing the time spent on

    this stage. The other stages of the graphics

    pipeline are left as a short pipeline on

    separate processors. The fine grained

    parallelisation technique is an image space

    pixel based parallel decomposition strategy.

    The display area is partitioned into

    horizontal strips, and the hidden surface

    removal and rendering for each strip occurs

    on a separate processor. The strips are

    rendered in memory and transferred to the

    graphics node which fits each block into the

    correct area of screen memory. The amount

    of data that is transmitted to the video

    controller node is the same as with the

    coarse grained parallelism. The messages

    are shorter but more frequent. The processor

    layout used to implement this technique is

    shown in Figure 5. The shaded processors

    represent one possible graphics pipeline for

    rendering a single strip. The transformation

    stage can be parallelised in the same

    manner, with transformations restricted to

    those objects appearing on any particular

    strip. However, in practice most objects

    appear on every strip anyway and so this

    does not result in any significant saving. The

    transformation stage will be placed on a

    single processor to simplify the model.

    Consider a simplified version of the graphics

    pipeline that consists of K stages with each

    stage taking time T to complete their task.

    By using N processors on one of the stages

    and assuming linear speedup, the time for

    that stage will be reduced to T/N. The total

    time taken to traverse the pipeline is (K -

    1)T + T/N. The latency decreases as the

    number of processors is increased. The

    restrictions given above can be relaxed, and

    as long as speedup is obtained, latency will

    decrease.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    17/39

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    18/39

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    19/39

    The speedup values obtained from

    implementations of this strategy are shown

    in Figure 6. The times measured are for the

    entire graphics pipeline, and not just the

    hidden surface removal and polygon

    rendering. This latter part of the pipeline

    does most of the work however, and controls

    the overall speed. This technique does not

    produce a large speedup, nor does the

    speedup increase very much with the

    number of processors used. Two factors are

    found to contribute to these effects. When

    rendering the horizontal strips, the polygonsthat fall partially outside these strips must be

    clipped. As the number of strips increases,

    each strip becomes smaller and the number

    of polygons needing to be clipped increases.

    In one test with the screen divided into 4

    strips the time taken for clipping increased

    from about 25% of the time taken for

    rendering to about 500% of the rendering

    time. The rendering time did decrease by a

    factor of 4 due to the parallelism, but theclipping overhead was still significant. The

    clipping is a significant factor with the

    single object in the complex world with its

    many polygons and limits the speedup the

    most for this case. The second factor

    involves load balancing. The picture is not

    always evenly distributed over the screen.

    Sometimes all the objects cluster in one

    region of the screen. In this case one

    processor is required to render most of the

    objects while others render almost nothing.

    This situation is almost equivalent to a

    single pipeline. The effect of this can be

    seen in the speedup values for the complex

    world with 512 x 512 resolution. The singleobject occupies the middle half of the

    screen. Thus for three pipelines, the

    processors rendering the top and bottom

    slices have little work to do and the

    processor rendering the middle slice does

    most of the rendering. This explains the drop

    in speedup over the two pipeline version

    where the load is more evenly balanced. The

    object fills the screen in the 320 x 200

    resolution and clipping limits the speedup in

    this case. Alternative techniques for

    distributing sections of the display area over

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    20/39

    N processors to improve the load balancing

    include:

    Placing successive scan-lines on different

    processors.

    Dividing the screen into N2 horizontal

    slices and placing successive slices on

    different processors.

    These alternatives, however, incur greater

    overheads. The second alternative would

    result in more clipping, making it more

    expensive. The first alternative proved

    possible to implement with surprisingly few

    alterations to the rendering routines. Tests of

    the method show a more consistent tendency

    to increase in speed as the number of

    processors is increased, as opposed to the

    more erratic behaviour evidenced by the

    original routine. This can be attributed to the

    improved load balancing. Despite the

    improvement in scaling, the added

    complexity results in the improved

    algorithm running about 10% slower than

    the original.The frame rates obtained withthese methods were not high enough for

    bandwidth limitations to be a constraint.

    Object-based parallelism

    An object space parallelisation strategy was

    designed in which rendering of different

    objects occurs on different processors. For

    the case of N processors, the objects are

    sorted into N groups based on their depth.

    This is easily done as part of the depth sort

    involved in the hidden object removal

    routine. Each group is then rendered on one

    of the processors. The images produced are

    sent to the graphics node where the images

    are combined. The combination technique is

    similar to the Painters algorithm, used for

    hidden surface removal, where images

    corresponding to nearer objects are painted

    over those belonging to more distant ones.

    The transputer has a specialised block move

    instruction that allows this to be done

    relatively quickly and easily. Both the

    transformation and hidden surface/polygon

    rendering stages can be done in parallel with

    this method. The transputer network used to

    implement this strategy is shown in Figure

    7. The shaded nodes show the graphics

    pipeline for one group of objects. This

    parallelisation strategy has one immediate

    disadvantage. If the scene consists of only

    one object, then there is little point in using

    more than one processor. Consequently a

    world containing 16 objects, each with 276

    triangles was used instead of the complex

    world used for previous tests. Values

    obtained for the speedup for an

    implementation of this strategy are shown in

    Figure 8. The bandwidth limitation is

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    21/39

    particularly severe in this case. In the case

    where the number of processors used is less

    than five, an entire image needs to be sent

    across each of the transputers links. This

    places an upper limit on the frame rate of 6.6

    frames per second for 512 x 512 resolution

    or 27 frames per second for 320 x 200

    resolution. The speedup values tend to level

    off as this barrier is reached. Adding a fifth

    processor only exacerbates the situation. A

    further problem with this technique is load

    balancing. Some objects are more complex

    than others, and just distributing equal

    numbers of objects may result in some

    processors having to do more work. A finer,

    more balanced approach would be to

    distribute on a polygon basis. This could be

    implemented relatively easily. However

    such an approach is not worth exploring

    until higher bandwidth systems are

    available.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    22/39

    The parallelism would also suffer when the

    number of visible objects is less than the

    number of processors. Polygon distribution

    will be valuable for this situation as well.

    The simplified version of the graphics

    pipeline used in the previous latency

    calculation, gives the total time taken to

    traverse the pipeline as (K - 2)T + 2T/N,

    since two stages of the pipeline are

    parallelised. The latency is less than that for

    the fine grained decomposition and also

    decreases as the number of processors is

    increased.

    Parallel rendering in practice

    With each of the techniques implemented,

    analysis of run-time profiling information

    showed that processor time was being

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    23/39

    wasted in waiting for other processors to

    catch up. Processors were blocking, waiting

    for others to communicate. This problem

    was overcome by buffering incoming

    messages, enabling the process sending the

    data to continue with other processing. A

    notable speedup in the frame rate was

    achieved. This solution had a corresponding

    disadvantage. Adding extra processes on the

    communication links has the effect of

    lengthening the pipeline. This increases the

    latency. The size of the buffers on each link

    also affects the latency adversely. In order to

    get the maximum benefit from this

    technique the size of the buffers should be

    related to the complexity of the scene. Small

    buffers are better suited for scenes with a

    few complex objects, while larger ones

    smooth over the delays for large numbers of

    simple objects.

    Conclusions

    This paper describes a number of different

    parallelisation strategies that were tested in

    an attempt to increase the rendering speed

    and decrease the interaction latency in

    virtual reality systems. The strategies

    explored tended to exhibit a tradeoff

    between speedup and latency. A frame per

    processor approach gave almost linear

    speedup, but left the latency unaffected.

    Strategies which distributed the calculations

    for a single frame over a number of

    processors resulted in latency decreasing as

    the number of processors increased. These

    techniques did, however, have a lesser

    speedup. A significant factor in the tests

    carried out was the limited bandwidth of the

    transputer links. An increase in the

    communication speed could contribute

    toward producing acceptable rendering

    speeds with some of the lower latency

    strategies. In particular, the technique which

    rendered slices of the image on different

    processors proved to be the most satisfying

    to use in practice. The object per processor

    strategy will become acceptable once the

    communication bandwidth increases.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    24/39

    A Robust Framework

    For Distributed

    Processing Of GIFTS

    Hyperspectral Data

    Abstract The Geosynchronous Imaging

    Fourier Transform Spectrometer (GIFTS)

    instrument can provide raw data in the order

    of multiple Terabytes per day. Due to the

    high data rate, satellite ground data

    processing will require considerable

    computing power to process and archive

    data in near real-time. Cluster technologiesemploying a multi-processor system

    combined with a parallel file system is the

    only cost effective solution for such

    processing and storage. GIFTS data

    processing system is required to generate

    critical products within 5 minutes of

    gathering observation. In this paper we

    present an approach for GIFTS ground

    system processing based on the master-

    worker paradigm which providesperformance, reliability, and scalability of

    candidate hardware and software using the

    Message Passing Interface (MPI) standard.

    The framework used, alleviates the need for

    earth scientists to understand parallel

    computing and fault-tolerant operations.

    Benchmarking results are presented for a

    selected number of science algorithms for

    the GIFTS instrument showing that

    considerable performance can be gained

    without sacrificing the reliability and high

    availability constraints imposed on the

    operational cluster system. A maximum

    speed up of 54.56 (85.9% efficiency) is

    obtained for a total number of 64 processors

    over 64 data cubes of 128 x 128 pixels in the

    long wave and short-medium wave spectral

    range. This prototype system shows that

    considerable performance can be gained for

    candidate science algorithms without

    sacrificing reliability and high availability

    needed for a real-time system.

    I. INTRODUCTION

    Future satellite instruments can provide raw

    data of the order of multiple Terabytes per

    day. Due to the high data rate, satellite

    ground data processing will require

    considerable computing power to process

    and archive this data in near real-time. Theprimary mission of the National Oceanic and

    Atmospheric Administration (NOAA) is to

    understand and predict changes in the

    Earths environment which requires a

    continuous capability to acquire, process and

    archive data in real-time. NOAA

    participation in the GIFTS technology

    transfer represents a risk reduction activity

    in the design of the NOAA GOES-R series

    of imager and sounder instruments and their

    associated science algorithms. The GIFTS

    instrument uses a combination of Large area

    Focal Plane Arrays (LFPAs), and a Fourier

    Transform Spectrometer (FTS), providing a

    spectral resolution of 0.6 cm-1 for a 128 x

    128 set of 4 km foot-prints every 11 seconds

    [1]. It is anticipated that the GIFTS Level-0

    data rate is about 55 Mbps or about 1.5

    Terabyte per day [1]. The computing powerneeded for Level-0 to Level-1 processing is

    not only due to the data volume but also due

    to the inverse Fourier transform and non-

    linearity correction.

    The volume of Level-1 data is

    approximately the same as Level-0. Since

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    25/39

    there is little reduction in data volume from

    Level-0 to Level-1, producing Level-2 data

    also requires significant computing power.

    GIFTS data processing system is required to

    generate critical products within 5 minutes

    of gathering observation. Cluster

    technologies employing a multi-processor

    system present the only current

    economically viable option to accommodate

    the processing needed for GIFTS ground

    data processing. However such systems are

    inherently unstable; failure of one

    component may result in a failure of the

    system if necessary measures are not taken.

    Operational real-time systems need to bereliable and fault-tolerant, operate on

    continuous data streams and be operator-

    friendly. To sustain high levels of system

    reliability and operability in a cluster-

    oriented operational environment, a fault-

    tolerant data processing framework is

    proposed to provide a platform for

    encapsulating science algorithms and hide

    the complexities involved with an

    operational cluster system. Many Earthscience algorithms are very complex, but

    they have only a small degree of spatial

    dependency and thus are ideal for parallel

    processing.

    Designing a highly reliable operational

    cluster system for NOAAs future ground

    segments is the focus of this paper. In the

    rest of this section, work reported in the

    literature that is most relevant to our work isbriefly discussed. In section 2, we describe

    GIFTS instrument and the associated since

    algorithms. Section 3 describes framework

    architecture used for GIFTS processing.

    Section 4 describes the implementation

    details of the prototype system. Section 5

    shows benchmarking results and tradeoff

    study for GIFTS ground system processing.

    Section 6 concludes the paper and comments

    on future directions of this work.

    II. GIFTS INSTRUMENT ANDSCIENCE

    ALGORITHMS

    A. GIFTS Instrument Background

    The GIFTS instrument consists of

    large area focal plane detector arrays

    (128 x 128 pixels) within a Fourier

    Transform Spectrometer (FTS),mounted on a geostationary satellite.

    The instrument provides observations

    of Earth infrared radiance spectra at

    high spectral resolution (0.6 cm -1)

    and high spatial resolution (4 km x 4

    km pixels). Depending on spectral

    resolution, GIFTS views a large area

    (512 km x 512 km) of the Earthwithin a 1 to 11 second time interval.

    Extended Earth coverage is achieved

    by step-scanning the instrument field

    of view in a contiguous fashion across

    any desired portion of the visible

    Earth. A visible camera provides

    daytime imaging of clouds at 1 km

    spatial resolution. Figure 1 shows a

    selection of GIFTS measurement

    modes.

    GIFTS uses two detector arrays to

    cover the spectral bands 685 to 1130

    cm-1 and 1650 to 2250 cm-1, as

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    26/39

    shown in Figure 2, and a Michelson

    interferometer to obtain the spectrum

    of radiance within these bands. The

    spectral resolution of the

    measurements is sufficient to resolve,within 1-2 km vertical resolution,

    dynamic features of the atmospheric

    temperature and moisture profiles.

    The geostationary platform enables

    the tracing of fine scale features of the

    atmospheric water (cloud and vapor)

    distribution to permit the derivation of

    altitude resolved wind profiles.

    Nevertheless, GIFTS will cover a

    major portion of the visible disk with

    high vertical resolution soundings in

    less than 30 minutes. This feature is

    important for obtaining wind profilesfrom geostationary temperature and

    moisture sounding data. As part of

    University of Wisconsins (UW's)

    algorithm development, simulations

    of expected top of atmosphere (TOA)

    radiances are being used for algorithm

    development and processing.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    27/39

    B. GIFTS Science Algorithm Pipeline

    The GIFTS sensor will sample the

    interferogram from each detector as a

    function of optical path delay and

    numerically filter the data in real-time to

    reduce the data rate before transmission to

    the ground-based X-band receiver. The

    sensor collects views of the onboard

    calibration references and deep space at

    regular intervals. The ground reception

    facility will decode the telemetry stream andpass the GIFTS sensor data in real-time to a

    ground data processing facility. The GIFTS

    science algorithms are developed by SpaceScience and Engineering Center (SSEC) at

    University of Wisconsin (UW). These

    algorithms can be described as a pipeline

    consisting of a set of modules including an

    initial Fast Fourier Transform (FFT), Non-

    linearity Correction, Radiometric

    Calibration, Spectral Calibration, Instrument

    Line Shape Correction and Spectral

    Resampling as shown in Figure 3. These

    modules are described below but completedetails can be obtained from Knuteson .

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    28/39

    Initially, FFT operations are applied to

    GIFTS data to convert the measured

    interferograms into complex spectra. A

    complex Fourier transform and data folding

    is performed to convert the complex

    interferograms to complex spectra

    corresponding to a wave number scale. In

    second step we apply non-linearity

    correction. This work is still under

    investigation by SSEC, UW. It is expected

    that the GIFTS detector material is highly

    linear in the range of photon fluxes used, but

    the electronics readout of the focal plane

    array can introduce small signal non-

    linearity. The non-linearity correction

    algorithm is being designed using ground-

    based AERI instrument as well as the

    Scanning-HIS instrument. In the third step,

    we perform radiometric calibration. This

    step ensures the GIFTS instrument

    requirement to measure brightness

    temperature to better than 1 K, with a

    reproducibility of 0.2 K. GIFTS uses views

    of two on-board blackbody sources (300 K

    and 265 K) along with cold space,

    sequenced at regular programmable

    intervals. The temperature difference

    between the two internal blackbody views

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    29/39

    provides the sensor slope term in the

    calibration equation, while the deep space

    view corrects for radiant emission from the

    telescope by establishing the offset term.

    This is followed by spectral calibration,

    which uses ground-based and aircraft FTS

    systems. The spectral characteristics of these

    instruments are defined by an Instrument

    Line Shape (ILS) and a spectral sampling

    interval. The spectral sampling scale is

    maintained very accurately by the stable

    laser used to trigger sampling at equal

    intervals of Optical Path Difference (OPD).

    The next stage is to perform the instrument

    line shape correction. It has been evaluatedthat this effect is very negligible for GIFTS

    instrument since it has an extremely small

    range of angles contributing to each

    individual detector pixel (< 1 mrad in the

    interferometer). As a result, the variation of

    it is extremely small and could even be

    ignored without introducing significant

    errors. Once the spectral calibration is

    determined we perform wave number

    resampling. The GIFTS radiance spectrumcan be resampled from the original sampling

    interval to a standard reference wave

    number scale. The resampling is performed

    in software using FFT, zero padding, and

    linear interpolation of an oversampled

    spectrum. The results of the wave number

    resampling operation are equivalent to

    GIFTS spectra with a common wave number

    scale independent of their location in the

    focal plane array. Our benchmarking results

    as presented in section 5 include the initial

    Fast Fourier Transform (FFT), the

    radiometric calibration and the spectral

    resampling stages, since the other modules

    are still in the development phase and not

    yet available for use in the framework.

    III. GIFTS FRAMEWORK

    ARCHITECTURE

    In this section we propose an architectural

    model for a robust and reliable operational

    satellite data processing frameworkconsisting of:

    Active/Standby Master

    Active/Standby Data Input Server

    Active/Standby Data Output Server

    Reference and Audit Database Servers

    Workers

    The master is responsible for cluster

    management and task scheduling. The data

    input server provides the real-time data

    which will be retrieved by the workers while

    the data output server gathers the results

    produced by the workers. The reference

    database server provides access to a

    database storage unit which may contain

    instrument factory parameters, science

    algorithm specific parameters, algorithm

    descriptions, and algorithm initialization

    parameters. The audit database server

    provides monitoring capability of algorithms

    used, produced products and so forth. The

    workers do the actual processing of the

    science algorithms. A Front End Processor

    (FEP) receives (FEP) receives the data from

    the satellite, processes the data into packets

    and frames according to the CCSDS formatdescription. The frames constitute

    interferometer measurements, performance,

    engineering and diagnostics data. The FEP

    provides the input data to the data input

    server and is not part of the framework.

    Redundancy for the master, data input and

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    30/39

    data output servers are provided through an

    active/standby mechanism. Parallel file

    systems with server fail-over capability may

    serve as input and output servers. The

    databases, consisting of the reference

    database and the audit database, are

    envisioned to consist of their own

    commercially available database clusters

    such as the MySQL Cluster for fault

    tolerance purposes. The master is

    responsible for cluster management and task

    scheduling. A task is referred to as a unit of

    work which contains its own set of data,

    initialization parameters, reference id,

    timestamp and so forth. Each task has aunique identifier which differentiates it from

    other tasks within the system. Tasks may

    also be prioritized in which case tasks with

    higher priorities will be executed before low

    priority tasks. The data input server retrieves

    the new incoming data, packages them

    appropriately and assigns unique identifiers

    to tasks encapsulating the new data sets. The

    identifiers for the new tasks produced by the

    data input server are sent to the master,which in turn schedules the new tasks to be

    assigned to workers. The master may also

    define scheduling policies in a way such that

    certain tasks are assigned to specific

    workers. We refer to this as selective

    scheduling. Such policies are beneficial in

    cases where specific tasks may need

    initialization parameters, which may reside

    on specific workers and may be

    computationally time consuming if they had

    to be recomputed. By assigning such tasks to

    the same worker or the same group of

    workers and using application-aware caches,

    a large amount of computation may be saved

    resulting in higher performance. All

    communication between the various servers

    may be performed asynchronously to

    overlap computation with communication.

    A task assigned to a worker by the master

    may not complete on time for various

    reasons such as network or worker nodefailure. A task is considered completed by

    the master when the task identifier is

    returned to the master by the worker. To

    implement task and data redundancy

    mechanisms, the task execution time, also

    referred to as task latency, has to be known

    in advance. Specifying task latency statically

    is unfeasible for most practical parallel

    applications. In the current design, task

    execution times are estimated dynamically

    based on previous actual task execution

    times and an over-estimation factor. The

    over-estimation factor for each task can be

    calculated as the inverse ratio of its

    estimated and actual execution time. The

    estimated execution time may be computed

    by taking the average of a certain number of

    previous task execution times multiplied by

    an over-estimation factor. The over-estimation factor may or may not be a

    constant.

    A task which has not been completed within

    its estimated execution time is considered

    lost by the master and will be reassigned to a

    worker requesting tasks. The master keeps

    two queues of tasks, one for new tasks

    which have not yet been assigned to

    workers, and one for assigned tasks. Whenthe master schedules a task to be assigned to

    a worker process, the task in the head of the

    queue for new tasks is retrieved and

    assigned to an appropriate worker. This task

    is then moved to the queue of assigned

    tasks. When the master receives a

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    31/39

    notification from the worker that the

    respective task has been computed, the task

    is removed from the queue of assigned tasks.

    However if this task is not completed within

    its estimated execution time, it is considered

    lost. Therefore the task is removed from the

    queue of assigned tasks and is placed in the

    head of the queue of new tasks to be

    reassigned to a new worker. The worker

    which was initially assigned the respective

    task is considered faulty and will be

    removed from the list of available workers.

    IV. GIFTS PROTOTYPE

    IMPLEMENTATION

    The prototype system includes an HP Linux

    cluster consisting of 32 dual core AMD

    Opteron DL145 compute nodes with 1 GB

    RAM per CPU and 4 dual core AMD

    Opteron DL385 management nodes with 1

    GB RAM per CPU. The system runs Red

    Hat Linux Enterprise Server for AMD64 and

    has a Gigabit Ethernet as well as Myrinet

    interconnect for the entire system. HP

    Service guard is used to provide failover

    capability for the master node. HP

    Serviceguard monitors the health of each

    master node and rapidly responds to failures

    in a way that minimizes or eliminates

    application downtime. This system is also

    supplemented by an HP Storage Works

    Scalable File Share (HP SFS) which is used

    to access data in parallel from all compute

    nodes. It has 16 Terabytes of usable storageand provides a bandwidth of 1064 Mbytes/s

    for READ and 570 Mbytes/s for WRITE. It

    uses the Lustre parallel file system which is

    one of the most extensively used parallel file

    systems.

    Framework Implementation

    The framework is implemented using the

    Message Passing Interface (MPI) and the C+

    + programming language. Since there are

    many implementations of MPI, we evaluated

    various implementations such as Open MPI ,

    MPICH-2 and LAM-MPI . Finally we

    choose the MPICH-2 implementation

    considering its support for MPI-2 standards.

    Our framework prototype is a complete

    implementation of all the various

    components as described in section 3 except

    for the FEP. For the current framework

    implementation, the FEP was not used to

    provide the data; rather the data was directly

    fetched from the input server. Theframework separates the science algorithm

    layer from the cluster management layer and

    provides an operational platform within

    which algorithm software pipelines can be

    deployed for satellite data processing. There

    are classes and APIs for each service such

    as master, input delivery, output delivery,

    reference database, audit database, and

    workers. The system can be configured at

    startup using configuration files wherehostnames may be specified for the various

    worker nodes, reference and audit database

    and so forth. Once the master finishes

    initialization, it starts the workers and

    communicates a unique identifier to each

    worker. Once connection is established, if

    there is an outstanding task to be processed,

    it will be assigned to a worker by the master.

    In case a worker is removed at run-time, its

    tasks are reassigned to other workers as

    described in section 3. In case of a master

    failover, all worker master connections are

    lost and the worker processes are cleaned

    up. The failover process from standby to

    active involves reading the last check-point

    file written by the active unit, re-creating the

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    32/39

    system state accordingly and failing over

    from standby to active unit. The standby

    master takes over and creates new worker

    processes as well as redistributes tasks to the

    workers. The framework provides algorithm

    independence through a set of base classes

    and interfaces for the algorithmic tasks .

    Each independent work unit is encapsulated

    in the Task class containing fields for start

    time, completion time, compute time, unique

    task identifier, algorithm identifier, number

    of pixels, input and output data sizes, and

    other initialization parameters common to

    most satellite data processing algorithms we

    have evaluated so far. A task is started assoon as the master schedules it for a specific

    worker and is considered completed as soon

    as the master receives a notification from

    worker that the respective task has been

    completed. Hence the completion time for

    each task includes communication, read,

    write and computation time, respectively.

    The compute time stored in the task contains

    the actual computational time for the task as

    observed by the worker. Each task has aunique task identifier to differentiate it from

    other tasks. Each task contains a field for

    algorithm identifier specifying the

    algorithm/algorithms to be applied to the

    data. Each task contains reference to its own

    set of input and output data files. Workers

    execute a task simply by calling the

    execute() method on an instance of the Task

    class. This way the science algorithm layer

    is clearly separated from the framework

    layer.

    V. RESULTS

    In this section we investigate our framework

    prototype in terms of performance and

    reliability. The framework prototype is

    evaluated using GIFTS science algorithms

    as described in section 2. Our

    benchmarking results show end to end

    performance using simulated data cubes as

    well as the effect of task size on the

    performance.

    A. Performance

    Our benchmarking results include the initial

    FFT, the radiometric calibration and the

    spectral resampling stages, since the other

    modules are still in the development phase

    and have not yet been released. For the

    benchmarking, a total of 64 data cubes areused. This results in sufficient amount of

    work to be provided to each worker to hide

    any anomalies. Each data cube has a total of

    128 x 128 pixels. In the first experiment we

    have used a constant task size of 64 data

    cubes and shown the execution time, speed

    up and efficiency for the GIFTS pipeline.

    Two spectral ranges for the datasets are

    used: the long wave infrared band (685-1129

    cm-1) and the short-medium wave band

    (1650-2250 cm-1). Figure 4 shows the total

    execution time versus number of processors

    using the end-to-end GIFTS pipeline

    processing. Figure 5 shows the speedup

    versus number of processors for the same

    experiment. It is shown that we can achieve

    a linear speedup of 54.56 compared to an

    ideal speedup of 64. Figure 6 shows the the

    efficiency versus number of processors. Itcan be seen that we are able to maintain an

    efficiency of 85 percent throughout the

    experiment. In the second experiment, we

    evaluated the effect of network bandwidth

    by increasing the task size in terms of

    number of pixels per task. This resulted in

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    33/39

    reduced number of messages between the

    workers and the master which in turn

    reduces the worker master communication.

    This experiment was conducted using a total

    of 64 data cubes running on 64 processors.

    The experiment was conducted using tasks

    containing 128, 256, 1024, and 2048 pixels.

    Figure 7 shows the total execution time

    versus task size. Increasing the task size

    from 128 pixels per task to 2048 pixels per

    task, resulted in an increase of the total

    execution time of about 0.7 percent. Hence

    the effect of task size on the total execution

    time is negligible. It was observed that the

    total amount of computation time remainedconstant as the total work remained

    constant. Figure 8 shows the effect of task

    size on READ and WRITE times. For a task

    size of 128 pixels per task, blocks of 128

    pixels were used to perform READ and

    WRITE, whereas for task sizes greater than

    or equal to 256 pixels per task, blocks of 256

    pixels were used. In the third experiment,

    we investigated the effect of the bandwidth

    of the parallel file system to determine the

    optimum number of pixels which should be

    used for performing I/O to obtain the

    maximum speedup. In this experiment we

    kept the task size to a constant value of 1024

    pixels per task while increasing the block

    size to perform bulk I/O. Block sizes of 32,

    64, 128, 256, 512, 1024 pixels were used.

    Figure 9 shows the total execution time

    versus block size in pixels. It can be seen

    that an optimal value of a block size of 256

    pixels results in the least execution time.Figure 10 shows the effect of bulk I/O time

    versus block size in pixels. An optimal value

    of a block size of 256 pixels is obtained for

    READ while the value of block size has

    negligible effect on WRITE.

    FIGURES:-

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    34/39

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    35/39

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    36/39

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    37/39

    B. Reliability

    In this section, we evaluated the reliability

    of the framework in case of worker failures

    and their impact on the performance. A task

    assigned to a worker by the master may not

    complete on time due to a worker node orworker process failure, respectively. In our

    framework an estimated execution time is

    computed by taking the average of a certain

    number of previous task execution times

    multiplied by an over-estimation factor as

    described in section 2. A task which has not

    been completed within its estimated

    execution time is considered lost by the

    master and will be reassigned to another

    worker requesting tasks. The worker which

    was initially assigned the lost task is

    considered faulty and is removed from the

    list of available workers. It is assumed that

    an operational system will have a safety

    margin to ensure the availability of the

    minimum required number of worker

    processes at all times. To investigate the

    capability of the framework in case of

    worker failures, artificially induced faults

    are introduced in the system. Worker

    processes are shut down according to an

    exponential distribution after they are

    assigned tasks. An example of the workerfailure distribution during the course of one

    of the experiments is shown in figure 11. In

    our experiments, we are assuming that the

    required number of worker processes is 32

    with a safety margin of 100 percent. Hence

    64 worker processes are shut down until

    only half of the initial processes are

    available. We verified that the master can

    successfully identify failed workers and

    reassign tasks to available worker processes.

    Figure 12 shows the speedup for a series of

    experiments where the speedup varied from

    about 38 to 53. A mean value of 44.88 is

    obtained for the speedup as shown in figure

    12. This value may be compared to the

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    38/39

    speedup of 54.56 shown in figure 5 without

    worker failures.

    FIGURES:-

    VI. CONCLUSIONS AND FUTURE

    WORK

    In this paper we have proposed a highly

    robust and reliable framework for the

    distributed real-time processing of satellite

    data for a GIFTS ground system.

  • 8/7/2019 tp PARALLEL PROCESSING APPLICATIONS

    39/39

    Furthermore we presented an architectural

    model for providing performance, reliability,

    and scalability of candidate hardware and

    software for such a framework. It has been

    shown that considerable performance can be

    gained for GIFTS science algorithms

    without sacrificing the reliability and high

    availability requirements for the operational

    system. Future work involves the processing

    of GIFTS science algorithm pipeline using

    real thermal vacuum data. Depending on the

    flight of the GIFTS instrument, the

    framework prototype as shown in this paper

    needs to be transitioned into an operational

    environment requiring data to be processedin near real-time. This requires further

    scalability studies in terms of number of

    processors needed to keep up with the

    GIFTS data rate.

    BIBLIOGRAPHY

    I use the following web links from

    the main website www.osun.org to

    make given term paper topic.

    http://vecpar.fe.up.pt/2006/program

    me/papers/36.pdf

    http://www.idosi.org/wasj/wasj5(2)/

    14.pdf

    http://comet.lehman.cuny.edu/vpan/

    pdf/lpieeetc01_184.pdf

    http://www.cs.tufts.edu/~jacob/pape

    rs/supercomputing.deligiannidis.pd

    f

    http://www.cs.ru.ac.za/research/Gr

    oups/vrsig/pastprojects/001parallel

    anddistributedvr/paper02.pdf

    http://www.dhpc.adelaide.edu.au/re

    ports/108/dhpc-108.pdf

    http://ams.confex.com/ams/pdfpape

    rs/69620.pdf

    http://www.osun.org/http://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.osun.org/http://www.idosi.org/wasj/wasj5(2)/14.pdfhttp://www.idosi.org/wasj/wasj5(2)/14.pdf