01 01 parallel computing explained

Upload: manaskirti

Post on 03-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 01 01 Parallel Computing Explained

    1/65

    Slides Prepared from the CI-Tutor Courses at

    NCSA

    http://ci-tutor.ncsa.uiuc.edu/

    ByS. Masoud Sadjadi

    School of Computing and Information

    Sciences

    Florida International University

    Parallel Computing Explained

    Parallel Computing Overview

    1

    http://ci-tutor.ncsa.uiuc.edu/http://ci-tutor.ncsa.uiuc.edu/http://ci-tutor.ncsa.uiuc.edu/http://ci-tutor.ncsa.uiuc.edu/
  • 7/28/2019 01 01 Parallel Computing Explained

    2/65

    Agenda

    1 Parallel Computing Overview

    2 How to Parallelize a Code

    3 Porting Issues

    4 Scalar Tuning5 Parallel Code Tuning

    6 Timing and Profiling

    7 Cache Tuning

    8 Parallel Performance Analysis

    9 About the IBM Regatta P690

    2

  • 7/28/2019 01 01 Parallel Computing Explained

    3/65

    Agenda

    1 Parallel Computing Overview

    1.1 Introduction to Parallel Computing

    1.1.1 Parallelism in our Daily Lives

    1.1.2 Parallelism in Computer Programs

    1.1.3 Parallelism in Computers1.1.4 Performance Measures

    1.1.5 More Parallelism Issues

    1.2 Comparison of Parallel Computers

    1.3 Summary

    3

  • 7/28/2019 01 01 Parallel Computing Explained

    4/65

    Parallel Computing Overview

    Who should read this chapter? New Users to learn concepts and terminology.

    Intermediate Users for review or reference.

    Management Staff to understand the basic

    conceptseven if you dont plan to do any

    programming.

    Note: Advanced users may opt to skip this chapter.

    4

  • 7/28/2019 01 01 Parallel Computing Explained

    5/65

    Introduction to Parallel

    Computing High performance parallel computers

    can solve large problems much faster than adesktop computer fast CPUs, large memory, high speed interconnects, and

    high speed input/output

    able to speed up computations

    by making the sequential components run faster

    by doing more operations in parallel

    High performance parallel computers are in

    demand need for tremendous computational capabilities in

    science, engineering, and business. require gigabytes/terabytes f memory and

    gigaflops/teraflops of performance

    scientists are striving for petascale performance5

  • 7/28/2019 01 01 Parallel Computing Explained

    6/65

    Introduction to Parallel

    Computing HPPC are used in a wide variety of disciplines.

    Meteorologists: prediction of tornadoes andthunderstorms

    Computational biologists: analyze DNA sequences

    Pharmaceutical companies: design of new drugs Oil companies: seismic exploration

    Wall Street: analysis of financial markets

    NASA: aerospace vehicle design

    Entertainment industry: special effects in moviesand commercials

    These complex scientific and businessapplications all need to perform computations onlarge datasets or large equations.

    6

  • 7/28/2019 01 01 Parallel Computing Explained

    7/65

    Parallelism in our Daily Lives There are two types of processes that occur in

    computers and in our daily lives: Sequential processes

    occur in a strict order

    it is not possible to do the next step until the current one iscompleted.

    Examples

    The passage of time: the sun rises and the sun sets.

    Writing a term paper: pick the topic, research, and writethe paper.

    Parallel processes many events happen simultaneously

    Examples

    Plant growth in the springtime

    An orchestra7

  • 7/28/2019 01 01 Parallel Computing Explained

    8/65

    Agenda

    1 Parallel Computing Overview1.1 Introduction to Parallel Computing

    1.1.1 Parallelism in our Daily Lives

    1.1.2 Parallelism in Computer Programs

    1.1.2.1 Data Parallelism1.1.2.2 Task Parallelism

    1.1.3 Parallelism in Computers

    1.1.4 Performance Measures

    1.1.5 More Parallelism Issues

    1.2 Comparison of Parallel Computers

    1.3 Summary

    8

  • 7/28/2019 01 01 Parallel Computing Explained

    9/65

    Parallelism in Computer

    Programs Conventional wisdom:

    Computer programs are sequential in nature

    Only a small subset of them lend themselves to

    parallelism.

    Algorithm: the "sequence of steps" necessary to do acomputation.

    The first 30 years of computer use, programs were run

    sequentially.

    The 1980's saw great successes with parallelcomputers.

    Dr. Geoffrey Fox published a book entitled Parallel

    Computing Works!

    many scientific accomplishments resulting from parallel

    computing9

  • 7/28/2019 01 01 Parallel Computing Explained

    10/65

    Parallel Computing What a computer does when it carries out more than

    one computation at a time using more than oneprocessor.

    By using many processors at once, we can speedupthe execution

    If one processor can perform the arithmetic in time t. Then ideally p processors can perform the arithmetic in

    time t/p. What if I use 100 processors? What if I use 1000

    processors?

    Almost every program has some form of parallelism. You need to determine whether your data or your

    program can be partitioned into independent pieces thatcan be run simultaneously.

    Decomposition is the name given to this partitioningprocess.

    Types of parallelism:10

  • 7/28/2019 01 01 Parallel Computing Explained

    11/65

    Data Parallelism

    The same code segment runs concurrently oneach processor, but each processor is assigned

    its own part of the data to work on.

    Do loops (in Fortran) define the parallelism.

    The iterations must be independent of each other.

    Data parallelism is called "fine grain parallelism"

    because the computational work is spread into

    many small subtasks.

    Example

    Dense linear algebra, such as matrix multiplication,

    is a perfect candidate for data parallelism.

    11

  • 7/28/2019 01 01 Parallel Computing Explained

    12/65

    An example of data parallelism

    Original SequentialCode Parallel Code

    DO K=1,N

    DO J=1,NDO I=1,N

    C(I,J) = C(I,J) +

    A(I,K)*B(K,J)

    END DO

    END DO

    END DO

    !$OMP PARALLEL DO

    DO K=1,N

    DO J=1,NDO I=1,N

    C(I,J) = C(I,J) +A(I,K)*B(K,J)

    END DO

    END DO

    END DO

    !$END PARALLEL DO

    12

  • 7/28/2019 01 01 Parallel Computing Explained

    13/65

    Quick Intro to OpenMP

    OpenMP is a portable standard for paralleldirectives covering both data and task parallelism.

    More information about OpenMP is available on the

    OpenMP website.

    We will have a lecture on Introduction to OpenMPlater.

    With OpenMP, the loop that is performed in

    parallel is the loop that immediately follows the

    Parallel Do directive. In our sample code, it's the K loop:

    DO K=1,N

    13

    http://www.openmp.org/http://www.openmp.org/
  • 7/28/2019 01 01 Parallel Computing Explained

    14/65

    OpenMP Loop Parallelism

    Iteration-ProcessorAssignments

    The code segment

    running on eachprocessor

    DO J=1,N

    DO I=1,N

    C(I,J) = C(I,J) +

    A(I,K)*B(K,J)

    END DO

    END DO

    ProcessorIterations

    of K

    Data

    Elements

    proc0 K=1:5A(I, 1:5)

    B(1:5 ,J)

    proc1 K=6:10A(I, 6:10)

    B(6:10 ,J)

    proc2 K=11:15

    A(I, 11:15)

    B(11:15 ,J)

    proc3 K=16:20A(I, 16:20)

    B(16:20 ,J)

    14

  • 7/28/2019 01 01 Parallel Computing Explained

    15/65

    OpenMP Style of Parallelism

    can be done incrementally as follows:1. Parallelize the most computationally intensive

    loop.

    2. Compute performance of the code.

    3. If performance is not satisfactory, parallelizeanother loop.

    4. Repeat steps 2 and 3 as many times as needed.

    The ability to perform incremental parallelism is

    considered a positive feature of data parallelism.

    It is contrasted with the MPI (Message Passing

    Interface) style of parallelism, which is an "all or

    nothing" approach.15

  • 7/28/2019 01 01 Parallel Computing Explained

    16/65

    Task Parallelism Task parallelism may be thought of as the opposite of

    data parallelism.

    Instead of the same operations being performed ondifferent parts of the data, each process performsdifferent operations.

    You can use task parallelism when your program canbe split into independent pieces, often subroutines,that can be assigned to different processors and runconcurrently.

    Task parallelism is called "coarse grain" parallelism

    because the computational work is spread into just afew subtasks.

    More code is run in parallel because the parallelism isimplemented at a higher level than in data parallelism.

    Task parallelism is often easier to implement and has

    less overhead than data parallelism.16

  • 7/28/2019 01 01 Parallel Computing Explained

    17/65

    Task Parallelism

    The abstract code shown in the diagram isdecomposed into 4 independent code segments

    that are labeled A, B, C, and D. The right hand

    side of the diagram illustrates the 4 code

    segments running concurrently.

    17

  • 7/28/2019 01 01 Parallel Computing Explained

    18/65

    Task Parallelism

    Original Code Parallel Code

    program main

    code segment labeled A

    code segment labeled B

    code segment labeled C

    code segment labeled D

    end

    program main

    code segment labeled A

    code segment labeled B

    code segment labeled C

    code segment labeled D

    end

    program main

    !$OMP PARALLEL

    !$OMP SECTIONS

    code segment labeled A!$OMP SECTION

    code segment labeled B

    !$OMP SECTION

    code segment labeled C

    !$OMP SECTION

    code segment labeled D

    !$OMP END SECTIONS

    !$OMP END PARALLEL

    end

    18

  • 7/28/2019 01 01 Parallel Computing Explained

    19/65

    OpenMP Task Parallelism

    With OpenMP, the code that follows eachSECTION(S) directive is allocated to a different

    processor. In our sample parallel code, the

    allocation of code segments to processors is as

    follows. Processor Code

    proc0code segment

    labeled A

    proc1code segment

    labeled B

    proc2code segment

    labeled C

    proc3code segment

    labeled D

    19

  • 7/28/2019 01 01 Parallel Computing Explained

    20/65

    Parallelism in Computers

    How parallelism is exploited and enhanced withinthe operating system and hardware components

    of a parallel computer:

    operating system

    arithmetic

    memory

    disk

    20

  • 7/28/2019 01 01 Parallel Computing Explained

    21/65

    Operating System Parallelism All of the commonly used parallel computers run a

    version of the Unix operating system. In the tablebelow each OS listed is in fact Unix, but the name ofthe Unix OS varies with each vendor.

    For more information about Unix, a collection ofUnixdocuments is available.

    Parallel Computer OS

    SGI Origin2000 IRIX

    HP V-Class HP-UX

    Cray T3E Unicos

    IBM SP AIX

    WorkstationClusters

    Linux

    21

    http://www.geek-girl.com/unix.htmlhttp://www.geek-girl.com/unix.htmlhttp://www.geek-girl.com/unix.htmlhttp://www.geek-girl.com/unix.html
  • 7/28/2019 01 01 Parallel Computing Explained

    22/65

    Two Unix Parallelism Features background processing facility

    With the Unix background processing facility youcan run the executable a.outin the background andsimultaneously view the man page for the etimefunction in the foreground. There are two Unixcommands that accomplish this:

    a.out > results &

    man etime

    cron feature With the Unix cron feature you can submit a job that

    will run at a later time.

    22

  • 7/28/2019 01 01 Parallel Computing Explained

    23/65

    Arithmetic Parallelism

    Multiple execution units facilitate arithmetic parallelism.

    The arithmetic operations of add, subtract, multiply, and divide (+- * /) are each done in a separate execution unit. This allowsseveral execution units to be used simultaneously, because theexecution units operate independently.

    Fused multiply and add is another parallel arithmetic feature. Parallel computers are able to overlap multiply and add. This

    arithmetic is named MultiplyADD (MADD) on SGI computers, andFused Multiply Add (FMA) on HP computers. In either case, thetwo arithmetic operations are overlapped and can complete inhardware in one computer cycle.

    Superscalar arithmetic is the ability to issue several arithmetic operations per computer

    cycle. It makes use of the multiple, independent execution units. On

    superscalar computers there are multiple slots per cycle that canbe filled with work. This gives rise to the name n-way superscalar,

    where n is the number of slots per cycle. The SGI Origin2000 iscalled a 4-way superscalar computer.

    23

    http://www.sgi.com/products/servers/http://www.sgi.com/products/servers/
  • 7/28/2019 01 01 Parallel Computing Explained

    24/65

    Memory Parallelism

    memory interleaving memory is divided into multiple banks, and consecutive data

    elements are interleaved among them. For example if yourcomputer has 2 memory banks, then data elements with evenmemory addresses would fall into one bank, and data elementswith odd memory addresses into the other.

    multiple memory ports Port means a bi-directional memory pathway. When the dataelements that are interleaved across the memory banks areneeded, the multiple memory ports allow them to be accessedand fetched in parallel, which increases the memory bandwidth(MB/s or GB/s).

    multiple levels of the memory hierarchy There is global memory that any processor can access. There is

    memory that is local to a partition of the processors. Finally thereis memory that is local to a single processor, that is, the cachememory and the memory elements held in registers.

    Cache memory Cache is a small memory that has fast access compared with the

    larger main memory and serves to keep the faster processor filledwith data.

    24

  • 7/28/2019 01 01 Parallel Computing Explained

    25/65

    Memory Parallelism

    Memory Hierarchy Cache Memory

    25

  • 7/28/2019 01 01 Parallel Computing Explained

    26/65

    Disk Parallelism

    RAID (RedundantArray ofInexpensive Disk) RAID disks are on most parallel computers.

    The advantage of a RAID disk system is that itprovides a measure of fault tolerance.

    If one of the disks goes down, it can be swappedout, and the RAID disk system remains operational.

    Disk Striping When a data set is written to disk, it is striped

    across the RAID disk system. That is, it is broken

    into pieces that are written simultaneously to thedifferent disks in the RAID disk system. When thesame data set is read back in, the pieces are readin parallel, and the full data set is reassembled inmemory.

    26

  • 7/28/2019 01 01 Parallel Computing Explained

    27/65

    Agenda

    1 Parallel Computing Overview1.1 Introduction to Parallel Computing

    1.1.1 Parallelism in our Daily Lives

    1.1.2 Parallelism in Computer Programs

    1.1.3 Parallelism in Computers1.1.4 Performance Measures

    1.1.5 More Parallelism Issues

    1.2 Comparison of Parallel Computers

    1.3 Summary

    27

  • 7/28/2019 01 01 Parallel Computing Explained

    28/65

    Performance Measures

    Peak Performance is the top speed at which the computer can operate.

    It is a theoretical upper limit on the computer's performance.

    Sustained Performance is the highest consistently achieved speed. It is a more realistic measure of computer performance.

    Cost Performance is used to determine if the computer is cost effective.

    MHz is a measure of the processor speed. The processor speed is commonly measured in millions of cycles

    per second, where a computer cycle is defined as the shortest timein which some work can be done.

    MIPS is a measure of how quickly the computer can issue instructions. Millions of instructions per second is abbreviated as MIPS, where

    the instructions are computer instructions such as: memory reads

    and writes, logical operations , floating point operations, integeroperations, and branch instructions.28

  • 7/28/2019 01 01 Parallel Computing Explained

    29/65

    Performance Measures

    Mflops (Millions of floating point operations per second) measures how quickly a computer can perform floating-point

    operations such as add, subtract, multiply, and divide.

    Speedup measures the benefit of parallelism. It shows how your program scales as you compute with more

    processors, compared to the performance on one processor. Ideal speedup happens when the performance gain is linearly

    proportional to the number of processors used.

    Benchmarks are used to rate the performance of parallel computers and

    parallel programs. A well known benchmark that is used to compare parallel

    computers is the Linpack benchmark. Based on the Linpack results, a list is produced of the Top 500

    Supercomputer Sites. This list is maintained by the Universityof Tennessee and the University of Mannheim.

    29

    http://www.top500.org/http://www.top500.org/http://www.top500.org/http://www.top500.org/
  • 7/28/2019 01 01 Parallel Computing Explained

    30/65

    More Parallelism Issues

    Load balancing is the technique of evenly dividing the workload among the

    processors. For data parallelism it involves how iterations of loops are

    allocated to processors. Load balancing is important because the total time for the program

    to complete is the time spent by the longest executing thread. Theproblem size

    must be large and must be able to grow as you compute with moreprocessors.

    In order to get the performance you expect from a parallelcomputer you need to run a large application with large data sizes,

    otherwise the overhead of passing information betweenprocessors will dominate the calculation time.

    Good software tools are essential for users of high performance parallel computers. These tools include:

    parallel compilers

    parallel debuggers30

  • 7/28/2019 01 01 Parallel Computing Explained

    31/65

    More Parallelism Issues

    The high performance computing market is risky andchaotic. Many supercomputer vendors are no longer inbusiness, making the portability of your applicationvery important.

    A workstation farm

    is defined as a fast network connecting heterogeneousworkstations. The individual workstations serve as desktop systems for

    their owners. When they are idle, large problems can take advantage of

    the unused cycles in the whole system.An application of this concept is the SETI project. You can

    participate in searching for extraterrestrial intelligencewith your home PC. More information about this project isavailable at the SETI Institute.

    Condor

    is software that provides resource management services forapplications that run on heterogeneous collections of31

    http://setiathome.ssl.berkeley.edu/http://setiathome.ssl.berkeley.edu/
  • 7/28/2019 01 01 Parallel Computing Explained

    32/65

    Agenda

    1 Parallel Computing Overview1.1 Introduction to Parallel Computing

    1.2 Comparison of Parallel Computers1.2.1 Processors

    1.2.2 Memory Organization

    1.2.3 Flow of Control

    1.2.4 Interconnection Networks

    1.2.4.1 Bus Network

    1.2.4.2 Cross-Bar Switch Network

    1.2.4.3 Hypercube Network1.2.4.4 Tree Network

    1.2.4.5 Interconnection Networks Self-test

    1.2.5 Summary of Parallel Computer Characteristics

    1.3 Summary

    32

    C i f P ll l

  • 7/28/2019 01 01 Parallel Computing Explained

    33/65

    Comparison of Parallel

    Computers

    Now you can explore the hardware componentsof parallel computers:

    kinds of processors

    types of memory organization

    flow of control

    interconnection networks

    You will see what is common to these parallel

    computers, and what makes each one of them

    unique.

    33

  • 7/28/2019 01 01 Parallel Computing Explained

    34/65

    Kinds of Processors

    There are three types of parallel computers:1. computers with a small number of powerful

    processors

    Typically have tens of processors.

    The cooling of these computers often requires verysophisticated and expensive equipment, making these

    computers very expensive for computing centers.

    They are general-purpose computers that perform especially

    well on applications that have large vector lengths.

    The examples of this type of computer are the Cray SV1 andthe Fujitsu VPP5000.

    34

    http://www.cray.com/products/legacy.htmlhttp://www.fujitsu.co.jp/hypertext/Products/Info_process/hpc/products-e/index-e.htmlhttp://www.fujitsu.co.jp/hypertext/Products/Info_process/hpc/products-e/index-e.htmlhttp://www.cray.com/products/legacy.html
  • 7/28/2019 01 01 Parallel Computing Explained

    35/65

    Kinds of Processors

    There are three types of parallel computers:2. computers with a large number of less powerful

    processors

    Named a Massively Parallel Processor (MPP), typically have

    thousands of processors. The processors are usually proprietary and air-cooled.

    Because of the large number of processors, the distance

    between the furthest processors can be quite large requiring a

    sophisticated internal network that allows distant processors to

    communicate with each other quickly. These computers are suitable for applications with a high

    degree of concurrency.

    The MPP type of computer was popular in the 1980s.

    Examples of this type of computer were the Thinking Machines

    CM-2 computer, and the computers made by the MassPar35

    http://www.svisions.com/sv/cm-dv.htmlhttp://www.svisions.com/sv/cm-dv.htmlhttp://www.svisions.com/sv/cm-dv.htmlhttp://www.svisions.com/sv/cm-dv.htmlhttp://www.svisions.com/sv/cm-dv.htmlhttp://www.svisions.com/sv/cm-dv.html
  • 7/28/2019 01 01 Parallel Computing Explained

    36/65

    Kinds of Processors

    There are three types of parallel computers:3. computers that are medium scale in between the two

    extremes

    Typically have hundreds of processors.

    The processor chips are usually not proprietary; rather they arecommodity processors like the Pentium III.

    These are general-purpose computers that perform well on a

    wide range of applications.

    The most common example of this class is the Linux Cluster.

    36

  • 7/28/2019 01 01 Parallel Computing Explained

    37/65

    Trends and Examples

    Processor trends :

    The processors on todays commonly used parallel

    computers:

    Decade Processor Type Computer Example

    1970s Pipelined, Proprietary Cray-1

    1980s Massively Parallel, Proprietary Thinking Machines CM2

    1990s Superscalar, RISC, Commodity SGI Origin2000

    2000s CISC, Commodity Workstation Clusters

    Computer Processor

    SGI Origin2000 MIPS RISC R12000

    HP V-Class HP PA 8200

    Cray T3E Compaq Alpha

    IBM SP IBM Power3

    Workstation Clusters Intel Pentium III, Intel Itanium37

  • 7/28/2019 01 01 Parallel Computing Explained

    38/65

    Memory Organization

    The following paragraphs describe the threetypes of memory organization found on parallel

    computers:

    distributed memory

    shared memory

    distributed shared memory

    38

  • 7/28/2019 01 01 Parallel Computing Explained

    39/65

    Distributed Memory

    In distributed memory computers, the total memory ispartitioned into memory that is private to eachprocessor. There is a Non-Uniform MemoryAccess time (NUMA),

    which is proportional to the distance between the twocommunicating processors.

    On NUMAcomputers, data is

    accessed the

    quickest from a

    private memory,

    while data from themost distant

    processor takes the

    longest to access.

    Some examples are

    the Cray T3E, the39

    http://www.cray.com/service/legacy.htmlhttp://www.rs6000.ibm.com/hardware/largescale/SP/index.htmlhttp://www.cray.com/service/legacy.html
  • 7/28/2019 01 01 Parallel Computing Explained

    40/65

    Distributed Memory

    When programming distributed memorycomputers, the code and the data should bestructured such that the bulk of a processorsdata accesses are to its own private (local)memory.

    This is called havinggood data locality.

    Today's distributedmemory computersuse message

    passingsuch asMPItocommunicatebetween processors

    as shown in the40

  • 7/28/2019 01 01 Parallel Computing Explained

    41/65

    Distributed Memory

    One advantage of distributed memory computersis that they are easy to scale. As the demand for

    resources grows, computer centers can easily

    add more memory and processors.

    This is often called the LEGO blockapproach.

    The drawback is that programming of distributed

    memory computers can be quite complicated.

    41

  • 7/28/2019 01 01 Parallel Computing Explained

    42/65

    Shared Memory

    In shared memory computers, all processors have accessto a single pool of centralized memory with a uniformaddress space.

    Any processor can address any memory location at thesame speed so there is Uniform MemoryAccess time(UMA).

    Processors communicate with each other through theshared memory.

    The advantages anddisadvantages ofshared memorymachines are roughlythe opposite ofdistributed memory

    computers. They are easier to

    program because theyresemble theprogramming of singleprocessor machines

    But they don't scalelike their distributed42

  • 7/28/2019 01 01 Parallel Computing Explained

    43/65

    Distributed Shared Memory

    In Distributed Shared Memory (DSM) computers, a cluster orpartition of processors has access to a common shared memory. It accesses the memory of a different processor cluster in a NUMA

    fashion.

    Memory is physically distributed but logically shared. Attention to data locality again is important. Distributed shared

    memory computerscombine the bestfeatures of bothdistributed memorycomputers and sharedmemory computers.

    That is, DSM computershave both the scalabilityof distributed memorycomputers and the easeof programming ofshared memorycomputers.

    Some examples of DSMcom uters are the SGI

    43

    http://www.sgi.com/products/servers/http://www.sgi.com/products/servers/
  • 7/28/2019 01 01 Parallel Computing Explained

    44/65

    Trends and Examples

    Memoryorganization

    trends:

    The memoryorganization of

    todays

    commonly used

    parallel

    Decade Memory Organization Example

    1970s Shared Memory Cray-1

    1980s Distributed Memory Thinking Machines CM-2

    1990s Distributed Shared Memory SGI Origin2000

    2000s Distributed Memory Workstation Clusters

    Computer Memory Organization

    SGI Origin2000 DSM

    HP V-Class DSM

    Cray T3E Distributed

    IBM SP Distributed

    Workstation Clusters Distributed44

  • 7/28/2019 01 01 Parallel Computing Explained

    45/65

    Flow of Control

    When you look at the control of flow you will seethree types of parallel computers:

    Single Instruction Multiple Data (SIMD)

    Multiple Instruction Multiple Data (MIMD)

    Single Program Multiple Data (SPMD)

    45

  • 7/28/2019 01 01 Parallel Computing Explained

    46/65

    Flynns Taxonomy

    Flynns Taxonomy, devised in 1972 by Michael Flynn ofStanford University, describes computers by how streamsof instructions interact with streams of data.

    There can be single or multiple instruction streams, andthere can be single or multiple data streams. This gives rise

    to 4 types of computers as shown in the diagram below: Flynn's taxonomy

    names the 4computer typesSISD, MISD, SIMDand MIMD.

    Of these 4, onlySIMD and MIMDare applicable toparallel computers.

    Another computer

    type, SPMD, is a46

  • 7/28/2019 01 01 Parallel Computing Explained

    47/65

    SIMD Computers

    SIMD stands forSingle Instruction Multiple Data. Each processor follows the same set of instructions.

    With different data elements being allocated to each processor.

    SIMD computers have distributed memory with typically thousands ofsimple processors, and the processors run in lock step.

    SIMD computers, popular in the 1980s, are useful for fine grain data

    parallel applications, such as neural networks. Some examples of SIMD

    computers were the ThinkingMachines CM-2 computer andthe computers from theMassPar company.

    The processors are

    commanded by the globalcontroller that sendsinstructions to the processors. It says add, and they all add.

    It says shift to the right, andthey all shift to the right.

    The processors are like

    obedient soldiers, marching inunison.

    47

  • 7/28/2019 01 01 Parallel Computing Explained

    48/65

    MIMD Computers

    MIMD stands forMultiple Instruction Multiple Data. There are multiple instruction streams with separate code

    segments distributed among the processors.

    MIMD is actually a superset of SIMD, so that the processors canrun the same instruction stream or different instruction streams.

    In addition, there are multiple data streams; different data

    elements are allocated to each processor. MIMD computers can have either distributed memory or shared

    memory. While the processors on

    SIMD computers run in lockstep, the processors onMIMD computers run

    independently of each other. MIMD computers can be

    used for either data parallelor task parallel applications.

    Some examples of MIMDcomputers are the SGI

    Origin2000 computer and-

    48

    http://www.sgi.com/products/servers/http://www.sgi.com/products/servers/http://www.techservers.hp.com/products/technical_servers/v-class/technical_v_class_pb.htmlhttp://www.techservers.hp.com/products/technical_servers/v-class/technical_v_class_pb.htmlhttp://www.techservers.hp.com/products/technical_servers/v-class/technical_v_class_pb.htmlhttp://www.techservers.hp.com/products/technical_servers/v-class/technical_v_class_pb.htmlhttp://www.sgi.com/products/servers/http://www.sgi.com/products/servers/
  • 7/28/2019 01 01 Parallel Computing Explained

    49/65

    SPMD Computers

    SPMD stands forSingle Program Multiple Data. SPMD is a special case of MIMD.

    SPMD execution happens when a MIMD computer is programmedto have the same set of instructions per processor.

    With SPMD computers, while the processors are running thesame code segment, each processor can run that code segment

    asynchronously. Unlike SIMD, the synchronous execution of instructions is relaxed.

    An example is the execution of an if statement on a SPMDcomputer. Because each processor computes with its own partition of the

    data elements, it may evaluate the right hand side of the if

    statement differently from another processor. One processor may take a certain branch of the if statement, and

    another processor may take a different branch of the same ifstatement.

    Hence, even though each processor has the same set ofinstructions, those instructions may be evaluated in a different

    order from one processor to the next. The analo ies we used for describin SIMD com uters can be

    49

    Summary of SIMD versus

  • 7/28/2019 01 01 Parallel Computing Explained

    50/65

    Summary of SIMD versus

    MIMD

    SIMD MIMD

    Memorydistributed

    memory

    distriuted memory

    or

    shared memory

    Code Segment same perprocessor

    same

    ordifferent

    Processors

    Run Inlock step asynchronously

    Data

    Elements

    different per

    processor

    different per

    processor

    Applications data parallel

    data parallel

    or

    task parallel

    50

  • 7/28/2019 01 01 Parallel Computing Explained

    51/65

    Trends and Examples

    Flow of control trends:

    The flow of control on today:

    Decade Flow of Control Computer Example

    1980's SIMD Thinking Machines CM-2

    1990's MIMD SGI Origin2000

    2000's MIMD Workstation Clusters

    Computer Flow of Control

    SGI Origin2000 MIMD

    HP V-Class MIMD

    Cray T3E MIMD

    IBM SP MIMD

    Workstation Clusters MIMD51

  • 7/28/2019 01 01 Parallel Computing Explained

    52/65

    Agenda

    1 Parallel Computing Overview1.1 Introduction to Parallel Computing

    1.2 Comparison of Parallel Computers1.2.1 Processors

    1.2.2 Memory Organization

    1.2.3 Flow of Control

    1.2.4 Interconnection Networks

    1.2.4.1 Bus Network

    1.2.4.2 Cross-Bar Switch Network

    1.2.4.3 Hypercube Network

    1.2.4.4 Tree Network

    1.2.4.5 Interconnection Networks Self-test

    1.2.5 Summary of Parallel Computer Characteristics

    1.3 Summary

    52

  • 7/28/2019 01 01 Parallel Computing Explained

    53/65

    Interconnection Networks

    What exactly is the interconnection network? The interconnection networkis made up of the wires and cables

    that define how the multiple processors of a parallel computer areconnected to each other and to the memory units.

    The time required to transfer data is dependent upon the specifictype of the interconnection network.

    This transfer time is called the communication time.

    What network characteristics are important? Diameter: the maximum distance that data must travel for 2

    processors to communicate.

    Bandwidth: the amount of data that can be sent through a networkconnection.

    Latency: the delay on a network while a data packet is being storedand forwarded.

    Types of Interconnection NetworksThe network topologies (geometric arrangements of the computer

    network connections) are:

    Bus

    Cross-bar Switch Hybercube

    53

  • 7/28/2019 01 01 Parallel Computing Explained

    54/65

    Interconnection Networks

    The aspects of network issues are: Cost Scalability Reliability Suitable Applications Data Rate Diameter Degree

    General Network Characteristics Some networks can be compared in terms of their degree and

    diameter.

    Degree: how many communicating wires are coming out ofeach processor. A large degree is a benefit because it has multiple paths.

    Diameter: This is the distance between the two processorsthat are farthest apart. A small diameter corresponds to low latency.

    54

  • 7/28/2019 01 01 Parallel Computing Explained

    55/65

    Bus Network

    Bus topology is the original coaxial cable-based LocalAreaNetwork (LAN) topology in which the medium forms asingle bus to which all stations are attached.

    The positive aspects

    It is also a mature technology that is well known and reliable.

    The cost is also very low. simple to construct.

    The negativeaspects

    limited data

    transmissionrate.

    not scalable interms ofperformance.

    Example: SGIPower Challenge.55

  • 7/28/2019 01 01 Parallel Computing Explained

    56/65

    Cross-Bar Switch Network

    A cross-bar switch is a network that works through a switchingmechanism to access shared memory. it scales better than the bus network but it costs significantly

    more.

    The telephone system uses this type of network. An example ofa computer with this type of network is the HP V-Class.

    Here is a diagram

    of a cross-barswitch networkwhich shows theprocessors talkingthrough theswitchboxes tostore or retrievedata in memory.

    There are multiplepaths for aprocessor tocommunicate witha certain memory.

    The switches56

  • 7/28/2019 01 01 Parallel Computing Explained

    57/65

    Cross-Bar Switch Network

    In a hypercube network, the processors areconnected as if they were corners of amultidimensional cube. Each node in an Ndimensional cube is directly connected to N othernodes.

    The fact that the number ofdirectly connected, "nearestneighbor", nodes increaseswith the total size of thenetwork is also highlydesirable for a parallel

    computer. The degree of a hypercube

    network is log n and thediameter is log n, where n isthe number of processors.

    Examples of computers with57

  • 7/28/2019 01 01 Parallel Computing Explained

    58/65

    Tree Network

    The processors are the bottom nodes of the tree. Fora processor to retrieve data, it must go up in thenetwork and then go back down.

    This is useful for decision making applications thatcan be mapped as trees.

    The degree of a tree network is 1. The diameter of thenetwork is 2 log (n+1)-2 where n is the number ofprocessors.

    The Thinking Machines CM-5 is an example of a parallelcomputer with this type of

    network. Tree networks are very

    suitable for databaseapplications because itallows multiple searches

    through the database at atime.58

  • 7/28/2019 01 01 Parallel Computing Explained

    59/65

    Interconnected Networks

    Torus Network: A mesh with wrap-aroundconnections in both the x and y directions.

    Multistage Network: A network with more thanone networking unit.

    Fully Connected Network: A network where everyprocessor is connected to every other processor.

    Hypercube Network: Processors are connectedas if they were corners of a multidimensionalcube.

    Mesh Network: A network where each interiorprocessor is connected to its four nearestneighbors.

    59

  • 7/28/2019 01 01 Parallel Computing Explained

    60/65

    Interconnected Networks

    Bus Based Network: Coaxial cable based LANtopology in which the medium forms a single bus

    to which all stations are attached.

    Cross-bar Switch Network: A network that works

    through a switching mechanism to access sharedmemory.

    Tree Network: The processors are the bottom

    nodes of the tree.

    Ring Network: Each processor is connected to

    two others and the line of connections forms a

    circle.

    60

    Summary of Parallel Computer

  • 7/28/2019 01 01 Parallel Computing Explained

    61/65

    Summary of Parallel Computer

    Characteristics

    How many processors does the computer have? 10s?

    100s?

    1000s?

    How powerful are the processors?

    what's the MHz rate

    what's the MIPS rate

    What's the instruction set architecture?

    RISC

    CISC

    61

    Summary of Parallel Computer

  • 7/28/2019 01 01 Parallel Computing Explained

    62/65

    Summary of Parallel Computer

    Characteristics

    How much memory is available? total memory

    memory per processor

    What kind of memory?

    distributed memory shared memory

    distributed shared memory

    What type of flow of control?

    SIMD MIMD

    SPMD

    62

    Summary of Parallel Computer

  • 7/28/2019 01 01 Parallel Computing Explained

    63/65

    Summary of Parallel Computer

    Characteristics

    What is the interconnection network? Bus

    Crossbar

    Hypercube

    Tree Torus

    Multistage

    Fully Connected

    Mesh

    Ring

    Hybrid

    63

    Design decisions made by some of

  • 7/28/2019 01 01 Parallel Computing Explained

    64/65

    Design decisions made by some of

    the major parallel computer vendors

    Computer

    Programmin

    g

    Style

    OS Processors MemoryFlow of

    ControlNetwork

    SGI

    Origin2000

    OpenMP

    MPI

    IRIXMIPS RISC

    R10000

    DSM MIMD

    Crossbar

    Hypercub

    e

    HP V-

    Class

    OpenMP

    MPIHP-UX HP PA 8200 DSM MIMD

    Crossbar

    Ring

    Cray T3E SHMEM UnicosCompaq

    Alpha

    Distribute

    dMIMD Torus

    IBM SP MPI AIX IBM Power3Distributed

    MIMDIBMSwitch

    Workstatio

    n

    Clusters

    MPI LinuxIntel Pentium

    III

    Distribute

    dMIMD

    Myrinet

    Tree

    64

  • 7/28/2019 01 01 Parallel Computing Explained

    65/65

    Summary

    This completes our introduction to parallel computing. You have learned about parallelism in computer

    programs, and also about parallelism in the hardwarecomponents of parallel computers.

    In addition, you have learned about the commonly

    used parallel computers, and how these computerscompare to each other.

    There are many good texts which provide anintroductory treatment of parallel computing. Here aretwo useful references:

    Highly Parallel Computing, Second EditionGeorge S. Almasi and Allan GottliebBenjamin/Cummings Publishers, 1994

    Parallel Computing Theory and Practice