concepts of parallel programming

Upload: amit4g

Post on 30-May-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Concepts of Parallel Programming

    1/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 1

    Concepts ofConcepts of

    Parallel ComputingParallel ComputingAlf Wachsmann

    Stanford Linear Accelerator Center (SLAC)[email protected]

    mailto:[email protected]:[email protected]
  • 8/9/2019 Concepts of Parallel Programming

    2/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 2

    Why do it in parallel?

    Why is parallel computing a good idea? 1 worker needs 3 days to dig a ditch.

    How long do 3 workers need?

    Parallel Computing is (in the most general sense) thesimultaneous use of multiple compute resources tosolve a computational problem

    What about 1 tree takes 30 years to grow big.

    How long do 3 trees need?

  • 8/9/2019 Concepts of Parallel Programming

    3/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 3

    Parallel Addition

    Diagram in space and time Abstraction from communication (the hard part!)

    3 7 11 15 19 23 27 311+2 3+4 5+6 7+8 9+10 11+12 13+14 15+16

    10 26 42 58

    36 100

    wallclocktime

    1 2 3 4 5 6 7 8Processors

    136

  • 8/9/2019 Concepts of Parallel Programming

    4/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 4

    Why do it in parallel?

    Algorithmic reasons: Save time (wall clock time) does NOT save work!

    Solve larger problems (more memory)

    Systemic reasons:

    Transmission speed (speed of light) Limits to miniaturization

    Economic limits

  • 8/9/2019 Concepts of Parallel Programming

    5/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 5

    Maximum Gain

    Gain by doing it in parallel is

    speedup =running time for best serial algorithm

    running time for parallel algorithm

    Ideally: use P processors and get P-fold speedup.

    Linear speedup in P is the best we can hope for!

    There are cases of super-linear speedup.

  • 8/9/2019 Concepts of Parallel Programming

    6/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 6

    Sequential Computer

    Architecture of serial computers:

    Fetch Execute

    CPU

    Memory

    VonNeuman Architecture: memory is used to store both program and data CPU gets instructions and/or data from memory Decodes instructions Executes them sequentially

  • 8/9/2019 Concepts of Parallel Programming

    7/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 7

    Parallel Computers

    Widely used classification for parallel computers:Flynn's Taxonomy (1966)

    S I S D

    Single Instruction, Single Data

    S I M D

    Single Instruction, Multiple Data

    M I S D

    Multiple Instruction, Single Data

    M I M D

    Multiple Instruction, Multiple Data

  • 8/9/2019 Concepts of Parallel Programming

    8/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 8

    Memory Architectures

    Other important classification schema is accordingthe parallel computer's memory architecture Shared memory

    Uniform memory access

    Non-uniform memory access

    Distributed memory

    Hybrid distributed-shared memory solutions

  • 8/9/2019 Concepts of Parallel Programming

    9/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 9

    Shared Memory

    Shared Memory Multiple processors can operate independently but share

    the same memory resources

    Changes in a memory location effected by one processorare visible to all other processors (global address space)

    Memory

    CPU

    CPU

    CPU CPU

  • 8/9/2019 Concepts of Parallel Programming

    10/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 10

    Uniform Memory Access

    Most commonly represented today by SymmetricMultiprocessor (SMP) machines

    Identical processors

    Equal access and access times to memory

    Sometimes called CC-UMA - Cache Coherent UMA.Cache Coherent means if one processor updates alocation in shared memory, all the other processorsknow about the update. Cache coherency is

    accomplished at the hardware level.

  • 8/9/2019 Concepts of Parallel Programming

    11/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 11

    Non-Uniform Memory Access

    Often made by physically linking two or more SMPs One SMP can directly access memory of another SMP

    Not all processors have equal access time to allmemories

    Memory access across link is slower If cache coherency is maintained, then may also becalled CC-NUMA - Cache Coherent NUMA

  • 8/9/2019 Concepts of Parallel Programming

    12/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 12

    Distributed Memory

    Processors have their own local memory. Memoryaddresses in one processor do not map to anotherprocessor, so there is no concept of global addressspace across all processors

    Distributed memory systems require acommunication network to connect inter-processormemory

    The network "fabric" used for data transfer varieswidely; can can be as simple as Ethernet

    MemoryCPU

    MemoryCPU

    MemoryCPU

    MemoryCPU

    Network

    Node 1 Node 2

    Node 3 Node 4

  • 8/9/2019 Concepts of Parallel Programming

    13/49Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 13

    Comparison

    Shared Memory Advantages

    Global address space

    Data sharing betweentasks is both fast and

    uniform Disadvantages

    Lack of scalabilitybetween memory andCPUs.

    Programmerresponsibility forsynchronization

    Expensive

    Distributed Memory Advantage

    Memory is scalable withnumber of processor

    Each processor can

    rapidly access ownmemory

    Disadvantages NUMA access times

    Programmer responsible

    for many details Difficult to map existing

    data structures

  • 8/9/2019 Concepts of Parallel Programming

    14/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 14

    Constellations

    Hybrid Distributed-Shared Memory Used in most of todays parallel computers

    Cache-coherent SMP nodes

    Distributed memory is networking of multiple SMP nodes

    Network

    MemoryCPUCPUCPUCPU

    MemoryCPUCPUCPUCPU

    MemoryCPUCPU

    CPUCPU

    MemoryCPUCPU

    CPUCPU

    Node 1

    Node 3

    Node 2

    Node 4

  • 8/9/2019 Concepts of Parallel Programming

    15/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 15

    Example Machines

    Comparison of Shared and Distributed Memory ArchitecturesArchitecture CC-UMA CC-NUMA Distributed

    Examples

    SMPsSun Fire Exxx/VxxxDEC/CompaqSGI Challenge

    IBM POWER3

    SGI Origin/AltixSequentHP ExemplarDEC/Compaq

    IBM POWER4

    Cray T3EMasparIBM SPIBM Blue Gene/L

    Beowulf Clusters

    Communications

    MPIThreadsOpenMPshmem

    MPIThreadsOpenMPshmem

    MPI

    Scalability to 10s of processors to 100s of processors to 1000s of processors

    Draw BacksLimited memory

    bandwidth

    New architecture Point-to-pointcommunication

    System administrationProgramming is hard todevelop and maintain

    Software Availability declining stable Still rising

  • 8/9/2019 Concepts of Parallel Programming

    16/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 16

    Parallel Programming Models

    Abstraction above hardware and memoryarchitecture

    Several programming models in use: Shared Memory (parallel computing)

    Threads Message Passing (distributed computing)

    Data Parallel

    Hybrid approaches

    All models exist for all hardware/memoryarchitectures

  • 8/9/2019 Concepts of Parallel Programming

    17/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 17

    Shared Memory Model

    Tasks share a common address space, which theyread and write asynchronously

    Access control to shared memory via locks orsemaphores

    No notion of ownership of data no need toexplicitly communicate data between tasks

    Implementations shared memory machines: compiler

    distributed memory machines: simulations

  • 8/9/2019 Concepts of Parallel Programming

    18/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 18

    Threads Model

    A single process has multiple, concurrent executionpaths

    Most commonly used on shared mem. machines andin operating systems

    call sub1call sub2do i = 1, nA(i) = fnct(i^3)

    B(i) = A(i) * pend do

    call sub3call sub4......

    prg.exe T1 T2

    T3T4 T

    ime

  • 8/9/2019 Concepts of Parallel Programming

    19/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 19

    Threads Model

    Implementations POSIX Thread Library C language only

    Offered for most hardware

    Very explicit parallelism

    Requires significant programmer attention to detail OpenMP

    Based on compiler directives; can use sequential code

    Fortran, C, C++

    portable/multi-platform

    Can be very easy and simple to use

  • 8/9/2019 Concepts of Parallel Programming

    20/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 20

    Message Passing Model

    Tasks exchange data through communications bysending and receiving messages

    usually requires cooperative operations to beperformed by each process: a send operation must

    have a matching receive operation

    task 0 task 1

    Machine A Machine B

    data dataNetwork

    send(data) receive(data)

  • 8/9/2019 Concepts of Parallel Programming

    21/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 21

    Message Passing Model

    Implementations Parallel Virtual Machine (PVM) Not much in use any more

    Message Passing Interface (MPI) Part 1 released 1994

    Part 2 (MPI-2) release 1996 http://www-unix.mcs.anl.gov/mpi/

    Now de-facto standard

    Fortran, C, C++

    Available on virtually all machines OpenMPI, MPICH, LAM/MPI, many vendor specific versions

    On shared memory machines, MPI implementations usuallydon't use a network for task communications

    http://www-unix.mcs.anl.gov/mpi/http://www-unix.mcs.anl.gov/mpi/
  • 8/9/2019 Concepts of Parallel Programming

    22/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 22

    Data Parallel Model

    A set of tasks work collectively on the same datastructure

    Each task works on a different partition of thesame data structure

  • 8/9/2019 Concepts of Parallel Programming

    23/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 23

    Data Parallel Model

    Implementations Fortran 90 ISO/ANSI extension of Fortran 77

    Additions to program structure and commands

    Variable additions methods and arguments

    High Performance Fortran (HPF) Contains everything in F90

    Directives to tell compiler how to distribute data added

    Data parallel constructs added (now part of F95)

    On distr. memory machines: translated into MPI code

  • 8/9/2019 Concepts of Parallel Programming

    24/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 24

    Hybrid Programming Models

    Two or more of the previous models are used in thesame program

    Common examples: POSIX Threads and Message Passing (MPI)

    OpenMP and MPI ClusterOpenMP (Intel)

    Works well on network of SMP machines

    Also used:

    Data Parallel and MPI

  • 8/9/2019 Concepts of Parallel Programming

    25/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 25

    Designing Parallel Programs

    No real parallelizing compilers Compiler knows how to parallelize certain constructs (e.g.loops)

    Compiler uses directives from programmer

    Not simply a matter of taking sequential algorithmand making it parallel. Sometimes, completelydifferent algorithmic approach necessary

    Very time consuming and labor intense task

  • 8/9/2019 Concepts of Parallel Programming

    26/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 26

    Parallelization Techniques

    Domain Decomposition Data is partitioned Each task works on different part of data

    Three different ways to partition data

  • 8/9/2019 Concepts of Parallel Programming

    27/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 27

    Parallelization Techniques

    Functional Decomposition Problem is partitioned into set of independent tasks

    Both types of decomposition can be and often are combined

  • 8/9/2019 Concepts of Parallel Programming

    28/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 28

    A little Theory

    Some problems can be parallelized very well:In complexity theory, the classNC ("Nick'sClass") is the set of decision problems decidablein poly-logarithmic time on a parallel computer

    with a polynomial number of processors. In otherwords, a problem is inNC if there are constants cand ksuch that it can be solved in timeusing parallel processors.

    Source: http://en2.wikipedia.org/wiki/Class_NC

    O lognc

    Onk

    http://en2.wikipedia.org/wiki/Class_NChttp://en2.wikipedia.org/wiki/Class_NC
  • 8/9/2019 Concepts of Parallel Programming

    29/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 29

    A little Theory

    Some problems can't be parallelized at all! Example: Calculating the Fibonacci Sequence

    (1,1,2,3,5,8,13,21,...) by using the formula

    Calculation entails dependent calculations: Thecalculation of the k + 2 value uses those of bothk + 1 and k. These three terms cannot be calculatedindependently and therefore, cannot be parallelized.

    F1=1

    F2=1

    Fk2=FkFk1

  • 8/9/2019 Concepts of Parallel Programming

    30/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 30

    Communication

    Decomposed problems typically need to communicate: Partial results need to be combined Changes to neighboring data have effects on a task's data

    Some problem don't need communication:

    Embarrassingly parallel problems

  • 8/9/2019 Concepts of Parallel Programming

    31/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 31

    Cost of Communication

    Communicating data takes time Inter-task comm. has overhead Often synchronization is necessary

    Communication is much more expensive than

    computation Communicating data needs to save a lot of computationbefore it pays off

    Infiniband needs < 10ms to set up communication

    2.4GHz AMD Opteron CPU needs ~0.4ns to perform one

    floating point operation (Flop) 25,000 floating point operations per communication setup!

  • 8/9/2019 Concepts of Parallel Programming

    32/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 32

    Latency - Bandwidth

    Latency: the amount of time for the first bit ofdata to arrive at the other end

    Bandwidth: how much data per time unit fitsthrough

  • 8/9/2019 Concepts of Parallel Programming

    33/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 33

    Cost of Communication

    Formula for the time needed to transmit data

    cost=LN

    B

    L = Latency [s]N = number of bytes [byte]B = Bandwidth [byte/s]cost [s]

  • 8/9/2019 Concepts of Parallel Programming

    34/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 34

    Visibility of Communication

    With MPI, communication is explicit and very visible Latency Hiding: Communicate and at the same time doing some other

    computations

    Implementation via parallel threads or non-blocking MPIcommunication functions

    Makes programs faster but more complex

  • 8/9/2019 Concepts of Parallel Programming

    35/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 35

    Scope of Communication

    Knowing which tasks must communicate with eachother is critical during the design stage of a parallelprogram Point-to-Point: involves two tasks with one task acting as

    the sender/producer of data, and the other acting as the

    receiver/consumer Collective: involves data

    sharing between more thantwo tasks, which are oftenspecified as being

    members in a commongroup, or collective

  • 8/9/2019 Concepts of Parallel Programming

    36/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 36

    Communication Hardware

    Architecture Comment Bandwidth Latency

    Myrinet

    http://www.myricom.com/

    Proprietary

    but

    commodity

    Sust. one-way for

    large messages:

    ~1.2GB/s

    short

    messages:

    ~3ms

    Infiniband

    http://www.infinibandta.org/

    Vendor

    indep.

    standard

    ~900MB/s(4x HCAs)

    ~10ms

    Quadrics (QsNet)

    http://www.quadrics.com/

    Expensive,

    proprietary

    ~900MB/s ~2ms

    Gigabit Ethernet commodity ~100MB/s ~60ms

    Custom: SGI, IBM, Cray, Sun, Compaq, ...

  • 8/9/2019 Concepts of Parallel Programming

    37/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 37

    Communication Hardware

    InfiniBand Proprietary GigE 10GigE

    MellanoxMHGA28

    QLogicInfiniPath

    HT

    MyrinetF

    Myrinet10G

    QuadricsQM500

    ChelsioT210-CX

    Latency (s) 2.25 1.3 2.6 2.0 1.6 30-100 9.6

    Peak Band-width (MB/s)

    1502 954 493 1200 910 125 860

    N/2 (Bytes)BW (MB/s)

    512750

    385470

    2000250

    2000600

    1000450

    800060

    100,000430

    CPUoverhead (%)

    ~5 ~40 ~10 ~10 ~50 >50 ~50

    *Mellanox Technology testing; Ohio State University; PathScale, Myricom, Quadrics, and Chelsio websites

    http://www.mellanox.com/applications/performance_benchN/2: Message size to achieve half the peak bandwidth

    http://www.mellanox.com/applications/performance_benchmarks.phphttp://www.mellanox.com/applications/performance_benchmarks.php
  • 8/9/2019 Concepts of Parallel Programming

    38/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 38

    Synchronization

    handshaking between tasks that are sharing data Types of synchronization: Barrier

    Usually implies that all tasks are involved

    Each task performs its work until it reaches the barrier.

    It then stops, or "blocks" When the last task reaches the barrier, all tasks are

    synchronized

    Used in MPI

  • 8/9/2019 Concepts of Parallel Programming

    39/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 39

    Synchronization

    More types: Lock/Semaphore Can involve any number of tasks

    Typically used to serialize (protect) access to global data or asection of code. Only one task at a time may use (own) the

    lock / semaphore / flag The first task to acquire the lock "sets" it. This task canthen safely (serially) access the protected data or code.

    Other tasks can attempt to acquire the lock but must waituntil the task that owns the lock releases it.

    Can be blocking or non-blocking

    Used in threads and shared memory

  • 8/9/2019 Concepts of Parallel Programming

    40/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 40

    Synchronization

    More types: Synchronous Communication Operations Involves only those tasks executing a communication

    operation

    When a task performs a communication operation, some form

    of coordination is required with the other task(s)participating in the communication. For example, before atask can perform a send operation, it must first receive anacknowledgment from the receiving task that it is OK tosend.

  • 8/9/2019 Concepts of Parallel Programming

    41/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 41

    Granularity

    Qualitative measure ofComputation / Communication Ratio Typically, periods of computations are separated from

    periods if communication by synchronization events

    Fine-Grain Parallelism:Small amount of

    computation betweencommunication

    Coarse-Grain Parallelism:Large amount of

    computation betweencommunication

  • 8/9/2019 Concepts of Parallel Programming

    42/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 42

    Granularity

    Fine-Grain Low computation tocommunication ratio

    Facilitates load balancing

    High communication

    overhead; less opportunityfor performanceenhancement

    Coarse-Grain High computation tocommunication ratio

    More opportunity forperformance increase

    Harder to load balanceefficiently

  • 8/9/2019 Concepts of Parallel Programming

    43/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 43

    Data In- and Output

    Parallel computers with thousands of nodes canhandle huge amounts of data

    It is hard to get this data in and out of the nodes parallel-I/O systems are still fairly new and not available

    for all platforms

    I/O over the network (like NFS) causes severe bottlenecks

    Help can be found with Parallel File Systems: Lustre, PVFS2, GPFS (IBM)

    MPI-2 provides support for parallel file systems

    Rule #1: Reduce overall I/O as much as possible!

  • 8/9/2019 Concepts of Parallel Programming

    44/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 44

    Efficiency

    Speedup

    Value between zero and one

    estimate how well-utilized the processors are in solving theproblem, compared to how much effort is wasted incommunication and synchronization

    linear speedup and algorithms running on a single processorhave an efficiency of 1

    many difficult-to-parallelize algorithms have efficiencysuch as 1/log p that approaches zero as the number ofprocessors increases

    Sp =Ts

    Tp

    Efficiency =Sp

    p

  • 8/9/2019 Concepts of Parallel Programming

    45/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 45

    Limits and Costs

    Besides theoretical limits and hardware limits,there are practical limits to parallel computing

    Amdahl's Law states that potential programspeedup is defined by the fraction of code (P) thatcan be parallelized:

    If none of the code can be parallelized,P= 0 and the speedup = 1 (no speedup).If all of the code is parallelized,P= 1 and thespeedup is infinite (in theory).

    If 50% of the code can be parallelized, maximumspeedup = 2, meaning the code will run twice as fast.

    speedup=

    1

    1P

  • 8/9/2019 Concepts of Parallel Programming

    46/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 46

    Limits and Costs

    Introducing the number of processors performingthe parallel fraction of work, Amdahl's Law can bereformulated as

    speedup=1

    P

    NS

    N = number of processors,

    P = parallel fraction andS=1-P = serial fraction

    Speedup

    http://upload.w

    ikimedia.org/wikipedia/en

    /7/7a/Amdahl-law.jpg

    N P=0.50 P=0.90 P=0.99 P=1.0

    10 1.82 5.26 9.17 10100 1.98 9.17 50.25 100

    1000 1.99 9.91 90.99 1000

    10000 1.99 9.99 99.02 10000

  • 8/9/2019 Concepts of Parallel Programming

    47/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 47

    Typical Parallel Applications

    Applications that are well suited for parallelcomputers are Weather and ocean patterns

    Finite Element Method (FEM; crash tests for cars)

    Fluid dynamics, aerodynamics

    Simulation of electro-magnetic problems

  • 8/9/2019 Concepts of Parallel Programming

    48/49

    Intro. to Parallel Computing Spring 2007 Concepts of Parallel Computing A. Wachsmann 48

    Summary

    Overview of parallel computing concepts Hardware Software

    Programming

    Problems of parallel computing Communication is expensive (latency)

    I/O is expensive

    Techniques to work around these problems Problem decomposition (communicate larger data)

    Parallel File Systems plus supporting hardware

    $$$$ (faster communication fabric)

  • 8/9/2019 Concepts of Parallel Programming

    49/49

    Acknowledgment/References

    Most of this talk is taken fromhttp://www.llnl.gov/computing/tutorials/parallel_comp/ Theory book Introduction to Parallel Algorithms and

    Architectures: Arrays, Trees, Hypercubes by F.Thomson Leighton

    Hardware book Computer Architecture: AQuantitative Approach (3rd edition) by John L.Hennessy, David A. Patterson, David Goldberg

    http://www.top500.org/

    http://www.llnl.gov/computing/tutorials/parallel_comp/http://www.top500.org/http://www.top500.org/http://www.llnl.gov/computing/tutorials/parallel_comp/