tuesday, september 12, 2006
DESCRIPTION
Tuesday, September 12, 2006. Nothing is impossible for people who don't have to do it themselves. - Weiler. BlueGene/L. Shared Memory. Adding more CPUs increase traffic on shared memory-CPU path. Cache coherent systems increase traffic associated with cache/memory management. Today. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/1.jpg)
Tuesday, September 12, 2006
Nothing is impossible for people who don't have to do
it themselves.
- Weiler
![Page 2: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/2.jpg)
BlueGene/L
![Page 3: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/3.jpg)
Shared Memory
Adding more CPUs increase traffic on shared memory-CPU path.
Cache coherent systems increase traffic associated with cache/memory management
![Page 4: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/4.jpg)
Today
Classification of parallel computers.Programming Models
![Page 5: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/5.jpg)
von Neumann ArchitectureA common machine model known as the
von Neumann computer. Uses the stored-program concept. The
CPU executes a stored program that specifies a sequence of read and write operations on the memory.
![Page 6: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/6.jpg)
How to classify parallel computers?
Flynn's Classical Taxonomy (1966)
S I S D Single Instruction, Single Data
S I M D Single Instruction, Multiple Data
M I S D Multiple Instruction, Single Data
M I M D Multiple Instruction, Multiple Data
![Page 7: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/7.jpg)
Single Instruction, Single Data (SISD):
A serial (non-parallel) computer Single instruction: only one instruction stream
is being acted on by the CPU during any one clock cycle
Single data: only one data stream is being used as input during any one clock cycle
This is the oldest and until recently, the most prevalent form of computer
Examples: most PCs, single CPU workstations and mainframes.
![Page 8: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/8.jpg)
Single Instruction, Single Data (SISD):
.
![Page 9: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/9.jpg)
Single Instruction, Multiple Data (SIMD):
A type of parallel computer This type of machine typically has an
instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units.
Apply same operation to different data values.
Picture & analogy.
![Page 10: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/10.jpg)
Single Instruction, Multiple Data (SIMD):
Single instruction: All processing units execute the same instruction at any given clock cycle .
Multiple data: Each processing unit can operate on a different data element.
Best suited for specialized problems characterized by a high degree of regularity, such as image processing.
![Page 11: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/11.jpg)
Single Instruction, Multiple Data (SIMD):
SIMD systems have fallen out of favor as general purpose computers Still important in fields like signal processing
Examples: Thinking machine corporation Connection
module (CM-1 and CM-2), Maspar
![Page 12: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/12.jpg)
SIMD approach is used in some processors for special operations Intel Pentium processors with MMX
includes small set of SIMD-style instructions designed for use in graphic transformations that involve matrix-vector multiplication.
AMD K6/Athlon processors 3D Now!
![Page 13: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/13.jpg)
Single Instruction, Multiple Data (SIMD):
.
![Page 14: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/14.jpg)
Multiple Instruction, Single Data (MISD):
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via independent instruction streams.
Does not exist in practice.Analogy.
![Page 15: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/15.jpg)
Multiple Instruction, Single Data (MISD):
Some conceivable uses might be: Multiple frequency filters operating on a
single signal stream Multiple cryptography algorithms
attempting to crack a single coded message.
![Page 16: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/16.jpg)
Multiple Instruction, Single Data (MISD):
![Page 17: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/17.jpg)
Multiple Instruction, Multiple Data (MIMD):
Currently, the most common type of parallel computer.
Most modern computers fall into this category.
![Page 18: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/18.jpg)
Multiple Instruction, Multiple Data (MIMD):
Multiple Instruction: every processor may be executing a different instruction stream
Multiple Data: every processor may be working with a different data stream
Examples: Most current supercomputers, networked
parallel computers and multi-processor SMP computers.
![Page 19: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/19.jpg)
Single Program Multiple Data (SPMD)
SPMD is a subset of ?
![Page 20: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/20.jpg)
Single Program Multiple Data (SPMD)
SPMD is a subset of MIMD Simplification for software Most parallel programs in
technical and scientific computing are SPMD
![Page 21: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/21.jpg)
Other approaches to parallelism
Vector computing (parallelism?)
Operations performed on vectors, often groups of 64 floating point numbers.
A single instruction may cause 64 results to be computed using vectors stored in vector registers.
Memory bandwidth is an order of magnitude greater than non-vector computers.
![Page 22: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/22.jpg)
Parallel Programming Models
Abstraction above hardware and memory architectures.
Inspired by parallel architecture.
![Page 23: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/23.jpg)
Shared Memory ModelTasks share a common address space,
which they read and write asynchronously.
Various mechanisms such as locks / semaphores may be used to control access to the shared memory.
An advantage of this model from the programmer's point of view is that the notion of data "ownership" is lacking, so there is no need to specify explicitly the communication of data between tasks.
![Page 24: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/24.jpg)
Threads Model
A single process can have multiple, concurrent execution paths.
![Page 25: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/25.jpg)
Threads Model
Each thread has local data, but also, shares the entire resources of the program.
Threads communicate with each other through global memory (updating address locations).
This requires synchronization constructs to ensure that more than one thread is not updating the same global address at any time.
Threads are commonly associated with shared memory architectures.
![Page 26: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/26.jpg)
Threads Model
Vendor proprietary versions Problem?
Standardization efforts.
![Page 27: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/27.jpg)
From a programming point of view, threads implementations comprise:
1. A library of subroutines (POSIX threads) Very explicit parallelism;
Requires significant programmer attention to detail.
APIs such as Pthreads are considered low level primitives.
![Page 28: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/28.jpg)
From a programming point of view, threads implementations comprise:
2. A set of compiler directives (OpenMP) Higher level construct to relieve the
programmer from manipulating threads Used with Fortran, C, C++ for programming
shared address space machines.
![Page 29: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/29.jpg)
In both cases, the programmer is responsible for determining all parallelism.
![Page 30: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/30.jpg)
Message Passing Model
A set of tasks that use their own local memory during computation.
Multiple tasks can reside on the same physical machine as well across an arbitrary number of machines.
Tasks exchange data through communications by sending and receiving messages.
![Page 31: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/31.jpg)
Message Passing Model
Data transfer usually requires cooperative operations to be performed by each process. A send operation must have a matching
receive operation.
![Page 32: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/32.jpg)
Message Passing Model
From a programming point of view:Message passing implementations commonly
comprise a library of subroutines.The programmer is responsible for determining
all parallelism.
![Page 33: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/33.jpg)
Message Passing Model
Variety of message passing libraries have been available since the 1980s.
Problem: portability
In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations.
MPI is now the "de facto" industry standard for message passing.
For shared memory architectures, MPI implementations usually don't use a network for task communications.
![Page 34: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/34.jpg)
Data Parallel Model
Most of the parallel work focuses on performing operations on a data set.
A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.
Tasks perform the same operation on their partition of work, for example, "add 4 to every array element".
![Page 35: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/35.jpg)
Data Parallel Model
![Page 36: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/36.jpg)
Data Parallel Model
SIMD machines.On shared memory architectures, all tasks may
have access to the data structure through global memory.
On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task.
![Page 37: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/37.jpg)
Data Parallel Model Programming with the data parallel model is usually
accomplished by writing a program with data parallel constructs.
Compiler Directives: Allow the programmer to specify the distribution and alignment of data. Fortran implementations are available for most common parallel platforms.
Distributed memory implementations of this model usually have the compiler convert the program into standard code with calls to a message passing library (MPI usually) to distribute the data to all the processes. All message passing is done invisibly to the programmer.
High Performance Fortran (HPF): Extensions to Fortran 90 to support data parallel programming.
![Page 38: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/38.jpg)
Hybrid Model
Environment of networked SMP machines Combination of the message passing model
(MPI) with either the threads model (POSIX threads) or the shared memory model
(OpenMP).
![Page 39: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/39.jpg)
Historically, architectures were often tied to programming models.
![Page 40: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/40.jpg)
Message passing model on a shared memory machine
MPI on SGI Origin. The SGI Origin employed the CC-
NUMA type of shared memory architecture, where every task has direct access to global memory.
Ability to send and receive messages with MPI.
![Page 41: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/41.jpg)
Shared memory model on a distributed memory machine
Kendall Square Research (KSR) ALLCACHE approach.
Machine memory was physically distributed, but appeared to the user as a single shared memory (global address space).
This approach is referred to as "virtual shared memory".
KSR approach is no longer used. No common distributed memory platform implementations currently exist.
![Page 42: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/42.jpg)
There certainly are better implementations of some models over others.
![Page 43: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/43.jpg)
Networks for Connecting Parallel Systems
Simple buses, 2-D and 3-D meshes, hypercube network topologies …
In the past, understanding details of topologies was important for the programmers.
![Page 44: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/44.jpg)
CPU Parallelism
Superscalar parallelism.
![Page 45: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/45.jpg)
CPU Parallelism
Amount of parallelism by superscalar processors is limited by instruction look ahead.
Hardware logic for dependency analysis is 5-10% of total logic on conventional microprocessors
![Page 46: Tuesday, September 12, 2006](https://reader035.vdocuments.us/reader035/viewer/2022062407/56812a59550346895d8db551/html5/thumbnails/46.jpg)
CPU Parallelism
Explicitly parallel instructions.Each instruction contains explicit sub-
instructions for each of the use of each of the different functional units in the CPU
Very long instruction word (VLIW) ISAs.Relies on compilers to resolve dependencies.
Instructions that can be executed concurrently are packed into groups and sent to processor as a single long instruction word.
Example: Intel Itanium