multi processors and thread level parallelism
TRANSCRIPT
-
7/25/2019 Multi processors and thread level parallelism
1/74
UNIT 5
MULTI PROCESSORS AND THREADLEVEL PARALLELISM
-
7/25/2019 Multi processors and thread level parallelism
2/74
CONTENT
INTRODUCTION
SYMMETRIC AND SHARED MEMORY ARCHITECTURES
PERFORMANCE OF SYMMETRIC SHARED MEMORY
ARCHITECTURES
DISTRIBUTES SHARED MEMORY AND DIRECTORY BASED
COHERENCE
BASICS OF SYNCHRONIZATION
MODELS OF MEMORY CONSISTENCY
-
7/25/2019 Multi processors and thread level parallelism
3/74
FACTORS THAT TREND TOWARD
MULTIPROCESSOR
1.A growing interest in servers and server
performance
2.A growth in data intensive applications
3. The insight that increasing performance on thedesktop is less important
4.An improved understanding on how to use
multi processors effectively
5.Advantages of leveraging a design investmentby replication rather than unique design
-
7/25/2019 Multi processors and thread level parallelism
4/74
A TAXONOMY OF PARALLEL
ARCHITECTURES
1. Single instruction stream, single data stream
(SISD)
2. Single instruction stream, multiple data stream
(SIMD)
3. Multiple instruction stream single data stream
(MISD)
4. Multiple instruction stream, multiple data
stream (MIMD)
-
7/25/2019 Multi processors and thread level parallelism
5/74
SIMDSame instruction is executed by multiple
processors using different data streams
Exploit data level parallelism
Each processor has its own data memory
Single instruction memory
Control processor to fetch and dispatch
instructions
SISD
- Uniprocessor
-
7/25/2019 Multi processors and thread level parallelism
6/74
MIMD
Each processor fetches its own instruction and
operates on its own code
Exploits thread level parallelism
-
7/25/2019 Multi processors and thread level parallelism
7/74
FACTORS THAT CONTRIBUTED TO
THE RISE OF MIMD
1. Flexibility Functions as a single user multiprocessor
Can focus on high performance for one application
Can run multiple tasks simultaneously
2. Cost performance Use the same micro processor found in
workstations and single processor servers
Multi core chips leverage the design investment
using replication
-
7/25/2019 Multi processors and thread level parallelism
8/74
CLUSTERS
One class of MIMD
Use standard components and a network
technology
Two types: Commodity clusters
Custom clusters
-
7/25/2019 Multi processors and thread level parallelism
9/74
COMMODITY CLUSTERS
Rely on 3rdparty processors and interconnect
technology
Are often blade / rack mounted servers
Focus on throughputNo communication among threads
Assembled by users rather than vendors
-
7/25/2019 Multi processors and thread level parallelism
10/74
CUSTOMCLUSTERS
Designer customizes either the detailed node
design or the interconnect design or both
Exploit large amount of parallelism
Require significant among of communicationduring computation
More efficient
Ex.: IBM Blue gene
-
7/25/2019 Multi processors and thread level parallelism
11/74
MULTICORE
Multi processors placed on a single die
A.k.a. on-chip multiprocessing or single-chip
multiprocessing
Multiple core shares resources (cache, I/O bus)Ex.: IBM Power 5
-
7/25/2019 Multi processors and thread level parallelism
12/74
PROCESS
Segment of code that may be run independently
Process state contains all necessary information
to execute that program
Each process is independent of the other :-multiprogramming environment
-
7/25/2019 Multi processors and thread level parallelism
13/74
THREADS
Multiple processors executing a single program
Share the code and address space
Grain size must be large to exploit parallelism
Independent threads within a process areidentified by the programmer or created by the
compiler
Loop iterations within thread-Exploit data level
parallelism
-
7/25/2019 Multi processors and thread level parallelism
14/74
MIMD CLASSIFICATION
1. Centralized shared memory architectures
2. Distributed memory processors
-
7/25/2019 Multi processors and thread level parallelism
15/74
CENTRALIZED SHARED MEMORY
ARCHITECTURES
A few dozen processors share a single centralized
memory
Large caches or multiple bank memory
Scaling done using p-2-p connections, switchesand multiple bank memory
Symmetric relationship
Uniform access time
Called as Symmetric Shared MemoryMultiprocessor (SMP) or Uniform Memory Access
(UMA)
-
7/25/2019 Multi processors and thread level parallelism
16/74
-
7/25/2019 Multi processors and thread level parallelism
17/74
DISTRIBUTED MEMORY MULTI
PROCESSORS
Physically distributed memory
Supports large number of processors and
bandwidth
Raises the need for high bandwidth interconnectDirection networks(switches) and indirection
networks(multidimensional meshes) are used
-
7/25/2019 Multi processors and thread level parallelism
18/74
-
7/25/2019 Multi processors and thread level parallelism
19/74
BENEFITS:
1. Cost effective to scale memory bandwidth
2. Reduces latency to access local memory
DRAWBACKS:3. Complex
4. Software needed to manage the increased
memory bandwidth
-
7/25/2019 Multi processors and thread level parallelism
20/74
MODELS FOR COMMUNICATION
AND MEMORY ARCHITECTURE
1. Communication occurs in a shared address
space Physically separated memory => one logical shared
address space
Called as Distributed Shared Memoryarchitecture(DSM) or Non Uniform Memory Access
(NUMA)
Memory reference made by any processor to any
memory location
Access time depends on the data location in
memory
-
7/25/2019 Multi processors and thread level parallelism
21/74
2. Address space consist of multiple private address
spaces
Addresses are logically disjoint
Cannot be addresses by a remote processorSame physical address(processor) refer to
different memory location
Each processor-memory module is a separate
computerCommunication is done via message passing
A.k.a Message Passing Multiprocessors
-
7/25/2019 Multi processors and thread level parallelism
22/74
CHALLENGES OF MULTI
PROCESSING
1. Limited parallelism available in program
2. Relatively high cost of communication
3. Large latency of remote access
4. Difficult to achieve good speed up
Performance measured using Amdahls
law
-
7/25/2019 Multi processors and thread level parallelism
23/74
SOLUTION
Limited parallelism : algorithms with better
parallel performance
Access latency : architecture design and
programming
Reduce the frequency of remote access: hardware
and software mechanisms
Tolerate latency: multi threading and pre-
fetching
-
7/25/2019 Multi processors and thread level parallelism
24/74
PROBLEM
Suppose you want to achieve a speedup of 80 with
100 processors. What fraction of the original
computation can be sequential?
-
7/25/2019 Multi processors and thread level parallelism
25/74
assume that the program operates in only two
modes:
1. parallel with all processors fully used, which is
the enhanced mode2. serial with only one processor in use.
Speedup in enhanced mode =number of processors,
Speed in fraction of enhanced mode = time spentin parallel mode.
-
7/25/2019 Multi processors and thread level parallelism
26/74
=99.75%
.25% of original computation can be can be
sequential26
-
7/25/2019 Multi processors and thread level parallelism
27/74
PROBLEM
-
7/25/2019 Multi processors and thread level parallelism
28/74
-
7/25/2019 Multi processors and thread level parallelism
29/74
SHARED SYMMETRIC MEMORY
ARCHITECTURE
Use of multi level caches substantially reduce the
memory bandwidth demands of a processor
Solution: Creation of small scale multi processors
where several processors shared a single physical
memory connected by a shared bus
Benefit: Cost effective
They support caching of private/shared data
-
7/25/2019 Multi processors and thread level parallelism
30/74
Private data: Used by a single processor
Shared data: Shared between multiple processors
How are these cached?
-
7/25/2019 Multi processors and thread level parallelism
31/74
WHAT IS MULTI PROCESSOR CACHE
COHERENCE?
A memory system is said to be coherent:
1.A read by processor P to a location X that follows a
write by P to X, with no writes of X by another
processor occurring between the write and the
read by P, always returns the value written by P2.A read by a processor to location X that follows a
write by another processor to X returns the
written value if the read and write are sufficiently
separated in time and no other writes to X occurbetween two accesses
3. Writes to the same location are serialized; two
writes to the same location by any two processors
are seen in the same order by all processors
-
7/25/2019 Multi processors and thread level parallelism
32/74
Coherence: Defines the behavior of reads and
writes to the same memory location
Consistency: Defines the behavior of reads and
writes w.r.t accesses to other memory location
-
7/25/2019 Multi processors and thread level parallelism
33/74
BASIC SCHEMES FOR ENFORCING
COHERENCE
Coherent caches provide:
1. Migration: Data item can be moved to a local
cache and used
2. Replication:Shared data can be
simultaneously read
. The protocols to maintain coherence for
multiple processors are called cache coherence
protocols
-
7/25/2019 Multi processors and thread level parallelism
34/74
1. Directorybased: The sharing status of a
block of physical memory is kept in just one
location: directory
2. Snooping:Every cache that has a copy of the
data from a block of physical memory has also asharing status of the block; no centralized state
is kept
-
7/25/2019 Multi processors and thread level parallelism
35/74
SNOOPING PROTOCOLS
1. Write invalidate
2. Write update
-
7/25/2019 Multi processors and thread level parallelism
36/74
BASIC IMPLEMENTATION
TECHNIQUES
1. The processor acquires bus access and
broadcasts the address to be invalidated on the
bus
2. Processors continuously snoop on the bus
watching for addresses
3. The processors check if the address on the bus
in their cache
4. If so, they invalidate the corresponding data in
their cache
5. If two processors attempt to write shared blocks
at the same time, their attempts to broadcast
an invalidate operation will be serialized
-
7/25/2019 Multi processors and thread level parallelism
37/74
Write update: Broadcasts the write to all the
cache lines
Consumes bandwidth
Write - through cache: Written data is sent to
memory
The most recent value of the data item be
fetched from memory
-
7/25/2019 Multi processors and thread level parallelism
38/74
Write back cache: Every processor snoops the
address on the bus.
If the processor finds that it has a dirty copy of
the requested cache block, it provides that cache
block on request for a read
This in turn causes the memory operation to be
aborted
The cache block is then retrieved from the
processors cache
-
7/25/2019 Multi processors and thread level parallelism
39/74
To track if a cache block is shared, an extra bit
calledstate bitis associated with each cache
block
When a write to a shared block occurs, the cache
generates an invalidation on the bus and marksthe block asexclusive
The processor with this sole copy of the block is
called theownerof the block
-
7/25/2019 Multi processors and thread level parallelism
40/74
When an invalidation is sent, the owners sate of
the cache block is changed from shared to
exclusive
Later, if another processor requests for the cache
block, the state has to be made shared again
-
7/25/2019 Multi processors and thread level parallelism
41/74
-
7/25/2019 Multi processors and thread level parallelism
42/74
WRITE INVALIDATE FOR A WRITE
BACK CAHCE
Circles: Cache states
Arcs: State transitions
Label on the arcs: Stimulus that causes state
transition
Bold: Bus actions caused by transitions
-
7/25/2019 Multi processors and thread level parallelism
43/74
-
7/25/2019 Multi processors and thread level parallelism
44/74
LIMITATIONS
As the number of processors in a multiprocessor
grow / memory demands grow, any centralized
resource becomes a bottleneck
A single bus has to carry both the coherence
traffic as well as the normal trafficDesigners can use multiple buses and
interconnection networks
Attain a midway approach : shared memory vs
centralized memory
-
7/25/2019 Multi processors and thread level parallelism
45/74
-
7/25/2019 Multi processors and thread level parallelism
46/74
PERFORMANCE OF SYMMETRIC SHARED
MEMORY MULTI PROCESSORS
Coherence misses can be broken into two sources:
1. True sharing miss: The first write by a
processor to a shared cache block causes an
invalidation to establish block ownership; a
subsequent attempt to read the modified in thatcache block results in a miss
2. False sharing miss:The block is invalidate
because some word in the cache block other
than the one being read is written into
-
7/25/2019 Multi processors and thread level parallelism
47/74
PROBLEM 3:Assume that words xl and x2 are
in the same cache block, which is in the shared
state in the caches of both PI and P2. Assuming
the following sequence of events, identify each
miss as a true sharing miss, a false sharing miss,or a hit. Any miss that would occur if the block
size were one word is designated a true sharing
miss.
-
7/25/2019 Multi processors and thread level parallelism
48/74
-
7/25/2019 Multi processors and thread level parallelism
49/74
DISTRIBUTED SHARED MEMORY
AND DIRECTORRY BASED COHERENCE
A directory keeps state of every cached block
Information in the directory includes which
caches have copies of the block, if they are dirty
and so on
An entry in the directory is associated with each
block
To prevent the directory from becoming a
bottleneck, the directory is distributed along with
the memory.
-
7/25/2019 Multi processors and thread level parallelism
50/74
-
7/25/2019 Multi processors and thread level parallelism
51/74
DIRECTORY BASED CACHE
COHERENCE PROTOCOLS
The state of each cache block could be the
following:
1. Shared: One or more processors have the block
cached, and the value in memory and all the
caches is up to date
2. Uncached: No processor has a copy of the
cache block
3. Modified: Exactly one processor has a copy of
the cached block, and it has written the block,the memory copy is out of date; the processor is
the owner of the block
-
7/25/2019 Multi processors and thread level parallelism
52/74
To keep track of the each potentially shared
block, abit vectoris maintained for each block.
Each bit indicates if the corresponding processor
has a copy of the block
Local node
Home node
Remote node
-
7/25/2019 Multi processors and thread level parallelism
53/74
-
7/25/2019 Multi processors and thread level parallelism
54/74
-
7/25/2019 Multi processors and thread level parallelism
55/74
DIRECTORY BASED CACHE
COHERENCE PROTOCOLS
When the block is in uncached state, the possible
requests for it are:
1. Read miss: The requesting processor is sent
the block from memory; the state of the block is
made shared
2. Write miss: The requesting processor is sent
the value and becomes the sharing node; the
block is made exclusive
-
7/25/2019 Multi processors and thread level parallelism
56/74
When the block is in the shared state, the
memory value is up to date:
1. Read miss: The requesting processor is sent
the requested data from memory, and the
requesting processor is added to the sharing set
2. Write miss: The requesting processor is sent
the value. All other processors in the sharers
state are sent invalidate messages and they
contain the identity of the requesting processor;the state of the block is made exclusive
-
7/25/2019 Multi processors and thread level parallelism
57/74
When the block is in the exclusive state, the
current value of the block is held in the owner
processors cache
1. Read miss: The owner processor is sent the
data fetch message. The state of the block smade shared; the requesting processor is added
to the sharers set which contains the identity of
the owner
-
7/25/2019 Multi processors and thread level parallelism
58/74
2. Data write back: The owners processor is
replacing the block and hence the block has to
be written back. Memory copy is made up to
date, the block is uncached and the sharers set
is empty3. Write miss: The block has a new owner. A
message is sent to old owner to invalidate the
block; the state of the block remains exclusive
-
7/25/2019 Multi processors and thread level parallelism
59/74
SYNCHRONIZATION
Synchronization mechanisms are built with user
level software routines that rely on hardware
supplies synchronization instructions
Atomic operations: The ability to atomically
read and modify the memory locationAtomic exchange: Inter changes the value in a
register for a value in memory
Locks: 0 is used to indicate a lock is free; 1 is
used to indicate that a lock is unavailable
-
7/25/2019 Multi processors and thread level parallelism
60/74
Test and set: Tests a value and sets if the value
passes the test
Fetch and increment: It returns a value in
memory and atomically increments it
-
7/25/2019 Multi processors and thread level parallelism
61/74
IMPLEMENTING LOCKS USING
COHERENCE
Spin locks: Locks that a processor continuously
tries to acquire, spinning around a loop until it
succeeds
Are to used when the lock is to be held for a very
short amount of time and the process acquiringthe lock is of low latency
-
7/25/2019 Multi processors and thread level parallelism
62/74
Simple implementation:
A processor could continually try to acquire the
lock using an atomic operation
E.g.: Exchange and test
To release a lock, the processor stores a 0 to the
lock
-
7/25/2019 Multi processors and thread level parallelism
63/74
Coherence mechanism:
Use cache coherence mechanism to maintain the
lock value coherently
The processor can acquire a locally cached lock
rather than using a global memory
Locality in lock access: The processor that used
the lock last will use it again in near future
-
7/25/2019 Multi processors and thread level parallelism
64/74
Spin procedure:
A processor reads the lock variable to test its
state
This is repeated until the value of the read
indicates that the lock is unlocked
The processor then races with all the other
waiting processors
All processes use aswapfunction that reads the
old value and stores a 1 into the lock variable
-
7/25/2019 Multi processors and thread level parallelism
65/74
The single winner will see a 0 and the losers willsee a 1 that is placed by the winner
The winning processor executes the code after
the lock and then release it by storing a 0 in the
lock variableThe race starts again
-
7/25/2019 Multi processors and thread level parallelism
66/74
-
7/25/2019 Multi processors and thread level parallelism
67/74
MODELS OF MEMORY
CONSISTENCY
Consistency:
1. When must a processor see a value that has
been updated by another processor
2. In what order must a processor observe the
data writes of another processor
. Sequential consistency: Result of any execution
be the same as if the memory accesses executed
by each processor were kept in order and
accesses among different processors areinterleaved
-
7/25/2019 Multi processors and thread level parallelism
68/74
Sequential consistency:Sequentialconsistency requires that the result of any
execution be the same as if the memory accesses
executed by each processor were kept in order
and the accesses among different processors werearbitrarily interleaved.
-
7/25/2019 Multi processors and thread level parallelism
69/74
A program issynchronizedif all accesses toshared data are ordered by synchronization
operations
Data race: Variables are updated without
ordering by synchronization; execution outcomedepends on the relative speed of the processors
Synchronization operations?
-
7/25/2019 Multi processors and thread level parallelism
70/74
RELAXED CONSISTENCY MODELS
Allow read and write to complete out of order; butuse synchronization operations to enforce
ordering
X->Y: Operation X must complete before Y
Four possible orderings: R->W; R->R; W->R; W->W
-
7/25/2019 Multi processors and thread level parallelism
71/74
1. Relaxing W -> R yields totalstore orderingorprocessor consistencymodel
2. Relaxing W -> W ordering yields a model
known aspartial store order
3. Relaxing R -> W and R -> R yields weakordering, release consistency model
-
7/25/2019 Multi processors and thread level parallelism
72/74
1. Define the four major categories of computersystems
2. List the factors that led to the rise of MIMD
multi processors
3. Illustrate the basic architecture of a centralizedshared memory multi processor
4. Illustrate the basic architecture of a distributed
memory multi processor
5. Distinguish between private data and shared
data
-
7/25/2019 Multi processors and thread level parallelism
73/74
6. Define the cache coherence problem
7. List the conditions required for a memory
system to be coherent
8. Define the cache coherence protocols
9.Analyze the implementation of cache coherence
protocol
10.Illustrate the performance of symmetric shared
memory multi processors with a commercial
workload applicatio
-
7/25/2019 Multi processors and thread level parallelism
74/74
11.Illustrate the working of distributed memorymulti processor
12.Demonstrate the transitions in a directory
based system
13.Define spin locks
14.Define the ordering of a relaxed consistency
model