-
LECTURE 6
DR. SAMMAN H. AMEEN
Shared-memory systems
1
-
Last week
We discussed Static interconnection networks
This week we explain
1. Shared-memory systems
PAGE 2
-
PAGE 3
In today’s competitive technology where speed and cost are so important, the efficiency of tasks is crucial to achieve success. In the most common human handle tasks, team work is the way to go. Technology follows the same principle as multi-processors allows speedup achievement through parallel programming. Like for team work, communication is very important to achieve maximum efficiency.
One popular way to allow multiple processors to communicate is to use shared memory architecture. This lecture describes the different characteristics of shared memory systems, the paradigm surrounding the software programming, the hardware requirement for implementation and a description of Altera’s solution. An emphasis is made on cache coherence and synchronization.
-
PAGE 4
Memory is a very important component of a computer, yet
even more importantly is the type of memory in a computer’s
architecture. Having the wrong type of memory for the system
can be costly.
There are many types of memory all of which must work
together in a system to perform tasks efficiently. System
efficiency is gained by having different levels of memory units
that have different data transfer speeds. The top level there is a
layer with slow memory access.Please check the grammar
Grammar not corrected -1
Is it normally magnetic tape drives? No reference -1
-
PAGE 5
The use of a shared memory system for parallel processing has several advantages. For programmers, the
coding is very similar to a multi-threaded application running a uniprocessor system because the resources
are shared the same way among the tasks, using the same synchronization techniques, which makes the
adaptation easy for them. The location of global variables can be loaded in local memory caches which increase
the performance of application using extensively the shared memory. Another advantage is that the most
popular platforms now offer hardware extensions to take care of memory and cache coherence. This
makes the incremental cost very low to add processors to a multi-processor design.
Two main problems need to be addressed when designing a shared memory system: performance degradation
due to contention, and coherence problems. Performance degradation might happen when multiple processors
are trying to access the shared memory simultaneously. A typical design might use caches to solve the contention
problem. However, having multiple copies of data, spread throughout the caches, might lead to a coherence
problem. The copies in the caches are coherent if they are all equal to the same value. However, if one of the
processors writes over the value of one of the copies, then the copy becomes inconsistent because it no longer
equals the value of the other copies. In this chapter we study a variety of shared memory systems and their
solutions of the cache coherence problem.
-
PAGE 6
Using the shared memory model for multiprocessor could induce a bottleneck to the
architecture. Multiple processor could be writing at a occasion. And at some instance, more
then one processor could be accessing the same memory location which could greatly ones
computation output.
Using a local memory for each
processing element and using
the message passing model
improves the previously issue.
-
PAGE 7
• Uniform Memory Access (UMA)
• Non-Uniform Memory Access (NUMA)
• Cache-only Memory Architecture (COMA)
-
PAGE 8
• All processors have equal access time to any memory location.
• The interconnection network used in the UMA can be a single bus, multiple buses, a crossbar
switch or multistage switching networks .
• Two or more CPUs and one or more memory modules all use the same bus for communication.
• Tightly-coupled systems (high degree of resource sharing)
-
PAGE 9
• Symmetric: - all processors have equal access to all peripheral devices.- all processors are identical.
• Asymmetric:- one processor (master) executes the operating system- other processors may be of different types and may be dedicated to special tasks.
-
PAGE 10
• each processor has part of the shared memory attached.
• There is a single address space visible to all CPUs. All local memories form a global address space accessible by all processors
• Access to remote memory is slower than access to local memory.
-
PAGE 11
-
PAGE 12
NUMA do not scale well because they do not do caching. Having to go to the remote memory every time a nonlocal memory word is accessed is a major performance hit. However, if caching is added, then cache coherence must also be added and the model is called ccNUMA.
-
PAGE 13
• Similar to the NUMA, each processor has part of the shared memory in the COMA. However, in this case the shared memory
• A COMA system requires that data be migrated to the processor requesting it. consists of cache memory.
• There is a cache directory (D) that helps in remote cache access.
-
PAGE 14
-
PAGE 15
• In a single cache system, coherence between memory and the cache is maintained using one of two policies: (1) write-through, and (2) write-back.
1-In write-through, the memory
is updated every time the
cache is updated,
2. In write-back, the memory is
updated only when the block
in the cache is being
replaced.
Write-Through vs. Write-Back
-
PAGE 16
-
PAGE 17
-Multiple copies of x
-What if P1 updates x?
-
PAGE 18
• Caches play key role in all cases
• Reduce average data access time
• Reduce bandwidth demands placed on shared interconnect
• But private processor caches create a problem
• Copies of a variable can be present in multiple caches
• A write by one processor may not become visible to others• They’ll keep accessing stale value in their caches
• Cache coherence problem
• Need to take actions to ensure visibility
-
PAGE 19
• There are two fundamental cache coherence policies: (1) write-invalidate, and (2) write-update.
Write-invalidate maintains consistency
by reading from local caches until a
write occurs. When any processor
updates the value of X through a write,
posting a dirty bit for X invalidates all
other copies.
Write-update maintains consistency by
immediately updating all copies in all
caches. All dirty bits are set during each
write operation.
Write-Update vs. Write-Invalidate
-
PAGE 20
• Writing to Cache in n processor case
• Write Update - Write Through
• Write Update - Write Back
• Write Invalidate - Write Through
• Write Invalidate - Write Back
-
PAGE 21
-
PAGE 22
-
PAGE 23
Snooping protocols are based on watching bus activities and carry out the appropriate coherency commands when necessary. Global memory is moved in blocks, and each block has a state associated with it, which determines what happens to the entire contents of the block. The state of a block might change as a result of the operations Read-Miss, Read-Hit, Write-Miss, and Write-Hit.
-
PAGE 24
Multiple processors can read block copies from main
memory safely until one processor updates its copy. At
this time, all cache copies are invalidated and the
memory is updated to remain consistent.
State Description
Valid
[VALID]The copy is consistent with global memory
Invalid
[INV]The copy is inconsistent
-
PAGE 25
X = 5
1. P reads X
2. Q reads X
3. Q updates X, X=10
4. Q reads X
5. Q updates X, X=15
6. P updates X, X=20
7. Q reads X
-
PAGE 26
Event MemoryLocation X
P’s Cache Q’s Cache comments
Location X State Location X State
Original value
5
P reads X 5 5 VALID Read-Miss
Q reads X 5 5 VALID 5 VALID Read-Miss
Q updates X 10 5 INV 10 VALID Write-Hit
Q reads X 10 5 INV 10 VALID Read-Hit
Q updates X 15 5 INV 15 VALID Write-Hit
P updates X 20 20 VALID 15 INV Write-Miss
Q reads X 20 20 VALID 20 VALID Read-Miss
-
PAGE 27
Event Memory
Location X
P’s Cache Q’s Cache comments
Location X State Location X State
Original
value
5
P reads X 5 5 VALID Read-Miss
Q reads X 5 5 VALID 5 VALID Read-Miss
Q updates X 10 5 INV 10 VALID Write-Hit
Q reads X 10 5 INV 10 VALID Read-Hit
Q updates X 15 5 INV 15 VALID Write-Hit
P updates X 20 20 VALID 15 INV Write-Miss
Q reads X 20 20 VALID 20 VALID Read-Miss
-
PAGE 28
A valid block can be owned by memory and shared in multiple caches that can contain only the shared copies of the block. Multiple processors can safely read these blocks from their caches until one processor updates its copy. At this time, the writer becomes the only owner of the valid block and all other copies are invalidated
State Description
Shared
(Read-Only) [RO]
Data is valid and can be read safely. Multiple copies can be in this
state
Exclusive
(Read-Write) [RW]
Only one valid cache copy exists and can be read from and written
to safely. Copies in other caches are invalid
Invalid
[INV]
The copy is inconsistent
-
PAGE 29
Event MemoryLocation X
P’s Cache Q’s Cache comments
Location X State Location X State
Original value
5
P reads X 5 5 RO Read-Miss
Q reads X 5 5 RO 5 RO Read-Miss
Q updates X 5 5 INV 10 RW Write -hit
Q reads X 5 5 INV 10 RW Read-Hit
Q updates X 5 5 INV 15 RW Write-Hit
P updates X 5 20 RW 15 INV Write-Miss
Q reads X 20 20 RO 20 RO Read-Miss
-
PAGE 30
Event Memory
Location X
P’s Cache Q’s Cache comments
Location X State Location X State
Original
value
5
P reads X 5 5 RO Read-Miss
Q reads X 5 5 RO 5 RO Read-Miss
Q updates X 5 5 INV 10 RW Write-hit
Q reads X 5 5 INV 10 RW Read-Hit
Q updates X 5 5 INV 15 RW Write-Hit
P updates X 5 20 RW 15 INV Write-Miss
Q reads X 20 20 RO 20 RO Read-Miss
-
PAGE 31
-
PAGE 32
• Explain both Von Neumann architecture and Harvard architecture showing the advantages of each architecture.
• What are the fundamental decision issues in selecting the appropriate architecture for an interconnection network (IN) for parallel machines. Then explain the synchronous and asynchronous mode of operation.
• Discuss SIMD Architecture in detail with its variance configurations.
-
PAGE 33
• What are the Characteristics of CISC and RISC Architecture?
• Explain Flynn's classification of computer architecture using neat block diagram.
• Discuss different types of Dynamic Interconnection Networks.