lecture 6 dr. samman h. ameen 2014/saman... · 2017. 1. 22. · samman h. ameen shared-memory...

33
LECTURE 6 DR. SAMMAN H. AMEEN Shared-memory systems 1

Upload: others

Post on 28-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • LECTURE 6

    DR. SAMMAN H. AMEEN

    Shared-memory systems

    1

  • Last week

    We discussed Static interconnection networks

    This week we explain

    1. Shared-memory systems

    PAGE 2

  • PAGE 3

    In today’s competitive technology where speed and cost are so important, the efficiency of tasks is crucial to achieve success. In the most common human handle tasks, team work is the way to go. Technology follows the same principle as multi-processors allows speedup achievement through parallel programming. Like for team work, communication is very important to achieve maximum efficiency.

    One popular way to allow multiple processors to communicate is to use shared memory architecture. This lecture describes the different characteristics of shared memory systems, the paradigm surrounding the software programming, the hardware requirement for implementation and a description of Altera’s solution. An emphasis is made on cache coherence and synchronization.

  • PAGE 4

    Memory is a very important component of a computer, yet

    even more importantly is the type of memory in a computer’s

    architecture. Having the wrong type of memory for the system

    can be costly.

    There are many types of memory all of which must work

    together in a system to perform tasks efficiently. System

    efficiency is gained by having different levels of memory units

    that have different data transfer speeds. The top level there is a

    layer with slow memory access.Please check the grammar

    Grammar not corrected -1

    Is it normally magnetic tape drives? No reference -1

  • PAGE 5

    The use of a shared memory system for parallel processing has several advantages. For programmers, the

    coding is very similar to a multi-threaded application running a uniprocessor system because the resources

    are shared the same way among the tasks, using the same synchronization techniques, which makes the

    adaptation easy for them. The location of global variables can be loaded in local memory caches which increase

    the performance of application using extensively the shared memory. Another advantage is that the most

    popular platforms now offer hardware extensions to take care of memory and cache coherence. This

    makes the incremental cost very low to add processors to a multi-processor design.

    Two main problems need to be addressed when designing a shared memory system: performance degradation

    due to contention, and coherence problems. Performance degradation might happen when multiple processors

    are trying to access the shared memory simultaneously. A typical design might use caches to solve the contention

    problem. However, having multiple copies of data, spread throughout the caches, might lead to a coherence

    problem. The copies in the caches are coherent if they are all equal to the same value. However, if one of the

    processors writes over the value of one of the copies, then the copy becomes inconsistent because it no longer

    equals the value of the other copies. In this chapter we study a variety of shared memory systems and their

    solutions of the cache coherence problem.

  • PAGE 6

    Using the shared memory model for multiprocessor could induce a bottleneck to the

    architecture. Multiple processor could be writing at a occasion. And at some instance, more

    then one processor could be accessing the same memory location which could greatly ones

    computation output.

    Using a local memory for each

    processing element and using

    the message passing model

    improves the previously issue.

  • PAGE 7

    • Uniform Memory Access (UMA)

    • Non-Uniform Memory Access (NUMA)

    • Cache-only Memory Architecture (COMA)

  • PAGE 8

    • All processors have equal access time to any memory location.

    • The interconnection network used in the UMA can be a single bus, multiple buses, a crossbar

    switch or multistage switching networks .

    • Two or more CPUs and one or more memory modules all use the same bus for communication.

    • Tightly-coupled systems (high degree of resource sharing)

  • PAGE 9

    • Symmetric: - all processors have equal access to all peripheral devices.- all processors are identical.

    • Asymmetric:- one processor (master) executes the operating system- other processors may be of different types and may be dedicated to special tasks.

  • PAGE 10

    • each processor has part of the shared memory attached.

    • There is a single address space visible to all CPUs. All local memories form a global address space accessible by all processors

    • Access to remote memory is slower than access to local memory.

  • PAGE 11

  • PAGE 12

    NUMA do not scale well because they do not do caching. Having to go to the remote memory every time a nonlocal memory word is accessed is a major performance hit. However, if caching is added, then cache coherence must also be added and the model is called ccNUMA.

  • PAGE 13

    • Similar to the NUMA, each processor has part of the shared memory in the COMA. However, in this case the shared memory

    • A COMA system requires that data be migrated to the processor requesting it. consists of cache memory.

    • There is a cache directory (D) that helps in remote cache access.

  • PAGE 14

  • PAGE 15

    • In a single cache system, coherence between memory and the cache is maintained using one of two policies: (1) write-through, and (2) write-back.

    1-In write-through, the memory

    is updated every time the

    cache is updated,

    2. In write-back, the memory is

    updated only when the block

    in the cache is being

    replaced.

    Write-Through vs. Write-Back

  • PAGE 16

  • PAGE 17

    -Multiple copies of x

    -What if P1 updates x?

  • PAGE 18

    • Caches play key role in all cases

    • Reduce average data access time

    • Reduce bandwidth demands placed on shared interconnect

    • But private processor caches create a problem

    • Copies of a variable can be present in multiple caches

    • A write by one processor may not become visible to others• They’ll keep accessing stale value in their caches

    • Cache coherence problem

    • Need to take actions to ensure visibility

  • PAGE 19

    • There are two fundamental cache coherence policies: (1) write-invalidate, and (2) write-update.

    Write-invalidate maintains consistency

    by reading from local caches until a

    write occurs. When any processor

    updates the value of X through a write,

    posting a dirty bit for X invalidates all

    other copies.

    Write-update maintains consistency by

    immediately updating all copies in all

    caches. All dirty bits are set during each

    write operation.

    Write-Update vs. Write-Invalidate

  • PAGE 20

    • Writing to Cache in n processor case

    • Write Update - Write Through

    • Write Update - Write Back

    • Write Invalidate - Write Through

    • Write Invalidate - Write Back

  • PAGE 21

  • PAGE 22

  • PAGE 23

    Snooping protocols are based on watching bus activities and carry out the appropriate coherency commands when necessary. Global memory is moved in blocks, and each block has a state associated with it, which determines what happens to the entire contents of the block. The state of a block might change as a result of the operations Read-Miss, Read-Hit, Write-Miss, and Write-Hit.

  • PAGE 24

    Multiple processors can read block copies from main

    memory safely until one processor updates its copy. At

    this time, all cache copies are invalidated and the

    memory is updated to remain consistent.

    State Description

    Valid

    [VALID]The copy is consistent with global memory

    Invalid

    [INV]The copy is inconsistent

  • PAGE 25

    X = 5

    1. P reads X

    2. Q reads X

    3. Q updates X, X=10

    4. Q reads X

    5. Q updates X, X=15

    6. P updates X, X=20

    7. Q reads X

  • PAGE 26

    Event MemoryLocation X

    P’s Cache Q’s Cache comments

    Location X State Location X State

    Original value

    5

    P reads X 5 5 VALID Read-Miss

    Q reads X 5 5 VALID 5 VALID Read-Miss

    Q updates X 10 5 INV 10 VALID Write-Hit

    Q reads X 10 5 INV 10 VALID Read-Hit

    Q updates X 15 5 INV 15 VALID Write-Hit

    P updates X 20 20 VALID 15 INV Write-Miss

    Q reads X 20 20 VALID 20 VALID Read-Miss

  • PAGE 27

    Event Memory

    Location X

    P’s Cache Q’s Cache comments

    Location X State Location X State

    Original

    value

    5

    P reads X 5 5 VALID Read-Miss

    Q reads X 5 5 VALID 5 VALID Read-Miss

    Q updates X 10 5 INV 10 VALID Write-Hit

    Q reads X 10 5 INV 10 VALID Read-Hit

    Q updates X 15 5 INV 15 VALID Write-Hit

    P updates X 20 20 VALID 15 INV Write-Miss

    Q reads X 20 20 VALID 20 VALID Read-Miss

  • PAGE 28

    A valid block can be owned by memory and shared in multiple caches that can contain only the shared copies of the block. Multiple processors can safely read these blocks from their caches until one processor updates its copy. At this time, the writer becomes the only owner of the valid block and all other copies are invalidated

    State Description

    Shared

    (Read-Only) [RO]

    Data is valid and can be read safely. Multiple copies can be in this

    state

    Exclusive

    (Read-Write) [RW]

    Only one valid cache copy exists and can be read from and written

    to safely. Copies in other caches are invalid

    Invalid

    [INV]

    The copy is inconsistent

  • PAGE 29

    Event MemoryLocation X

    P’s Cache Q’s Cache comments

    Location X State Location X State

    Original value

    5

    P reads X 5 5 RO Read-Miss

    Q reads X 5 5 RO 5 RO Read-Miss

    Q updates X 5 5 INV 10 RW Write -hit

    Q reads X 5 5 INV 10 RW Read-Hit

    Q updates X 5 5 INV 15 RW Write-Hit

    P updates X 5 20 RW 15 INV Write-Miss

    Q reads X 20 20 RO 20 RO Read-Miss

  • PAGE 30

    Event Memory

    Location X

    P’s Cache Q’s Cache comments

    Location X State Location X State

    Original

    value

    5

    P reads X 5 5 RO Read-Miss

    Q reads X 5 5 RO 5 RO Read-Miss

    Q updates X 5 5 INV 10 RW Write-hit

    Q reads X 5 5 INV 10 RW Read-Hit

    Q updates X 5 5 INV 15 RW Write-Hit

    P updates X 5 20 RW 15 INV Write-Miss

    Q reads X 20 20 RO 20 RO Read-Miss

  • PAGE 31

  • PAGE 32

    • Explain both Von Neumann architecture and Harvard architecture showing the advantages of each architecture.

    • What are the fundamental decision issues in selecting the appropriate architecture for an interconnection network (IN) for parallel machines. Then explain the synchronous and asynchronous mode of operation.

    • Discuss SIMD Architecture in detail with its variance configurations.

  • PAGE 33

    • What are the Characteristics of CISC and RISC Architecture?

    • Explain Flynn's classification of computer architecture using neat block diagram.

    • Discuss different types of Dynamic Interconnection Networks.