parallel memory architecture

32
PARALLEL MEMORY ARCHITECTURE CS/ECE 6810: Computer Architecture Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Upload: others

Post on 05-Jun-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PARALLEL MEMORY ARCHITECTURE

PARALLEL MEMORY ARCHITECTURE

CS/ECE 6810: Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor

School of Computing

University of Utah

Page 2: PARALLEL MEMORY ARCHITECTURE

Chip Multiprocessors

¨ Can be viewed as a simple SMP on single chip

¨ CPUs are now called cores¤ One thread per core

¨ Shared higher level caches¤ Typically the last level¤ Lower latency¤ Improved bandwidth

¨ Not necessarily homogenous cores!

Intel Nehalem (Core i7)

Core 0

Core 1

Core 3

Shared cache

Page 3: PARALLEL MEMORY ARCHITECTURE

Efficiency of Chip Multiprocessing

¨ Ideally, n cores provide nx performance¨ Example: design an ideal dual-processor

¤ Goal: provide the same performance as uniprocessor

Uniprocessor Dual-processorFrequency 1 ?

Voltage 1 ?

Execution Time 1 1

Dynamic Power 1 ?

Dynamic Energy 1 ?

Energy Efficiency 1 ?

Page 4: PARALLEL MEMORY ARCHITECTURE

Efficiency of Chip Multiprocessing

¨ Ideally, n cores provide nx performance¨ Example: design an ideal dual-processor

¤ Goal: provide the same performance as uniprocessor

Uniprocessor Dual-processorFrequency 1 0.5

Voltage 1 0.5

Execution Time 1 1

Dynamic Power 1 2x0.125

Dynamic Energy 1 2x0.125

Energy Efficiency 1 4

f�V & P�V3 à Vdual = 0.5Vuni à Pdual = 2×0.125Puni

Page 5: PARALLEL MEMORY ARCHITECTURE

Challenges

Page 6: PARALLEL MEMORY ARCHITECTURE

Example Code I

¨ A sequential application runs as a single thread

void kern (int start, int end) {int i;for(i=start; i<=end; ++i) {

A[i] = A[i] * A[i] + 5;}

}

Kernel Function: Memory

Processor

A1 n…

main() {…kern (1, n);…

}

Single Thread

Page 7: PARALLEL MEMORY ARCHITECTURE

Example Code I

¨ Two threads operating on separate partitions

void kern (int start, int end) {int i;for(i=start; i<=end; ++i) {

A[i] = A[i] * A[i] + 5;}

}

Kernel Function: Memory

Processor

main() {…kern (1, n/2);…

}

Thread 0

A1 n

Processor

kern (n/2+1, n);

Thread 1

Page 8: PARALLEL MEMORY ARCHITECTURE

Performance of Parallel Processing

¨ Recall: Amdahl’s law for theoretical speedup¤ Overall speedup is limited to the fraction of the

program that can be executed in parallel

speedup = !"#$%&'

f: sequential fraction

02468

10

0 50 100 150

Spee

dup

Number of Processors

Speedup vs. Sequential Fraction

10% 20% 40% 60% 90%

10x

5x~2x~1x

Page 9: PARALLEL MEMORY ARCHITECTURE

Example Code II

¨ A single location is updated every timeKernel Function: Memory

ProcessorThread 0

A1 n

main() {…kern (1, n);…

}

void kern (int start, int end) {int i;for(i=start; i<=end; ++i) {

sum = sum + A[i];}

}

Page 10: PARALLEL MEMORY ARCHITECTURE

Example Code II

¨ A single location is updated every timeKernel Function: Memory

ProcessorThread 0

A1 n

main() {…kern (1, n);…

}

void kern (int start, int end) {int i;for(i=start; i<=end; ++i) {

sum = sum + A[i];}

}

sum

Page 11: PARALLEL MEMORY ARCHITECTURE

Example Code II

¨ Two threads operating on separate partitionsKernel Function: Memory

ProcessorThread 0

A1 n

Processor

kern (n/2+1, n);

Thread 1

main() {…kern (1, n/2);…

}

void kern (int start, int end) {int i;for(i=start; i<=end; ++i) {

sum = sum + A[i];}

}

sum

Page 12: PARALLEL MEMORY ARCHITECTURE

Communication in Multiprocessors

¨ How multiple processor cores communicate?

Shared Memory Message Passing

§ Multiple threads employ shared memory

§ Easy for programmers (loads and stores)

§ Explicit communication through interconnection network

§ Simple hardware

Core1

Core N

Shared Memory

… Core1

Core N

Mem Mem

Interconnection Network

Page 13: PARALLEL MEMORY ARCHITECTURE

Shared Memory Architectures

¨ Equal latency for all processors

¨ Simple software control

¨ Access latency is proportional to proximity¤ Fast local accesses

Uniform Memory Access Non-Uniform Memory Access

Core1

Core 4

Memory

… Core1

Mem

Router

Core4

Mem

Router

Example UMA Example NUMA

Page 14: PARALLEL MEMORY ARCHITECTURE

Network Topologies

¨ Low latency¨ Low bandwidth¨ Simple control

¤ e.g., bus

¨ High latency¨ High bandwidth¨ Complex control

¤ e.g., mesh, ring

Shared Network Point to Point Network

Core1

Mem

Router

Core4

Mem

Router

Core1

Mem

Router

Core2

Mem

Router

Core4

Mem

Router

Core3

Mem

Router

Page 15: PARALLEL MEMORY ARCHITECTURE

Challenges in Shared Memories

¨ Correctness of an application is influenced by¤ Memory consistency

n All memory instructions appear to execute in the program order

n Known to the programmer

¤ Cache coherencen All the processors see the same data for a particular

memory address as they should have if there were no caches in the system

n Invisible to the programmer

Page 16: PARALLEL MEMORY ARCHITECTURE

Cache Coherence Problem

¨ Multiple copies of each cache block¤ In main memory and caches

¨ Multiple copies can get inconsistent when writes happen ¤ Solution: propagate writes from one core to others

core1

Core N

Cache1

CacheN

Main Memory

Page 17: PARALLEL MEMORY ARCHITECTURE

Scenario 1: Loading From Memory

¨ Variable A initially has value 0¨ P1 stores value 1 into A¨ P2 loads A from memory and sees old value 0

P1 P2

Memory

Bus

A:0

CacheCache

Page 18: PARALLEL MEMORY ARCHITECTURE

Scenario 2: Loading From Cache

¨ P1 and P2 both have variable A (value 0) in their caches

¨ P1 stores value 1 into A¨ P2 loads A from its cache and sees old value

P1 P2

Memory

Bus

A:0

CacheCache

Page 19: PARALLEL MEMORY ARCHITECTURE

Cache Coherence

¨ The key operation is update/invalidate sent to all or a subset of the cores¤ Software based management

n Flush: write all of the dirty blocks to memoryn Invalidate: make all of the cache blocks invalid

¤ Hardware based managementn Update or invalidate other copies on every writen Send data to everyone, or only the ones who have a copy

¨ Invalidation based protocol is better. Why?

Page 20: PARALLEL MEMORY ARCHITECTURE

Snoopy Protocol

¨ Relying on a broadcast infrastructure among caches¤ For example shared bus

¨ Every cache monitors (snoop) the traffic on the shared media to keep the states of the cache block up to date

Core Core

Memory

LLC

L1 L1

Core Core

Memory

LLC

L1 L1

Page 21: PARALLEL MEMORY ARCHITECTURE

Simple Snooping Protocol

¨ Relies on write-through, write no-allocate cache¨ Multiple readers are allowed

¤ Writes invalidate replicas¨ Employs a simple state machine for each cache unit

P1 P2

Memory

Bus

A:0

CacheCache

Page 22: PARALLEL MEMORY ARCHITECTURE

Simple Snooping State Machine

¨ Every node updates its one-bit valid flag using a simple finite state machine (FSM)

¨ Processor actions¤ Load, Store, Evict

¨ Bus traffic¤ BusRd, BusWr

Valid

Invalid

Store/BusWrLoad/--

Evict/--

Store/BusWr

BusWr/--Load/BusRd

Transaction by local actionsTransaction by bus traffic

Page 23: PARALLEL MEMORY ARCHITECTURE

Snooping with Writeback Policy

¨ Problem: writes are not propagated to memory until eviction¤ Cache data maybe different from main memory

¨ Solution: identify the owner of the most recently updated replica¤ Every data may have only one owner at any time¤ Only the owner can update the replica¤ Multiple readers can share the data

n No one can write without gaining ownership first

Page 24: PARALLEL MEMORY ARCHITECTURE

Modified-Shared-Invalid Protocol

¨ Every cache block transitions among three states¤ Invalid: no replica in the cache¤ Shared: a read-only copy in the cache

n Multiple units may have the same copy¤ Modified: a writable copy of the data in the cache

n The replica has been updatedn The cache has the only valid copy of the data block

¨ Processor actions¤ Load, store, evict

¨ Bus messages¤ BusRd, BusRdX, BusInv, BusWB, BusReply

Page 25: PARALLEL MEMORY ARCHITECTURE

MSI Example

P1 P2

I I

Load/BusRd

BUS

invalid shared

Load

BusRd

BusReply

Page 26: PARALLEL MEMORY ARCHITECTURE

MSI Example

P1 P2

S I

Load/--

BusRd/[BusReply]Load/BusRd

invalid shared

BUSBusRd

Load

Page 27: PARALLEL MEMORY ARCHITECTURE

MSI Example

P1 P2

S S

Load/--

BusRd/[BusReply]Load/BusRd

Evict/--

invalid shared

BUS

Evict

Page 28: PARALLEL MEMORY ARCHITECTURE

MSI Example

P1 P2

S I

Load, Store/--

Load/--

BusRd/[BusReply]Load/BusRd

Evict/--

BusRdX/[BusReply]

Sto

re/B

usR

dX

invalid shared

modified BUS

Store

Page 29: PARALLEL MEMORY ARCHITECTURE

MSI Example

P1 P2

I M

Load, Store/--

Load/--

BusRd/[BusReply]Load/BusRd

Evict/--

Sto

re/B

usR

dX

BusRd/BusReply

invalid shared

modified BUS

BusRdX/[BusReply]

Load

Page 30: PARALLEL MEMORY ARCHITECTURE

MSI Example

P1 P2

S S

Load, Store/--

Load/--

BusRd/[BusReply]Load/BusRd

Evict/--

BusInv,BusRdX/[BusReply]

Sto

re/B

usR

dX

Store/BusInv

BusRd/BusReply

invalid shared

modified BUS

Store

Page 31: PARALLEL MEMORY ARCHITECTURE

MSI Example

P1 P2

M I

Load, Store/--

Load/--

BusRd/[BusReply]Load/BusRd

Evict/--

BusInv,BusRdX/[BusReply]

Sto

re/B

usR

dX

BusR

dX

/BusR

eply

Store/BusInv

BusRd/BusReply

invalid shared

modified BUS

Store

Page 32: PARALLEL MEMORY ARCHITECTURE

MSI Example

P1 P2

I M

Load, Store/--

Load/--

BusRd/[BusReply]Load/BusRd

Evict/--

BusInv,BusRdX/[BusReply]

Sto

re/B

usR

dX

BusR

dX

/BusR

eply

Store/BusInv

BusRd/BusReply

invalid shared

modified BUS

Evict

BusWB