cache coherent distributed shared memory. motivations small processor count –smp machines...

30
Cache Coherent Distributed Shared Memory

Upload: cordelia-todd

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Cache Coherent Distributed Shared Memory

Page 2: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Motivations

• Small processor count– SMP machines– Single shared memory with multiple

processors interconnected with a bus

• Large processor count– Distributed Shared Memory Machines– Largely message passing architectures

Page 3: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Programming Concerns

• Message passing– Access to memory involve send/request

packets– Communication costs

• Shared memory model– Ease of programming– But not very scalable

• Scalable and easy to program?

Page 4: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Distributed Shared Memory

• Physically distributed memory

• Implemented with a single shared address space

• Also known as NUMA machines since memory access times are non-uniform– Local access times < Remote access times

Page 5: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

DSM and Memory access

• Big difference in accessing local versus remote data

• Large differences make it difficult to hide latency

• How about caching?– In short, it’s difficult– Cache coherence

Page 6: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Cache coherence

• Cache Coherence– Different processors may access values at

same memory location– How to ensure data integrity at all times?

• An update by a processor at time t is available for other processors at time t+1

– Snoopy protocol– Directory based protocol

Page 7: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Snoopy Coherence Protocols

• Transparent to user• Easy to implement• For a read

– Data fetched from other cache or from memory

• For a write– All data at other caches are invalidated– Delayed or immediate write-back.

• The Bus plays an important role

Page 8: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Example

Page 9: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

But it does not scale!

• Not feasible for machines with memory distributed across a large number of systems

• Broadcast on bus approach is bad

• Leads to bus saturation

• Waste of processor cycles to snoop all caches in system

Page 10: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Directory-Based Cache Coherence

• A directory tracks which processor have cached a block of memory

• Directory contains information for all cache blocks in system

• Each cache block can have 1 of 3 states– Invalid– Shared– Exclusive

• To enter exclusive state, all other cache blocks for same memory location is invalidated

Page 11: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Original form not popular

• Compared to snoopy protocols– Directory systems avoid broadcasting on bus

• But requests served by 1 directory server– May saturate a directory server

• Still not scalable

• How about distributing the directory– Load balancing– Hierarchical model?

Page 12: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Distributed Directory Protocol

• Involved sending messages among 3 node types– Local node

• Requesting processor node

– Home node• Node containing memory location

– Remote node• Node containing cache block in exclusive state

Page 13: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

3 Scenarios

• Scenario 1– Local node sends request to home node– Home node sends data back to local node

• Scenario 2– Local node sends request to home node– Home node redirects request to remote node– Remote node sends data back to local node

• Scenario 3– Local node sends request for exclusive state– Home node redirects request to other remote nodes

for invalidation

Page 14: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Example

Page 15: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Stanford DASH Multiprocessor

• 1st operational multiprocessor to support scalable coherence protocol

• Demonstrates scalability and cache coherence are not incompatible

• 2 hypotheses– Shared memory machines easier to program– Cache coherence vital

Page 16: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Past experience

• From experience– Memory access times differ widely between

physical location– Latency and bandwidth is important for

shared memory systems– Caching helps amortize cost of memory

access in a memory hierarchy

Page 17: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

DASH Multiprocessor

• Relaxed memory consistency model

• Observation– Most programs use explicit synchronization – Sequential consistency is not necessary– Allows system to perform writes without

waiting till all invalidations are performed

• Offers advantages in hiding memory latency

Page 18: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

DASH Multiprocessor

• Non-Binding software prefetch– Prefetches data into cache– Maintains coherence– Transparent to user

• Compiler can issue such instructions to help runtime performance

– If data is invalidated, it will refresh the data when it is accessed

• Helps to hide latency as well

Page 19: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

DASH Multiprocessor

• Remote Access Cache– Remote access combined and buffered within

individual nodes– Can be likened to having a 2-level cache

hierarchy

Page 20: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Lessons

• High performance require careful planning of remote data access

• Scaling applications depend on other factors– Load balancing– Limited parallelism– Difficult to scale application into using more

processor

Page 21: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Challenges

• Programming model?– Model that helps programmers reason about

code rather than fine-tuning for a specific machine

• Fault tolerance and recovery?– More computers = Higher chance of failure

• Increasing latency?– Increasing hierarchies = Larger variety of

latencies

Page 22: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

Callisto

• Previously networking gateways– Handle diverse set of services– Handles 1000s of channels– Complex designs involving many chips– High power requirement

• Callisto is a gateway on a chip – Used to implement communication gateways

for different networks

Page 23: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected

In a nutshell

• Integrates DSPs, CPUs, RAM, IO channels on chip

• Programmable multi-service platform

• Handles 60 to 240 channels per chip

• An array of Callisto chips can fit in a small space– Power efficient– Handles a large number of channels

Page 24: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected
Page 25: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected
Page 26: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected
Page 27: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected
Page 28: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected
Page 29: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected
Page 30: Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected