tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system

Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System

Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm

Locality

• What do they mean by locality?– locality of reference?– temporal locality?– spatial locality?

Temporal Locality

• Recently accessed data and instructions are likely to be accessed in the near future

Spatial Locality

• Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future

Locality of Reference

• If we have good locality of reference, is that a good thing for multiprocessors?

Locality in Multiprocessors

• Good performance depends on data being local to a CPU– Each CPU uses data from its own cache• cache hit rate is high• each CPU has good locality of reference

– Once data is brought into cache it stays there• cache contents not invalidated by other CPUs• different CPUs have different locality of reference

Example: Shared Counter

Memory

CPU

Cache

CPU

Cache

Counter


Memory

CPU CPU

0


Memory

CPU

0

CPU

0


Memory

CPU

1

CPU

1


Memory

CPU

1

CPU

1

1

Read : OK


Memory

CPU CPU

2

2

Invalidate

Performance

Problems

• Counter bounces between CPU caches– cache miss rate is high

• Why not give each CPU its own piece of the counter to increment?– take advantage of commutativity of addition– counter updates can be local– reads require all counters

Array-based Counter

Memory

CPU CPU

0 0

Array-based Counter

Memory

CPU

1

CPU

1 0

Array-based Counter

Memory

CPU

1

CPU

1

1 1

Array-based Counter

Memory

CPU

1

CPU

1

1 1

CPU

2

Read Counter

Add All Counters

(1 + 1)

PerformancePerforms no better than ‘shared counter’!

Problem: False Sharing

• Caches operate at the granularity of cache lines– if two pieces of the counter are in the same cache

line they can not be cached (for writing) on more than one CPU at a time

False Sharing

Memory

CPU CPU

0,0

False Sharing

Memory

CPU

0,0

CPU

0,0

False Sharing

Memory

CPU

0,0

CPU

0,0

0,0

Sharing

False Sharing

Memory

CPU

1,0

CPU

1,0

Invalidate

False Sharing

Memory

CPU

1,0

CPU

1,0

1,0

Sharing

False Sharing

Memory

CPU CPU

1,1

1,1

Invalidate

Solution?

• Spread the counter components out in memory: pad the array

Padded Array

Memory

CPU CPU

00

Padded Array

Memory

CPU

1

CPU

1

11

Updates independent of each other

PerformanceWorks better

Locality in OS

• Serious performance impact• Difficult to retrofit• Tornado– Ground up design– Object Oriented approach (natural locality)

Tornado

• Object oriented approach• Clustered objects• Protected procedure call• Semi-automatic garbage collection– Simplifies locking protocols

Object Oriented Structure

• Each resource is represented by an object• Requests to virtual resources handled

independently– No shared data structure access– No shared locks

Why Object Oriented?

Process 1

Process 2

…

Process Table


Coarse-grain locking:Process 1

Process 2

…

Process Table

Process 1

Lock


Coarse-grain locking:Process 1

Process 2

…

Process Table

Process 1

Lock

Process 2

Object Oriented Approach

Class ProcessTableEntry{datalock

code}

Object Oriented Approach

Fine-grain, instance locking:Process 1

Process 2

…

Process Table

Process 1

Lock

Process 2

Lock

Clustered Objects

• Problem: how to improve locality for widely shared objects?

• A single logical object can be composed of multiple local representatives– the reps coordinate with each other to manage

the object’s state– they share the object’s reference

Clustered Objects

Clustered Object References

Clustered Objects : Implementation

• A translation table per processor– Located at same virtual address– Pointer to rep

• Clustered object reference is just a pointer into the table– created on demand when first accessed– global miss handling object

Clustered Objects

• Degree of clustering• Management of state– partitioning– distribution– replication (how to maintain consistency?)

• Coordination between reps?– Shared memory– Remote PPCs

Counter: Clustered Object

Counter – Clustered Object

CPU CPU

rep 1 rep 1

Object Reference



CPU

1

CPU

1

rep 1 rep 1

Object Reference



CPU

2

CPU

1

rep 2 rep 1

Object Reference

Update independent of each other



CPU

1

CPU

1

rep 1 rep 1

Object Reference


rep 1 rep 1

Object Reference


CPU

1

CPU

1

rep 1 rep 1

Read Counter


rep 1 rep 1

Object Reference


CPU

1

CPU

1

rep 1 rep 1

Add All Counters

(1 + 1)

Synchronization

• Two distinct locking issues– Locking• mutually exclusive access to objects

– Existence guarantees• making sure an object is not freed while still in use

Locking in Tornado

• Encapsulate locking within individual objects• Uses clustered objects to limit contention• Uses spin-then-block locks

Existence Guarantees: the problem

• Use a lock to protect all references to an object?– eliminates races where one thread is accessing the

object and another is deallcoating it– results in complex global hierarchy of locks

• Tornado - semi automatic garbage collection– Clustered object reference can be used any time– Eliminates needs for locks

Existence Guarantees in Tornado

• Semi-automatic garbage collection:– programmer decides what to free, system decided

when to free it– guarantees that object references can be used

safely– eliminates needs for reference locks

How does it work?

• Programmer removes all persistent references– Normal cleanup done manually

• System tracks all temporary references– Event driven kernel– Maintain an activity counter for each processor – Delete object only when activity counter is zero

Performance Scalability

Conclusion

• Object-oriented approach and clustered objects exploit locality to improve concurrency

• OO design has some overhead, but it is low compared to the performance advantages

• Tornado scales extremely well and achieves high performance on shared-memory multiprocessors

tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system

Documents

locality of referenceif

shared memory multiprocessor

accessed data

false sharingmemorycpu0

cpu cachescache

higheach cpu

cpueach cpu

coarsegrain locking