1
Hardware Transactional Memory
Royi MaimonMerav Havuv
27/5/2007
2
References
M. Herlihy and J. Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures
C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, Charles E. Leiserson, Sean Lie: Unbounded Transactional Memory.
Hammond, Wong, Chen, Carlstrom, Davis (Jun 2004).“Transactional Memory Coherence and Consistency”
3
Today
What are transactions?
What is Hardware Transactional Memory?
Various implementations of HTM
4
Outline
Lock-Free Hardware Transactional Memory (HTM)
Transactions Cache coherence protocol General Implementation Simulation
UTM LTM TCC (briefly) Conclusions
5
Outline
Lock-Free Hardware Transactional Memory (HTM)
Transactions Cache coherence protocol General Implementation Simulation
UTM LTM TCC (briefly) Conclusions
6
Lock-free
A shared data structure is lock-free if its operations do not require mutual exclusion.
If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object.
7
Lock-free data structures avoid common problems associated with conventional locking techniques in highly concurrent systems:
– Priority inversion
– Convoying occurs when a process holding a lock is descheduled, and then, other processes capable of running may be unable to progress.
– Deadlock
Lock-free (cont)
8
Priority inversion
Priority inversion occurs when a lower-priority process is preempted while holding a lock needed by higher-priority processes.
9
Deadlock
Deadlock – two or more processes are waiting indefinitely for an event that can be caused by only one of waiting processes.
Let S and Q be two resources
P0 P1
Lock(S) Lock(Q)Lock(Q) Lock(S)
10
Outline
Lock-Free Hardware Transactional Memory (HTM)
Transactions Cache coherence protocol General Implementation Simulation
UTM LTM TCC (briefly) Conclusions
11
What is a transaction?
A transaction is a sequence of memory loads and stores executed by a single process that either commits or aborts
If a transaction commits, all the loads and stores appear to have executed atomically
If a transaction aborts, none of its stores take effect Transaction operations aren't visible until they
commit or abort
12
Transactions properties:
A transaction satisfies the following properties:– Serializability
– Atomicity
Simplified version of traditional ACID database (Atomicity, Consistency, Isolation, and Durability)
13
Transactional Memory
A new multiprocessor architecture The goal: Implementing a lock-free synchronization
– efficient– easy to use
comparing to conventional techniques based on mutual exclusion
Implemented by straightforward extensions to multiprocessor cache-coherence protocols.
14
An Example
Locks:if (i<j) {
a = i; b = j; } else { a = j; b = i; } Lock(L[a]); Lock(L[b]); Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; Unlock(L[b]); Unlock(L[a]);
Transactional Memory:
StartTransaction; Flow[i] = Flow[i] – X; Flow[j] = Flow[j] + X; EndTransaction;
15
Transactional Memory
Transactions execute in commit order
ld 0xdddd...st 0xbeef
Transaction ATime
ld 0xbeef
Transaction C
ld 0xbeef
Re-execute Re-execute with new datawith new data
Commit
ld 0xdddd...ld 0xbbbb
Transaction B
Commit Violation!Violation!
0xbeef0xbeef
16
Outline
Lock-Free Hardware Transactional Memory (HTM)
Transactions Cache coherence protocol General Implementation Simulation
UTM LTM TCC (briefly) Conclusions
17
Cache-Coherence Protocol
A protocol for managing the caches of a multiprocessor system:
– No data is lost– No overwritten before the data is transferred from a cache
to the target memory.
When multiprocessing, each processor may have its own memory cache that is separate from the shared memory
18
The Problem (Cache-Coherence)
Solving the problem in either of two ways:– directory-based– snooping system
19
Snoopy Cache
All caches watches the activity (snoop) on a global bus to determine if they have a copy of the block of data that is requested on the bus.
20
Directory-based
The data being shared is placed in a common directory that maintains the coherence between caches.
The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache.
When an entry is changed the directory either updates or invalidates the other caches with that entry.
21
Outline
Lock-Free Hardware Transactional Memory (HTM)
Transactions Cache coherence protocol General Implementation Simulation
UTM LTM TCC (briefly) Conclusions
22
How it Works?
The following primitive instructions for accessing memory are provided:
Load-transactional (LT): reads value of a shared memory location into a private register.
Load-transactional-exclusive (LTX): Like LT, but “hinting” that the location is likely to be modified.
Store-transactional (ST) tentatively writes a value from a private register to a shared memory location.
Commit (COMMIT) Abort (ABORT) Validate (VALIDATE) tests the current transaction status.
23
Some definitions
Read set: the set of locations read by LT by a transaction
Write set: the set of locations accessed by LTX or ST by a transaction
Data set (footprints): the union of the read and write sets.
A set of values in memory is inconsistent if it couldn’t have been produced by any serial execution of transactions
24
Intended Use
Instead of acquiring a lock, executing the critical section, and releasing the lock, a process would:
1. use LT or LTX to read from a set of locations2. use VALIDATE to check that the values read are
consistent,3. use ST to modify a set of locations4. use COMMIT to make the changes permanent.
If either the VALIDATE or the COMMIT fails, the process returns to Step (1).
25
Implementation
Transactional memory is implemented by modifying standard multiprocessor cache coherence protocols
We describe here how to extend “snoopy” cache protocol for a shared bus to support transactional memory
Our transactions are short-lived activities with relatively small Data set.
26
The basic idea
Any protocol capable of detecting accessibility conflicts can also detect transaction conflict at no extra cost
Once a transaction conflict is detected, it can be resolved in a variety of ways
27
Implementation
Each processor maintains two caches– Regular cache for non-transactional operations, – Transactional cache for transactional operations.
It holds all the tentative writes, without propagating them to other processors or to main memory (until commit)
Why using two caches?
28
Cache line states
Each cache line (regular or transactional) has one of the following states:
The transactional cache expends these states:
29
Cleanup
When the transactional cache needs space for a new entry, it searches for:– EMPTY entry
– If not found - a NORMAL entry
– finally for an XCOMMIT entry.
30
Processor actions
Each processor maintains two flags:– The transaction active (TACTIVE) flag: indicates whether a
transaction is in progress
– The transaction status (TSTATUS) flag: indicates whether that transaction is active (True) or aborted (False)
Non-transactional operations behave exactly as in original cache-coherence protocol
31
Example – LT operation:
Look for XABORT entry
Return it’s value
Look for NORMAL entry
Change it to XABORT and allocate another XCOMMIT entry
Found?Not Found?
Ask to read this block from the shared memory
Found?
Not Found?
Successful read
Create two entries: XABORT and XCOMMIT
Unsuccessful read
Abort the transaction:
TSTATUS=FALSE
Drop XABORT entries
All XCOMMIT entries are set to NORMAL
Cache miss
32
Snoopy cache actions:
Both the regular cache and the transactional cache snoop on the bus.
A cache ignores any bus cycles for lines not in that cache.
The transactional cache’s behavior:– If TSTATUS=False, or if the operation isn’t transactional,
the cache acts just like the regular cache, but ignores entries with state other than NORMAL
– On LT of other cpu, if the state is VALID, the cache returns the value, and for all other transactional operations it returns BUSY
33
Outline
Lock-Free Hardware Transactional Memory (HTM)
Transactions Cache coherence protocol General Implementation Simulation
UTM LTM TCC (briefly) Conclusions
34
Simulation
We’ll see an example code for the producer/consumer algorithm using transactional memory architecture.
The simulation runs on both cache coherence protocols: snoopy and directory cache.
The simulation use 32 processors The simulation finishes when 2^16 operations have
completed.
35
Part Of Producer/Consumer Code
typedef struct { Word deqs; // Holds the head’s index Word enqs; // Holds the tail’s index Word items[QUEUE_SIZE];} queue;
unsigned queue_deq(queue *q) { unsigned head, tail, result; unsigned backoff = BACKOFF_MIN unsigned wait; while (1) { result = QUEUE_EMPTY; tail = LTX(&q->enqs); head = LTX(&q->deqs); if (head != tail) { /* queue not empty? */ result = LT(&q->items[head % QUEUE_SIZE]); /* advance counter */ ST(&q->deqs, head + 1); } if (COMMIT()) break; /* abort => backoff */ wait = random() % (01 << backoff); while (wait--); if (backoff < BACKOFF_MAX) backoff++; } return result;}
36
The results:
37
In both HTM and STM the transactions shouldn’t touch many memory locations
There is a (small) bound on the transactions footprint
In addition, there is a duration limit.
So Far:
38
Outline
Lock-Free Hardware Transactional Memory (HTM)
Transactions Cache coherence protocol General Implementation Simulation
UTM LTM TCC (briefly) Conclusions
39
UTM – new thesis: supports transactions of arbitrary footprint and duration.
The UTM architecture allows:– transactions as large as virtual memory– transactions of unlimited duration– transactions which can migrate between processors
UTM supports a semantics for nested transactions
In contrast to previous HTM implementation: UTM is optimized for transactions below a certain size but still operate correctly for larger transactions
Unbounded Transactional Memory (UTM)
40
The Goal of UTM
The primary goal: – make concurrent programming easier.– Reducing implementation overhead.
Why do we want unbounded TM?
Neither programmers nor compilers can easily cope with an imposed hard limit on transaction size.
41
UTM architecture
The transaction log – data structure that maintains bookkeeping information for a transaction
Why is it needed?– Enables transactions to survive time slice
interrupts – Enables process migration from one processor to
another.
42
Two new instructions
All the programmer must specify is where a transaction begins and ends
XBEGIN pc– Begin a new transaction. Entry point to an abort handler
specified by pc.– If transaction must fail, roll back processor and memory
state to what it was when XBEGIN was executed, and jump to pc.
– We can think of an XBEGIN instruction as a conditional branch to the abort handler.
XEND– End the current transaction. If XEND completes, the
transaction is committed and appeared atomic.– Nested transactions are subsumed into outer transaction.
43
Transaction Semantics
XBEGIN L1 ADD R1, R1, R1 ST 1000, R1 XEND
L2: XBEGIN L2 ADD R1, R1, R1 ST 2000, R1 XEND
Two transactions:– “A” has an abort handler at L1– “B” has an abort handler at L2
Here, very simplistic retry.
A
B
44
A name dependence occurs when two instructions Inst1 and Inst2 use the same register (or memory location), but there is no data transmitted between Inst1 and Inst2.
If the register is renamed so that Inst1 and Inst2 do not conflict, the two instructions can execute simultaneously or be reordered.
This technique that dynamically eliminates name dependences in registers, is called register renaming.
Register renaming can be done statically (= by compiler) or dynamically (= by hardware).
Register renaming
45
Rolling back processor state
After XBEGIN instruction we take a snapshot of the rename table
To keep track of busy registers, we maintain an S (saved) bit for each physical register to indicate which registers are part of the active transaction and it includes the S bits with every renaming-table snapshot
An active transaction’s abort handler address, nesting depth, and snapshot are part of its transactional state.
46
Memory State
UTM represents the set of active transactions with a single data structure held in system memory, the x-state (short for “transaction state”).
47
Xstate Implementation
The x-state contains a transaction log for each active transaction in the system.
Each log consists of:– A commit record: maintains the transaction’s status:
pending committed aborted
– A vector of log entries: corresponds to a memory block that the transaction has read or written to. The entry provides:
pointer to the block The block’s old value (for rollback) A pointer to the commit record Pointers that form a linked list of all entries in all transaction logs that
refer to the same block. (Reader List)
48
Xstate Implementation (Cont)
The final part of the x-state consists of:– log pointer– read-write bit
for each memory block
49
X-state Data Structure
42
Transaction log 1
PENDING
42
Transaction log 2
PENDING
32
32
42
Commit record
Old value
Block pointer
Reader list
Commit record pointer
Transaction log entry
W
log pointerRW bit
R
X-state
Application memory
Old value
Block pointer
Reader list
Commit record pointer
block
43
42
50
More on x-state
When a processor references a block that is already part of a pending transaction, the system checks the RW bit and log pointer to determine the correct action:
– use the old value
– use the new value
– abort the transaction
51
Commit action
42
Transaction log 1
PENDING
42
Transaction log 2
PENDING
32
43
42
Commit record
Old value
Block pointer
Reader list
Commit record pointer
Transaction log entry
W
log pointerRW bit
R
X-state
Application memory
Old value
Block pointer
Reader list
Commit record pointer
block
Transaction log 1
COMMITED
52
Cleanup action
42
Transaction log 1
COMMITED
42
Transaction log 2
PENDING
32
43
42
Commit record
Old value
Block pointer
Reader list
Commit record pointer
Transaction log entry
W
log pointerRW bit
R
X-state
Application memory
Old value
Block pointer
Reader list
Commit record pointer
block
53
Abort action
42
Transaction log 1
PENDING
42
Transaction log 2
PENDING
32
43
42
Commit record
Old value
Block pointer
Reader list
Commit record pointer
Transaction log entry
W
log pointerRW bit
R
X-state
Application memory
Old value
Block pointer
Reader list
Commit record pointer
block
Transaction log 1
ABORTED
32
42
54
Transactions Conflict
A conflict: When two or more pending transactions have accessed a block and at least one of the accesses is for writing.
Performing a transaction load:– check that the log pointer refers to an entry in the current
transaction log or the RW bit is R.
Performing a transaction store:– check that the log pointer references no other transaction
In case of a conflict, some of the conflicting transactions are aborted.
– Which transaction should be aborted?
55
Caching
For small transaction that fits in cache, UTM, like earlier proposed HTM systems, uses cache coherence protocol to identify conflicts
For transactions too big to fit in cache, the x-state for the transaction overflows into the ordinary memory hierarchy
– Most log entries don't need to be created
– Only create transaction log when transaction is run out of physical memory.
56
UTM’s Goal
support transactions that run for an indefinite length of time
migrate from one processor to another footprints bigger than the physical memory.
The main technique we propose is to treat the x-state as a systemwide data structure that uses global virtual addresses
57
Benefits and Limits of UTM
Limits:– Complicated implementation
Benefits:– Unlimited footprint– Unlimited duration– Migration possible– Good performance in the common case (small
transactions)
58
Outline
Lock-Free Hardware Transactional Memory (HTM)
Transactions Cache coherence protocol General Implementation Simulation
UTM LTM TCC (briefly) Conclusions
59
LTM: Visible, Large, Frequent, Scalable
“Large Transactional Memory”– Not truly unbounded, but simple and cheap
Minimal architectural changes, high performance– Small modifications to cache and processor core– No changes to main memory, cache coherence
protocol– Can be pin-compatible with conventional
processors
60
LTM’s Restrictions :
Limiting a transaction’s footprint to (nearly) the size of physical memory.
Duration must be less than a time slice Transactions cannot migrate between
processors.
With these restrictions, we can implement LTM by modifying only the cache and processor core
61
LTM vs UTM
Like UTM, LTM maintains data about pending transactions in the cache and detects conflicts using the cache coherency protocol
Unlike UTM, LTM does not treat the transaction as a data structure. Instead, it binds a transaction to a particular cache.
– Transactional data overflows from the cache into a hash table in main memory
LTM and UTM have similar semantics: XBEGIN and XEND instructions are the same
In LTM, the cache plays a major part…
62
Addition to Cache
LTM adds a bit (T) per cache line to indicate that the data has been accessed as part of a pending transaction.
An additional bit (O) is added per cache set to indicate that it has overflowed.
63
Cache overflow mechanism
O T Tag Data
Overflow hashtable
Key
ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND
Data
64
Cache overflow mechanism
1000 55
O T Tag Data
Overflow hashtable
Key
ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND
Data
65
Cache overflow: recording reads
T 1000 55
O T Tag Data
Overflow hashtable
Key
ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND
Data
66
Cache overflow: recording writes
T 1000 55
T 2000 66
O T Tag Data
Overflow hashtable
Key
ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND
Data
67
Cache overflow: spilling
T 3000 77
T 2000 66
O
1000 55
O T Tag Data
Overflow hashtable
Key
ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND
Data
68
Cache overflow: miss handling
T 1000 55
T 2000 66
O
3000 77
O T Tag Data
Overflow hashtable
Key
ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND
Data
69
LTM - Summary
Transactions as large as physical memory
Scalable overflow and commit
Easy to implement!
Low overhead
70
Outline
Lock-Free Hardware Transactional Memory (HTM)
Transactions Cache coherence protocol General Implementation Simulation
UTM LTM TCC (briefly) Conclusions
71
Transactional Memory Coherence and Consistency (TCC)
Hammond, Wong, Chen, Carlstrom, Davis (Jun 2004).“Transactional Memory Coherence and Consistency”
All transactions, all the time! Code partitioned into transactions by programmer or tools
– Possibly at run-time, for legacy code!
All writes buffered in caches, CPUs arbitrate system-wide for which one gets to commit
Updates broadcast to all CPUs. CPUs detect conflicts of their transactions and abort
72
TCC Implementation
r m V tag data
Commit control
Write buffer
Local cache hierarchy
Broadcast bus or network
snoopingcommits
CPU Corestoresonly
Loads & stores
73
Conclusions
Unbounded, scalable, and efficient Transactional Memory systems can be built.
– Support large, frequent, and concurrent transactions– Allow programmers to (finally!) use our parallel systems!
Three architectures:– LTM: easy to realize, almost unbounded– UTM: truly unbounded– TCC: high performance
74
THE END…
Royi Maimon
Merav Havuv