![Page 1: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/1.jpg)
Syed Ali Raza Jafri et al. 1
LiteTM: Reducing HTM State Overhead
T. N. Vijaykumar
with Ali Jafri & Mithuna Thottethodi in HPCA ‘10
![Page 2: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/2.jpg)
2
Transactional Memory (TM)Multicores require parallel programming
• Significantly harder than sequential programming
Locks may cause incorrect behavior• Deadlocks/livelocks and data races
TM appears to make correct programming easier
TM implementations can be efficient
Transactions may provide better programmability and performance than locks
![Page 3: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/3.jpg)
3
Previous WorkHardware, software and hybrid TMs
• HTMs piggyback conflict detection on coherence • STMs and HybridTMs detect conflicts in software
Recent HTMs support many features• Transaction time and footprint not limited by hardware
• Can exceed caches and even be swapped out of memory
• Transaction-OS interactions not restricted• In-flight context switches, page/thread migrations
• Modest hardware complexity• No coherence protocol changes (very big deal)
Supporting these features incurs high hardware cost
![Page 4: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/4.jpg)
4
HTM Cost: State overheadHTMs need large state throughout memory hierarchy
• Numerous state bits in L1 and L2• Hijack memory ECC weaker protection
» E.g., 25% fewer SECDED bits in TokenTM
Supporting all features large state in caches + weaker memory ECC high barrier for adoption
19 bits
16 bits
16 bits
![Page 5: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/5.jpg)
5
HTM State Overhead
Thread Id/sharer-count + state bits per block Thread Id to determine conflictors or own blocks
Sharer count to track multiple readers• Ideally, need all ids but too much state make do with counts
Avoid coherence changes Extra bits (beyond R, W)• E.g., TokenTM uses 5 bits instead of usual R, W
Thread Ids+sharer-counts in hardware Detect conflicts + identify conflictors mostly in hardware
![Page 6: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/6.jpg)
6
LiteTM: Key Observations
Most state information not needed in common case Eliminate thread Ids and sharer-counts
• Intended for conflicts on L1-evicted blocks, but• Conflict usually on L1-resident blocks• Coherence trivially identifies L1-resident conflictors & count
Merge R,W into T• Coherence’s “Modified” state can approximate W• False positives possible but rare
Uncommon case: scan transactional log
LiteTM detects conflicts in h/w (all cases, like all HTMs);identifies conflictors: h/w (common) & s/w (uncommon)
![Page 7: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/7.jpg)
7
LiteTM: Contributions (1)LiteTM reduces transactional state
Average (worst) case 4% (10%) performance loss in STAMP ( 8 cores)
Key reduction is removal of thread id/count (W approx is secondary)
2 bits
2 bits
2 bits
19 bits
16 bits
16 bits
![Page 8: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/8.jpg)
8
LiteTM: Contributions (2)
LiteTM compensates for the loss of• Thread Id
• Read-sharer count
• Separate R,W bits
via novel mechanisms • Self-log walks
• Lazy clearing of L1-spilled transaction state
• W approximation
• All-log walks (a la TokenTM)
Smaller state in caches & fewer hijacked memory ECC bits significantly lower barrier for adoption
![Page 9: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/9.jpg)
9
LiteTM in the HTM-STM spectrum
LiteTM improves HTM by pushing more into software• i.e., by moving HTMs closer to STMs!
LiteTM differs from HybridTMs in h/w-s/w split• Hybrids: conflict detection in h/w if fits in cache; otherwise in s/w
• LiteTM: conflict detection always in h/w; resolution in s/w
Key point: Conflict detection• Needed for all accesses must be fast
• Is a global operation usually hard to do fast in software
• Closely matches coherence which is fast easy to piggyback
• Hence, always in hardware in LiteTM (like all HTMs)
![Page 10: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/10.jpg)
10
Outline Introduction
LiteTM transactional state
Lazy clearing
Experimental Results
Conclusion
![Page 11: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/11.jpg)
11
Transactional State in L1
TokenTM (~16 bits)
R, W – transactionally read/written
R',W' + id – read/written and moved to another cache upon coherence movement
•no change in coherence •Identifies conflictor
R+ + count – fusion of multiple read copies
LiteTM (2 bits)
T + clean/modified – transactionally read/written
T' – T moved to another cache No id All log walk if conflict
Upon conflict, abort writer and all but one reader, or all readers
![Page 12: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/12.jpg)
12
Transactional State in L2 & Memory
TokenTM (~16 bits)
States in L2 & memoryIdle (transactionally clean)Single reader + idSingle writer + idMultiple readers + count
Conflict on multiple readers all log walks
LiteTM (2 bits)
State in L2 & memory IdleSingle readerSingle writerMultiple readers
Conflict in any state all log walks
No id self log walk
No count no decrement of count Lazy Clearing of ‘Multiple readers’
![Page 13: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/13.jpg)
13
Lazy Clearing
‘Multiple readers’ conflict/commit leaves state behind• No count don’t know who is last reader cannot clear• Lazy clear on next conflict via all log walks
All log walk check and state clearing should be atomic• Hardware address buffers + software support
Details in HPCA ‘10 paper
![Page 14: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/14.jpg)
14
Outline Introduction
LiteTM transactional state
Lazy clearing
Experimental Results
Conclusion
![Page 15: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/15.jpg)
15
Methodology GEMS HTM simulator on top of Simics 8 core, 1GHz in-order issue processor Typical memory hierarchy parameters All STAMP benchmarks Multiple runs for statistical significance Transactional state bits: TokenTM 16 vs. LiteTM 2
• Also show LiteTM-1bit: read sharing triggers log walks
Hybrid-bound: Emulate spilled transactions in hybrid TMs• 1 extra hash-table write per first transactional access
![Page 16: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/16.jpg)
16
•Lack of distinction between read-sharing and conflict degrades LiteTM
LiteTM Performance
•Mostly 1-3% loss; Contentious, long transactions 10% loss•Labyrinth’s contention hurts base optimistic TM small loss
-1bit, conflict detection in s/w degrades Hybrid-bound
![Page 17: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/17.jpg)
17
LiteTM Aborts & Log WalksBenchmarks % false abort due
to W approxself log walksper commit
all log walks per commit
ssca2, km-low, km-high, intruder
0 ~0 ~0
genome 2.5 0.02 ~0
vac-low 0 ~0 ~0
vac-high 0 0.02 0.01
yada 0.9 0.3 ~0
bayes 0.3 3.9 0.08
labyrinth 0.1 58 0.94
Overhead increases with contention yet still low
![Page 18: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/18.jpg)
18
Conclusion
Current HTMs support many key features
Incur high transactional state overhead• Many state bits in all caches & hijacked memory ECC bits
• High barrier for adoption
LiteTM significantly reduces transactional state• Most state information not needed in common case
• Employs novel mechanisms for uncommon case
LiteTM reduces TokenTM’s 16 bits/block to 2 bits• Average (worst) case 4% (10%) performance loss in STAMP
LiteTM significantly lowers the barrier for adoption
![Page 19: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/19.jpg)
19
A couple points on Cliff’s talk
Main problem: Conflicts due to auxiliary data This problem exists for all optimistic TMs
• HTMs, STMs, and hybrids
Options• Learn from past conflicts to skew the schedule (prevent conflict)
• Repair transactional state - Martin et al. ISCA ’10 (cure conflict)
• Instead of learning, compiler can provide hints to aid prevention
These problems don’t seem big enough to give up on HTMs
![Page 20: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/20.jpg)
20
Questions?
![Page 21: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/21.jpg)
21
Is TokenTM overhead really high?
16 bits/L1-block is a lot in absolute terms
16 bits in memory may be hijacked from ECC• 25% fewer SECDED bits weaker protection
Or, 16 bits may be placed in main memory• Increase the bandwidth requirements
![Page 22: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/22.jpg)
22
Narrow Topic?
LiteTM separates• Conflict detection (hardware)
• From conflictors identification (software)
Fundamental and can be applied to other unbounded HTMs
![Page 23: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/23.jpg)
23
Focus on TokenTM
TokenTM is the only design which supports all features mentioned previously
Hence we attempt to improve TokenTM
Our design is applicable to other HTMs as well • OneTM-concurrent's ids and • VTM's ids (pointers to XSW in XADT) • And counts (#entries in XADT))
![Page 24: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/24.jpg)
24
What about UFO?
UFO is not a TM
Supports strong atomicity in Hybrids/STMs
We compare against upper bound on hybrids
![Page 25: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/25.jpg)
25
Read-sharing Support
LiteTM allows read sharing
Multiple L1's can have T-bits
L2 has multiple read sharing state
Disallows readsharing if T bit + Modified• Uncommon
![Page 26: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/26.jpg)
26
Should logs be locked to avoid racing conflicts?
Recall: Conflicting access faults and retries Suppose thread F is checking thread N’s log
• Looking for block X
N makes racing access to X N takes away coherence permissions from F After log walk F will RETRY the access to X Coherence action will cause F to fault again Back stop available to prevent livelock Context switches handled similarly
![Page 27: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/27.jpg)
27
Coherence Actions are Completed
Invalidations of a reader• T' bit sent to writer• T' states that there exists a token
Read sharing of writer• T' bit sent to reader
![Page 28: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/28.jpg)
28
STM Acceleration Easier?
STM-acceleration provides weaker semantics• Requires at least one bit per memory block • UFO-like mechanisms
LiteTM only 2 bits per block• No changes to coherence protocol• Performs better than STM-accelerated approach
Shown by our hybrid-upper bound comparison
![Page 29: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/29.jpg)
29
Smallest Input Dataset
8-core setup
Suitable scaling for all benchmarks
Reasonable simulation times, • Statistical variation.
![Page 30: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/30.jpg)
30
Hybrid better than signature HTMs
Signature saturation causes serialized execution
TokenTM and LiteTM use per-block metastate,
![Page 31: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/31.jpg)
31
Support for SMT Cores
LiteTM can support multithreaded cores• Replicates the T bits per hardware context• Single T' bit
T’ bit Remote transactional access
![Page 32: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/32.jpg)
32
Bits every where are hard?
No, adding nacks/delays in coherence is hard• Leads to deadlocks/livelocks
Adding bits is quite easy
![Page 33: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/33.jpg)
33
Validity of Hybrid Bound
Upper bound on Hybrids which retry spilled TX in s/w
Does not apply to other self-proclaimed hybrids• E.g. SigTM
SigTM uses signatures for conflict detection
Signature-based TMs have other issues• Signature saturation causes serualization
![Page 34: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/34.jpg)
34
TokenTM vs LiteTM: Transactional state for Conflict Detection
TokenTM LiteTM
![Page 35: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/35.jpg)
35
Sensitivity to Busy Buffers
No buffers all L1 misses wait till lazy clearing significant loss for high contention
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ssca
2
km-lo
w
km-h
igh
intru
der
geno
me
vac-
low
vac-
high
yada
baye
s
labyr
inth
Pe
rf. R
ela
tive
to T
oke
nT
M
4 Buffers
no buffers
![Page 36: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/36.jpg)
36
()Hybrid Upper-bound
Upperbound for any hybrid that retries
transactions in an STM (with software
conflict detection) after a failure in HTM
mode.
![Page 37: Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri & Mithuna Thottethodi in HPCA ‘10](https://reader036.vdocuments.us/reader036/viewer/2022062304/56649d445503460f94a20dd5/html5/thumbnails/37.jpg)
37
Transactional State Overheads
Thread Id/sharer-count + state bits per block
Avoid coherence changes Extra bits (beyond R, W)• Previously, conflicting access is nacked (cannot complete)
• Such nacks are invasive changes to coherence (cause deadlocks)
• TokenTM allows coherence to complete even on a conflict
» Access itself does not complete & excepts• Needs transactional state to move with blocks under coherence
• Tracks non-local transactional state
• E.g., TokenTM‘s R', W', R+
Thread Ids+sharer-counts in hardware Detect conflicts + identify conflictors mostly in hardware