eazyhtm: eager-lazy hardware transactional memory€¦ · eazyhtm: eager-lazy hardware...
TRANSCRIPT
![Page 1: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/1.jpg)
EazyHTM: Eager-Lazy Hardware
Transactional Memory
Saša Tomić, Cristian Perfumo, Chinmay Kulkarni,
Adrià Armejach, Adrián Cristal, Osman Unsal,
Tim Harris, Mateo Valero
Barcelona Supercomputing Center, UPC
BITS Pilani
Microsoft Research Cambridge
![Page 2: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/2.jpg)
Why Transactional Memory?
• Lock-based parallel programming has problems
– Deadlocks, races, complexity, performance, …
• Transactional Memory (TM) to the rescue
– Optimistic concurrency control mechanism
– Easy to use
– Deadlock free
– Supports composability
– Protects data in critical sections
• Hardware-TM (HTM), Software-TM (STM) and hybrid
• Lock-based parallel programming has problems
– Deadlocks, races, complexity, performance, …
• Transactional Memory (TM) to the rescue
– Optimistic concurrency control mechanism
– Easy to use
– Deadlock free
– Supports composability
– Protects data in critical sections
• Hardware-TM (HTM), Software-TM (STM) and hybrid
2
![Page 3: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/3.jpg)
HTM terminology
• Atomic section/transaction: group of instructions that
appear to take effect instantaneously
• Where are speculative values stored (version
management):
– in-place, and log the original value, or
– buffered in private storage, publish on commit
• Conflict: TX writes where others TX reads
– Detection: an action in which we check for conflicts
– Resolution: an action performed to resolve the conflict
• Can be abort, stalling the execution, …
3
![Page 4: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/4.jpg)
• A.k.a. pessimistic
• Writes in-place, detects&resolves conflicts on every access
• LogTM [Moore, HPCA06], LogTM-SE [Yen, HPCA07]
Eager HTM
4
Stall
W
RR
TX 1
TX 2
TX 3
fast
commit
Limited
concurrency
Fast commit
Slow abort
![Page 5: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/5.jpg)
• A.k.a. optimistic
• Writes buffered, detect&resolve conflicts on commit
• TCC [Hammond, ISCA04], Scalable-TCC [Chafi, HPCA07]
Lazy HTM
5
W
RR
TX 1
TX 2
TX 3
complex
commit:
validate +
write
Fast abort
Complex
commit
Good
concurrency
![Page 6: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/6.jpg)
The Motivation
Splitting conflict management
• Eager-Lazy hardware-software TM exists (FlexTM [Shriraman, ISCA08]):
– Software begin, commit and abort
– Probabilistic (signature based) conflict detection
• EazyHTM is the first pure-hardware TM
6
Conflict
detection
Eager
Lazy
Conflict resolution
Eager Lazy
LogTM
TCC, S-TCCImpossible
EazyHTM Fast commit
Good
concurrency
![Page 7: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/7.jpg)
Outline
• Motivation
• Contributions
• Hardware changes
• The Protocol
• Evaluation
• Conclusions
7
![Page 8: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/8.jpg)
EazyHTM Contributions
• The best of two worlds
– Eager conflict detection: simple commit/exact list of
conflicts in advance
– Lazy conflict resolution: good concurrency
• Parallel commits of non-conflicting TXs
• Designed for CMPs (Chip-Multiprocessors)
– Use cores proximity
– MESI/MOESI protocol upgrade (easier verification)
8
![Page 9: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/9.jpg)
Hardware changes
9
Racers list – 1 bit per core
Killers list – 1 bit per core
SR – 1 bit per line
SM – 1 bit per line
TD – 1 bit per line
Register file
checkpoint
Racers listRacers list
Killers listKillers listCPU
S
R
S
R Existing cache logicPrivate
Cache(s)S
M
S
M
T
D
T
D Existing directory logicDirectory
• tracks conflicts
•
• tracks conflicts
• bit-vector
• 32 bits for 32 cores
holds read/write set
read only optimization bit
(details in the paper)
read-only optimization bit
(details in the paper)
core core core... ... ...
![Page 10: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/10.jpg)
Racers and killers list
• If line is shared between two TXs:
– Read-Read
• No conflict
– Write-Read, Read-Write, Write-Write
• Writer adds reader TX into “racers” list
– “TXs that I have to abort” list, if I commit first
• Reader adds writer TX into “killers” list
– “TXs that can abort me” list, if they commit first
• We illustrate only the Write-after-Read (WAR) conflict
10
![Page 11: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/11.jpg)
txMark @A
ACK @A, 0
... ...
no other
sharers
EazyHTM Protocol
Conflict Detection (1/2)
11
racers
killers
TX 0
racers
killers
TX 2
sharers @A
Directory
1
2
TX 0 TX 2
BTX
RD A
CTX
TX 0 TX 2
BTX
BTX
RD A
WR A
CTX
CTX
Replaces
GETS/GETX
![Page 12: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/12.jpg)
TX 0 TX 2
BTX
RD A
CTX
TX 0 TX 2
BTX
BTX
RD A
WR A
CTX
CTX
racers
killers
TX 2
sharers @A
Directory
racers
killers
TX 0
ACK @A, 1txAccessor #2, @A
txMark @A
Reader #0, @A
Potential
conflict
1 other
sharer
Writer #2, @A
EazyHTM Protocol
Conflict Detection (2/2)
12
Remember:
abort TX#0
on commitRemember:
TX#2 can
abort me
1
23
4
5
![Page 13: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/13.jpg)
racers
killers
TX 2
racers
killers
TX 0
sharers @A
Directory
Abort from TX#2
WR @A (commit)
Abort Ack from TX#0
EazyHTM Protocol
Conflict Resolution
13
TX#2 first came to the commit point, abort TX#0!1
1
2
3
TX 0 TX 2
BTX
RD A
CTX
TX 0 TX 2
BTX
BTX
RD A
WR A
CTX
CTX
![Page 14: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/14.jpg)
TX 0 TX 2
BTX
WR A
CTX
TX 0 TX 2
BTX
BTX
WR A
WR B
CTX
CTX
TX 0 TX 2
BTX
WR A
CTX
TX 0 TX 2
BTX
BTX
WR A
WR B
CTX
CTX
TX 0 TX 2
BTX
WR A
CTX
TX 0 TX 2
BTX
BTX
WR A
WR B
CTX
CTX
0 other
sharers
EazyHTM Protocol
Disjoint data => parallel commit
14
txMark @B
...
txMark @A
ACK @A, 0
WR @A
(commit)
WR @B
(commit)
TX#0 works with line @A TX#2 works with line @B
sharers @A
Directorysharers @B
1 1
ACK @B, 022
racers
killers
TX 0
3racers
killers
TX 2
3
...
NO
SERIALIZATION0 other
sharers
![Page 15: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/15.jpg)
Implementation
• Implemented in M5, full-system simulator (Alpha)
• Private L1 (32KB, 4-way, 64B CL, 2 cycles)
• Private L2 (512KB, 8-way, 64B CL, 10 cycles)
• Memory (with directory, 100 cycles)
• ICN (2D Mesh, 10 cycles per hop)
15
![Page 16: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/16.jpg)
Evaluation
• Evaluated STAMP benchmarks
• Compared with Scalable-TCC-like HTM
– Same base simulator
– Implemented specialized directory protocol
• Compared with ideal lazy HTM (MESI based)
– magical conflict detection
– instant conflict resolution
– parallel write-back commit
16
![Page 17: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/17.jpg)
Kmeans Low
• Small TXs (RS 15 CL; WS 5 CL)
• Low contention
(10% aborts)
• Similar profile to
“replacing locks with atomic”
• Near ideal performance
• K-means: groups N-dimensional
space into K clusters
• Most of the SPLASH-2 suite has
similar profile
17
0
5
10
15
20
25
30
0 10 20 30 40
sp
ee
du
p
processors
Kmeans-Low
Ideal
EazyHTM
STCC
![Page 18: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/18.jpg)
SSCA2
• Small TXs (RS 50 CL, WS 10 CL)
• Low contention
(1.2% aborts)
• Near ideal performance
• Scalability affected by barriers,
not by contention
• SSCA2: large directed graph
operations
18
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 10 20 30 40
sp
ee
du
p
processors
SSCA2
Ideal
EazyHTM
STCC
![Page 19: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/19.jpg)
Yada
• Large TXs (260 CL RS, 140 CL
WS)
• Moderate contention
(35% aborts)
• We can see good performance
also for large TXs!
• Yada: delaunay mesh refinement
19
0
2
4
6
8
10
12
0 10 20 30 40
sp
ee
du
p
processors
Yada
Ideal
EazyHTM
STCC
![Page 20: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/20.jpg)
Intruder
• Medium TXs (53 CL RS, 20 CL
WS)
• High contention (85%
aborts)
• Very bad scalability for all HTMs
• Every transaction detects conflicts
over and over again – lot of
conflict detection messages slow
down the execution
• Intruder: signature based network
intrusion detection system
20
0
2
4
6
8
10
12
0 10 20 30 40
sp
ee
du
p
processors
Intruder
Ideal
EazyHTM
STCC
![Page 21: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/21.jpg)
Only high-conflict STAMP
• >50% abort rate only
• High contention high-core-count
should be optimized
• Averages:
• Labyrinth
• Intruder
• Kmeans-Hi
• Results highly affected by
Intruder
21
0
2
4
6
8
10
12
0 10 20 30 40
sp
ee
du
p
processors
High-conflict STAMP
Ideal
EazyHTM
STCC
![Page 22: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/22.jpg)
Only low-conflict STAMP
• <50% abort rate only
• Low abort rate necessary for
scaling
• Excludes:
• Labyrinth 8-32
• Intruder 16-32
• Kmeans-Hi 32
22
0
2
4
6
8
10
12
0 10 20 30 40
sp
ee
du
p
processors
Scaling STAMP
Ideal
EazyHTM
STCC
![Page 23: EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián](https://reader033.vdocuments.us/reader033/viewer/2022052519/5f0f12ed7e708231d4425c4c/html5/thumbnails/23.jpg)
Conclusions
• Introduced EazyHTM, a new HTM implementation
– Eager conflict detection, lazy conflict resolution
– Fast: performs well for low conflict parallel applications
– Minimal changes to directory protocols (easier verification)
– As scalable as standard directory protocol
• EazyHTM mechanism could allow (future work):
– Simpler transaction prioritization
– Less wasted work
– Better performance optimization
– Power efficient TM mechanisms
23