architectural techniques for memory oversight in

233
Architectural Techniques for Memory Oversight in Multiprocessors by Arrvindh Shriraman Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Supervised by Professor Sandhya Dwarkadas Department of Computer Science Arts, Sciences, and Engineering Edmund A. Hajim School of Engineering and Applied Sciences University of Rochester Rochester, New York 2010

Upload: others

Post on 07-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Architectural Techniques for Memory Oversightin Multiprocessors

by

Arrvindh Shriraman

Submitted in Partial Fulfillment

of the

Requirements for the Degree

Doctor of Philosophy

Supervised by

Professor Sandhya Dwarkadas

Department of Computer ScienceArts, Sciences, and Engineering

Edmund A. Hajim School of Engineering and Applied Sciences

University of RochesterRochester, New York

2010

ii

To my family and friends

iii

Curriculum Vitae

Arrvindh Shriraman was born in Madras, India (now known as Chennai, although he always

liked the British version) and has walked this earth for ' 10,000 days. He graduated from

the University of Madras in 2004, with a Bachelor of Engineering degree in the area of Com-

puter Science. Arrvindh entered graduate studies at the University of Rochester in the Fall of

2004, pursuing research in computer architecture and systems, under the direction of Professor

Sandhya Dwarkadas. He received the Master of Science degree in Computer Science from Uni-

versity of Rochester in 2006. In the future, Arrvindh will seek to research ideas at Simon Fraser

University that his advisor considered too risky or half-baked when he was in graduate school.

Arrvindh loves fast cars, the Finger Lakes in Western New York, and fall weather.

iv

Acknowledgments

First and foremost, I would like to thank Sandhya Dwarkadas for helping me define and refine

the ideas in this dissertation. I was extraordinarily fortunate that she agreed to work with me,

and have grown and matured under her mentorship. I have stress tested Sandhya’s patience

many times over the years, but she has remained committed to my professional and personal

growth. It was great that she gave me the freedom to pursue my research ideas and go find my

own thing. These six years as her student have changed my life forever.

Michael Scott is the man, I continue to be amazed by his ability to google his brains for

major research works related to the topic of discussion, and then summarize them, all within a

few seconds. What is also obvious after a few minutes of conversation with him, is his startling

knowledge across a range of topics, which I have drawn upon often. Most of all, I would like to

thank him for helping me refine my half-baked ideas. He has been my virtual co-advisor.

Kai Shen impressed me with his eye for engineering details and his values as a researcher. I

learned a lot by just observing his dedication to work and his drive, which kept him a few hours

longer at work after I had retired for the night. I would also like to thank him for all the history

and politics lessons he taught me over dinner.

Engin Ipek, Chen Ding, and John B. Carter (thanks for agreeing to be on my thesis com-

mittee) were all very helpful during my job hunt season and provided valuable professional

consultancy at zero cost. I have tried to learn from Engin’s dedication to his students, Chen’s

drive to continuously to refine himself as a teacher, and John’s infectious enthusiasm.

Waran kickstarted my research career and motivated me to get a Ph.D. in the first place. He

v

helped me redefine my own limits and find work gears that I never knew existed to put in the

long hours.

My graduate studies have come to an end, but the relationships I have built at the UofR will

last a lifetime.

To Michael Spear, for teaching me how to work on collaborative projects and manage time.

Debugging RTM with him was a pleasure. I look forward to working with him in the future.

To Christopher Stewart, for lending me a comforting shoulder when I had paper rejections.

He has always heard my ideas, even before Sandhya did and helped me with his critical review-

ing skills.

To Virendra Marathe, whose is a treasure hoard of novel ideas.

To Bill Scherer and Luke Dalessandro, to whom I have always turned when I had trouble

with synchronization and C++.

To Hemayet Hossain and Hongzhou Zhao, for becoming part of the local GEMS hacker

community; they were also great architectural idea sounding boards.

More thanks to Kirk Kelsey, Xiao Zhang, Tongxin Bai, Xiaoming Gu, Amal Fahad, Stan

Park, Kostas Menychtas, and Bin Bao for being part of a fun systems group.

Thanks to everyone who spared a thought for me. I apologize, if I have not mentioned your

name; it was only because I ran out of space.

I would also like to thank my friends Gundi, KP, Rumbum, Harish, and Ninja for all the

good times and essentially being my extended family here in this country. I just tallied my cell

phone minutes and I averaged 45 mins every weekend in calls to these guys, over the last 4

years. The other part of the extended family consisted of the indian mafia at whipple park; this

thesis would not have been possible without their support.

The secretaries in our department have a thankless job; I would like to to acknowledge them

for letting me focus on my research and taking care of everything else.

Marty taught me everything I know about the north east and I personally owe all my great

summers in Rochester to her. She showed me how to maintain a sense of humor on the job and

helped me tide many a mini crisis (think spilling sour milk on one’s office carpet).

vi

JoMarie was my first contact within the department and she has taken care of all my logistics

over the years. I thank her for the countless letters she provided for the various visa procedures,

without ever asking why I needed it.

Eileen is really important, more so than most students realize. She ensured that I always got

paid on time.

To Pat, for making sense of all the pieces of conference receipts (and non-receipts) and

turning them into a kosher reimbursement form.

I would like to thank my parents, K.S.N.Sreeramen (yes!, south indians have 2–3 middle

names) and Sudha Sreeram, for always being there for me. My mother took an active hands-on

role in my education from the beginning and encouraged me be focused and at the same time

have an open mind. My dad always believed in me. To my grandmother for all the great muruku

(indian snack) she kept feeding me, when I was pouring over my middle school homework in

the summers. Lastly, I would like to thank my late thata (grandfather) and baba (uncle) for a

memorable childhood. Life was simple back then!

This material is based upon research supported by the National Science Foundation (grants

numbers: CNS-0411127, CAREER Award CCF-0448413, CNS-0509270, CNS-0615045,

CNS-0615139, CCF-0621472, CCF-0702505, ITR/IIS-0312925, CCR-0306473, and CNS-

0834451), the National Institutes of Health (5 R21 GM079259-02 and 1 R21 HG004648-01),

IBM Faculty Partnership Awards, and the University of Rochester. Any opinions, findings, and

conclusions or recommendations expressed in this material are those of the author(s) and do not

necessarily reflect the views of the above named organizations.

vii

Abstract

Computer architects have exploited the transistors afforded by Moore’s law to provide software

developers with high performance computing resources. Software has translated this growth in

hardware resources into improved features and applications. Unfortunately, applications have

become increasingly complex and are prone to a variety of bugs when multiple software mod-

ules interact. The advent of multicore processors introduces a new challenge, parallel program-

ming, which requires programmers to coordinate multiple tasks.

This dissertation develops general-purpose hardware mechanisms that address the dual chal-

lenges of parallel programming and software reliability. We have devised hardware mechanisms

in the memory hierarchy that shed light on the memory system and control the visibility of data

among the multiple threads. The key novelty is the use of cache coherence protocols to im-

plement hardware mechanisms that enable software to track and regulate memory accesses at

cache-line granularity. We demonstrate that exposing the events in the memory hierarchy pro-

vides useful information that was either previously invisible to software or would have required

heavyweight instrumentation.

Focusing on the challenge of parallel programming, our mechanisms aid implementations

of Transactional Memory (TM), a programming construct that seeks to simplify synchroniza-

tion of shared state. We develop two mechanisms, Alert-On-Update (AOU) and Programmable

Data Isolation (PDI), to accelerate common TM tasks. AOU selectively exposes cache events,

including those that are triggered by remote accesses, to software in the form of events. TM

runtimes use it to detect accesses that overlap between transactions (i.e., conflicts), and track a

viii

transaction’s status. Programmable-Data-Isolation (PDI) allows multiple threads to temporar-

ily hide their speculative writes from concurrent threads in their private caches until software

decides to make them visible. We have used PDI and AOU to implement two TM run-time

systems, RTM and FlexTM. Both RTM and FlexTM are flexible runtimes that permit software

control of the timing of conflict resolution and the policy used for conflict management.

To address the challenge of software reliability, we propose Sentry, a lightweight, flexible

access-control mechanism. Sentry allows software to regulate the reads and writes to memory

regions at cache-line granularity based on the context in the program. Sentry coordinates the

coherence states in a novel manner to eliminate the need for permission checks entirely for a

large majority of the program’s accesses (all cache hits), thereby improving efficiency. Sentry

improves application reliability by regulating data visibility and movement among the multiple

software modules present in the application. We use a real-world webserver, Apache, as a

case study to illustrate Sentry’s ability to guard the core application from vulnerabilities in the

application’s modules.

ix

Table of Contents

Appendices

Curriculum Vitae iii

Acknowledgments iv

Abstract vii

List of Figures xvi

List of Tables xix

Foreword 1

1 Introduction and Motivation 3

1.1 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Our Approach : Flexible Transactional Memory . . . . . . . . . . . . . 6

1.1.2 Monitoring : Alert-On-Update . . . . . . . . . . . . . . . . . . . . . . 7

1.1.3 Isolation : Programmable-Data-Isolation . . . . . . . . . . . . . . . . 7

1.1.4 Decoupling Conflict detection from Resolution . . . . . . . . . . . . . 8

1.2 Software Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

x

1.2.1 Problem and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Our Approach : Sentry . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Background 13

2.1 Concurrency in Software Execution: Transactional Memory . . . . . . . . . . 13

2.1.1 Transactional Memory in a Nutshell . . . . . . . . . . . . . . . . . . . 14

2.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.3 Hardware support for small transactions . . . . . . . . . . . . . . . . . 17

2.1.4 Unbounded transactions . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.5 Classifying proposed TM systems . . . . . . . . . . . . . . . . . . . . 22

2.1.6 Our Approach : Flexible Transactional Memory . . . . . . . . . . . . . 24

2.2 Concurrency in Software Development :

Fine-grain Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.1 Modern processors : Paging and Segmentation . . . . . . . . . . . . . 25

2.2.2 Research Prototypes : Mondrian and Loki . . . . . . . . . . . . . . . . 26

2.2.3 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.4 Tagged Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.5 Software-based Protection . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2.6 Access Control for Debugging . . . . . . . . . . . . . . . . . . . . . . 28

2.2.7 Our Approach: Sentry . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Monitoring: Alert-On-Update 30

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Current Monitoring Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

xi

3.3 Alert-On-Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.2 Observable events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.3 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4.1 Informing Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4.2 Intel mark bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.3 Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.5 Application 1: AOU Assisted STMs . . . . . . . . . . . . . . . . . . . . . . . 41

3.5.1 RSTM : Indirection-Based STMs . . . . . . . . . . . . . . . . . . . . 42

3.5.2 LOCK : Lock-based STM . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5.3 Challenges in STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.4 Using AOU to accelerate STMs . . . . . . . . . . . . . . . . . . . . . 45

3.5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.6 Application 2: Accelerating Locks . . . . . . . . . . . . . . . . . . . . . . . . 54

3.6.1 Background : Transactional Mutex Locks . . . . . . . . . . . . . . . . 55

3.6.2 AOU Acceleration for Locks . . . . . . . . . . . . . . . . . . . . . . . 57

3.6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.7 Application 3: Detecting Atomicity Bugs . . . . . . . . . . . . . . . . . . . . 59

3.8 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.8.1 AOU for Fast User-space Mutexes . . . . . . . . . . . . . . . . . . . . 62

3.8.2 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.8.3 Code Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 Isolation: Programmable Data Isolation 65

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

xii

4.1.1 Previous Approaches to Data Isolation . . . . . . . . . . . . . . . . . . 67

4.1.2 Our Approach : Lazy Coherence . . . . . . . . . . . . . . . . . . . . . 68

4.2 Broadcast-based TMESI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.1 Bulk State Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 Directory-based TMESI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3.1 Conflict Summary Tables . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4 Application of TMESI-Bcast : RTM Project . . . . . . . . . . . . . . . . . . . 77

4.4.1 RTM Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4.2 Fast-Path RTM Transactions . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.3 Overflow RTM Transactions . . . . . . . . . . . . . . . . . . . . . . . 81

4.4.4 Latency of RTM Transactions . . . . . . . . . . . . . . . . . . . . . . 82

4.4.5 Hardware-Software Transactions . . . . . . . . . . . . . . . . . . . . . 86

4.5 Application of TMESI-Dir: FlexTM . . . . . . . . . . . . . . . . . . . . . . . 86

4.5.1 Bounded FlexTM Transactions . . . . . . . . . . . . . . . . . . . . . . 88

4.5.2 Mixed Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . . 91

4.6 Virtualizing of Cache Overflows in FlexTM . . . . . . . . . . . . . . . . . . . 92

4.6.1 Eviction of Transactionally Read Lines . . . . . . . . . . . . . . . . . 92

4.6.2 Overflow table (OT) Controller Design . . . . . . . . . . . . . . . . . 92

4.6.3 Handling Evictions with Fine-grain Translation . . . . . . . . . . . . . 94

4.6.4 Handling OS Page Evictions . . . . . . . . . . . . . . . . . . . . . . . 98

4.6.5 Context Switch Support . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.7.1 Area Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.7.2 FlexTM Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.7.3 FlexTM vs. Hybrid TMs and STMs . . . . . . . . . . . . . . . . . . . 104

4.7.4 FlexTM vs. Central-Arbiter Lazy HTMs . . . . . . . . . . . . . . . . . 109

xiii

4.7.5 FlexTM-S vs. Other Virtualization Mechanisms . . . . . . . . . . . . . 111

4.8 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.8.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.8.2 Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.8.3 Concurrent Programming . . . . . . . . . . . . . . . . . . . . . . . . . 115

5 Conflict Management and Resolution in HTMs 116

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.2 Conflict Resolution Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.2.1 Conflict Resolution and Contention Management . . . . . . . . . . . . 119

5.2.2 Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.3 Effectiveness of Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.3.1 Implementation Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . 124

5.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.3.3 Effect of Wasted work . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.4 Interplay between Conflict Resolution and Management . . . . . . . . . . . . . 129

5.4.1 Wasted work in Eager and Lazy . . . . . . . . . . . . . . . . . . . . . 133

5.4.2 Concurrent Readers and Writers . . . . . . . . . . . . . . . . . . . . . 135

5.5 Mixed Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.5.1 Implementation Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . 137

5.5.2 Porting Mixed to other TMs . . . . . . . . . . . . . . . . . . . . . . . 139

5.6 Other studies on contention management . . . . . . . . . . . . . . . . . . . . . 140

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6 Protection : Sentry 144

6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.1.1 Access Control in the Memory Hierarchy . . . . . . . . . . . . . . . . 147

xiv

6.2 Sentry : Auxiliary Memory Access Control . . . . . . . . . . . . . . . . . . . 148

6.2.1 Metadata Hardware Cache (M-Cache) . . . . . . . . . . . . . . . . . . 149

6.2.2 Permission Cache Checks . . . . . . . . . . . . . . . . . . . . . . . . 151

6.2.3 Coherence-based Access Checks . . . . . . . . . . . . . . . . . . . . . 152

6.2.4 Exception Trigger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6.2.5 How is the M-cache filled ? . . . . . . . . . . . . . . . . . . . . . . . 153

6.2.6 Changing Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6.3 Sentry Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.3.1 Foundations for Sentry’s Protection Models . . . . . . . . . . . . . . . 157

6.3.2 One Domain Per Process . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.3.3 Intra-Process Compartments . . . . . . . . . . . . . . . . . . . . . . . 160

6.3.4 Ring Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.4 M-cache Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

6.4.1 Area, Latency, and Energy . . . . . . . . . . . . . . . . . . . . . . . . 164

6.4.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.5 Experimental System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

6.6 Application 1: Compartmentalizing Apache . . . . . . . . . . . . . . . . . . . 167

6.6.1 Compartmentalizing Code . . . . . . . . . . . . . . . . . . . . . . . . 168

6.6.2 Compartmentalizing Data . . . . . . . . . . . . . . . . . . . . . . . . 169

6.6.3 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

6.6.4 Process-based Protection vs. Sentry . . . . . . . . . . . . . . . . . . . 171

6.6.5 Lightweight Remote Procedure Call . . . . . . . . . . . . . . . . . . . 172

6.7 Application 2: Sentry-based Watchpoint Debugger . . . . . . . . . . . . . . . 173

6.7.1 Debugging Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

6.7.2 Comparison with Other Hardware . . . . . . . . . . . . . . . . . . . . 175

6.8 Extensions for address-translation . . . . . . . . . . . . . . . . . . . . . . . . 176

6.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

xv

7 Summary and Future Work 178

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.1.1 Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.1.2 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.1.3 Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.2.1 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 182

7.2.2 Fine-grain memory protection . . . . . . . . . . . . . . . . . . . . . . 183

7.3 Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

7.3.1 Future of Transactional Memory . . . . . . . . . . . . . . . . . . . . . 184

7.3.2 Which one of your hardware mechanisms holds the most promise? . . . 186

7.3.3 How do you know your cache protocols work? . . . . . . . . . . . . . 187

A Supplement for Transactional Memory 189

A.1 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

A.2 Application Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

A.3 Conflict Scenarios in Applications . . . . . . . . . . . . . . . . . . . . . . . . 193

B Coherence State Machine 196

xvi

List of Figures

1.1 Execution time breakdown in STM . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Concurrency in software development . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Loss of parallelism due to locks. [RaG01] . . . . . . . . . . . . . . . . . . . . 15

3.1 Coherence protocol support for Alert-on-update. . . . . . . . . . . . . . . . . . 37

3.2 STM Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Lock Stealing to make LOCK non-blocking . . . . . . . . . . . . . . . . . . . 46

3.4 Performance of STMs with AOU acceleration. . . . . . . . . . . . . . . . . . . 50

3.5 L1 cache miss rates across accelerated STMs . . . . . . . . . . . . . . . . . . 51

3.6 Timing breakdown for accelerated STMs . . . . . . . . . . . . . . . . . . . . . 52

3.7 Single-orec STM Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.8 Performance of Transactional-Mutex-Locks with AOU acceleration. . . . . . . 60

4.1 Example of data isolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2 TMESI Broadcast Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3 TMESI Directory Protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4 RTM metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

xvii

4.5 RTM transaction execution time breakdown . . . . . . . . . . . . . . . . . . . 85

4.6 Interaction of RTM-F and RTM-O transactions . . . . . . . . . . . . . . . . . 87

4.7 FlexTM Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.8 Pseudocode of BEGIN TRANSACTION and END TRANSACTION. . . . . . 90

4.9 Metadata for pages that have overflowed state . . . . . . . . . . . . . . . . . . 95

4.10 Software-metadata cache architecture . . . . . . . . . . . . . . . . . . . . . . 95

4.11 1 thread performance of FlexTM . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.12 16 thread performance of FlexTM . . . . . . . . . . . . . . . . . . . . . . . . 107

4.13 FlexTM vs. Centralized hardware arbiters. . . . . . . . . . . . . . . . . . . . . 111

4.14 Comparing FlexTM-S (FlexTM-Streamlines) with other TMs . . . . . . . . . . 113

4.15 Effect of signature size on FlexTM performance . . . . . . . . . . . . . . . . . 114

5.1 Contention manager actions . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.2 Studying the effect of randomized-backoff on conflict management . . . . . . . 127

5.3 Interplay of conflict management and conflict resolution . . . . . . . . . . . . 132

5.4 Interaction of access patterns with conflict resolution . . . . . . . . . . . . . . 134

5.5 Interaction of Mixed resolution with Size contention manager. . . . . . . . . . . 138

5.6 Interaction of Mixed resolution with Age contention manager. . . . . . . . . . . 138

6.1 Software modules in Apache . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.2 Access control in the memory hierarchy . . . . . . . . . . . . . . . . . . . . . 148

6.3 Permissions metadata cache (M-cache) . . . . . . . . . . . . . . . . . . . . . . 150

6.4 Pseudocode for inserting a new M-cache entry. . . . . . . . . . . . . . . . . . 154

6.5 Changing Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.6 Cross-domain call execution flow . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.7 L1 miss rate in applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.8 TLB vs. M-cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

xviii

6.9 Performance of Sentry-protected Apache . . . . . . . . . . . . . . . . . . . . . 169

6.10 Comparing Sentry against process-based protection . . . . . . . . . . . . . . . 172

6.11 Sentry-Watcher vs. Binary instrumentation . . . . . . . . . . . . . . . . . . . 174

6.12 M-cache vs other hardware-based watchpoints . . . . . . . . . . . . . . . . . . 176

A.1 Conflict type breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

B.1 State Machine:Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

B.2 State Machine:Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

xix

List of Tables

2.1 Virtualization in TM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Classification of current monitoring mechanisms . . . . . . . . . . . . . . . . 33

3.2 Alert-on-update Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Classification of proposed monitoring mechanisms . . . . . . . . . . . . . . . 40

3.4 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 Atomicity violation bugs defined by Lu et al. [LT06] . . . . . . . . . . . . . . 61

3.6 Execution time overhead for Atomicity violation detection . . . . . . . . . . . 62

4.1 Programmable Data Isolation API . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Coherence state encoding for fast commits and aborts. . . . . . . . . . . . . . 72

4.3 Area overhead of FlexTM’s hardware mechanisms . . . . . . . . . . . . . . . 102

4.4 Experimental setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.1 Percentage of total (committed and aborted) txs that encounter a conflict. . . . 117

5.2 Interaction of contention manager and conflict resolution . . . . . . . . . . . . 120

5.3 Characteristics of aborted transactions . . . . . . . . . . . . . . . . . . . . . . 143

6.1 Permissions metadata cache (M-cache) API . . . . . . . . . . . . . . . . . . . 151

xx

6.2 Mapping coherence protocol states to permission checks . . . . . . . . . . . . 152

6.3 Design tradeoffs in M-cache design . . . . . . . . . . . . . . . . . . . . . . . 164

6.4 Application Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

A.1 Transactional Workload Characteristics . . . . . . . . . . . . . . . . . . . . . 195

B.1 TMESI L1 controller states . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

B.2 TMESI L1 controller events . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

B.3 TMESI L1 controller messages . . . . . . . . . . . . . . . . . . . . . . . . . . 202

B.4 TMESI L1 cache controller actions . . . . . . . . . . . . . . . . . . . . . . . . 203

1

Foreword

While I am the author of this dissertation, the work in this dissertation would not have been

possible without the collaboration of various students and professors.

The monitoring mechanism in Chapter 3 was born out of discussions with Virendra Marathe

and Michael F. Spear about thread synchronization. I would like to thank Michael F. Spear for

incorporating my untested hardware mechanism within his reasonably stable (at least in com-

parison to my simulator) STM framework; our collaboration resulted in papers at ISCA’07,

SPAA’07, and TRANSACT’09. Hemayet Hossain provided valuable debugging support and

spent many hours stress testing our framework. Sandhya Dwarkadas and Michael Scott pro-

vided valuable suggestions and technical guidance throughout this project.

Programmable Data Isolation (PDI) described in Chapter 4, was first presented at the first

workshop on transactional memory (TRANSACT 2005). The cache coherence protocol for PDI

was developed by me with input from Sandhya Dwarkadas and Michael Scott. In published ma-

terial, PDI has been primarily used as as a mechanism to implement versioning in transactional

memory. We discuss two TM systems in this dissertation. The first, RTM (Section 4.4), was de-

veloped primarily by me in conjunction with Michael F. Spear. Virendra Marathe and Hemayet

Hossain provided valuable debugging support. The second, FlexTM (Section 4.5) was devel-

oped by me and designed in conjunction with Sandhya Dwarkadas and Michael Scott. The

contention management study for hardware-based TMs (Chapter 5.1) inherited the software

framework from RSTM.

The Sentry system described in Chapter 6 was designed and developed by me with advice

2

from Sandhya Dwarkadas and Kai Shen. Kai Shen helped me define the software interface and

protection models.

Finally, I would like to thank the Wisconsin GEMS members for providing the basic simu-

lation infrastructure upon which I developed the mechanisms described in this dissertation.

3

Chapter 1Introduction and Motivation

Computing systems have significantly evolved over the years and have come to occupy a major

part of our daily lives. Hardware designers have fueled this growth by doubling performance

every two years. Meanwhile, software has continued to include more features and has increased

in complexity. The advent of multicore processors is also an inflection point for software de-

velopment [Pat10]. Eight-core chips are available today, and if programmers can learn how to

take advantage of them, vendors will deliver hundreds of cores within a decade. However, de-

veloping correct, high performance, and reliable programs has become challenging. The major

source of bugs is the corruption of program memory state due to unexpected interaction be-

tween the different concurrently executing parts of the program [LP08]. Fine-grain intra-thread

events and inter-thread interactions via memory make it difficult for developers to comprehend

the synchronization required to enable correct programs.

In this dissertation, we seek to utilize the transistors afforded by Moore’s law to develop new

hardware mechanisms that lead to better software development tools and programming models.

The basic idea is to use hardware to expose information that enables a program to understand its

own execution and react to it. This will also help developers understand misbehaving software.

For example, if software could easily track the locations accessed by a program, it can use this

to detect concurrency bugs and check safety invariants. Similarly, if software was aware of the

cached locations, it could reorganize the code to overlap computation with cache misses and

improve performance.

It is possible to understand concurrent interactions with software-based techniques. One

4

possible approach is to use static program analyzers [NMW02]. Unfortunately, most static tools

require significant effort on the programmer’s side to annotate programs; limited information

available at compile-time also poses a challenge. Dynamic instrumentation-based tools can

extract information at run time that can detect concurrency bugs [SBN97; Cor] and track data

flow [NeS07]. Unfortunately, dynamic techniques impose significant performance overhead.

The work in this thesis focuses on mechanisms for the cache hierarchy, which holds a sig-

nificant fraction of a program’s execution state and serves as a medium of communication be-

tween the various parts of a program. Current hardware systems export a narrow interface to

memory (only read and write to words) and hide most of the data movement operations from

software. We believe that future memory systems should be designed in a manner that exposes

information about the hardware actions and events on memory locations. Our research pro-

poses hardware mechanisms that shed light on the memory system and expose information that

software can use for both self-diagnosis and improved programmability and reliability.

A key novelty in this dissertation is the utilization of cache coherence to develop general-

purpose hardware mechanisms that enable software to track and regulate memory accesses.

Current cache coherent systems already include a hardware framework to manage data commu-

nication; we show that it requires few extensions to enable software to observe the data accessed

by the various parts of the program. Cache coherent systems also include a state machine to

manipulate reads and writes and we demonstrate that software can exploit coherence states to

control accesses efficiently.

In this dissertation, we address the challenge of parallelism that manifests itself at multi-

ple. First, the introduction of multicore processors means future application now depend on

thread-level parallelism execution for improving performance. The implications for software

are profound: historically only the most talented programmers have been able to write good

parallel code; now everyone must do it. The core part of this dissertation addresses this chal-

lenge; we discuss this in detail in Section 1.1.

Second, we also address another challenging form of parallelism in Section 1.2; the paral-

lelism in software development. Modern software consist of collaborative artifacts with millions

of lines of code written by many developers concurrently. Fine-grain intra- and inter-thread in-

teractions via memory make it difficult for developers to track and validate the accesses arising

5

from the various software modules. We proposed the use of fine-granularity access control to

improve interaction among the various parts of an application.

1.1 Transactional Memory

Parallel programming is hard; even given a good division of labor among threads (something

that’s often difficult to find), mainstream applications are plagued by the need to synchronize

access to shared state. Transactional Memory (TM) aims to simplify synchronization by raising

the level of abstraction. As in the database world, the programmer or compiler simply marks

a block of code as atomic; the underlying system then promises to execute the block in an

“all-or-nothing” manner isolated from similar blocks (transactions) in other threads.

Any TM implementation based on speculation must perform the following tasks: detect

and resolve conflicts between transactions executing in parallel (conflict detection), keep track

of both old and new versions of data that are modified speculatively (version management),

and ensure that non-committed transactions never perform externally visible actions due to an

inconsistent view of memory.

The mechanisms required for conflict detection and version management can be im-

plemented in hardware (HTM) [HWC04; MBM06; HeM93; AAK05; RHL05], soft-

ware (STM) [HLM03b; FrH07; MSS05; SAH06; DSS06], or some hybrid of the two

(HyTM) [DFL06; KCH06; MTC07; DCW11].

Software-only implementations have the advantage of running on legacy machines, but it

is widely acknowledged that performance competitive with fine-grain locks will require hard-

ware support. Figure 1.1 shows the overhead of an STM system. In order to track conflicts

in the absence of special hardware, a software TM (STM) system must augment a program

with instructions that read and write some sort of metadata, which leads to high performance

overhead (20% – 350%). This overhead is embedded in every thread, cannot be amortized with

parallelism, and in fact tends to increase with processor count, due to contention for metadata

access.

Once regarded as impractical, in part because of limits on the size and complexity of 1990s

caches, TM in the 2000s has enjoyed renewed attention. Unfortunately, it is not yet clear to

6

Useful work Metadata Checks Versioning

0

1

2

3

4

5

6

Bayes Delaunay Genome Intruder Kmeans Labyrinth Vacation

Norm

aliz

ed E

xecu

tion T

ime

Single-thread runs of a TL2-like STM [DSS06] system on applications from STAMP (see Appendix A). Uninstru-

mented code run time = 1.

Figure 1.1: Execution time breakdown in STM

us that proposed hardware TMs will provide the most practical, cost-effective, or semantically

acceptable implementation of transactions. A key limitation with current hardware TM propos-

als is their rigidity in conflict management policy. Key policy choices that have a first order

effect on TM performance are conflict resolution time (when to manage conflicts) and conflict

management policy (how to arbitrate amongst conflicting transactions). Proposed hardware

TMs employ fixed choices for eagerness of conflict resolution, strategies for conflict arbitration

and back-off, and eagerness of versioning. They embed conflict resolution and management

in silicon—policies for which current evidence suggests that no one static approach may be

acceptable and which are likely to change with the emergence of new applications.

1.1.1 Our Approach : Flexible Transactional Memory

We strive to leave policy decisions on when and how to resolve conflicts under software control,

while using hardware mechanisms to accelerate both bounded and unbounded transactions. This

strategy allows the choice of policy to be tuned to the current workload. It also allows the TM

system to reflect system-level concerns such as thread priority. The key insight that enables pol-

icy flexibility is that information gathering and decision making can be decoupled. In particular,

7

data versioning, access tracking, and conflict detection can be supported as decoupled/separable

mechanisms that do not embed policy.

The first TM runtime system we developed in 2005, RTM, is a hardware-software TM.

RTM promotes policy flexibility by decoupling version management from conflict detection

and management—specifically, by separating data and TM runtime metadata, and performing

conflict detection only on the latter. Software can choose (by controlling the timing of metadata

inspection and updates) when to abort concurrent conflicting transactions. Software can also use

various parameters to choose which transaction needs to be aborted. We show in this dissertation

that the software control on conflict management has a first-order impact on performance. RTM

included two forms of hardware support : (1) Monitoring of cache lines, which can reduce the

overheads of conflict detection and (2) Isolation for keeping transaction updates invisible from

other concurrent transactions.

1.1.2 Monitoring : Alert-On-Update

Alert-On-Update selectively exposes cache coherence events to software and enables an appli-

cation to track accesses from the various threads in the system. Alert-On-Update enables fast

event-based notification on addresses marked by software and serves two related roles in RTM:

it provides low overhead conflict detection and informs software, which can then manage the

conflict and it also tracks the status of the transaction and ensures a transaction is immediately

informed when it is aborted. Overall, it eliminates a large fraction of the cost for the metadata

checks shown in Figure 1.1.

1.1.3 Isolation : Programmable-Data-Isolation

To accelerate data versioning in TM, we proposed Programmable Data Isolation (PDI). PDI

allows selective use of processor-private caches as a buffer for speculative writes or for read-

ing/caching the current version of locations being speculatively written remotely. Multiple

transactions can speculatively read or write the same location enabling lazy conflict resolution.

RTM allows us to experiment with a wide variety of policies for conflict detection, con-

tention management, deadlock, livelock avoidance, and virtualization. RTM falls back to a

8

software-only implementation of transactions in the event of overflow. Unfortunately, metadata

management imposes significant performance costs since even bounded transactions that use

hardware support need software instrumentation to inter-operate with overflow transactions.

1.1.4 Decoupling Conflict detection from Resolution

In 2008, based on the lessons we learnt from RTM, we proposed FlexTM to minimize the

software bookkeeping overheads of RTM. The key insight that enables policy flexibility is that

information gathering and decision making can be decoupled. RTM left both these in software

while FlexTM employs hardware to gather information and leaves software still in charge of

the policy. FlexTM decouples conflict detection from resolution time, with a monitoring mech-

anism we call conflict summary tables (CSTs). CSTs record the occurrence of conflicts without

necessarily forcing immediate resolution. More specifically, CSTs indicate the transactions that

conflict, rather than the locations on which they conflict. This information concisely captures

what a TM system needs to know in order to resolve conflicts at some potentially future time.

Software can choose when to examine the tables and can use whatever other information it

desires (e.g., priorities) to drive its resolution policy.

1.2 Software Reliability

1.2.1 Problem and Motivation

Software bugs in production environments lead to as much as 40% of computer system fail-

ures [MaS0.]. Modern applications are large software projects and are rapidly evolving with

the collaboration of many developers. For example, the popular firefox browser involves a large

number of concurrent developers (see Figure 1.2). To sustain this growth, most software sys-

tems employ a modular approach (or plugins) to provide extensibility to the core kernels of

large software systems. For example, internet browsers (e.g. firefox) support a general plugin

interface to enable additional functionality; the linux kernel has an elaborate module system for

device drivers and kernel sub systems; In most modern shared memory systems, each process

has a separate, linear, demand-paged virtual address space. Most current software systems use

9

a single address space and link in the modules into the same space. Using a single address

space provides fast flexible shared memory based communication but compromises safety. For

example, vulnerabilities in Adobe’s pdf plugin enabled attackers to inject arbitrary code into

the browser [PaF06]. Modular boundaries can only be enforced by programmer convention

and not by the runtime system. Defending against memory corruption and dirty reads requires

inspecting every load, store, and instruction fetch. There is considerable complexity and per-

formance penalty associated with doing this entirely in software due to the cost of instrumenta-

tion [NeS07] and source level modifications [NMW02]. Ideally, we would want the underlying

runtime to sandbox the module and control the accesses by the application’s modules based on

system-specified rules.

Number of concurrent developers on Firefox. X axis: Firefox release years. Y axis: Number of developers.

Source : [Tro10]

Figure 1.2: Concurrency in software development

10

1.2.2 Our Approach : Sentry

We propose an architectural mechanism for access control that will enable software to track,

regulate accesses, and deliver a more robust application. We investigate Sentry, a hardware

framework that enables software to enforce protection policies at run time. The core developer

annotates the program to define the policy and then the system ensures the privacy and integrity

of a module’s private data (no external reads or writes), the safety of inter-module shared data

(by enforcing permissions specified by the application), and adherence to the module’s interface

(controlled function call points). From the software’s perspective, Sentry is a pluggable access

control mechanism for application-level fine-grain protection. The key novelty in Sentry is

the lightweight low-cost manner in which permissions are enforced. Sentry’s permissions tags

intercept only L1 misses and reuse the L1 cache coherence states to enforce permissions and

elide checks on L1 hits. Overall, this results in significant saving in the dynamic energy needed

to implement permission checks and provides design flexibility.

1.3 Thesis Contributions

Thesis Statement: As we move into the multicore era, software development is being hindered

by the need to develop and maintain parallel programs. Hardware mechanisms in the memory

hierarchy that provide support for tracking, isolating, and controlling memory accesses can help

software effectively manage program state. The required architectural support can be efficiently

implemented by exploiting the cache subsystem and coherence protocol.

Figure 1.3 provides a pictorial representation of our contributions.

1. We developed Alert-On-Update for selectively monitoring cache coherence events. We

demonstrate the usage of this mechanism in speeding up locks, event-based communica-

tion, and accelerating transactions.

2. We developed Programmable-Data-Isolation for allowing multiple software threads to

concurrently issue speculative writes and to effectively control the visibility of writes.

We demonstrate the usage of this mechanism in accelerating versioning in TM.

11

3. To demonstrate the utility of the proposed hardware mechanisms, we develop two TM

runtime systems. The first, RTM, uses hardware support for only bounded transactions

and reverts to software TM for transactions that overflow hardware resources. The sec-

ond, FlexTM, manages conflicts with very low overhead and adds hardware support for

virtualization. Both RTM and FlexTM, support flexible software-based policies for con-

flict resolution and management.

4. We developed Sentry, a ligtweight access control framework that can be used for im-

proving software reliability by setting up fine-grain protection domains to encapsulate

the modules in an application.

Shared memory Multicore

Monitoring(Alert-On-Update)

Isolation(Prog. Data Isolation)

Protection(Sentry)

Synch. Debugger Transactional MemoryThread-level Speculation

SoftwareSafety ...... } Applications

} HardwareMechanisms

Top Row: Software application case studies developed in the thesis. Middle Row: Proposed hardware mechanisms.

Bottom Row: System

Figure 1.3: Thesis contributions

1.4 Dissertation Structure

Chapter 2 provides background on the work in this dissertation. It provides an overview of trans-

actional memory and discusses the support required for both small and large transactions. It also

provides an overview of access control mechanisms and their usage in debugging and software

protection. Chapter 3 presents our monitoring mechanism, Alert-On-Update, and demonstrates

its application in speeding up locks, improving TM performance, and debugging data races.

Chapter 4 describes Programmable-Data-Isolation (PDI) and includes details on the additions

required to the coherence protocol. It also compares and contrasts the techniques that can be

used to virtualize hardware resources. Finally, it demonstrates the use of PDI in RTM and

12

FlexTM. In Chapter 5, we use the FlexTM framework to study the various conflict management

strategies in hardware-based TMs. We also suggest policies that various TM systems should

adopt based on application characteristics. Chapter 6 deals with Sentry and includes details on

the hardware and software framework. It also showcases applications of Sentry in protecting

a webserver and in watchpoint-based debugging. Chapter 7 concludes with a discussion on

possible directions for future research.

13

Chapter 2Background

In this dissertation, we tackle problems with concurrency at two levels: managing the concur-

rency that arises out of multiple threads in a program and managing the concurrency present

in a multi-programmer, multi-modular application. In Section 2.1, we provide an overview of

Transactional Memory (TM), which seeks to enable programmers to tackle the concurrency in

programming. We commence with a discussion on the hardware support proposed to support

small transactions, discuss the software and hardware techniques to virtualize transactional re-

sources, and finally conclude with a classification of proposed TM systems. In Section 2.2, we

discuss the use of fine-grain access control to address the challenges of safety in modularized

applications with concurrent developers. We then discuss access control mechanisms proposed

in both commercial and research prototypes, and describe access control in the context of mem-

ory protection and debugging.

2.1 Concurrency in Software Execution: Transactional Memory

For more than 40 years, Moore’s Law has packed twice as many transistors on a chip every

18 months. Between 1974 and 2004, hardware vendors used those extra transistors to equip

their processors with ever-deeper pipelines, multi-way issue, aggressive branch prediction, and

out-of-order execution, all of which served to harvest more instruction-level parallelism (ILP).

Because the transistors were smaller, vendors were also able to dramatically increase the clock

14

rate. All of that ended in the 2000s, when microarchitects ran out of independent things to do

while waiting for data from memory, and when the heat generated by faster clocks reached the

limits of fan-based cooling. Future performance improvements must now come from multicore

processors, which depend on thread-level parallelism.

Sadly, parallel programming is hard. Historically it has been limited mainly to servers, with

“embarrassingly parallel” workloads, and to high-end scientific applications, with enormous

data sets and enormous budgets. Applications need to set up synchronization on access to shared

state. For this, programmers have traditionally relied on mutual exclusion locks, but these suffer

from a host of problems, including the lack of composability (one cannot nest two lock-based

operations inside a new critical section without introducing the possibility of deadlock) and

the tension between concurrency and clarity: Coarse-grain lock-based algorithms are relatively

easy to understand (grab the One Big Lock, do what needs doing, and release it), but they

preclude any significant parallel speedup; Fine-grain lock-based algorithms allow independent

operations to proceed in parallel, but they are notoriously difficult to design, debug, maintain,

and understand. Transactional Memory (TM) aims to simplify synchronization by raising the

level of abstraction for critical sections.

2.1.1 Transactional Memory in a Nutshell

With transactions, software has only to mark a block of code as “atomic”; the underlying system

then promises to execute the block in an “all-or-nothing” manner isolated from similar blocks

(transactions) in other threads. The implementation is typically based on speculation: it guesses

that transactions will be independent and executes them in parallel, but watches their memory

accesses just in case. If a conflict arises (two concurrent transactions access the same location,

and at least one of them tries to write it), the implementation aborts one of the contenders, rolls

back its execution, and restarts it at a later time. In some cases it may suffice to delay one of the

contending transactions, but this does not work if, for example, each transaction tries to write

something that the other has already read.

While TM systems vary in how they handle various subtle semantic issues, all are based

on the notion of serializability: regardless of implementation, transactions appear to execute

15

Thread 1 Thread 2

lock(hash tab.mutex) lock(hash tab.mutex)var = hash tab.lookup(X); var = hash tab.lookup(Y);if(!var) if(!var)

hash tab.insert(X); hash tab.insert(Y);unlock(hash tab.mutex) unlock(hash tab.mutex)

Figure 2.1: Loss of parallelism due to locks. [RaG01]

in some global serial order. The writes by transaction A must never become visible to other

transactions until A commits, at which time all of its writes must be visible. Moreover, writes

by other transactions must never become visible to A partway through its own execution, even

if A is doomed to abort (otherwise A might perform some logically impossible operation with

externally visible effects). Some TM systems relax the latter requirement by sandboxing A so

that any erroneous operations it may perform do no harm to the rest of the program.

The principal motivation for TM is to simplify the synchronization of state in parallel pro-

gramming. In some cases (e.g., if transactions are used in lieu of coarse-grain locks), it may

also lead to performance improvements. An example appears in Fig. 2.1: if X6=Y, it is likely

that the critical sections of Threads 1 and 2 can execute safely in parallel. Because locks are a

low-level mechanism, they preclude such execution. TM, however, allows it. If we replace the

lock. . . unlock pairs with atomic{. . . } blocks, the typical TM implementation will execute the

two transactions concurrently, aborting and retrying one of the transactions only if they actually

conflict.

TM can be implemented in hardware, in software, or in some combination of the two.

Software-only implementations have the advantage of running on legacy machines, but it is

widely acknowledged that performance competitive with fine-grain locks will require hardware

support [SAH06; YBM07; CBM08]. This section aims to describe what the hardware might

look like, and briefly describe their policy choices. Section 2.1.3 describes several ways in

which brief, small-footprint transactions can be implemented entirely in hardware. Section 2.1.4

considers extension to transactions that overflow on-chip hardware resources, or must survive a

context switch.

16

2.1.2 Implementation

Any TM implementation based on speculation must perform the following three tasks: it must

(1) detect and resolve conflicts between transactions executing in parallel; (2) keep track of

both old and new versions of data that are modified speculatively; and (3) ensure that running

transactions never perform erroneous, externally visible actions due to an inconsistent view of

memory.

Many researchers (e.g., [MSS05; MBM06]) have conceived conflict resolution to be either

eager or lazy. An eager system detects and resolves conflicts as soon as a pair of transactions

have performed (or are about to perform) operations that preclude committing them both. A

lazy system delays conflict resolution (and possibly detection as well) until one of the trans-

actions is ready to commit. The losing transaction L may abort immediately or, if it is only

about to perform its conflicting operation (and has not done so yet), it can wait for the winning

transaction W to either abort (in which case L can proceed) or commit (in which case L may

be able to occur after W in logical order).

Lazy conflict resolution exposes more concurrency by permitting both transactions in a

pair of concurrent read-write conflicting transactions to commit so long as the reader commits

(serializes) before the writer. Lazy conflict resolution also helps in ensuring that the conflict

winner is likely to commit: if we defer to a transaction that is ready to commit, it will generally

do so, and the system will make forward progress. Eager conflict resolution avoids investing

effort in a transaction L that is doomed to abort, but may waste the work performed so far if

it aborts L in favor of W and W subsequently fails to commit due to conflict with some third

transaction T . Recent work, suggests that eager management is inherently more performance-

brittle and livelock-prone than lazy management [SDM09; ShD09]. The performance of eager

systems can be highly dependent on the choice of contention management (arbitration) policy

used to pick winners and losers, and the right choice can be application-dependent [ScS05].

Version management typically employs either direct update, in which speculative values are

written to shared data immediately, and undone on abort using an undo-log, or deferred update,

in which speculative values are written to a log and redone (written to shared data) on commit.

Direct update may be somewhat cheaper if—as we hope—transactions commit more often than

17

they abort. Systems that perform lazy conflict resolution, however, must generally use deferred

update, to enable parallel execution of (i.e., speculation by) conflicting writers.

2.1.3 Hardware support for small transactions

On modern processors, locks and other synchronization mechanisms tend to be implemented

using compare-and-swap (CAS) or load-linked/store-conditional (LL/SC) instructions. Both

of these options provide the ability to read a single memory word, compute a new value, and

update the word, atomically. Transactional memory was originally conceived as a way to extend

this capability to multiple locations.

Herlihy & Moss [HeM93] The term “transactional memory” was coined by Herlihy and

Moss in 1993. In their proposal (“H&M TM”), a small “transactional cache” holds specula-

tively accessed locations, including both old and new values of locations that have been written.

Conflicts between transactions appear as an attempt to invalidate a speculatively accessed line

within the normal coherence protocol, and cause the requesting transaction to abort. A transac-

tion commits if it reaches the end of its execution while still in possession of all speculatively

accessed locations. A transaction will always abort if it accesses more locations than will fit in

the special cache, or if its thread loses the processor due to preemption or other interrupts.

Oklahoma Update [SSH93] In modern terminology, H&M TM called for eager conflict res-

olution. A contemporaneous proposal by Stone et al. envisioned lazy resolution, with a conflict

detection and resolution protocol based on two-phase commit. Dubbed “Oklahoma Update”

(after the Rogers and Hammerstein song “All er Nothin’ ”), the proposal included a novel solu-

tion to the doomed transaction problem: as part of the commit protocol, an Oklahoma Update

system would immediately restart any aborted competing transactions, by branching back to a

previously saved address. By contrast, H&M TM required that a transaction explicitly poll its

status (to see if it were doomed) prior to performing any operation that might not be safe in the

wake of inconsistent reads.

18

AMD ASF [DiH08] Recently, researchers at AMD have proposed a multiword atomic update

mechanism as an extension to the x86-64 instruction set. Their “Advanced Synchronization

Facility” (ASF), though not a part of any current processor roadmap, has been specified in

considerable detail. As H&M TM does, it uses eager conflict resolution, but with a different

contention management strategy: where H&M TM resolves conflicts in favor of the transaction

that accessed the conflicting location first, ASF resolves it in favor of the one that accessed it last.

This “requester wins” strategy fits more easily into standard invalidation-based cache coherence

protocols, but may be somewhat more prone to livelock. As Oklahoma Update does, ASF

includes a provision for immediate abort. Most importantly ASF provides a strong progress

guarantee that a transaction that does not write more than four unique locations will eventually

commit.

Sun Rock [TrC08] Prior to Oracle’s acquisition of Sun, the next generation UltraSPARC

processor [TrC08] included a thread-level speculation (TLS) mechanism that could be used to

implement transactional memory. Like H&M TM and ASF, Rock uses eager conflict manage-

ment; it resolves conflicts in favor of the requester. Like Oklahoma Update and ASF, it provides

immediate abort. In a significant advance over these systems, however, ROCK implements true

processor checkpointing; on abort, all processor registers revert to the values they held when

the transaction began. Moreover, all memory accesses within the transaction (not just those

identified by special load and store instructions) are considered speculative.

Stanford TCC [HWC04] While still limited (in its original form) to small transactions, the

Transactional Coherence and Consistency (TCC) proposal of Hammond et al. represented a

major break with traditional concepts of memory access and communication. Where traditional

threads (and processors) interact via individual loads and stores, TCC expresses all interaction

in terms of transactions.

Like the multilocation commits of Oklahoma Update, TCC transactions are lazy. Individual

writes within the transaction are delayed (buffered) and propagated to the rest of the system in

bulk at commit time. Commit-time conflict detection and resolution employs either a central

hardware arbiter or a distributed two-phase protocol. As in Rock, doomed transactions suffer

19

an immediate abort and roll back to a processor checkpoint.

Discussion A common feature of the systems described in this section is the careful leveraging

of existing hardware mechanisms. Eager systems (H&M TM, ASF, and Rock) leverage existing

coherence protocol actions to detect transaction conflicts. In all five systems, hardware avoids

most of the overhead of both conflict detection and versioning. At the same time, transactions

in all five can abort simply because they access too much data (overflowing hardware resources)

or take too long to execute (suffering a context switch). While the systems differ in both the

eagerness of conflict resolution and choice of conflict winner, in all cases these policy choices

are embedded in the hardware; they cannot be changed in response to programmer preference

or workload characteristics.

2.1.4 Unbounded transactions

Small transactions are not sufficient if TM becomes a generic programming construct that can

interact with other system modules (e.g., file systems and middleware) that have much more

state than the typical critical section. It also seems unreasonable to expect programmers to

choose transaction boundaries based on hardware resources. While the usage model for “un-

bounded transactions” is not proven yet, we do not want to discourage their deployment. Hence,

what is needed are low overhead “unbounded” transactions that hide hardware resource limits

and persist across system events (e.g., context switches, system calls, and device interrupts).

To support unbounded transactions, a TM system must virtualize both conflict detection and

versioning. In both cases, the obvious strategy is to move transactional state from hardware to

a metadata structure in virtual memory. Concrete realizations of this strategy vary in hardware

complexity, degree of software intervention, and flexibility of conflict detection and contention

management policy.

Transactional memory is a very active area of research. Larus and Rajwar [HLR10] provide

an excellent summary up to Fall 2010. We first categorize the design space for both versioning

and conflict detection mechanisms. Hardware-based TMs need to address two requirements: (1)

a conflict detection mechanism to track concurrent accesses and conflicts for locations evicted

out of the caches and coherence framework and (2) a versioning mechanism to maintain the

20

new and old values of data for an unbounded number of locations. The implementation for

these tasks is governed by performance targets, the conflict resolution policy supported, and the

hardware complexity.

Conflict Detection

The implementation choices can be broadly classified as:

Software Instrumentation: To handle large transactions, TM runtime systems system can

augment a program with instructions that read and write some sort of metadata. If pro-

gram data are read more often than written (as is often the case), it is generally undesirable

for readers to modify metadata, since that tends to introduce high performance overhead.

As a result, readers are invisible to writers in most STM systems, and bear full responsi-

bility for detecting conflicts with writers. This task is commonly rolled into the problem

of validation—ensuring that the data read so far are mutually consistent.

State-of-the-art STM systems perform validation on every nonredundant read [SMP08;

SMS09; DSS06]. The supporting metadata vary greatly: In some systems, a reader in-

spects a modification timestamp or writer (owner) id associated with the location it is

reading. In other systems, the reader inspects a list of Bloom filters that capture the

write sets of recently committed transactions [SMP08]. In addition to the instrumen-

tation overhead that limits gains from concurrency, the software instrumentation add to

cache pressure since they access additional metadata on each memory access. Overall,

transactions with the software instrumentation can experience slowdowns on the order of

2–3× compared to transactions thathave hardware support.

Hardware Acceleration: Hardware support can remove the bottlenecks associated with soft-

ware instrumentation by using hardware metadata to track accesses. There are trade-

offs associated with different tracker hardware: Bloom filter based signatures [CTC06;

YBM07] can concisely summarize a large set of addresses with a fixed amount of space.

Bloom filters require support only at the processor but are prone to false-positive based

performance bugs. Error checking codes [BGH08] are extra metadata that are stored

21

along with data blocks. Such metadata can also be used to encode transactional metadata

and are precise. Unfortunately, this requires modifications to the various caches in the

memory hierarchy and requires the coherence protocol to interact with off-chip metadata.

Metadata added to cache tags are simple to implement but have space limitations; they

either require a software algorithm [DFL06] or a complex hardware controller [AAK05]

to handle cache evictions.

Virtual Memory: OS page tables include protection bits per-page entry to implement pro-

cesses. TM runtime can exploit these protection bits to set up read-only and read/write

permissions at a page granularity and trap concurrent accesses to detect conflicts [CNV06;

CMM06]. The major performance overhead involved is the TLB shootdowns and OS

intervention required when modifying the protection bits which occur frequently on non-

redundant accesses within a transaction.

Versioning

The conflict resolution policy in some ways critically governs the choice of versioning mecha-

nism. Lazy allows concurrent transactions to read or write a shared location thus necessitating

a redo-log in order to avoid irreversible actions while Eager detects conflicts prior to the ac-

cess, thereby accommodating both forms of logging. The undo-log is used to restore values if

a transaction aborts while the redo-log is used to copy-update the original locations on commit;

these actions need to occur in an atomic manner for all the locations in the log. Most impor-

tantly, since a redo-log buffers new values, it needs to intervene on all other accesses to check

if it needs to supply the data; this dictates the data structure used to maintain the new values

(typically a hash table). An undo-log approach can make do with a simpler data structure (e.g.,

dynamically resizable array or vector) and typically does not need to optimize the access cost

since it is traversed only on an abort.

Similar to conflict detection, versioning can be implemented either with (1) Software han-

dlers, (2) Hardware acceleration, or (3) Virtual memory (i.e., translation information in the

page tables). The performance and complexity tradeoffs are similar to conflict detection. The

software approach adds handlers to all writes (to set up the log data structures) and possibly

22

to all reads (needed to pass values if they are buffered in a redo-log) and leads to significant

degradation in performance. The hardware approach adds significant complexity, including

new state machines that interact in a non-trivial manner with the existing memory hierarchy.

The VM approach reuses existing hardware and OS support, but suffers the performance over-

heads of having to perform page granularity cloning and buffering. An important difference be-

tween the mechanisms that implement versioning and conflict detection is that versioning deals

with data values (no false positives or negatives) and cannot trade off precision for complexity-

effectiveness like conflict detection (e.g., signatures).

2.1.5 Classifying proposed TM systems

This dissertation primarily focuses on the hardware support required for accelerating TMs. We

refer readers interested in STM systems to Michael Spear’s Ph.d dissertation [Spe09]; we use the

term STM to refer to TM systems that rely only on software instrumentation to implement the

TM runtime. In this section, we study the design of proposed hardware TM systems. Table 2.1

lists the mechanism used by various hardware-based TMs based on the classification scheme

discussed previously. We specify three features of each TM system: the conflict resolution

supported, the type of conflict detection mechanism, and the versioning scheme.

Table 2.1: Virtualization in TMSystem Conflict Resolution Conflict Detection Versioning

UTM [AAK05] Eager H (controller) H (undo-log)VTM [RHL05] Eager H (microcode) S (redo-log)

LogTM-SE [YBM07] Eager H (Signature) H (undo-log)XTM [CMM06] Lazy (Eager?) VM VM (redo-log)

PTM-Select [CNV06] Eager H (controller) VM (undo-log)TokenTM [BGH08] Eager H (ECC) H (undo-log)

Hybrid SystemsSigTM [MTC07] Eager (Lazy?) H (signature) S (redo-log)

UFO TM [BNZ08] Eager H (ECC) S (undo-log)HyTM [DFL06] Eager S S (undo-log)

H - Hardware Acceleration S - Software Instrumentation VM - Virtual Memory ? - Can supportconflict resoultion but limits policy choices.Hybrids (other than SigTM) use a best-effort HTM for small transactions.

The Bulk system of Ceze et al. [CTC06] decouples conflict detection from cache tags by

summarizing read/write sets in Bloom filter signatures [Blo70]. To commit, a transaction broad-

casts its write signatures, which other transactions compare to their own read and write signa-

23

tures to detect conflicts. Conflict management (arbitration) is first-come-first-served, and re-

quires global synchronization in hardware to order commit operations.

LogTM-SE [YBM07] integrates the cache-transparent eager versioning mechanism of

LogTM [MBM06] with Bulk style signatures. It supports efficient virtualization (i.e., con-

text switches and paging), but this is closely tied to eager versioning (undo logs), which in turn

requires Eager to avoid inconsistent reads. Since LogTM does not allow transactions to abort

one another, it is possible for running transactions to “convoy” behind a suspended transaction.

Like LogTM-SE, TokenTM [BGH08] use undo-logs to implement versioning but implement

conflict detection using a hardware token scheme.

UTM [AAK05] and VTM [RHL05] both perform lazy versioning using virtual memory. On

a cache miss in UTM, a hardware controller walks an uncacheable in-memory data structure that

specifies access permissions. VTM employs tables maintained in software and uses software

routines to walk the table only on cache misses if an overflow signature indicates that the block

has been speculatively modified. Like LogTM, both VTM and UTM require eager resolution

of conflicts.

Both XTM [CMM06] and PTM [CNV06] use the virtual memory mechanisms (i.e., protec-

tion and translation) present in existing operating systems to enable TM virtualization. These

virtual memory mechanism are coarse-grained and add significant performance overhead to the

TM runtime.

Hybrid TMs [DFL06; KCH06] allow hardware to handle common-case bounded transac-

tions and fall back to software for transactions that overflow time and space resources. Hybrid

TMs must maintain metadata compatible with the fallback STM and use policies compatible

with the underlying HTM. SigTM [MTC07] employs hardware signatures for conflict detec-

tion but uses an (always on) TL2 [DSS06] style software redo-log for versioning. Like hybrid

systems, it suffers from per-access metadata bookkeeping overheads. It restricts conflict man-

agement policy (specifically, only self aborts) and requires expensive commit time arbitration

on every speculatively written location.

24

2.1.6 Our Approach : Flexible Transactional Memory

Published papers [MSS05; ScS05; SDM09; ShD09; BMV07] reveal performance differences

across applications of '10× in each direction for different approaches to contention manage-

ment, and eagerness of conflict detection (i.e., write-write sharing). It is clear that no one knows

the “right” way to do these things; it is likely there is no one right way. We propose, therefore,

that hardware serve simply to optimize the performance of transactions that are controlled fun-

damentally by software. This allows us, in almost all cases, to cleanly separate policy and

mechanism. The former is the province of software, allowing flexible policy choice; the latter

is supported by hardware in cases where we can identify an opportunity for significant perfor-

mance improvement. We present two TM systems, RTM and FlexTM that embody our design

principles.

2.2 Concurrency in Software Development :

Fine-grain Access Control

In this section, we discuss the approaches to address the challenges associated with collabora-

tive multi-programmer multi-module software projects. Modern software are complex artifacts

consisting of millions of lines of code written by many developers. Developing correct, high

performance, and reliable code has thus become increasingly challenging. As one example,

consider the development framework of the Apache webserver. The system designers define a

software interface that specifies the set of functions and data that are private and/or exported to

other modules. For the sake of programming simplicity and performance, current implementa-

tions of Apache run all modules in a single process and rely on adherence to the module API to

enforce protection. A bug or safety problem in any module could potentially (and does) affect

the whole application.

The prevalence of multicore processors, resulting in the need for multiple threads of control

in order to harness the available compute power, has increased the burden on software develop-

ers. Fine-grain intra- and inter-thread interactions via memory make it difficult for developers

to track, debug, and validate the accesses arising from the various software modules. Architec-

25

tural features for access control that will enable software to track, regulate accesses, and deliver

a more robust application, are highly desirable. We first commence with a discussion on access

control for software protection.

2.2.1 Modern processors : Paging and Segmentation

Page-based protection Page-based protection introduced in 1961 [BCD72], has virtually

been adopted by nearly every major OS and hardware vendor. Essentially each user-process

has a separate, linear address space which also represents a unique protection domain (a pro-

cess). Every thread in the system belongs to only a single process for its lifetime. Furthermore,

pages also represent the minimum granularity of sharing between address spaces and only sup-

port coarse-grain protection. This design makes it difficult for setting up protected sharing and

performing access control between the various modules in an application. Pages can have dif-

ferent permissions only when they map to separate address spaces. Even then, all words in the

page need to have the same permissions for a process. Further, if data pointers are passed across

processes, both processes need to map the shared page to same virtual address region in their

respective address spaces or require the use of non-trivial pointer swizzling techniques [Wil92].

Finally, the overhead of context switches and OS intervention required for inter process com-

munication impose significant limits to their usage.

Improvements to page-based protection Some architectures provide support for pages

shared between groups of processes. Ultrasparc and MIPS both tag TLB entries with ASID

(address space identifiers). A group of pages shared between processes can share an ASID (and

TLB entry). A key restriction is that each process can only see the same permissions for each

shared page.

Efforts were made to separate the protection mechanisms from the address space by the

proposal of a protection lookaside buffer [KCE92] and protection identifiers in HP’s PA-

RISC [WiS92]. Such systems specify the permissions for a tag (as opposed to a page) and then

associate the tag with each page that expects the same type of permissions. This introduces an

extra-level of indirection when looking up the permissions. Although, bulk permission changes

can be accomplished easily since since changing a tag’s permissions, effectively updates the

26

permissions of all pages associated with the tag. These systems require that collection of pages

shared between the various processes have a fixed set of permissions, but each process could

choose whether to subscribe to those fixed permissions. Another key challenge is coarse-grain

protection, all designs are confined to page-granularity (typically 4/8 KB) protection manage-

ment and cannot provide the finer granularity of control required for the protection needs of

today’s modularized software.

Segmentation Segments was first introduced in Burroughs [HaD68] as a technique to manage

virtual memory space as variable granularity regions. It could also support fine-grain protected

sharing amongst processes. Segments are base and bounds information which describe variable

granularity memory regions. Every process (or address space) is described using a table of

segments. Every instruction or data access in the processor refers this table for the protection

information. Segments (like pages) can be shared between processes. Segment based protection

is more flexible than page based protection since it can describe variable fine granularity regions.

There are two main drawbacks of segments: the hardware structures required to cache segment

information are complex since the lookups need to check if address falls within a segment region

and cross-segment data and instruction accesses require special instructions, which exposes

hardware limits to applications.

Segments are well suited to describe coarse grain mutually exclusive regions which do not

change often. This feature has been exploited to enable sandboxing between modules of an

application [CVP99]. While this enables separation of the modules and isolation of faults it is

hard to enable protected sharing of data between modules. In most cases, the applications have

to be modified significantly to take advantage of segments.

2.2.2 Research Prototypes : Mondrian and Loki

Recently, Mondriaan [WCA02] decoupled protection from a conventional paging framework

and implemented it using segments. An application’s address space is described by a collec-

tion of a variable sized segments, each capable of supporting byte granularity. This flexibility

comes at the cost of additional hardware and operating system modifications. This work re-

places the existing protection framework (TLB) with a new permissions-lookaside-buffer (PLB)

27

that checks all accesses in the pipeline and needs add-ons (e.g., sidecar registers) to reduce the

performance overheads. Furthermore, it introduces new layers in the operating system to imple-

ment all protection (intra-process and inter-process) based on the PLB approach. This requires

every application to communicate its policy to the low-level supervisor.

Loki [ZKD08] adopted a different tagging strategy, choosing to tag physical memory with

labels that further map to a set of permission rules. Loki allows system software to translate

application security policies into memory tags. Threads must have the corresponding entry

in the permissions cache in order to be granted access permission. If the threads need to be

segregated into separate domains, then this would require software support to convert inter-

thread function calls into cross-domain call gates. Permission cache modifications must be

performed within the operating system. Permission revocation in the case of page swapping

would require a sweep of all process/thread’s permissions caches.

2.2.3 Capabilities

Capability systems [CKD94; Lev84; CCL81; SSF99] augment object reference pointers to in-

clude information on access control. The capabilities shared between threads are marshaled

by the OS and can support generalized protection models. Typical capability implementations

change the layout of memory and fatten pointers. Software developers need to be aware of the

modified layout and typically need code rewrites, which lessens their appeal.

More importantly, the relatively large management cost for a capability (e.g., when revok-

ing access rights) makes it ill-suited for protecting fine-grain data elements. Rather, typical

capability-protected objects are external resources like files and printers, or memory segments

at page or larger granularity.

2.2.4 Tagged Memory

The memory hierarchy in tagged architectures [Feu73] carry a few additional bits per data block

to store metadata information. The hardware typically manipulates the tags in parallel with the

data and calls in to the software, if needed. Tagged architectures have been used in the the

past for setting up access control for type safety [ScT89], supporting capabilities [Lev84], and

28

debugging. The IBM System 360 [GiS87] tagged memory locations with 4 bit protection keys.

Only processes that owned specific protection keys could access the data and unauthorized ac-

cesses would raise an exception. While general purpose tagged architectures have not penetrated

mass market processors, there has been recent interest in exploiting them for access control for

debugging, garbage collection, and watchpoints.

2.2.5 Software-based Protection

Several approaches specifically target protection within the operating system, but lack the flex-

ibility to support application-level protection. For instance, SPIN takes advantage of type-safe

languages in constructing safe operating system extensions [BSP95]. The required use of certain

type-safe languages is too restrictive for general application development. As another example,

Nooks manages specific kernel to device driver interactions to guard against bugs [SBL03]. Un-

fortunately, these schemes require programmers to modify the applications and in some cases

change the programming model.

2.2.6 Access Control for Debugging

Hardware support for debugging is an active research area [QLZ05; ZQL04; TAC08; VRS07],

which primarily focuses on reducing the performance overhead of the access control mecha-

nism. These proposals can conveniently ignore the manageability concerns since debuggers

are typically deployed in the development phase. Space overheads are also inherently minimal

since debuggers are interested in only a limited subset of locations. Given that the number of

locations is small, debuggers typically try to provide word-granularity watchpoints.

When used for debugging applications, the space and performance overheads are directly

proportional to the number of locations being watched. Prior proposals mainly focus on reduc-

ing the overhead of intercepting memory operations and manipulating some debugger-specific

metadata. This capability is sufficient to detect a variety of memory bugs [NeS07].

The main commonality among the various acceleration proposals is the software-transparent

interception of memory accesses and the use of hardware state bits to track the watchpoint

metadata. A common feature of all these works is that they intercept all loads and stores in

29

the processor pipeline. They also share the metadata among different threads, which makes it

difficult to set up thread-specific access control. The main differences arise in the hardware

implementation: bloom filters [TAC08], additional cache-tag bits [NeZ07], ECC bits [ZQL04;

QLZ05], or separate lookaside structures [VRS07]; which have varying levels of false positives

and coverage characteristics. They also vary in whether the hardware semantics are hardcoded

for a specific tool [ZLL04] or support a general-purpose state machine [VRS07].

2.2.7 Our Approach: Sentry

While we share the goals of Mondriaan and Loki to allow more flexible protection models, Sen-

try employs an auxiliary protection architecture. Specifically, Sentry’s hardware is implemented

entirely outside the processor core subordinate to the existing TLB mechanism. Sentry’s design

is based on the key observation that permission checks do not change often; we intercept only

L1 misses and completely eliminate permission checks for L1 hits. Sentry intercepts a minimal

number of accesses, which enables us to save energy for the permission checks. Compared

to Mondriaan and Loki, Sentry reduces the changes required to the core hardware and oper-

ating system software. It enables flexible, low-cost protection within individual applications.

No changes are made to the existing process-based protection and the intra-process protection

models may be implemented at the user level. Finally, it incurs space or time overhead only

when auxiliary intra-application protection is needed.

30

Chapter 3Monitoring: Alert-On-Update

In this chapter, we begin with a description of shared memory multicore processors. We then

introduce the design space of mechanisms to monitor memory accesses and classify existing

monitoring mechanisms (Section 3.2.1). Section 3.3 describes the Alert-On-Update (AOU)

mechanism, its implementation, and the type of memory system events that can be monitored.

We demonstrate AOU’s generality by using it to accelerate STMs (Section 3.5), improve locks

(Section 3.6), and in detecting concurrency bugs (Section 3.7).

3.1 Introduction

Multicore processors are typically based on the shared memory paradigm. A conventional

shared memory multicore processor partitions the physical memory across multiple memory

banks and includes multiple levels of low latency private caches for each processor backed up

by a large shared cache. Shared memory systems employ additional hardware to provide the

programmer the illusion of a single unified address space. These systems partition the physical

address space into blocks (i.e. cache lines) and manage caching and block transfers.

Allowing processors to cache memory blocks means that a logically unique single block of

physical memory may reside in multiple physical locations scattered across the cache levels.

A coherence protocol is required to define the interaction between the processors, the various

caches, and memory. The coherence protocol controls the block permissions when the various

31

system components read or write the contents of a memory block. The invalidation scheme

is the dominant mechanism for implementing cache coherence protocols. Hardware-based co-

herence protocols enforce the following invariants: (1) only one processor in the system may

obtain write permissions to a cache block and (2) more than one processor is allowed to obtain

a copy for reading and all these copies are implicitly cleaned up on a write. The hardware needs

to track the cached copies of a memory block to efficiently satisfy a processor accesses and gen-

erate invalidations. This tracking is typically implemented using either a snoopy bus broadcast

[ArB84] or through a directory-based coherence [ASH88] mechanism.

3.2 Current Monitoring Mechanisms

The introduction of multicore processors has increased the burden on software developers–

multiple threads of control are needed to harness the available compute power. Fine-grain

intra-thread events and inter-thread interactions via memory make it difficult for developers

to comprehend and debug programs. Hence, there is a need to utilize at least some of the extra

transistors afforded by Moore’s law toward architectural features that help monitor programs

efficiently. Monitoring a process’s memory accesses, whether for passive observation or active

regulation of reads and writes, forms the basis of many program analysis techniques, debugging

tools, and sandboxing mechanisms.

Broadly defined, a Memory Monitoring mechanism tracks shared memory accesses that

arise from the various threads and possibly informs software about accesses to locations specif-

ically marked by software. Current general purpose processors incorporate three forms of hard-

ware support for memory monitoring: a memory management unit (TLB) that combines address

translation with access control for pages, watchpoint registers that support access monitoring

for words, and performance counters that passively count memory events. The TLB is managed

by heavyweight OS routines, affords only coarse-grain access monitoring. Watchpoint registers

are very few in number (e.g., 4 on the x86), which deters widespread use. Finally, performance

counters typically summarize/count only local events (e.g., cache misses).

It is possible to implement memory monitoring entirely in software by instrumenting mem-

ory accesses. Instrumentation can be set up with either compiler support [NMW02] or dynamic

32

runtime instrumentation [NeS07]. Unfortunately, static instrumentation requires programmer

effort to annotate programs and requires access to source code. Dynamic instrumentation,

typically adds significant overhead due to lack of type information about the data structures

and runtime interposition needed on the entire program. Software-based instrumentation has

been widely employed for tools like detecting concurrency bugs [SBN97; Cor] and security

violations [CPM98], although reducing the performance overheads with hardware support is

appealing even in such cases.

3.2.1 Design Space

In this section, we describe our design space for classifying memory monitoring mechanisms

and study the monitoring mechanisms of current processors. The typical software environment

employs multiple threads of control, each of which issue sequences of memory accesses (loads

and stores). We use the term “local” to refer to the thread performing the access and “remote”

to refer to other threads in the system with potentially concurrent reads and writes to the same

location.

1. Access visibility refers to whether memory operations by the “local” thread are visible

to the memory monitoring mechanism at either the “local” or “remote” end. Options:

“local”, “remote” or “local and remote”

2. Event location specifies where an event handler is triggered. The thread hardware could

choose to pass the information gathered about a memory operation. Typically this entails

stopping the thread’s operation and triggering a handler.

Options: “local” or “remote”

3. Event time refers to when (and if) the handler is triggered to enable software to be syn-

chronously notified when the event occurs. Options: Break (handler triggered before ac-

cess), Report (handler triggered after access), Record (no handlers, software is expected

to poll)

4. Monitor Visibility: It is obvious that to track accesses, they need to be visible to memory

monitoring mechanism. Monitoring visibility is the complementary of Acccess visibility;

33

it specifies whether the monitoring mechanism provides a response to inform the access

about its presence.

Options: “local” or “remote” refers to whether a monitoring mechanism set up locally or

remotely is visible

5. Accuracy specifies the precision of the monitoring mechanism. In the interest of

complexity-effectiveness, hardware primitives might introduce false-positives (events

even in the absence of operations) or false-negatives (lose some of the tracked opera-

tions).

In Table 3.1 we classify monitoring primitives that current processors support. An evident

limitation is the restriction of monitoring to local events only, which limits their benefits to

multithreaded code that have fine-grain inter-thread interactions. Furthermore, implementations

provide limited support (e.g., x86 has 4 watchpoint registers) and/or introduce high overhead

(e.g., TLB protection changes) on software.

Table 3.1: Classification of current monitoring mechanismsAccessVisibility

Event lo-cation

EventTime

Monitor -Visibility

Accuracy

TLB Local Local Break Local False +WatchpointRegs.

Local Local Break/Rep. Local -

PerformanceCounters

Local Local Record Local False +/-

Accuracy: “-” 100% accurate. Rem. - Remote, Rep. - Report.

3.3 Alert-On-Update

We propose a simple technique to selectively expose cache events to user programs. Using our

technique, threads register an alert handler and then selectively mark lines of interest as alert-

on-update (AOU). When a cache controller detects an event of interest, it notifies the local

processor, effecting a spontaneous subroutine call to the current thread’s alert handler.

A traditional advantage of shared-memory multiprocessors is their ability to support very

fast implicit communication: if thread A modifies location D, thread B will see the change as

34

soon as it tries to read D; no explicit receive is required. There are times, however, when B

needs to know ofA’s action immediately. Event-based programs and condition synchronization

are obvious examples, but there are many others. As an example, consider a program which uses

a lock L for synchronization. Typically, if L is already acquired, then a threadA that desires the

lock has to repeatedly poll the the status of L to detect the release of the lock. Interprocessor

interrupts are the standard alternative to polling in shared-memory multiprocessors, but they are

typically triggered by the operating system and have prohibitive latencies. This cost is truly

unfortunate, since most of the infrastructure necessary to inform remote threads of a change

to location L is already present in the cache coherence protocol. Alert-on-update provides an

effective way to reflect write notices up to user-level code.

3.3.1 Implementation

Alert-on-update can be implemented on top of any cache coherence protocol: coherence re-

quires, by definition, that a controller be notified when the data cached in a local line is written

elsewhere in the machine. The controller also knows of conflict and capacity evictions. AOU

simply alerts the processor of these events when they occur on marked lines. The alert includes

the address of the affected line and the nature of the event. Perhaps the most obvious way to

deliver a notification is with an interrupt, which the OS can transform into a user-level signal.

The overhead of signal handling, however, makes this option unattractive. We propose, there-

fore, that interrupts be used only when the processor is already running in kernel mode. If the

processor is running in user mode, an alert takes the form of a spontaneous, hardware-effected

subroutine call. Alert traps require simple additions to the processor pipeline. Modern pro-

cessors already include trap signals between the Load-Store-Unit (LSU) and Trap-Logic-Unit

(TLU) [KAO05]. AOU adds an extra message to this interface.

Note that with AOU, the thread performing the access itself is completely oblivious of

whether the cache line was AOUd, unless the accessor thread also requested notification of

local accesses. Simply put, AOU is used as a way for a thread to glean information on events

on lines that are locally cached. AOU can be implemented entirely with modifications to the

local cache controller and a single additional “A” bit in tag of the private L1, without requiring

any modifications to the coherence protocol itself. Since AOU relies on tag bits, it can not

35

monitor a line if it is not cached. On eviction of an AOUd line, the hardware conservatively

traps to software to indicate that the line will no longer be tracked. The “A” bits are valid only

as long as the thread that set them is active on the processor; they are flash-cleared on context

switches. Subsequently, the OS wakes up the thread in its alert handler to inform that the AOU

was terminated.

An application enables AOU by using an instruction to register a handler. It then indicates

its interest in individual lines using aload (alert-load) instructions. One variety of aload

requests alerts on remote writes only; another requests alerts on any write, including local

ones. Both varieties also generate alerts on capacity and conflict misses. Hardware provides

set_handler, aload, clear_handler, and arelease instructions; special registers

to hold the handler address, the PC at the time an alert is received, and a description of the alert;

and two bits per cache line to indicate interest in local and/or remote operations. For simplicity,

we implement the alert bits in a non-shared level of the memory hierarchy (the L1 cache). If the

L1 were shared we would need separate bits for each local core or thread context.

Table 3.2: Alert-on-update InterfaceRegisters%aou_handlerPC: address of handler to be called on a user-space alert%aou_oldPC: PC immediately prior to call to %aou_handlerPC%aou_alertType (4 bits) remote read, remote write, local write, and local read

lost alert, or capacity/conflict eviction%alert_enable (1bit): set if alerts are to be delivered; unset when they are maskedInstructionsset_handler %r move %r into %aou_handlerPCclear_handler clear %aou_handlerPC and flash-clear alert bits for all cache linesaload %r set alert bit for cache line containing the address in %r; set overflow

condition code to indicate whether the bit was already setarelease %r unset alert bit for line containing the address in %rClearAlerts flash-clear alert bits on all cache linesenable_alerts set the alert-enable bit

Cacheone extra bit per line, orthogonal to the usual sta te bits

We make no assumptions about the calling conventions of AOU-based code. Typically, our

alert handlers consist of a subroutine bracketed by code reminiscent of a signal trampoline. Ad-

ditional alerts are masked until the handler re-enables them. A status register allows the handler

to determine whether masked alerts led to lost information. Capacity/conflict alerts, likewise,

36

indicate that the local cache has lost the ability to precisely track a line. Moreover, exclusive

loads and prefetches, failed atomic operations, silent stores, and false sharing within lines may

all lead to “spurious” alerts. For these reasons user software must treat alerts conservatively: any

change to a marked line will result in an alert, but not all alerts provide precise information or

correspond to interesting changes. We expect the overhead of spurious alerts to be significantly

less than that of fruitless polling.

Alert States

Figure 3.3.1 shows the coherence protocol support required for Alert-on-update. Processor read

and write operations are represented by Rd and Wr respectively; alert operations are represented

as ARd and AWr. To indicate a write operation (alert or not) we use [A]Wr; we use a similar

convention for reads with the [A]Rd. Dashed boxes enclose the MESI and Alert subsets of the

state space. Notation on transitions is conventional: the part before the slash is the triggering

message; after is the ancillary action (‘–’ means none). “Flush” indicates writing the line to

the bus. S indicate signals on the “shared” bus line. It indicates the signals that accompany the

response to a BusRd request; an overbar means “not signaled”. The “Release” transition from

the Alert state space to the MESI state space occur only on the corresponding cache line. “Bus-

Rdx” invalidate the cache-line and transition the corresponding cache line to I. “ClearAlerts”

flash-clears the alert bit and reverts all the AOUd lines in the cache to their respective MESI

counterparts.

Note that the response to processor and bus operations of the alert states mirror those of their

respective MESI counterparts. This ensures that we can implement Alert-on-update without any

addition to the transition state space. Infact while the coherence protocol shows a separate set

of alert states to describe the transitions, the alert bit is completely orthogonal to the coherence

protocol. Its sole purpose is to monitor cache coherence events and does not impact the protocol

responses or transitions.

37

M

E

Wr/--

I

Wr /BusRdx

S

Wr /BusRdx

Rd /BusRd (S)-

Rd/--

Rd/--

Wr,Rd / --

BusRdX/Flush

BusRdX/ --

BusRd/ --

BusRdX/ --

BusRd/ --

Rd /BusRd (S)

MESI States

[A]Wr/--

[A]Wr /BusRdx

[A]Rd/--

[A]Rd/--

[A]Wr, [A]Rd / --

BusRd/ --

BusRd/ --

Alert States

AM

AE

AS

AWr /BusRdx

ARd /BusRd (S)

ARd /BusRd (S)

AWr/BusRdX

ARd/BusRd (S)

ARd/BusRd (S)

Release or BusRdx

ClearAlerts

Figure 3.1: Coherence protocol support for Alert-on-update.

3.3.2 Observable events

A key design choice in AOU is the type of events that are observable at the cache controller.

Alert bits can obviously observe all local accesses (writes and reads) from the processor. The

cache coherence protocol also guarantees that remote writes are observable at all cached copies.

Remote reads are trickier since they are only observable if the local processor has performed a

write and holds the only valid copy i.e., remote reads are only observable if the line is locally

cached in the AM state (see Figure 3.3.1). Also note that in the case of remote accesses, only the

first remote access is visible. Subsequent accesses will not be seen either because the alerted

line is lost (remote write) or coherence protocol filtered out remote reads by choosing to supply

the line from another copy. Finally, the cache controller has to inform software of evictions

(either capacity or confict) to software that can use this alert to enable software-based tracking.

Another important design choice that critically impacts the implementation, is the capability

to allow multiple types of monitoring to be activated simultaneously. To indicate whether a

38

cache line requires to monitor any combination of the four type of events (i.e., local read, local

write, remote write, and remote read) would require four separate alert bits per cache line. To

minimize the modifications required to the critical L1 data cache, our current design decouples

this design choice from the cache line. It provides a set of four configuration bits for the entire

L1 cache to specify the type of events and requires a single Alert bit per cache line to switch

on/off alerts. All cache lines set up for alerts to monitor the same type of events.

3.3.3 Virtualization

When exposing hardware primitives to user-level applications there is a critical need to vir-

tualize resources to free applications from the burden of reasoning about low-level hardware

resources. When aloaded lines are evicted from the cache there are two challenges that need

to be addressed: First, alert bits are available only in the cache and second, the appropriate

thread contexts that requested the alert need to be tracked. The former can possibly be ad-

dressed by extending alert bits throughout the memory hierarchy. The latter is a hard challenge

since a AOUd cache line implicitly identified the thread that requested the alert, but maintaining

this information for an arbitrary swapped out number of threads would require extensive hard-

ware resources and would further complicate system software support. For our base design,

we adopted a simple approach to virtualization. We assume that all aloaded lines belong to

the active thread. Thread and process schedulers must execute clearhandler (which also

areleases all aloaded cache lines) on context switches.

When a thread is inactive, the cache line associated with the thread could possibly change

without notifying the thread. When the thread is re-activated it must perform a conservative

polling operation to check if a previously AOUd line has changed. The polling can be imple-

mented in an application-specific manner by either examining values or version numbers. For

example, our AOU-accelerated STMs utilize version numbers (see Section 3.5 for details).

Another challenge that we have to deal with is the possibility of multiple alerts arising at the

same time. As with any other form of hardware interrupt, we coalesce multiple alerts. Only the

first alert has detailed information saved in the registers. When the alert handler is triggered it

disables alerts until it has saved the alert registers on the stack; any further alerts that arise in the

39

meantime are reflected back to software by using a flag register to indicate missed alerts. We

use existing virtualizing mechanisms like version numbers to detect alerts that may have been

missed. More details on how STMs use software validation to detect missed alerts is discussed

in Section 3.5.

Finally, we need to define the behavior when a code-region that is using AOU invokes sys-

tem calls. To ensure simple system calls and many common interrupts do not enforce conser-

vative polling the OS does not execute clear_handler when returning to the most recently

executing user thread. If an event occurs on an aloaded line while the processor is executing

in the kernel, the alert will be delivered as an interrupt. The interrupt handler simply remembers

the contents of the AOU registers (see Table 3.2), and if control eventually returns to user space

without a context switch, the kernel effects a deferred call to the user-level handler. Moreover

kernel routines themselves can use alert-on-update, so long as the interrupt handler is able to

distinguish between user and kernel lines, and so long as all the kernel lines are areleased

before returning to user space.

3.4 Related Work

Table 3.3 classifies proposed monitoring mechanisms and alert-on-update on the design space

discussed in Section 3.2.1. Overall, AOU is more flexilble than other proposed mechanisms; it

can track both remote and local accesses, trigger events at either the remote processor or local

processor, and can operate in either break, report or record mode. We compare and contrast

with each of the proposed mechanisms in detail below.

3.4.1 Informing Loads

More than a decade ago, Horowitz et al. [HMM96] proposed a set of memory operations that

enables software to directly observe cache misses and react to it. Their proposal was to enable

runtime optimizations through fine-grain profiling of non-uniform data access latencies. They

propose informing memory instructions, which essentially couple conventional memory opera-

tions with a branch operation. If the memory operation misses in the primary L1 cache, then the

hardware transfers control to a specified handler. AOU is data centric and its seeks to monitor

40

Table 3.3: Classification of proposed monitoring mechanismsAccessVisibility

EventLocation

EventTime

Monitor -Visibility

Accuracy

Informing Loads [HMM96] Local Local Break Local -Intel Mark bits [SAJ06] Local Local/-

RemoteRecord Local -

Signature [CTC06] Local/-Remote

Local Break Local/-Remote

False +

Alert-on-update Local/-Remote

Local/Remote

Break,Report orRecord

Local -

Accuracy: “-” 100% accurate. Rem. - Remote, Rep. - Report. Signatures can only be usedin record mode for remote accesses.

accesses (both local and remote) to a given location, while informing loads detect cache (hit or

miss) for the specific memory access. Unlike informing loads, alert operations are decoupled

from the instruction that informs the cache itself i.e., the handler is triggered on subsequent

operations.

3.4.2 Intel mark bits

Alert-on-update was originally introduced in a 2005 technical report and a paper at TRANS-

ACT’06 [SMD06]. Researchers in the McRT group at Intel subsequently published a variant

of AOU that uses polling instead of events to detect cache line evictions. Their HASTM sys-

tem [SAJ06] proposed a set of mark bits in the cache which software can set up for specifically

monitoring remote writes. Unlike AOU, which can operate in break or report mode, Intel’s mark

bits operate only in record mode (see Table 3.3). On remote write operations or (cache line evic-

tions) the hardware sets a flag register which software polls to detect if any of the marked lines

were evicted. Expecting software to frequently poll and check the status of the cache line intro-

duces performance penalty due to noticeable increase in instructions and requires extra memory

fences in the program to order the polling of the flag register and memory operations. Alert op-

erations are aysnchronous events which completely eliminate polling and significantly reduce

the overheads.

41

3.4.3 Signatures

In 2006, Ceze et al. [CTC06] introduced bloom filter based hardware signatures that record the

addresses read and written by a thread. The signature performs a membership test on coher-

ence requests and memory accesses; on a membership hit a handler is triggered on the accessor.

Unfortunately, bloom filters have false positives i.e., they represent a superset of the addresses

marked by the program. False positives, while a performance issue can be handled in appli-

cations like transactional memory or debugging [TAC08] which require alert handlers only on

the accessor. Triggering handlers on remote processors is a challenge in the presence of false

positives, since you could have spurious alerts showing up on completely unrelated threads in

the system. False positives are not acceptable for applications such as event-based communica-

tion or accelerating locks (see Section 3.6). Signatures also modify the coherence protocol to

inform the thread making the access about the monitoring mechanism i.e., signatures detect an

access and provide feedback to the accessor about the thread that set up the signature. However

Aload, makes no modifications to the coherence response messages and remote accessors are

not informed about AOU.

3.5 Application 1: AOU Assisted STMs

In this section, we examine in detail the value that alert-on-update offers in accelerating a soft-

ware transactional memory system. We focus on a two specific STM systems, (1) RSTM,

an indirection-based nonblocking STM system, and (2) LOCK, a lock-based STM similar

to Transactional Locking II [DSS06]. AOU can also be used to accelerate other STMs like

McRT [SAH06], Microsoft’s Bartok STM [HPS06], and the Ennals’ LibTx system [Enn06].

Pure software implementations of transactional memory (STM) can be divided into two

main camps: those that use locks under the hood (hidden from the user) and those that are non-

blocking (typically obstruction-free [HLM03a; HLM03b]). Several groups have found lock-

based implementations to be faster in practice [Enn06; DSS06; SAH06; HPS06], but nonblock-

ing implementations have other advantages: they are immune to priority inversion in event-

based code, and to performance anomalies caused by inopportune preemption or page faults.

42

Finally, nonblocking implementations ensure consistency even if a thread can die in the middle

of a transaction.

In Section 3.5.1 and Section 3.5.2, we briefly describe the internals of two STM systems,

RSTM and LOCK. In Section 3.5.3, we discuss the challenges associated with improving per-

formance of STMs. Section 3.5.4 describes the use of AOU in accelerating STMs. Finally,

Section 3.5.5 quantifies the performance improvements obtained when incorporating AOU sup-

port in STMs.

3.5.1 RSTM : Indirection-Based STMs

Indirection-based nonblocking STM systems include DSTM [HLM03b], ASTM [MSS05], and

RSTM [SMS06]; in this dissertation we focus on RSTM. The two principal metadata structures

in RSTM (see Figure 3.2a) are the transaction descriptor and the object header. The descriptor

contains an indication of whether the transaction is active, committed, or aborted. The header

contains a pointer to the descriptor of the most recent transaction to modify the object, together

with pointers to old and new versions of the data. If the most recent writer committed in soft-

ware, the new version is valid; otherwise the old version is valid. Indirection-based STMs

typically introduce an extra level of pointer lookup through the metadata on every data access.

All modifications to objects are performed on private copies. Before it can commit, a trans-

action must acquire the headers of any objects it wishes to modify, by making them point at its

descriptor. By using a CAS instruction to change the status word in the descriptor from active

to committed, a transaction can then, in effect, make all its updates valid in one atomic step.

Prior to doing so, it must also verify that all the object clones it has been reading are still valid.

The indirection of nonblocking systems ensures that committing and aborting are both

lightweight operations, and that objects read during a transaction are immutable. Unfortunately,

the extra indirection also increases both capacity and coherence misses in the cache, by increas-

ing the number of lines required to represent an object and the number of lines that are modified

when changes to an object are committed. Ennals et al. [Enn06] and of Dice et al. [DiS06], find

these cache misses to be a major source of overhead in indirection-based STMs.

43

Header

new version

of data

Txn descriptor

Owner

Old Data

old version

of data

(a) RSTM Metadata

Version#/Owner/LockRedo Log (Clone)

Master Copyof Object

Old Version#Master CopyIn-Progress

Modifications(“Log”)

(b) LOCK Metadata

(a) In RSTM, all references go through the indirection pointer (header), which is easily changedto install a new version. (b) In LOCK, readers can directly reference the location but need tocheck to ensure that the read value is consistent.

Figure 3.2: STM Metadata

3.5.2 LOCK : Lock-based STM

The RSTM indirection header is unnecessary. Pointers to shared objects need not pass through

the header to mediate conflicts and indicate acquisition. Instead we can pack lock status, version

number, and acquisition status into a single-word in the object itself. Figure 3.2b depicts the per-

object metadata for the LOCK STM. Every object contains two header words, which we modify

only using atomic double word compare-and-swap (CAS) instructions. In the common

case, an object is not owned: its redo log pointer is null and its version number is odd. In this

case, the object contents can be read directly, so long as the version number has not changed

since the first time the object was accessed by the current transaction.

As in RSTM, LOCK transactions perform all speculative writes on private object clones.

Each clone serves, in effect, as a “redo log”, to be applied to the master copy in the wake of a

successful commit. A thread must perform this copy-back before it can start a new transaction.

In the absence of hardware support, log application is performed under the protection of per-

object locks. Threads that desire the lock can do so eagerly (at the time of the first access) or

lazily (just before commit). If a transaction attempts to read a locked object, it must wait. After

completing a log application, a thread zeros the redo log pointer and updates the version number

word and releases the lock.

A metadata in the acquired state contains a pointer to the owner in the first header field, and

a pointer to the redo log in the second. Because objects are updated in place, a transaction is not

44

guaranteed that the (copy of an) object it reads is immutable. To ensure that all reads come from

the same consistent version of the object, we need a version number to each object. The first

word of the redo log contains the old version number of the object. The second word contains

a back-pointer to the public version of the object. Reading transactions ensure the consistency

of the version number before using any field read from an object.

3.5.3 Challenges in STM

Validation

Since many transactions can read a shared object simultaneously, STMs tend to permit “invisible

reads”, in which transactions that read an object O but make no modification to O’s metadata.

This behavior prevents metadata from bouncing between cache lines when several threads read

O simultaneously, but in the event that a transaction modifies O while O is being read, then all

transactions TR who have read O are responsible for detecting the change to O and aborting

themselves. To validate a single object, a Tr need only ensure that the object’s header still

references the same version of the object that TR has seen in all previous accesses. Since

acquisition and locking both modify the same header field, it is straightforward for a transaction

to ensure that an object has not changed; it need only compare the first word of the header to a

private copy of that field that was read the first time the object was accessed. However, every

time TR opens a new object, it must re-validate all previously read objects. Thus if TR reads

N objects, it must perform N(N−1)2 = O(N2) comparisons. This validation overhead is on the

critical path for both RSTM and LOCK.

Bookkeeping

An additional cost that many STMs face is that of bookkeeping. Clearly, in the case of invisible

reads it is necessary for a transaction to maintain a list of all objects read. Furthermore, in STMs

for non-garbage-collected languages, it is likely that transactions reuse their descriptor objects.

In this case, the O(n) storage and time overhead for tracking all reads is necessary even with

visible reads. In STMs that frequently search through their read sets, list structures may be

inappropriate, resulting in greater algorithmic and storage complexity.

45

Delayed Aborts

STMs cannot tightly bound the delay between when a transaction becomes destined to abort

and when it actually aborts. In the case of invisible reads, the thread must first validate its read

set to detect that it must abort, but even for visible reads, the system may not be able to bound

the time that a transaction spends outside of library code. In library-based STMs where abort

and rollback mechanisms are never initiated in user-provided code even when globally-visible

information dictates that a transaction must abort, it still might spend considerable time doing

useless work.

3.5.4 Using AOU to accelerate STMs

To keep the discussion simple, we only describe the details in incorporating AOU in LOCK.

We present a two-step evolution of STM systems characterized by increasing reliance on AOU.

LOCK can use a single aloaded line per thread to guarantee nonblocking progress. LOCK

can also use one aloaded line per object to avoid the performance overheads associated with

validation.

Restoring Nonblocking Guarantees

The sole purpose of the per-object lock in LOCK to ensure that as soon as one thread has

completed the copyback of a redo log, no other thread continues attempting to apply that log.

The use of lock for the copyback causes the STM to satisfy obstruction freedom. We can restore

nonblocking guarantees by making the lock revocable. Using AOU we can construct a simple

special-purpose revocable lock. We do not expect the use of a single alert-on-update

to significantly increase performance; the code is equivalent in complexity of the base LOCK

system.

If all threads agree on the set of locations to be written (a condition guaranteed by the redo

log), then the lock can be stolen as long as the previous lock holder is certain not to continue

writing once the lock is lost. Figure 3.3 demonstrates how the redo lock can be stolen using

AOU. The AcquireRevocableLock() operation can either lock an unlocked object, or

overwrite the lock of a locked object. In the latter case, the current lock holder will receive an

46

#define LOCKED 2bool success = truetry

set_handler({throw Alert()})if (log = o->HasRedoLog)

aload(o->lock);if (o->AcquireRevocableLock(log))

o->ApplyRedoLog(log)o->ReleaseLockAndClearLog()

arelease(o->lock)clear_handler

catch (Alert)success = false

AcquireRevocableLock(log):v = versionNumberif (v == copyback_needed() && backoff() < THRESHOLD)

goto AcquireRevocableLockreturn CASX(this, <v, log>, <LOCKED, log>)

A single aloaded line suffices to steal responsibility for applying a redo log to object o.

Figure 3.3: Lock Stealing to make LOCK non-blocking

immediate alert, ensuring that if the new lock holder completes, no other threads are writing the

object. To avoid pathological behavior, AcquireRevocableLock() waits for a bounded

period of time before attempting to steal the lock. Since threads never hold more than one lock,

and since that lock is used only to protect log application, a single aloaded line suffices to

restore obstruction freedom.

Reducing Validation Costs

We now focus on a more pervasive use of AOU to dramatically improve performance. As in

the previous subsection, we use AOU to implement revocable locks. However, we also use AOU

to eliminate both quadratic-time inter-object validation and per-access intra-object validation in

the common case.

In the common case, a transaction reads and writes only a small number of objects. If all

transaction headers fit in cache, then once an object O is read, its header ought to remain in

the cache until the transaction completes. Barring pathological cache overflows, invalidation of

O’s header implies that O has been acquired by another transaction and the current transaction

47

should abort. If the transaction registers an alert handler that immediately aborts, and then each

object header is initially loaded using alert-on-update, all validation can be elided: An

alert is certain to precede any modification to relevant metadata, and all forms of validation fail

only if some relevant metadata is modified. In effect, the alert mechanism transforms the cache

into a self-validating read set.

Our proposal is somewhat idealized, since the capacity of a cache is limited. If we have

abundant but finite lines that can be tagged alert-on-update, then we may tag up to K

objects (where K is based on the cache size) and then fall back to explicit incremental and

per-access validation for the remaining R − K objects in the read set.In our implementation,

transactions estimate K and decrease it when they are alerted due to overflow (detected through

the %aou_alertType register).

The runtime sometimes requires that a set of related metadata updates be allowed to com-

plete, i.e., that the transaction not be aborted immediately. This is accomplished by setting a

“do not abort me” flag. If an alert occurs while the flag is set, the handler defers its normal ac-

tions, sets another flag, and returns. When the runtime finishes its updates, it clears the first flag,

checks the second, and jumps back to the handler if action was deferred. This “deferred abort”

mechanism is also available to user applications, where it serves as a cheap, non-isolated ap-

proximation of open nested transactions [Mos06]. We us this interface to interact with external

memory allocator libraries.

3.5.5 Evaluation

In this section we present experimental results that measure the impact of

alert-on-update on TM performance. We consider both throughput and the cache

miss rate, which we use as a measure of the benefit of removing indirection. Detailed

performance graphs appear in an appendix. All results were obtained through full-system

simulation.

48

Simulator Framework

We simulate a 16-way chip multiprocessor (CMP) using the GEMS/Simics infrastruc-

ture [MSB05]. Simulation parameters are listed in Table 3.4.

Table 3.4: Simulation Parameters

16-way CMP, Private L1, Shared L2Processor Cores 1.2GHz in-order, single issue,

ideal IPC=1L1 Cache 64KB 4-way split, 64-byte

blocks, 1 cycle latency, VictumBuffer:32 entries

L2 Cache 8MB, 8-way unified, 64-byteblocks, 4 banks, 20 cycle la-tency

Memory 2GB, 250 cycle latencyInterconnect 4-ary totally ordered hierarchi-

cal tree, 2 cycle link latency,64-byte links

Benchmarks

We consider the following five benchmarks, HashTable, RBTree, LFUCache, LinkedList-

Release and Randomgraph, designed to stress different aspects of software TM. Appendix A

includes a detailed description of the benchmarks. In all benchmarks, we execute a fixed num-

ber of transactions in single-thread mode to advance the data structure to a steady state. We then

execute a fixed number of transactions concurrently in multiple threads to evaluate scalability

and throughput. During the timed trial, we also monitor L1 cache misses (read and write), as

we expect them to decrease in systems without indirection.

Runtime Systems Evaluated

We compare a total of 7 systems. As baselines, we consider RSTM (RSTM) and RSTM with

our global commit counter heuristic (RSTM+C) [SMS06]. The commit counter is a global count

of the number of transactions that have attempted to commit. When a transaction acquires an

object, it sets a local flag indicating that it must increment the counter before attempting to

49

commit. Now when opening a new object, a reader can skip validation if the global commit

counter has not changed since the last time the reader checked it.

We also evaluate the acceleration of LOCK using AOU. The first, LOCK-AOU_1, uses a

single AOU line to eliminate indirection; the second, LOCK-AOU_N, uses one AOU line per ob-

ject to eliminate validation. For LOCK-AOU_1, we also consider a variant that uses the global

commit counter (LOCK_AOU_1+C). Lastly, we consider a variant of RSTM (RSTM-AOU_N)

that uses AOU to avoid validation overheads (like LOCK-AOU_N) but without eliminating indi-

rection. This library does not incur per-access validation, but has increased cache pressure. The

graphs also include results for the (LOCK) STM of Section 3.5.2 . Its performance is within 5%

of LOCK_AOU_1 on average, which is unsurprising, since LOCK-AOU_1 differs from LOCK

only when locks are stolen; we do not discuss LOCK further here.

To ensure a fair comparison, we use the same benchmark code, memory manager, and con-

tention managers in all systems. For contention management we use the Polka manager [ScS05]

and all TMs use eager conflict detection.

Eliminating Read-Set Validation

By leveraging abundant AOU lines, both RSTM-AOU_N and LOCK-AOU_N are able to im-

prove TM performance by 1.4–2× in HashTable, RB-Tree, LinkedList-Release, and LFUCache.

Single-thread RandomGraph improves by a factor of 5. Not only do these systems outperform

their unaccelerated counterparts (RSTM and LOCK-AOU_1, respectively) at all thread levels,

they also outperform our commit-counter heuristic in almost all cases.

In previous work, we showed that the commit counter entails a tradeoff: in return for a

constant-time indication of whether any transaction has committed, all transactions must seri-

alize on a single global counter. For HashTable, where transactions tend not to conflict, this

forced serialization is a bottleneck that slows performance. However, both RSTM-AOU_N and

LOCK-AOU_N avoid serializing on a counter while still enabling validation calls to be skipped.

As a result, we see HashTable improve by over 20%, whereas the commit counter actually

degrades performance with respect to RSTM.

In LFUCache, where transactions conflict with high likelihood and consequently do

50

HashTable

0

2

4

6

8

10

12

1 2 4 8 16Threads

Norm

aliz

ed T

hrou

ghpu

tRBTree

0

2

4

6

8

10

12

14

16

1 2 4 8 16Threads

Norm

aliz

ed T

hrou

ghpu

t

LinkedListRelease

0

2

4

6

8

10

12

14

16

18

1 2 4 8 16Threads

Norm

aliz

ed T

hrou

ghpu

t

LFUCache

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 4 8 16Threads

Norm

aliz

ed T

hrou

ghpu

t

RandomGraph

0

1

2

3

4

5

6

7

1 2 4 8 16Threads

Norm

aliz

ed T

hrou

ghpu

t

RSTM

RSTM+C

RSTM-AOU_N

LOCK-AOU_1

LOCK-AOU_1+C

LOCK-AOU_N

Results are normalized to RSTM, 1 thread. Using alert-on-update to eliminate valida-tion improves performance by as much as a factor of 2 (a factor of 5 in RandomGraph), andoutperforms the global commit counter heuristic.

Figure 3.4: Performance of STMs with AOU acceleration.

51

0

0.25

0.5

0.75

1

1.25

1.5

1.75

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

HashTable RBTree LinkedList-Release LFUCache RandomGraph

Nor

mal

ized

L1

Cac

he M

isse

s Wr Misses

Rd Misses

0

0.25

0.5

0.75

1

1.25

1.5

1.75

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

RST

M

LOC

K

LOC

K-A

OU

_1

LOC

K-A

OU

_N

HashTable RBTree LinkedList-Release LFUCache RandomGraph

Nor

mal

ized

L1

Cac

he M

isse

s

Top:L1 cache misses per transaction at 1 thread. Bottom:L1 cache misses at 16 threads. Resultsare normalized to RSTM, 1 thread.

Figure 3.5: L1 cache miss rates across accelerated STMs

not admit scalability, we still see that removing validation without adding an expensive

fetch-and-increment enables an improvement of almost 40%. Furthermore, since AOU

decreases the time required to commit a transaction, performance degrades less at higher thread

levels. With faster transactions, the window of conflict is smaller.

In RBTree and LinkedList, the counter is not a bottleneck, but it is imprecise. When a

writing transaction increments the counter and commits, all active transactions are forced to

validate, even if they do not conflict with the writer. Thus for longer transactions and moderate

concurrency (T threads), a transaction is likely to validate T−12 = O(T ) times, even if there

are no conflicts. Since AOU precisely tracks conflicts, it is not victim to false-positive events,

and thus it improves performance by a much larger amount. For RBTree at 16 threads, AOU

52

0

0.2

0.4

0.6

0.8

1

1.2

CG

L

RST

M

LOC

K

RST

M-A

OU

_N

LOC

K-A

OU

_1

LOC

K-A

OU

_N

CG

L

RST

M

LOC

K

RST

M-A

OU

_N

LOC

K-A

OU

_1

LOC

K-A

OU

_N

CG

L

RST

M

LOC

K

RST

M-A

OU

_N

LOC

K-A

OU

_1

LOC

K-A

OU

_N

CG

L

RST

M

LOC

K

RST

M-A

OU

_N

LOC

K-A

OU

_1

LOC

K-A

OU

_N

CG

L

RST

M

LOC

K

RST

M-A

OU

_N

LOC

K-A

OU

_1

LOC

K-A

OU

_N

Hash RBTree LinkedList-Release LFUCache RandomGraph

Nor

mal

ized

Exe

cutio

n Ti

me

Abort

Copy

Validation

CM

MM

Bookkeeping

App Non-Tx

App Tx

Figure 3.6: Timing breakdown for accelerated STMs

increases throughput by 70%, whereas the commit counter improves throughput by less than

10%. LinkedList-Release sees a 2× speedup with AOU, and only a 10% speedup with the

counter. The imprecision and false positives induced by the counter mask the concurrency of

these benchmarks.

As in other benchmarks, RandomGraph single-thread performance is slightly higher with

AOU than with the counter, since AOU does not require an expensive CAS operation. However,

the counter enables reorderings that approximate mixed invalidation [Sco06; SMS06], which

dramatically improves throughput in RandomGraph. Briefly, the counter defers detection of

conflicts between a reader and a subsequent writer of an object until the writer commits. If

the reader commits first, the conflict is ignored. This behavior is not present when using AOU,

since the writer’s acquisition will immediately alert the reader. Since the window of contention

is long in RandomGraph, and since the counter shrinks this window considerably, the commit

counter delivers substantially better throughput than AOU.

Analysis of cache misses identifies an interesting trend: When read set validation is avoided,

cache misses decrease. This is a direct result of the reduced bookkeeping afforded by AOU.

Since the transaction relies on the cache for notification of conflicts, it is not necessary to main-

tain a large list of all objects read in order to enable validation. Additionally, there is no costly

validation step that pulls metadata into the cache, possibly at the expense of objects read transac-

tionally. By reducing bookkeeping, AOU reduces cache pressure and avoids capacity evictions,

decreasing the overall miss rate. The commit counter has a similar, but less pronounced effect.

53

Latency and Overhead

Figure 3.6 quantifies the overheads incurred by our various TM systems in single-thread execu-

tion. Among the principle overheads, only validation and bookkeeping vary significantly across

systems; other overheads are either negligible (due to the lack of conflicts in single-threaded

code) or constant. Single-thread latency measurements are an effective way to gauge the fixed

overheads of the STM instrumentation that cannot be overcome with concurrency.

Our latency measurements reflect some instrumentation artifacts. As in HASTM [WCW07],

since object metadata is located within data objects, the cost of pulling an object into the cache

is represented as bookkeeping rather than real work (App Tx). In RSTM, this results in one level

of indirection being assigned to App Tx and the other to metadata manipulation (Bookkeeping),

as desired. However, for LOCK this artificially inflates bookkeeping. Secondly, since per-

access validation is only a three-instruction sequence (cache hit, compare, branch), we treat that

overhead as App Tx rather than as validation, in order to limit the instrumentation cost. This

incorrectly adds all per-access validation in the LOCK-based systems to App Tx overhead.

The combination of these effects paints a surprising picture. Indirection and per-access val-

idation overheads are roughly equal, resulting in a slight slowdown in LOCK for most bench-

marks despite the removal of indirection. Furthermore, in the absence of validation we see that

metadata bookkeeping is the dominant overhead. In our systems, this overhead is the cost of

flexibility and obstruction freedom: we must bookkeep eager and lazy writes separately, result-

ing in higher constant overhead per transaction, and we must execute multiple branches when

reading any object, in order to choose between visible and invisible reads (in RSTM) and eager

or lazy acquire. We also collect extensive statistics to drive contention management and adaptive

policies (not employed here) that choose between eager and lazy acquire, and between visible

and invisible readers. To support obstruction freedom and flexible contention management, our

systems must obey a protocol for stealing ownership, stealing locks, and aborting competitors

that places tens of instructions on the critical path. For large transactions, this per-object cost is

an obstacle to good performance.

54

Sensitivity to Cache Size

Our benchmarks present a best-case scenario for LOCK-AOU_N and RSTM-AOU_N. Even Ran-

domGraph fits entirely in the L1 cache, and thus despite hundreds of transactional objects, AOU

can still be leveraged to avoid all incremental validation overhead. Under more taxing condi-

tions (such as cache associativity constraints or read sets dramatically larger than the number

of cache lines and victim-buffer entries), the relative benefit of AOU will decrease. Assuming

no commit counter, for R >> C objects, where C is the cache size (in lines), the expected

validation overhead is O((R − C)2). Compared to the validation overhead of RSTM or LOCK

(O(R2)), the cost will be less in practice, though still quadratic. For such workloads, combining

AOU and the commit counter would appear to be an attractive option.

3.6 Application 2: Accelerating Locks

For TRANSACT 2009, I collaborated with Michael F. Spear to propose the use of AOU in ac-

celerating transactional mutex locks (TML), a scalable reader-writer lock. In shared memory

applications, locks are the most common mechanism for synchronizing access to shared mem-

ory. While transactional memory (TM) research has identified novel techniques to replace some

lock-based critical sections, these techniques suffer from many problems [MAK01]. Funda-

mentally, TM’s speculative nature does not appear to be suitable for small but highly contended

critical sections, such as those common in operating systems and language runtimes, where low

latency is critical. From an engineering perspective, modern STM runtimes typically require

significant amounts of global and per-thread metadata. This space overhead may be prohibitive

if TM is not used widely within the language runtime.

Of course, it is possible to improve the scalability of a lock-based critical section with-

out abandoning locks altogether. In particular, reader/writer (R/W) locks, read-copy-update

(RCU) [McK04],1 and sequence locks [Lam05]2 all permit multiple read-only critical sections

1RCU writers create a new version of a data structure that will be seen by future readers. Cleanup of the oldversion is delayed until one can be sure (often for application-specific reasons) that no past readers are still active.

2Sequence lock readers can “upgrade” to write status so long as no other writer is active, and can determine,when desired, whether a writer has conflicted with their activity so far. Readers must be prepared, however, to backout manually and retry on conflict.

55

to execute in parallel. Each of these mechanisms comes with significant strengths and limita-

tions: R/W locks allow the broadest set of operations within the critical section, but typically

require static knowledge of whether a critical section might perform writes; RCU ensures no

blocking in a read-only operation, but constrains the behavior allowed within a critical section

(such as permitting only forward traversal of a doubly-linked list); and sequence locks forbid

both function calls and any traversal of linked data structures. Furthermore, the performance

characteristics of these mechanisms (such as the two atomic operations required for a R/W

lock), or their programmer overhead (e.g., the manual instrumentation overhead for rollback

of sequence locks) make it troublesome to use them, even in situations where these techniques

appear appropriate.

The nature of many critical sections in systems software suggests an approach that spans

the boundary between locks and transactions: specifically, we may be able to leverage TM

research to create a more desirable locking mechanism. Transactional Mutex Lock (TML), a

scalable locking mechanism was developed by Michael F. Spear. TML offers the generality of

mutex locks and the read-read scalability of sequence locks, while avoiding the atomic oper-

ation overheads of R/W locks or the usage constraints of RCU and sequence locks. We used

AOU to enable event-based communication between lock acquirer (writer) and current holders

(readers). This enables threads to detect lock-related operations without any polling (which

improves performance) and simplifies the memory management issues since threads are imme-

diately informed when the lock is stolen.

3.6.1 Background : Transactional Mutex Locks

TML is built atop an STM with minimal storage and instrumentation overheads. In contrast

to previous “lightweight” STMs, we explicitly limit both programming features and potential

scalability. In turn, TML can operate with as little as one word of global metadata, two words

of per-thread metadata, and low per-access instrumentation.

The most straightforward STM API consists of four functions: TMBegin and TMEnd mark

the boundaries of a lexically scoped transaction, and TMRead and TMWrite are used to read

and write shared locations, respectively. Compared to STMs, TML institutes several simplifica-

56

TMBegin:1 while (true)2 lOrec = gOrec3 if (isEven(lOrec))4 break

TMEnd:1 if (isOdd(lOrec))2 gOrec++

TMRead(addr)1 tmp = *addr2 if (gOrec != lOrec)3 throw aborted()4 return tmp

TMWrite(addr, val)1 if (isEven(lOrec))2 if (!cas(&gOrec, lOrec, lOrec + 1))3 throw aborted()4 lOrec++5 *addr = val

Figure 3.7: Single-orec STM Algorithm.

tions. Any writing transaction is inevitable (will never be rolled back). Inevitability precludes

the use of condition synchronization after a transaction’s first write (for simplicity, we omit

self-abort altogether). At the same time, it means that writes can be performed in place without

introducing the possibility of out-of-thin-air reads in concurrent readers.

Figure 3.7 lists the algorithm. A single word of global metadata (gOrec) provides all

concurrency control. When odd, it indicates that a writer is active; when even, zero or more

readers may be active. A single words of metadata is stored per transaction: a local copy of the

global orec (lOrec). Instrumentation is also low. The single global orec (gOrec) is sampled

at transaction begin, and stored in transaction-local lOrec. To write, a transaction attempts

to atomically increment gOrec to lOrec + 1. Reads postvalidate to ensure that gOrec and

lOrec match (which is trivially true for transactions that have performed any writes), and the

commit sequence entails only incrementing gOrec in writing transactions. For simplicity, any

memory management operation (malloc or free) is treated as a write, and is prefixed with

the instrumentation of TMWrite lines 1–4.

57

With only one orec, the runtime does not support parallelism in the face of any writes, but

the per-write instrumentation can be simplified: no mapping function is called on each access,

and the read log has a single entry. If every transaction is expected to read at least one location

(a reasonable assumption under most, but not all transactional semantics [MBS08]), then it is

correct to hoist and combine all “prevalidate” operations to a single operation at the beginning

of the transaction. The algorithm is livelock-free: once W increments gOrec, it is guaranteed

to commit (it cannot abort due to conflicts, nor can it self-abort).

3.6.2 AOU Acceleration for Locks

AOU eliminates the need for read instrumentation within TML transactions. With AOU the

alerted processor immediately jumps to a handler that can either (a) validate the transaction,

re-mark the line, and continue, or (b) roll back and restart.

The changes to Figure 3.7 are trivial: in TMBegin, the transaction must use an aload] to

sample gOrec, and must install a handler (which may be as simple as throw aborted()).

In TMEnd (or upon first TMWrite), the transaction must unmark the line holding gOrec. De-

pending on the implementation of AOU, writes may (in the best case) acquire gOrec with a

simple store, or (in the worst case) they may require an atomic read-modify-write. Transaction

behavior is unchanged from the baseline, except that a read-only transaction does not poll for

changes to gOrec. If its processor loses the line holding gOrec, it will be notified immedi-

ately.

Virtualization support is straightforward: on a context switch, the AOU mark is discarded

from all lines. When a preempted thread resumes, the OS provides a signal, so that the thread

can test gOrec and abort if necessary. Additionally, writers must not be preempted between

lines 2 and 4 of TMWrite. Alternatively, they may briefly store a per-thread identifier in gOrec

during TMWrite.

By detecting alerts immediately at the time when a write is acquired, AOU eliminates all

read instrumentation. If read-only transaction T is traversing a linked data structure, and writer

transaction W will modify that data structure, then so long as W acquires gOrec before per-

forming any free() calls, T will be guaranteed to take an immediate alert (caused by the

58

acquisition of gOrec) before the free() can return memory to the operating system.

A key challenge with speculative critical sections like in TML is the memory allocator. If

a writer acquires the lock and deallocates a memory location (like it can do with mutex locks),

readers could potentially be dereferencing illegitimate data, which can lead to dangerous side-

effects. By providing immediate aborts, AOU ensures that concurrent readers’ are immediately

stopped and their alert handlers are invoked. This ensures that transactions can employ conven-

tional memory allocator libraries within critical sections.

3.6.3 Evaluation

We evaluate four run-time systems. The runtimes are configured as follows. The optimizations

in variants of TML were all performed by hand. We use the simulation parameters listed in

Table 3.4.

• TML – The default TML implementation, with transaction metadata accessed via thread-

local storage, and setjmp/longjmp for rollback.

• TML-tls – Removes thread-local storage overhead from TML. TM implementations typ-

ically store per-thread metadata on the heap. Every transactional access requires the

address of the calling thread’s metadata, and rather than add an extra parameter to ev-

ery function call to provide a reference to this metadata, we rely on thread-local storage

(TLS). With OS support, TLS is almost free; otherwise, an expensive pthread API

call provides TLS. Avoiding TLS overhead is a well-understood optimization [WCW07];

TML merely makes it simpler.

• TML-pwi – Builds upon TML by adding the PWI (Post-write-instrumentation) optimiza-

tion. When a transaction issues its first write to shared memory, via TMWrite, it becomes

inevitable and cannot abort. Other concurrent transactions are also guaranteed to abort,

and to block until the writer completes completes. Thus once a transaction performs its

first write, instrumentation is not required on any subsequent read or write. For any call to

TMRead that occurs on a path that has already called TMWrite, lines 2–3 can be skipped.

Similarly, for any call to TMWrite that occurs on a path that has already called TMWrite,

59

lines 1–4 can be skipped. Thus after the first write the remainder of a writing transaction

will execute as fast as one protected by a single mutex.

• TML-aou – Adds AOU support, as well as PWI optimizations, to TML. AOU eliminates

all read instrumentation, since by design any change to the lock variable will trigger an

alert at the treader end. PWI is used to optimize for write instrumentation.

Results appear in Figure 3.8; all performance numbers are normalized to coarse-grain locks.

The ability to remove all read instrumentation has a substantial impact on the list and tree

benchmarks. The list is particularly interesting, since PWI offers little benefit. Additionally, the

impact is noticeable even on the counter, for which our compiler optimizations had little effect

on the Niagara. However, AOU introduces an unfortunate tradeoff: since AOU is modeled as a

spontaneous subroutine call, we require setjmp for rollback. This leads to noticeable overhead

on the SPARC, in part due to register windows and in part due to the single-issue cores. Best-

effort hardware TM systems [TrC08] have also proposed micro-architecture support for register

checkpointing; this support would have a clear beneficial impact on TML-aou performance.

3.7 Application 3: Detecting Atomicity Bugs

Data-race bugs occur when there are concurrent accesses to the same piece of shared-data, and

at least one of the accesses is a write. Flanagan and Qadeer [FlQ03] introduced the notion

of atomicity violation and distinguished it from a data race. They are a specific class of con-

currency bugs that occur due to runtime interleaving between memory accesses from different

threads that break the atomicity expectations of the programmer. Atomicity violations typically

occur when the software (or programmer) annotates the program incorrectly due to which two

accesses that are expected to fall in the same critical section (or atomic region) are interleaved

by an external access.

At ASPLOS 2006 Lu et al. [LT06] proposed the use of an access invariant to detect such

bugs. The access invariant is held by an instruction if the access pair, composed of itself and

its preceding local access to the same location, is never unserializably interleaved. The thread

whose atomicity is disrupted is the local thread (its accesses are called local accesses) and

60

0

1

2

3

4

5

6

7

8

9

168421

Nor

mal

ized

Thr

ough

put)

Threads

TMLTML-tls

TML-pwiTML-aou

(a) List

0

1

2

3

4

5

6

7

168421

Nor

mal

ized

Thr

ough

put)

Threads

TMLTML-tls

TML-pwiTML-aou

(b) RBTree

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

168421

Nor

mal

ized

Thr

ough

put)

Threads

TMLTML-tls

TML-pwiTML-aou

(c) Counter

Performance normalized to coarse-grain single thread lock.

Figure 3.8: Performance of Transactional-Mutex-Locks with AOU acceleration.

the thread whose access is interleaved is the remote thread [LT06]. The accessing instruction is

known as the I-instruction, the immediately preceding access is the P-instruction, and the remote

access is the R-instruction. Violations occur when any of the interleavings listed in Table 3.5

occur. To detect these interleavings, software needs to monitor the P and I instructions and

detect the R-instruction.

Lu et al. [LT06] proposed specialized hardware bits in the cache tag, to record the in-

formation on P and R-instructions to detect violations at the I-instruction. We detect atom-

icity violations using AOU. To detect races on a location A, a thread marks the cache-line

as AOU and software records the information on access. When a remote access occurs on

61

Table 3.5: Atomicity violation bugs defined by Lu et al. [LT06]P-Inst I-inst R-inst Atomicity violation

R R W(RR

)←W Reads not consistent within atomic section

W R W(WR

)←W Write in atomic section overwritten by external thread;

read does not get correct value.

W W R(WW

)← R Writes in atomic section are not atomic; earlier write

visible to remote read before later write has propa-gated.

R W W(RW

)←W Writes in atomic section may not be consistent with

read in atomic section; remote write dependent on theread may view external write.

R - Read; W-Write. Operations in brackets indicate atomic operation needed. All opera-tions show in tabular operate on a single variable.

the line, the thread alert handler records the %aou alertType register (R-instruction informa-

tion), %aou alertAddress (location), %alert alertState (P-instruction information). At the I-

instruction, AOU will generate another trigger since its set up to monitor local accesses as well.

Using the %aou alertType register (I-instruction information) and the previously recorded in-

formation the system can flag violations.

To measure the overheads we used the benchmarks provided by Lu et al. [LT06] and com-

pare the AOU and AVIO scheme with a software binary-instrumentation scheme (Valgrind3).

The main difference between AOU violation detection and the AVIO [LT06] is that AOU

records information in software, while AVIO uses extra hardware bits to record without dis-

rupting the program. As can be seen from Table 3.6, the software handlers do add overhead

over AVIO, but are still significantly better than an all-software approach. The overheads seem

acceptable, given that AOU is also general purpose (e.g., RPC calls).

Multi-variable Atomicity Violations

While in this section, we discussed atomicity violations on a single variable, AOU can be easily

extended to support multi-variable atomicity violation detection as well. Multi-variable viola-

tions occur when the P-instruction and the I-instruction access different variables. With AOU,

the detection of the violation itself occurs in software which can lookup the list of accesses

3Valgrind is not compatible with SPARC. Hence we measure its overheads on the real x86 machine.

62

Table 3.6: Execution time overhead for Atomicity violation detectionApplication AOU AVIO Valgrind

Apache 2.1 1.15 2567Mysql 2.3 1.2 3215lu-cont 1.3 1.05 589radix 1.5 1.07 209

Base application execution time=1. Valgrindoverheads measured on real x86 hardware3.3Ghz,16MB L2 Core2duo machine. AOUand AVIO overheads measured on simulator.

and check if an earlier P-instruction (to a different address) was affected by a remote operation.

AVIO [LT06] performs detection in hardware and hence would find it challenging to check all

other caches lines for the occurence of a P-instruction.

3.8 Other Applications

3.8.1 AOU for Fast User-space Mutexes

The low latency of alert signals shifts the tradeoff between spinning and yielding on lock ac-

quisition failure, especially in the case of user-level thread packages. Let us consider the ideal

behavior of a user-level mutex: a thread T will yield immediately when it fails to acquire lock

L, and will wake immediately when L is released. To approximate this behavior, we need only

prefix the acquire attempt with an aload of the lock. Then, on lock failure T can yield, and on

an alert we yield the current thread and switch back to T . In this manner no cycles are wasted

spinning on an unavailable lock, and no bus traffic is generated by multiple unsuccessful acquire

attempts.

For optimal performance, the thread package may specify that the alert handler attempts

to acquires L on T ’s behalf when an alert is given. This ensures the lowest latency between

the release of L and its acquisition by T . Additionally, if L is acquired by T ′ before the alert

handler can acquire it for T , the thread switch to T can be avoided. Furthermore, since alerts are

also generated by local events, this method is appropriate regardless of whether the lock holder

and failing acquirer reside on the same processor or separate processors. This technique is also

general to multithreaded code independent of whether that code uses transactions, and thus is

63

more general (and carries less overhead due to transaction rollback) than a similar proposal by

Zilles and Baugh [ZiB06].

3.8.2 Debugging

Modern microprocessors currently provide limited support for debuggers through watchpoint

registers. On the X86 there are 4 debug registers which are used by GDB to maintain memory

regions. These debug registers can watch regions of 1,2 or 4bytes long. However, GDB has to

maintain reference count for these debug registers to allow multiple watchpoints to share these

registers. With pervasive parallelism, four debug watchpoints registers will probably be insuf-

ficient. The alert-on-update allows the debugger to set watchpoints at only coarser cache-line

granularity but supports larger number of watchpoints, upto primary cache size. It is possible

for the debugger with feedback parameter from the alert signal indicating the address being

alerted on, to provide finer granularity watchpoints up to word level in software.

3.8.3 Code Security

Due to the fine granularity of the alert-on-update mechanism, it is suitable for reacting to mem-

ory corruption in some settings where page-based detection mechanisms are either too expen-

sive or too space-inefficient. We do not consider AOU to be a fine-grained replacement for all

or even most page-based memory protection techniques: page protection traps causes before

memory is modified modification, whereas AOU alerts occur after modification to a location,

making AOU clearly unsuitable for tasks such as garbage collection [ApL91].

Buffer Overflows In order to detect buffer overflows in legacy code, a program could aload

portions of its stack. A particularly appealing technique, inspired by DieHard, is to use random-

ization across installations of an application: the compiler could choose a random number of

cache lines of buffering between stack frames at compile time, and then aload those empty lines

as part of the function prologue. Since the size of the padding is known at compile time, pa-

rameter passing via the stack would not be compromised, but the randomization of the padding

across builds of the program would increase the likelihood that an attacker could not attack

64

multiple installations of a program with the same input. To do so would very likely result in an

alert-based detection for at least one installation, thereby revealing the buffer vulnerability.

Notification on Dynamic Code Modification Another appealing use of aload is to permit

fine-grained protection of code pages. Although the majority of applications do not require the

ability to dynamically modify their code, and are well served by using page-level protection,

thereq is no mechanism by which applications who modify their own code pages can ensure

that malicious code does not tamper make unauthorized modifications.

With aload, however, a program could mark the alert bits of code pages, and then use a

private signature field to indicate to the alert handler when the application is making a safe

modification to its code. If the alert handler is invoked and the private signature matches a hash

of the address causing the alert, the handler can safely assume that the alert was caused by a

trusted operation. If the alert handler detects a signature conflict, it can assume that the code

pages are being modified by an untrusted agent, and can raise an appropriate exception.

65

Chapter 4Isolation: Programmable Data Isolation

In this chapter, we elaborate on programmable-data-isolation (PDI), a hardware mechanism

that enables software to have fine-grain control over the propagation of write operations in

a multiprocessor system. Section 4.1 motivates the need for a data isolation mechanism in

shared-memory systems and briefly introduces the notion of lazy cache coherence. Section 4.2

introduces a lazy coherence protocol based on broadcast networks, TMESI-Bcast, and discusses

the coherence encoding that allows bulk state manipulations on isolated data. Section 4.3 dis-

cusses a directory-based version of TMESI and introduces a mechanism for detecting specula-

tive sharing between processors to optimize bulk state manipulations. Finally, Section 4.4 and

Section 4.5 discuss two flexible transactional memory systems, RTM and FlexTM, that exploit

TMESI to implement transactional isolation. We use these TM systems to also illustrate var-

ious techniques to virtualize data isolation when the cache runs out of space. RTM explores

a software-only approach (Section 4.4.3), while FlexTM explores two hardware techniques: a

hardware overflow controller (Section 4.6.2) and fine-grain translation hardware (Section 4.6.3).

Finally, Section 4.8 concludes with a discussion on other applications that could possibly benefit

from PDI.

66

4.1 Motivation

In shared-memory systems, coherence guarantees that a write operation in software is implic-

itly visible to all threads that share that data. While this implicit propagation of writes pro-

vides fast flexible communication, it is not often always desired. For example, vulnerabilities

in Adobe’s pdf plugin enables attackers to make changes that are immediately visible to the

browser [PaF06]. Shared memory systems do not provide any mechanism to hide a plugin’s

updates from the core application. Once a write operation is issued, the software does not have

any control over the communication of the value to other copies in the system.

In emerging programming aids like Thread-Level-Speculation [SBV95; SCZ00] and Trans-

actional Memory (TM) [HWC04; MBM06; AAK05; RaG02; MaT02], software marked code

regions perform speculative memory operations that are assumed to be not visible to concurrent

tasks until the end of the speculation region. Isolation of memory state is required to guarantee

the atomicity of such tasks.

Finally, online software testing is another application that requires low-overhead data iso-

lation capability. In vivo (IV) testing of software applications [CMK08; TXZ09; MKC07] is

a recently proposed approach, in which microbenchmarks test deployed software at specific

points. To enable isolated online testing we need to provide each test case with its own sep-

arate snapshot of memory that can sandbox updates and test cases. Currently, online tests are

sandboxed within heavyweight OS processes; an alternative light weight isolation mechanism

would enable us to run more stringent tests frequently.

In this dissertation, we propose a hardware mechanism, Programmable-Data-Isolation

(PDI), that enables software to control the transparency of writes in shared memory. Soft-

ware can use PDI to (1) decouple the execution of a write from its propagation, i.e. execute

a write operation, make it immediately visible to the local thread, but hide the write operation

from remote threads until specified by software, (2)atomically make visible the set of isolated

memory blocks to other threads in the system, and (3)atomically “undo” isolated blocks and

restore coherency with the rest of the memory system.

We use the sample execution in Figure 4.1 to illustrate PDI. The example involves three

threads, T1 and T3, both of which execute speculative, isolated code regions, and T2, which

67

performs non-isolated loads and stores. PDI seeks to make transparent the stores marked for

isolation within the speculative tasks, T1 and T3 i.e., Stores while visible to subsequent local

accesses are not visible to accesses in other threads. When a speculative region commits it

propagates its modifications, from which point future accesses (either local or remote) can view

the new value. As shown, the updates from T1 are visible to the second load in T2. At T1’s

commit, T3 aborts the isolated region to ensure coherence and that subsequent accesses receive

the up-to-date values.

Store X,1

Load X (1)

Begin_Isolation

Commit

T1

Load X (0)

Initially X = 0

Load X (1)

T2

Store X,2

Load X (2)

Begin_Isolation

Abort

T3

Time

T1, T2 and T3 indicate threads. The code regions marked for isolation are speculative. The store addr,value indicates

value written to addr. The load addr (value) indicates value read from addr.

Figure 4.1: Example of data isolation.

4.1.1 Previous Approaches to Data Isolation

To implement data isolation, the memory system needs to support multiple versions of a cache

block. Caches inherently provide data buffering, but coherence protocols normally propa-

gate modifications as soon as possible to all copies. Hardware transactional memory pro-

posals investigated versioning mechanisms to support transaction isolation. Most HTM pro-

posals [AAK05; RHL05; MBM06; HWC04; CTC06] allow a thread to delay this propaga-

tion while executing speculatively, and then to make the entire set of written blocks visible to

other threads atomically. Previous research have adopted two alternative approaches. Some

68

proposals track data (new and old) at the granularity of words and individual write opera-

tions, which leads to extra storage buffers and overheads proportional to length of specula-

tion [GFV99; RPA97; WAF07]. Furthermore, this also increases the cost of either rollback or

commit since these operations need to be performed at a fine-granularity on individual loca-

tions. Another body of work seek to amortize the cost of speculative state and their associated

coherence operations across multiple cache lines. They enable speculation at the granularity

of chunks (groups of memory operations), and at commit attempt to acquire coherence store

permissions for all cache lines speculatively written in that chunk [HWC04; CTM07]. These

designs require an aggressive memory system with support for bulk coherence operations and

global commit arbitration.

4.1.2 Our Approach : Lazy Coherence

The main challenges associated with supporting data isolation are: First, we need a memory

system where the speculatively written values and the non-speculative current values can be

maintained simultaneously in the memory hierarchy. The speculative values while transparent

to remote tasks need to be accessible within the speculative task. Second, conventional co-

herence protocols maintain the single-writer or multiple-reader invariant. Speculative memory

operations can be enabled across multiple processors which leads to multiple copies of the same

cache line being concurrently read and written. Finally, coherence protocol operations need to

logically appear to happen on groups of memory addresses, to preserve a mutually consistent

view.

To address these challenges we propose the notion of lazy coherence in which coherence

messages are eagerly sent out at each memory operation (speculative or non-speculative) but the

coherence state changes are performed lazily (under software control) to enable isolation. We

use a level of cache close to the processor to hold the new speculative copy of data, and rely on

shared lower levels of the memory hierarchy to hold the current values of the cache line. This

design results in many benefits: we can employ a standard memory hierarchy with conventional

cache lines being used as a buffer to hold the speculative data. Since the speculative data value

is already drained into the caches, when a speculative chunk commits (or aborts) the new values

can be published with a few simple operations on the local cache state.

69

Table 4.1: Programmable Data Isolation APIRegisters%t_in_flight: a bit to indicate that a isolated task is currently executingInstructionsbegin_t set the %t_in_flight register to indicate the start of a speculative

regiontstore [%r1], %r2 write the value in register %r2 to the word at address %r1; isolate the

line (TMI state)tload [%r1], %r2 read the word at address %r1, place the value in register %r2, and tag

the line as transactionalabort discard all isolated (TMI or TI) lines; clear all transactional tags and

reset the %t_in_flight registercas-commit [%r1], %r2, %r3 compare %r2 to the word at address %r1; if they match, commit all

isolated writes (TMI lines) and store %r3 to the word; otherwise dis-card all isolated writes; in either case, clear all transactional tags, dis-card all isolated reads (TI lines), and reset the %t_in_flight reg-ister

Lazy coherence also forwards coherence messages similar to traditional cache coherence,

at the time of the individual memory operations. We reuse the actions that already exist in a

conventional cache coherence protocol. The coherence messages on speculative operations are

noted down at the the remote cache, but the remote processor is allowed to continue accessing

the cached copy. When a subsequent remote chunk commits we invalidate the state of the cache

lines for which speculative memory operations were intercepted. This can also be achieved with

a few completely local state change operations.

4.2 Broadcast-based TMESI

In this section, we describe the first generation transactional-MESI which was developed as

part of the RTM transactional memory project (see Section 4.4). This protocol was designed

assuming a broadcast interconnect linked the processor-private L1s with the shared L2 in a

multicore; like traditional broadcast protocols it implements processors responses using shared

wired-OR lines.

Table 4.1 presents the processor interface for data isolation. Speculative reads and writes

use TLoad and TStore instructions. Since PDI was first developed in the context of transac-

tional memory the api uses the terms “transactional” to refer to speculative instructions. These

instructions are interpreted as speculative when the transactional bit (%t_in_flight) is set.

70

As described in Section 4.4, this allows the same code path to be used by by speculative blocks

that use hardware support for speculation and those that employ software support when the

cache resources are overflown. TStore is used for writes that require isolation. TLoad is used

for reads that can safely be cached despite remote TStores.

Speculative reads and writes employ two new coherence states: TI and TMI. These states

allow a software policy, if it chooses, to activate lazy coherence and permits multiple speculative

tasks to share cache blocks for reading and writing in a transparent manner. A conflict occurs

between two copies of a cache block when two concurrent tasks both access the block and

at least one of the accesses is a write. Hardware helps in the detection task by piggybacking

a threatened (T) signal/message, analogous to the traditional shared (S) signal/message, on

responses to read-shared bus requests whenever the line exists in TMI state somewhere in the

system. The T signal warns a reader of the existence of a speculative writer.

TMI serves to buffer speculative local writes. Regardless of previous state, a line moves to

TMI in response to a PrTWr (the result of a TStore). The first TStore to a modified cache line

results in a writeback prior to transitioning to TMI to ensure that lower levels of the memory

hierarchy have the latest non-speculative value. A TMI line then reverts to M on commit and

to I on abort. Software must ensure that among any concurrent speculative chunks which have

written, at most one commits, and if a conflicting reader and writer both commit, the reader does

so first from the point of view of program semantics. Lines in TMI state threaten any remote

reads, resulting in the remote reader loading the cache in TI state. A line in TMI state threatens

read requests and suppresses its data response, allowing lower levels of the hierarchy to supply

the non-speculative version of the data.

TI allows continued use of data that have been read by the current transaction, but may have

been speculatively written by a concurrent transaction in another thread. An I line moves to

TI when threatened during a TLoad; an M, E, or S line moves to TI when written by another

processor while tagged transactional (indicating that a TLoad has been performed by the current

transaction). A TI line must revert to I when the current transaction commits or aborts, because

a remote processor has made speculative changes which, if committed, would render the local

copy stale. No writeback or flush is required since the line is not dirty.

The CAS-Commit instruction performs the usual function of compare-and-swap. In addi-

71

BusRd/S

BusTRdX,UpgrTX/Flush

PrTWr/Flush

PrTRd/−

PrTRd/−

PrRd,PrTRd/−

PrTWr/−

PrRd,PrTRd/−

PrTWr/ BusTRdX

PrWr

X/–

PrRd/−

PrWr

PrRd /

BusRd,X/–

PrRd/−

PrRd /

/ BusRdX

PrRd&¬t_in_flight/ BusRd(T)

PrRd,PrWr/−

PrWr/−

PrRd&t_in_flight/BusRd(T)

PrTRd/BusRd(S,T)

/ UpgrX

/ UpgrX

E

M

I

S

PrTWr/–

X/–

X/Flush

BusRd TMI

TS

PrTWr

TE

TM

TI

PrTWr/ UpgrTX

PrRd,PrTRd/−

/ Flush

BusTRdX, UpgrTX/–

BusRd/Flush

PrRd,PrTRd,PrTWr,BusTRdX,UpgrTX/– BusRd/T

BusTRdX, UpgrTX/–

PrTRd/BusRd(S,T)

MESI States

TMESI States

PrTRd/−

BusRd/S

CAS−Commit

ABORTPrRd,PrTRd,BusRd,BusTRdX,UpgrTX/−

/ FlushPrTWr

PrTWr/ UpgrTX

BusRd/S

BusRd/S

BusRd(S,T)

BusRd(S,T)

PrTRd/BusRd(T)

PrWr,BusRdX,UpgrX/ ABORT state, MESI action;if TMI, alert processor

Dashed boxes enclose the MESI and TMESI subsets of the state space. On a CAS-Commit, TM, TE, TS, and TI

revert to M, E, S, and I, respectively; TMI reverts to M if the CAS succeeds, or to I if it fails. Notation on transitions

is conventional: the part before the slash is the triggering message; after is the ancillary action (‘–’ means none). X

stands for the set {BusRdX, UpgrX, BusTRdX, UpgrTX}. “Flush” indicates writing the line to the bus. S and T

indicate signals on the “shared” and “threatened” bus lines respectively. Plain, they indicate assertion by the local

processor; parenthesized, they indicate the signals that accompany the response to a BusRd request. An overbar

means “not signaled”. For simplicity, we assume that the base protocol prefers memory–cache transfers over cache–

cache transfers. The dashed transition from the TMESI state space to the MESI state space indicates that actions

occur only on the corresponding cache line. “ABORT state” is the state to which the line would transition on abort.

The solid “CAS-Commit” and “ABORT” transitions from the TMESI state space to the MESI state space operate on

all transactional lines.

Figure 4.2: TMESI Broadcast Protocol

tion, if the CAS succeeds, speculatively written (TMI) lines revert to M, thereby making the

data visible to other readers through normal coherence actions. If the CAS fails, TMI lines are

invalidated, and software branches to an abort handler. In either case, speculatively read (TI)

72

Table 4.2: Coherence state encoding for fast commits and aborts.T A MESI C/A M/I State0 — 0 0 — —0 — 1 1 0 1

}I

0 — 0 1 — — S0 — 1 0 — — E0 — 1 1 1 —0 — 1 1 0 0

}M

1 — 0 0 — — TI1 — 0 1 — — TS1 — 1 0 — — TE1 — 1 1 — 0 TM1 — 1 1 — 1 TMI

T Line is (1) / is not (0) transactionalA Line is (1) / is not (0) alert-on-updateMESI 2 bits: I (00), S (01), E (10), or M (11)C/A Most recent txn committed (1) or aborted (0)M/I Line is/was in TMI (1)

lines revert to I and any transactional tags are flashed clear on M, E, and S lines.The motivation

behind CAS-Commit is simple: software TM systems invariably use a CAS or its equivalent to

commit the current transaction; we overload this instruction to make buffered transactional state

once again visible to the coherence protocol. The Abort instruction clears the transactional state

in the cache in the same manner as a failed CAS-Commit.

TMESI-Bcast enforces the single-writer or multiple-reader invariant for non-transactional

lines. For transactional lines, FlexTM also enforces (1) TStores can only update lines in TMI

state and (2) TLoads that are threatened can only cache the block in TI state. Software is

expected to ensure that at most one of the speculation with overlap commits. It can restore

coherence to the system by triggering an Abort on the remote speculation’s cache, without

having to re-acquire exclusive access to store sets [HWC04; CTC06].

4.2.1 Bulk State Changes

A cache line can be in any one of the four MESI states (I, S, E, M), the speculative states (TI,

TMI), or transactionally tagged variants of M, E, and S. If the protocol were implemented as a

pure automaton, this would imply a total of 9 stable states, compared to 4 in the base protocol.

To allow fast commits and abort of speculative state, our cache tags can be encoded in six

bits, as shown in Table 4.2.

At commit time, based on the outcome of the CAS in CAS-Commit, we pulldown (broadcast

a 0 on abort) the C/A bit line of transactional lines for which the T bit has been conditionally en-

73

abled. The conditional pull down is achieved by adding two separate series pull down transitors

to the C/A bit. The two transistors form a logical and between an external conditional enable

signal and the adjacent T bit; the C/A bit of lines with the T bit set are pulled down in bulk when

the conditional enable is asserted. Following this, we flash-clear the A and T bits. Flash clear

can be achieved with a single pull-down transitor enabled by a flash clear signal. Ken Mai’s

thesis discusses the circuit for conditional flash clear and flash clear in detail [Mai05] For TM,

TE, TS, and TI the flash clear alone would suffice, but TMI lines must revert to M on commit

and I on abort. We use the C/A bit to distinguish between these: when the line is next accessed,

M/I and C/A are used to interpret the state before being reset. If T is 0, the MESI bits are 11,

C/A is 0, and M/I is 1, the cache line state is invalid and the MESI bits are changed to reflect

this. In all other cases, the state reflected by the MESI bits is valid.

Note that, when the T bit is set we can ignore the C/A bit when interpreting the coherence

state. We can possibly eliminate the C/A bit, by adding three pull-down transistors to the

MESI’s state. The pull down transistors would logically and the conditional enable signal,

the T bit, and the M/I bit. On a conditional enable, if the T bit and M/I bit are set, the MESI

state will drop to I.

4.3 Directory-based TMESI

A key challenge with TMESI-Bcast is the addition of five stable states to the basic MESI pro-

tocol. TMESI-Dir adapts PDI to a directory protocol and extends it to incorporate signatures

(Section 3.4.3). It also simplifies the management of speculative reads, adding only two new

stable states to the base MESI protocol, rather than the five employed in RTM [SSH07]. Details

appear in Figure 4.3.

It would be possible to eliminate the transactionally tagged TM ,TE, and TS states entirely,

at the cost of some extra reloads in the event of read-write sharing. Suppose thread T1 has read

line X speculatively at some point in the past. The transactional tag indicates that X was

TLoaded as part of the current speculative region. A remote write toX (appearing as a BusRdX

protocol message) can moveX to I, because software (or other hardware like signatures) will be

tracking potential conflicts. If TLoads are replaced with normal loads and/or the transactional

tags eliminated, T1 will need to drop X to I, but a subsequent load will bring it back to TI.

74

Coherence Protocol

M

E

S

TMI

TI

I

X /INV-ACK

X /INV-ACK

GETS /Flush

X / Flush

TStore /Flush

TStore /—

TStore /TGETX

TStore /TGETX

TStore /TGETX

GETX / INV-ACK

TGETX / EXP-RD;GETS / S

TLoad /GETS(T)

Store /GETX

Store /GETX

Store /—

GETS / S

Load / GETS(T);X / INV-ACK; GETS /—

Load,Store,TLoad /—

Load,TLoad,TStore /—;TGETX,GETX,GETS / T

Load,TLoad /GETS(S

_,T_

)

Load,TLoad /GETS(S,T

_)

Load,TLoad /—

Load,TLoad /—

PDI States

COMMIT

ABORT

Dashed boxes enclose the MESI and PDI subsets of the state space. Notation on transitions is conventional: the part

before the slash is the triggering message; after is the ancillary action (‘–’ means none). GETS indicates a request for

a valid sharable copy; GETX for an exclusive copy; TGETX for a copy that can be speculatively updated with TStore.

X stands for the set {GETX, TGETX}. “Flush” indicates a data block response to the requestor and directory. S

indicates a Shared message; T a Threatened message. Plain, they indicate a response by the local processor to the

remote requestor; parenthesized, they indicate the message that accompanies the response to a request. An overbar

means logically “not signaled”.

State EncodingM bit V bit T bit

I 0 0 0S 0 1 0M 1 0 0E 1 1 0

TMI 1 0 1TI 0 0 1

Responses to requests that hit in Wsig or Rsig

Hit in Wsig Hit in Rsig

Request Msg Response Msg Response MsgGETX Threatened Invalidated

TGETX Threatened Exposed-ReadGETS Threatened Shared

Figure 4.3: TMESI Directory Protocol.

75

The TMESI-Dir base cache protocol for private L1s and a shared L2 is an adaptation of

the SGI ORIGIN 2000 [LaL97] directory-based MESI, with a full-map directory maintained at

the L2. TMESI-Bcast uses the coherence states to detect speculative sharing and specify the

coherence responses. With TM,TE, and TS states eliminated an alternative mechanism is re-

quired to track the read locations within a speculative region. Inspired by hardware TM systems

like Bulk [CTC06] and LogTM-SE [YBM07], TMESI2 uses Bloom filter signatures [Blo70]

to summarize the read and write sets of transactions in a concise but conservative fashion (i.e.,

false positives but no false negatives). Local L1 controllers use these signatures to respond to

both the directory and the requestor (response to the directory is used to indicate whether the

cache line has been dropped or retained). Requestors make three different types of requests:

GETS on a read (Load/TLoad) miss in order to get a copy of the data, GETX on a normal write

(Store) miss/upgrade in order to get exclusive permissions as well as potentially an updated

copy, and TGETX on a transactional store (TStore) miss/upgrade.

A TStore results in a transition to the TMI state in the private cache (encoded by setting the

T bit and dirty bit in conventional MESI. A TMI line reverts to M on commit (propagating the

speculative modifications) and to I on abort (discarding speculative values). To the directory,

the local TMI state is analogous to the conventional E state. The directory realizes that the

processor can transition to M (silent upgrade) or I (silent eviction), and any data request needs

to be forwarded to the processor to detect the latest state. The only modification required at

the directory is the ability to support multiple owners. We accommodate this need by adding

a mechanism similar to the existing support for multiple sharers. We track owners when they

issue a TGETX request and ping all of them on other requests. In response to any remote request

for a TMI line, the local L1 controller sends a Threatened response, analogous to the Shared

response to a GETS request on an S or E line.

In addition to transitioning the cache line to TMI, a TStore also updates the Wsig. A Wsig

lookup is performed to threaten remote requestors (readers and writers alike, represented as (T)

in Figure 4.2). TLoad likewise updates the Rsig.

TLoads when threatened (i.e., concurrent remote writer) move to the TI state (encoded by

setting the T bit when in the I (invalid) state). (Note that a TLoad from E or S can never be

threatened; the remote transition to TMI would have moved the line to I.) TI lines must revert

76

to I onz commit or abort, because if a remote processor commits its speculative TMI block, the

local copy could go stale. The TI state appears as a conventional sharer to the directory.

On forwarded L1 requests from the directory, the local cache controller tests the signatures

and appends the appropriate message type to the response message. On a miss in the Wsig,

the result from testing the Rsigis used; on a miss in both, the L1 cache responds as in normal

MESI. The local controller also piggybacks a data response if its deemed necessary (M state).

Signature-based response types are shown in Figure 4.2. Threatened indicates a write shar-

ing (hit in the Wsig), Exposed-Read indicates a read sharing (hit in the Rsig), and Shared or

Invalidated indicate no conflict.

Hits to Wsig always threaten the requestor, indicating a speculative writer. Hits to Rsig on a

TGETX respond with “Exposed-Read” to indicate a reader conflict with the transactional writer.

If the cache state is M, then a data response is piggybacked on all response messages. Hits to

either Rsig or Wsig result in a Shared (S) message to the directory in order to continue to be

perceived as a sharer and receive future requests for access checks.

4.3.1 Conflict Summary Tables

A challenge with TMESI-Bcast is the commit operation; remote copies of cache lines which

have been locally modified and committed need to be invalidated. The TMESI protocol arranges

for such lines to be in the TI state at the time of the speculative write; but the commit process

needs to arrange for the line on the remote cores to transition to I when an abort is issued. To

trigger the aborts, the committer needs information about the specific remote processors with

which it shares cache lines speculatively. TMESI-Dir addresses this challenge by gathering

information about sharers of conflicting speculative cache lines. A conflicting cache line is

shared between two or more processors where at least one of the processors has written the

line speculatively. It exposes this information to software which can then take the appropriate

decision about which processor to trigger the abort on.

We devise a bitmap structure, conflict summary tables (CSTs) to record the occurrence of

speculatively shared cache lines between processors. CSTs indicate that speculative regions on

two processors that conflict, instead of the locations on which they conflict. This information

77

concisely captures what software needs to know in order to resolve conflicts at the time of

its choosing. Software can choose when to examine the tables, and can use whatever other

information it desires (e.g., priorities) to drive its choice of which speculative chunk can commit.

Specifically, each processor has three Conflict Summary Tables (CSTs), each of which con-

tains one bit for every other processor in the system. Named R-W, W-R, and W-W, the CSTs

indicate that a local read (R) or write (W) has conflicted with a read or write (as suggested by

the name) on the corresponding remote processor. The W-R and W-W list at a processor P rep-

resent the speculative task that might need to be aborted when the speculative task at P wants to

commit. The R-W list helps disambiguate abort triggers; if an abort is initiated by a processor

not marked in the table, then software can safely ignore the message (i.e, not a conflicting spec-

ulative task). On each coherence request, the controller reads the local Wsig and Rsig, sets the

local CSTs accordingly, and includes information in its response that allows the requestor to set

its own CSTs to match. When it sends a Threatened or Exposed-Read message, the responder

sets the bit corresponding to the requestor in its R-W, W-W, or W-R CSTs, as appropriate. The

requestor likewise sets the bit corresponding to the responder in its own CSTs, as appropriate,

when it receives the response. In Section 4.5, we show that CSTs when exposed to software

enable flexible software-controlled TM confilct resolution policies.

4.4 Application of TMESI-Bcast : RTM Project

In this section, we use TMESI-Bcast to develop a flexible hardware-software TM, RTM

(Rochester Transactional Memory). RTM [SSH07] promotes policy flexibility by decoupling

version management from conflict detection and management—specifically, by separating data

and metadata, and performing conflict detection only on the latter. While RTM’s conflict detec-

tion mechanism enforces immediate conflict resolution, software can choose (by controlling the

timing of metadata inspection and updates) when conflicts are resolved. We permit, but do not

require, read-write and write-write sharing, with delayed detection of conflicts. We also employ

a software contention manager [ScS05] to arbitrate conflicts and determine the order of com-

mits. RTM also illustrates the use of software to virtualize the proposed hardware mechanisms.

The RTM runtime is based on the open-source RSTM system [MSH06], a C++ library that

78

runs on legacy hardware. RTM uses alert-on-update and programmable data isolation to avoid

copying and to reduce bookkeeping and validation overheads, thereby improving the speed of

“fast path” transactions. When a transaction’s execution time exceeds a single quantum, or

when the working set of a transaction exceeds the ALoad or TStore capacity of the cache, RTM

restarts the transaction in a more conservative “overflow mode” that supports unboundedness in

both space and time. We use the RTM to illustrate how cache evictions can be handled entirely

in software.

4.4.1 RTM Transaction

Transactions are lexically scoped, and delimited by BEGIN_TRANSACTION and

END_TRANSACTION macros. The first of these sets the alert handler for a transaction and

configures per-transaction metadata. The second issues a CAS-Commit.In order to access fields

of an object, a thread must obtain read or write permission by performing an open_RO or

open_RW call.

Every RTM transaction is represented by a descriptor (Figure 4.4) containing a serial num-

ber and a word that indicates whether the transaction is currently ACTIVE, COMMITTED, or

ABORTED. The serial number is incremented every time a new transaction begins.

Every transactional object is represented by a header containing five fields: a pointer to

an “owner” transaction, the owner’s serial number, pointers to valid (old) and speculative(new)

versions of the object, and a bitmap listing overflow transactions currently reading the object. 1

Open_RO returns a pointer to the most recently committed version of the object. Typi-

cally, the owner/serial number pair indicates a COMMITTED transaction, in which case the New

pointer is valid if it is not NULL; otherwise the Old pointer is valid. If the owner/serial number

pair indicates an ABORTED transaction, then the Old pointer is always valid. When the owner

is ACTIVE, there is a conflict.

Open_RW returns a pointer to a writable copy of the object. At some point between its

open_RW and commit time, a transaction must acquire every object it has written. The acquire

1The reader list is a software bitmap very similar to the sharer list associated with cache lines in a directoryprotocol. The reader list informs the writer of all the current transactions actively reading the object.

79

Serial Number

Status

Serial Number

Status

Txn−1 Descriptor

Old Writer

Old Version

Data Object −Clone

Txn−2 Descriptor

New Writer

Object Header

Owner

Old Object

New Object

Serial Number

Overflow Readers

Data Object −

Here a writer transaction is in the process of acquiring the object, overwriting the Owner pointer and Serial Number

fields, and updating the Old Object pointer to refer to the previously valid copy of the data. A fast-path transaction

will set the New Object field to null; an overflow transaction will set it to refer to a newly created clone. Several

overflow transactions can work concurrently on their own object clones prior to acquire time, just as fast-path

transactions can work concurrently on copies buffered in their caches.

Figure 4.4: RTM metadata.

operation first gets permission from a software contention manager [HLM03b; ScS05] to abort

all transactions in the overflow reader list. It then writes the owner’s ID, the owner’s serial

number, and the addresses of both the last valid version and the new speculative version into the

header using a Wide-CAS(not shown in Table 4.1) instruction.2 Finally, it aborts any transactions

in the overflow reader list of the freshly acquired object.

At the end of a transaction, a thread issues a CAS-Commit to change its state from ACTIVE

to COMMITTED. If the CAS fails because another thread has set the state to ABORTED, the

transaction is retried.

2As in Itanium’s cmp8xchg16 instruction [Int06], if the first two words at location A match their “old” values,all words are swapped with the “new” values (loaded into contiguous registers). Success is detected by comparingold and new values in the registers.

80

4.4.2 Fast-Path RTM Transactions

Eliminating Data Copying

A fast-path transaction calls begin_hw_t inside the BEGIN_TRANSACTION macro. Subse-

quent TStores will be buffered in the cache; their data will remain invisible to other threads until

the transaction commits (at the hardware level, of course, the existence of lines to which TStores

have been made is visible in the form of “threatened” signals/messages). For fast-path transac-

tions this is the valid version that would be returned by open_RO; updates will be buffered in

the cache. Programmable data isolation thus avoids the need to create a separate writable copy,

as is common in software TM systems (RSTM among them). When a fast-path transaction

acquires an object, it writes a NULL into the New pointer, since the old pointer is both the last

and next valid version. A newly arriving transaction that sees mismatched serial numbers will

read the appropriate version.

Reducing Bookkeeping and Validation Costs

In most software TM systems a transaction must verify that all its previously read objects are

still valid before it performs any dangerous operation. ALoad allows validation to be achieved

essentially for free (see Section 3.5). Whenever an object is read (or opened for writing lazily),

the transaction uses ALoad to mark the object’s header in the local cache. Since transactions

cannot commit changes to an object without modifying the object header first, the remote ac-

quisition of a locally ALoaded line results in an immediate alert to the reader transaction. Freed

of the need to explicitly validate previously opened objects, software can also avoid the book-

keeping overhead. Best of all, perhaps, a transaction that acquires an object implicitly aborts

all fast-path readers of that object simply by writing the header: fast-path readers need not

add themselves to the list of readers in the header, and the O(t) cost of aborting the readers is

replaced by the invalidation already present in the cache coherence protocol. An RTM trans-

action aborts and retries in overflow mode to such an event, or to invalidation or eviction of an

A-tagged or TMI line.

81

4.4.3 Overflow RTM Transactions

Fast-path RTM transactions are bounded in space and time; they cannot ALoad or TStore more

lines than the cache can hold, and they cannot execute across a context switch.To accommodate

larger or longer transactions, RTM employs an overflow mode with only one hardware require-

ment: that the transaction’s ALoaded descriptor remain in the cache whenever the transaction is

running. Since committing a fast-path writer updates written objects in-place, we must ensure

that a transaction in overflow mode also notices immediately when it is aborted by a competitor.

We therefore require that every transaction ALoad its own descriptor. If a competitor CAS-es

its status to ABORTED, the transaction will suffer an immediate alert. It also writes itself into

the Overflow Reader list of every object it reads; this ensures it will be explicitly aborted by

writers.

While only one ALoaded line is necessary to ensure immediate aborts and to handle valida-

tion, using a second ALoad can improve performance when a an overflow transaction clones an

object.If the overflow writer is cloning an object when the fast-path writer commits, the clone

operation may return an internally inconsistent object. If the transaction ALoaded the header

and then clone the object. We assume in our experiments that the hardware is able (with a small

victim cache) to prefer non-ALoaded lines for eviction, and to keep at least two in the cache.

In the overflow mode the transaction leaves the %hardware_t bit clear, instructing the

hardware to interpret TLoad and TStore instructions as ordinary loads and stores. This conven-

tion allows the overflow transaction to execute the exact same user code as fast-path transac-

tions; there is no need for a separate version. Without speculative stores, open_RW calls in the

overflow transaction must clone to-be-written objects. At acquire time, the WCAS instruction

writes the address of the clone into the New field of the metadata. When open_RO calls en-

counter a header whose last Owner is committed and whose New field is non-null, they return

the New version as the current valid version.

Context Switches

To support transactions that must be preempted, we require two actions from the operating

system. When it swaps a transaction out, the operating system flash clears all the A tags. In

82

addition, for transactions in fast-path mode, it executes the abort instruction to discard iso-

lated lines. When it swaps the transaction back in, it starts execution in a software-specified

restart handler (separate from the alert handler). The restart handler aborts and retries if the

transaction was in fast-path mode or was swapped out in mid-clone; otherwise it re-ALoads the

transaction descriptor and checks that the transaction status has not been changed to ABORTED.

If this check succeeds, control returns as normal; otherwise the transaction jumps to its abort

code.

4.4.4 Latency of RTM Transactions

In this section we study the latency characteristics of RTM transactions and investigate the

overheads of the various TM runtime components. The RTM system evaluated in this disserta-

tion is an object based system in which applications need to declare the shared data and use the

specified interface. We evaluate RTM with six microbenchmarks, HashTable, RBTree, RBTree-

Large, LinkedList-Release, LFUCache and RandomGraph with varied transaction characteris-

tics. Appendix A provides a detailed description of the microbenchmarks.

We evaluate each benchmark with two RTM configurations: RTM-F always executes fast-

path transactions to extract maximum benefit from the hardware; RTM-O always executes over-

flow transactions to demonstrate worst-case throughput. We also compare RTM to RSTM and

to the RTM-Lite runtime which only uses AOU described in Chapter 3. Like a fast-path RTM

transaction, an RTM-Lite transaction ALoads the headers of the objects it reads; it does not per-

form any validation. Since PDI not available, however, it must version every acquired object.

Every RTM-Lite transaction keeps an estimate of the number of lines it can safely ALoad. If it

opens more objects than this, it keeps a list of the extra objects and validates them incremen-

tally, as RSTM does. If it suffers a “no more space” alert, it reduces its space estimate, aborts,

and restarts. As a baseline best-case single-thread execution, we compare against a coarse-

grain locking library (CGL), which enforces mutual exclusion by mapping the BEGIN_ and

END_TRANSACTION macros to acquisition and release of a single coarse-grain test-and-test-

and-set lock.

83

To ensure a fair comparison, we use the same benchmark code, memory manager, and con-

tention manager (Polka [ScS05]) in all systems. Briefly, Polka performs exponential backoff

when a transaction aborts and increases the stall time based on the difference between the num-

ber of read accesses performed by the conflicting transactions. To avoid the need for a hook

on every load and store, we modify the memory manager to segregate the heap and to place

shared object payloads at high addresses (metadata remains at low addresses). The simulator

then interprets memory operations on high addresses as TLoads and TStores.

Figure 4.5 presents a breakdown of transaction latency at 1 thread and 8 threads. App Tx

represents time spent in user-provided code between the BEGIN_ and END_TRANSACTION

macros; time in user code outside these macros is App Non-Tx. Validation records any time

spent by the runtime explicitly validating its read set; Copy is the time spent cloning objects;

MM is memory management; CM is contention management. Fine-grain metadata and book-

keeping operations are aggregated in Bookkeeping. For single-thread runs, the time spent in

statistics collection for contention management is merged into bookkeeping; for multi-thread

runs, Abort is the sum of all costs in aborted transactions.

We make the following overall conclusion.

Result: Seperating the metadata tracking conflicts from the shared data that needs version-

ing makes them amenable for hardware acceleration and enables flexible software controlled

TMs. However, this separation of data and metadata can cause excessive bookkeeping over-

heads due to the indirection required on every data access. The interoperability of concurrent

hardware and software transactions complicates the critical path of hardware transactions.

On a single thread, RTM-F exploits PDI the shorter code path to attain a maximum speedup

of 8.7× on RandomGraph and a geometric-mean speedup of 3.5× across all the benchmarks.

Figure 4.5 shows that bookkeeping is the dominant overhead in RTM-F relative to CGL. One

source of overhead is RTM’s use of indirection to access an object. This in turn stems from our

choice of metadata organization. RTM decouples metadata cache blocks on which conflicts are

detected using AOU from the data cache blocks which are made invisible to the shared memory

system. Access to shared data lines incur an extra pointer access on the critical path through the

metadata.

84

While this supports flexible conflict resolution it also add up to a significant number of

instructions executed by hardware transactions (22 extra instructions in open RO, 38 extra in-

structions in open RW, 47 extra instructions in begin transaction, and 22 in end transaction)

compared to optimal path of other hardware TMs (e.g., HASTM [SAJ06]). Bookeeping over-

heads minimize the gains that RTM-F can obtain for these benchmarks. Here, we also tried to

optimize the runtime to elide the per-access instrumentation for transactions that are guaranteed

to run in single-thread mode (i.e., no other active transactions). In Section 4.5 we demonstrate

a more streamlined HTM that exploits the full potential of AOU and PDI.

As the breakdowns show, RTM and RTM-Lite successfully leverage ALoad to eliminate

RSTM’s validation overhead. RTM-F also eliminates copying costs. Due to the small object

sizes in benchmarks other than RBTree-Large, this gain is sometimes compensated for by dif-

ferences in metadata structure and corresponding bookkeeping overhead. Similar analysis at 8

threads (Figure 4.5) continues to show these benefits, although increased network contention

and cache misses, as well as limited concurrency in some benchmarks, cause increased latency

per transaction.

HashTable exhibits embarrassing parallelism since transactions are short (5 cache lines read

and 2 cache lines written) and conflicts are rare. These properties prevent the hardware from

offering much additional benefit. In single thread mode, the cost of copying is 4.3% of trans-

action latency in RSTM. Since the read set is small the cost of validation is 16.8%. RTM-F

eliminates the validation overhead and reduces bookkeeping to improve transaction latency by

X%. In single-thread mode, RTM-F minimizes bookkeeping over RTM-Lite by 2.2× due to

single-thread optimizations. However, memory management and bookkeeping account for 52%

of RTM-F’s transaction latency. Furthermore, these costs increase with number of threads; at 8

threads, the bookkeeping cost increases by 2.5×.

LinkedList-Release has a high cost for metadata manipulation and bookkeeping. Together,

they account for 88% of the transaction latency of an RSTM transaction. RTM-F removes the

validation overhead and reduces RSTM’s bookkeeping overheads by 2.8×. However, it still

incurs 58% overhead over CGL. RTM-Lite performs slightly better than RTM-F at more than 1

thread since RTM-F has 11% higher bookkeeping compared to RTM-Lite at > 1 thread. This

increase in bookkeeping outweighs the benefits obtained from eliminating the copying.

85

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CG

L

RS

TM

RT

M-L

ite

RT

M-F

CG

L

RS

TM

RT

M-L

ite

RT

M-F

CG

L

RS

TM

RT

M-L

ite

RT

M-F

CG

L

RS

TM

RT

M-L

ite

RT

M-F

CG

L

RS

TM

RT

M-L

ite

RT

M-F

CG

L

RS

TM

RT

M-L

ite

RT

M-F

Hash RBTree LinkedList-Release LFUCache RandomGraph RBTree-Large

Nor

mal

ized

Exe

cutio

n T

ime

Abort

Copy

Validation

CM

Bookkeeping

MM

App Non-Tx

App Tx

0

0.5

1

1.5

2

2.5

RS

TM

RT

M-L

ite

RT

M-F

RS

TM

RT

M-L

ite

RT

M-F

RS

TM

RT

M-L

ite

RT

M-F

RS

TM

RT

M-L

ite

RT

M-F

RS

TM

RT

M-L

ite

RT

M-F

RS

TM

RT

M-L

ite

RT

M-F

Hash RBTree LinkedList-Release LFUCache RandomGraph RBTree-Large

Nor

mal

ized

Exe

cutio

n T

ime

Abort

Copy

Validation

CM

Bookkeeping

MM

App Non-Tx

App Tx

9.2 6.7 Livelock5.7

Breakdown of per transaction latency for 1 thread (top) and 8 threads (bottom). All results are normalized to 1-thread

RSTM.

Figure 4.5: RTM transaction execution time breakdown

Tree rebalancing in RBTree gives rise to conflicts. By deferring to transactions that have

invested more work, and by backing off before retrying, the Polka [ScS05] contention manager

keeps aborts to within 5% of the total transactions committed. At 8 threads, transactions spend

∼10% of the time in contention management. As shown in Figure 4.5, validation is a signifi-

cant overhead in RSTM transactions (40% of transaction latency); both RTM-F and RTM-Lite

eliminate the validation cost. Despite this, in single-thread mode, bookkeeping for RTM-F

transactions is 17× higher compared to optimal critical path fn CGL. In RBTree-Large RTM-F

is able to leverage TLoad and TStore to eliminate long latency copy operations. RTM-F reduces

latency by another 19% by eliminating the copy overhead. At 8 threads, '58% of RTM-F’s

transaction latency is bookkeeping..

LFUCache has little concurrency and all TM systems experience very high latencies at 8

threads due to the wasted work in aborts. At 8 threads, RSTM transactions take 9.2× as long

86

as they do on a single thread, and aborts account for 75% of total time. Commit latency in a

transaction increases by 2.4× compared to a single thread. At 8 threads,, RTM-F eliminates

copying and validation, and reduces bookkeeping, resulting in 2.5× reduction in latency over

RSTM. LFUCache’s small transactions stress RTM’s bookkeeping. Even with single thread

optimizations, RTM has 2.2× increased latency compared to CGL. At 8 threads, the increased

complexity of the code path when there are multiple threads) and absence of single-thread

optimization means bookkeeping increases by 3× compared to a single thread.

Transactions in RandomGraph are complex and conflict with high probability. Validation is

expensive and aborts are frequent. In RSTM, validation dominates single thread latency, con-

tributing to 79% of overall execution time. Leveraging ALoad to eliminate validation enables

RTM-F improves performance by a factor of 8.7× compared to RSTM. With the validation

overheads eliminated the bookkeeping overheads are clearly noticeable; 60% of a transaction’s

time is spent in bookkeeping. When there is any concurrency, the choice of eager acquire causes

all TMs to livelock in RandomGraph with the Polka contention manager. 3

4.4.5 Hardware-Software Transactions

The previous subsections analyzed the performance of RTM fast-path (RTM-F) transactions.

Figure 4.6 presents average transaction latency as the fraction of fast-path transactions is varied

from 0–100%, normalized to latency in the all-fast-path case. The figure shows a linear increase

in latency as the percentage of overflow transactions is increased, with the fraction of time

spent in overflow mode being directly proportional to this percentage. Overflow transactions

do not block or significantly affect the performance of fast-path transactions when executed

concurrently.

4.5 Application of TMESI-Dir: FlexTM

In this section, we develop a TM system, FlexTM, which uses Conflict Summary Tables for

demonstrating low-overhead and flexible conflict resolution between transactions. FlexTM

3We have tagged the top of the breakdown plots for 8 threads as “livelock” since almost 99% of transaction timeis spent in aborts. This livelocking behavior can be avoided even with eager acquire by using a Greedy contentionmanager [GHP05a].

87

0

0.5

1

1.5

2

100%

F75

%F

-50

%F

-25

%F

-10

0%O

100%

F75

%F

-50

%F

-25

%F

-10

0%O

100%

F75

%F

-50

%F

-25

%F

-10

0%O

100%

F75

%F

-50

%F

-25

%F

-10

0%O

100%

F75

%F

-50

%F

-25

%F

-10

0%O

100%

F75

%F

-50

%F

-25

%F

-10

0%O

HashTable RBTree LinkedList-Release

LFUCache RandomGraph RBTree-Large

Nor

mal

ized

Tra

nsac

tion

Lat

ency Tx-O

Tx-F

Breakdown of time spent in fast-path and in overflow mode, normalized to the all-fast-path execution (16 threads).

Figure 4.6: Interaction of RTM-F and RTM-O transactions

(FLEXible Transactional Memory) [SDS08] separates conflict detection from resolution and

management, and leaves software in charge of the latter. Simply put, hardware always detects

conflicts eagerly during the execution of a transaction and records them in the CSTs, but soft-

ware chooses when to notice and what to do about it.

A key contribution of FlexTM is a commit protocol that arbitrates between transactions in a

distributed fashion, allows parallel commits of an arbitrary number of transactions, and imposes

performance penalty proportional to the number of transaction conflicts. It enables lazy conflict

resolution without commit tokens [HWC04], broadcast of write sets [HWC04; CTC06], or

ticket-based serialization [CCC07]. To our knowledge FlexTM is the first hardware TM in

which the decision to commit or abort can be an entirely local operation, even when performed

lazily by an arbitrary number of threads in parallel.

FlexTM deploys four hardware primitives (1) Bloom filter signatures (as in Bulk [CTC06]

and LogTM-SE [YBM07]) to track and summarize a transaction’s read and write sets; (2) con-

flict summary tables (CSTs) to concisely capture conflicts between transactions; (3) the version-

ing system of RTM (programmable data isolation—PDI), simplified and adapted to directory-

88

based coherence, and augmented with an overflow mechanism and (4) Alert-On-Update mech-

anism to help transactions detect their status. Figure 4.7 shows the hardware additions.

User Registers

DataSharer ListStateTag

Processor Core

L2$

Signatures

RS

WSsig

sig

WsigRsig

Cores Summary

Tag State A T Data

L1 D$

Private L1 Cache ControllerOverflow Table Controller

MissL1

Shared L2 Cache Controller

Context Switch Support

R−W

W−R

W−W

CSTs

AOU Control

PDI Control

CMPC AbortPC

Thread IdOsigOver. CountComm./Spec.

V. BaseP. Base# Sets# Ways

Figure 4.7: FlexTM Architecture Overview

4.5.1 Bounded FlexTM Transactions

A FlexTM transaction is represented by a software descriptor (Table 4.4). This descriptor

includes a status word, space for buffering the hardware state when paused (CSTs, Signatures,

and Overflow control registers), pointer to the abort (AbortPC) and contention management

handlers (CMPC), and a field to specify the conflict resolution mode of the transaction.

A transaction is delimited by BEGIN_TRANSACTION and END_TRANSACTION macros

(see Figure 4.8). BEGIN_TRANSACTION establishes the conflict and abort handlers for the

transaction, checkpoints the processor registers, configures per-transaction metadata, sets the

transaction status word (TSW) to active, and ALoads that word (for notification of aborts).

Some of these operations are not intrinsically required and can be set up for the entire lifetime of

a thread (e.g., AbortPC and CMPC). END_TRANSACTION aborts conflicting transactions and

tries to atomically update the status word from active to committed using CAS-Commit.

89

Within a transaction, the processor issues TLoads and TStores when it expects transactional

semantics, and conventional loads and stores when it wishes to bypass those semantics. TLoads

and TStores are interpreted as speculative when the hardware transaction bit (%hardware_t)

is set. This convention facilitates code sharing between transactional and non-transactional

program fragments. Ordinary loads and stores can be requested within a transaction; these could

be used to implement escape actions, update software metadata, or reduce the cost of thread-

private updates in transactions that overflow cache resources. In order to avoid the need for

compiler generation of the TLoads and TStores, our prototype implementation follows typical

HTM practice and interprets ordinary loads and stores as TLoads and TStores when they occur

within a transaction.

Transactions of a given application can employ either Eager or Lazy conflict resolution. In

Eager mode, when conflicts appear through response messages (i.e., Threatened and Exposed-

Read), the processor effects a subroutine call to the handler specified by CMPC. The conflict

manager either stalls the requesting transaction or aborts one of the conflicting transactions. The

remote transaction can be aborted by atomically CASing its TSW from active to aborted,

thereby triggering an alert (since the TSW is always ALoaded). FlexTM supports a wide va-

riety of conflict management policies (even policies that desire the ability to synchronously

abort a remote transaction). When an Eager transaction reaches its commit point, its CSTs

will be empty, since all prior conflicts will have been resolved. It attempts to commit by ex-

ecuting a CAS-Commit on its TSW. If the CAS-Commit succeeds (replacing active with

committed), the hardware flash-commits all locally buffered (TMI) state. The CAS-Commit

will fail leaving the buffered state local if the CAS does not find the expected value (a remote

transaction managed to abort the committing transaction before the CAS-Commit could com-

plete).

In Lazy mode, transactions are not alerted into the conflict manager. The hardware simply

updates requestor and responder CSTs. To ensure serialization, a Lazy transaction must, prior to

committing, abort every concurrent transaction that conflicts with its write-set. It does so using

the END TRANSACTION() routine shown in Figure 4.8.

All of the work for the END TRANSACTION() routine occurs in software, with no need

for global arbitration [CTC06; CCC07; HWC04], blocking of other transactions [HWC04], or

90

BEGIN TRANSACTION()1. clear Rsig and Wsig

2. set Abort PC3. set CMPC

4. TSW[my id] = active5. Aload(TSW[my id])6. begin t

END TRANSACTION() /* Non-blocking, pre-emptible */1. if (TSW[my id] == active) goto Abort PC2. copy-and-clear W-R and W-Wregisters3. foreach i set in W-R or W-W4. abort id = manage conflict(my id, i)5. if (abort id 6= NULL) // not resolved by waiting6. CAS(TSW[abort id], active, aborted)7. CAS-Commit(TSW[my id], active, committed)8. if (TSW[my id] == active) // failed due to nonzero CST9. goto 1

Figure 4.8: Pseudocode of BEGIN TRANSACTION and END TRANSACTION.

special hardware states. The routine begins by using a copy and clear instruction (e.g., clruw

on the SPARC) to atomically access its own W-R and W-W. In lines 3–6 of Figure 4.8, for

each of the bits that was set, transaction T aborts the corresponding transactionR by atomically

changing R’s TSW from active to aborted. Transaction R, of course, could try to CAS-

Commit its TSW and race with T , but since both operations occur on R’s TSW, conventional

cache coherence guarantees serialization. After T has successfully aborted all conflicting peers,

it performs a CAS-Commit on its own status word. If the CAS-Commit fails and the failure can

be attributed to a non-zero W-R or W-W (i.e., new conflicts), the END TRANSACTION() routine

is restarted. In the case of a R-W conflict, no action is needed since T is the reader and is about

to serialize before the writer (i.e., the two transactions can commit concurrently). Software

mechanisms can be used to disambiguate conflicts and avoid spurious aborts when the writer

commits.

The contention management policy (line 4) in the commit process is responsible for pro-

viding various progress and performance guarantees. The TM system can choose to plug in an

application-specific policy. For example, if we used a Timestamp manager [ScS05], then it will

ensure livelock freedom. More recently, EazyHTM [TPK09] has exploited CST-like bitmaps to

accelerate a pure-HTM’s commit, but does not allow pluggable contention management poli-

cies. FlexTM’s commit operation is entirely in software and its latency is proportional to the

number of conflicting transactions — in the absence of conflicts there is no overhead. Even in

the presence of conflicts, aborting each conflicting transaction consumes only the latency of a

single CAS operation (at most a coherence operation).

91

4.5.2 Mixed Conflict Resolution

While Lazy generally provides the best performance [ShD09] with its ability to exploit concur-

rency and ensure progress, it does introduce certain challenges. Lazy requires a multiple-writer

and/or multiple reader protocol, that makes notable additions to a basic MESI protocol. Multi-

ple L1’s need to be able to concurrently cache a block and be able to read and write the block

(quite different from the basic “’S” and “M” states). This is a source of additional complexity

over an Eager system and could prove to be a barrier to adoption.

Furthermore, allowing write-write sharing seems to introduce non-trivial performance chal-

lenges. Write-Write conflicts need to be conservatively treated as dueling read-write and write-

read conflicts since a transaction that obtains permissions for writing a block can also read it.

It is not possible to allow both transactions to concurrently commit (one of them has to abort).

Commit-time conflict resolution in Lazy does try to ensure progress but the workload charac-

teristics could lead to significant levels of wasted work on delayed aborts and handicap the

performance benefits from concurrency (see effect on STMBench7 workload in [ShD09]).

A possible way to address Lazy’s challenges is to disallow multiple-writer sharing since it

does not seem to be prevalent in the first generation of TM workloads (see Appendix A). We

extend FlexTM to support the Mixed mode proposed previously [Sco06; ShD09]. In Mixed

mode, when write-write conflicts appear (TStore operation receives a threatened response) the

processor effects a call to the contention manager. On read-write or write-read conflicts the

hardware records the conflict in the CSTs and allows the transaction to proceed. When the

transaction reaches its commit point, it needs to take care of only W-R conflicts (algorithm

similar to Figure 4.8) as its W-W CST will be empty. Mixed tries to save wasted work on write-

write conflicts (where allowing concurrent execution does not help) and exploit parallelism

present in W-R and R-W conflicts.

Mixed has modest versioning requirements compared to Lazy. A system that supports only

Mixed (can also support Eager) can simplify the coherence protocol and overflow mechanisms.

Briefly, Mixed maintains the single writer and/or multiple reader invariant: it allows only one

writer for a cache block (unlike Lazy) although the writer can co-exist with concurrent readers

(unlike Eager). At any given instant, there is only one speculative copy accessed by the single

92

writer and/or a non-speculative version accessed by the concurrent readers. This simplifies the

design of the TMI state in the TMESI protocol–only one of the L1 caches in the system can

have the line in TMI (not unlike the “M” state in MESI).

4.6 Virtualizing of Cache Overflows in FlexTM

To provide the illusion of unbounded space, FlexTM must provide mechanisms to handle trans-

actional state evicted from the L1 cache. Cache evictions must be handled carefully. First,

signatures rely on forwarded requests from the directory to trigger lookups and provide con-

servative conflict hints (Threatened and Exposed-Read messages). Second, TMI lines holding

speculative values need to be buffered and cannot be merged into the shared level of the cache.

We first describe our approach to handling coherence-based conflict detection for evicted lines,

followed by two alternative schemes for versioning of evicted TMI lines in Section 4.6.2 and

Section 4.6.3.

4.6.1 Eviction of Transactionally Read Lines

Conventional MESI performs silent eviction of E and S lines to avoid the bandwidth overhead of

notifying the directory. In FlexTM, silent evictions of E, S, and TI lines also serve to ensure that

a processor continues to receive the coherence requests it needs to detect conflicts. (Directory

information is updated only in the wake of L1 responses to L2 requests, at which point any

conflict is sure to have been noticed.) When evicting a cache block in M, FlexTM updates the

L2 copy but does not remove the processor from the sharer list. Processor sharer information

can, however, be lost due to L2 evictions. To preserve the access conflict tracking mechanism,

L2 misses result in querying all L1 signatures in order to recreate the sharer list. This scheme

is much like the sticky bits used in LogTM [MBM06].

4.6.2 Overflow table (OT) Controller Design

This design has been adopted by FlexTM and employs a per-thread overflow table (OT) to buffer

evicted TMI lines. The OT is organized as a hash table in virtual memory. It is accessed both by

93

software and by an OT controller that sits on the L1 miss path. The latter implements (1) fast

lookups on cache misses, allowing software to be oblivious to the overflowed status of a cache

line, and (2) fast cleanup and atomic commit of overflowed state.

The controller registers required for OT support appear in Figure 4.7. They include a thread

identifier, a signature (Osig) for the overflowed cache lines, a count of the number of such lines,

a committed/speculative flag, and parameters (virtual and physical base address, number of sets

and ways) used to index into the table.

On the first overflow of a TMI cache line, the processor traps to a software handler, which

allocates an OT, fills the registers in the OT controller, and returns control to the transaction.

To minimize the state required for lookups, the current OT controller design requires the OS to

ensure that OTs of active transactions lie in physically contiguous memory. If an active transac-

tion’s OT is swapped out, then the OS invalidates the Base-Address register in the controller. If

subsequent activity requires the OT, the hardware traps to a software routine that re-establishes

a mapping. The hardware needs to ensure that new TMI lines are not evicted during OT set-up;

the L1 cache controller could easily support this by ensuring that at least one entry in the set is

free for non-TMI lines.

On a subsequent TMI eviction, the OT controller calculates the set index using the physical

address of the line, accesses the set tags of the OT region to find an empty way, and writes the

data block back to the OT instead of the L2. The controller tags the line with both its physical

address (used for associative lookup) and its virtual address (used to accommodate page-in at

commit time; see below). The controller also adds the physical address to the overflow signature

(Osig) and increments the overflow count.

The Osig provides quick lookaside checks for entries in the OT. Reads and writes that miss

in the L1 are checked against the signature. Signature hits trigger the L1-to-L2 request and

the OT lookup in parallel. On OT hits, the line is fetched from the OT, the corresponding OT

tag is invalidated, and the L2 response is squashed. This scheme is analogous to the speculative

memory request issued by the home memory controller before snoop responses are all collected.

When a remote request hits in the Osig of a committed transaction, the controller could perform

lookup in the OT, much as it does for local requests, or it could NACK the request until copy-

back completes. Our current implementation does the latter.

94

In addition to functions previously described, the CAS-Commit operation sets the Commit-

ted bit in the controller’s OT state. This indicates that the OT content should be visible, acti-

vating NACKs or lookups. At the same time, the controller initiates a microcoded copy-back

operation. To accommodate page-evictions of the original locations, OT tags include the virtual

addresses of cache blocks. These addresses are used during copy-back, to ensure automatic

page-in of any nonresident pages.

There are no constraints on the order in which lines from the OT are copied back to their

natural locations. This stands in contrast to time-based logs [MBM06], which must proceed in

reverse order of insertion. Remote requests need to check only committed OTs (since specula-

tive lines are private) and for only a brief span of time (during OT copy-back). On aborts, the

OT is returned to the operating system. The next overflowed transaction allocates a new OT.

When an OT overflows a way, the hardware generates a trap to the OS, which expands the OT

appropriately.

Although we require that OTs be physically contiguous for simplicity, they can themselves

be paged. In particular, it makes sense for the OS to swap out the OTs of descheduled threads.

A more ambitious FlexTM design could allow physically non-contiguous OTs, with controller

access mediated by more complex mapping information. With the addition of the OT controller,

software is involved only for the allocation and deallocation of the OT structure. Indirection

to the OT on misses, while unavoidable, is performed in hardware rather than in software,

thereby reducing the resulting overheads. Furthermore, FlexTM’s copyback is performed by

the controller and occurs in parallel with other useful work on the processor.

4.6.3 Handling Evictions with Fine-grain Translation

The OT controller mechanism described in the previous section requires a hardware state ma-

chine to maintain a write-buffer that is scattered across multiple levels in the memory hierarchy.

There is implementation complexity associated with the state machine that searches (writes

back and reloads) and accesses data blocks without any help from software.

In this section, we propose a more streamlined mechanism. We move the actions of main-

taining the data structure and performing the redo on commit to software, replacing the hash

95

table with buffer pages and introducing a metadata cache that enables hardware to access the

buffer pages without software intervention. Figure 4.9 shows the per-page software metadata,

which specifies the buffer-page address and for each cache block, the writer transaction id

(Tx id) and a “V/I” bit to indicate if the buffer block is buffering valid data. To convey the

metadata information to hardware and accelerate repeated block accesses, we install a metadata

cache (SM-cache) on the L1 miss path (see Figure 4.10).

Tx IdBuffer V. addr V/IData

V. addr.

Page-Granularity Metadata1st cache line Nth cache line

Writer Tx andValid bit

Tx Id V/I...

Buffer-page is in virtual memory. V/I bit is set/unset by hardware on cache-eviction/reload respectively. The

(V/I,Tx Id) pair denotes the following semantics when accessed by transaction T: (0,don’t care) buffer-page cache

block empty; (1,X) X6=T. T conflicts with writer transaction X; (1,T). T has speculatively written the block and

evicted it in the past.

Figure 4.9: Metadata for pages that have overflowed state

Processor Core

Tag State A T Data

Private L1 Cache

Tag1 Data Tag2Metadata P. addr

Data V. addrMetadata

SM-CacheInsert /

Remove entry

Coherence

Overflow Wsig

L1 miss /writeback

Simplified overflow support with SM-cache. Dashed lines surround new extension that replaces the OT controller

(see Figure 4.7)

Figure 4.10: Software-metadata cache architecture

When a speculatively written cache line is evicted, the cache controller looks up the SM-

cache for the metadata and uses the buffer page address to index into the TLB (for the buffer-

96

page’s physical address) for writeback redirection. Multiple transactions that are possibly writ-

ing different cache blocks on the same page can share the same buffer-page.4 A miss in the

SM-cache triggers a software handler that allocates the buffer-page metadata and reloads the

SM-cache. To provide the commit handler with the virtual address of the cache block to be

written back, every SM-cache entry includes this information and is virtually indexed (note

that the data cache is still physically indexed). While the entire buffer-page is allocated when

a single cache-block in the original page is evicted, the individual buffer-page cache blocks

are used only as and when further evictions occur. This ensures that the overflow mechanism

adds overhead proportional to the number of cache blocks that are evicted (similar to the OT

controller mechanism). In contrast to this design, other page-based overflow mechanisms (e.g.,

XTM [CMM06] and PTM [CNV06]) clone the entire page if at least a single cache block on

the page is evicted.

With data buffered, L1 misses now need to ensure that data is obtained from the appropriate

location (buffer-page or original). Similar to the OT controller design, we use an overflow

signature (Osig) to summarize addresses of evicted blocks and elide metadata checks. L1 misses

check the Osig, and signature hits require a metadata check. If the metadata indicates that

transaction T accessing the location had written the block (i.e., V/I bit is 1 and Tx Id=T), then

hardware fetches the buffer block and over-rides the L2 response. It also unsets the V/I bit to

indicate that the buffer block is no longer valid (block is present in the cache). Otherwise, the

coherence response message dictates the action.

On eviction of a speculatively written cache line that another transaction has written and

overflowed as well (i.e., V/I bit is 1 and Tx Id=X, X 6=T), a handler is invoked that either al-

locates a new buffer-page and refills the SM-cache or resolves the conflict immediately. The

former design supports multiple writers to the same location (and enables Lazy conflict resolu-

tion), while the latter forces eager write-write conflict resolution, but enables a simpler design.

The Tx Id field supports precise detection of writer conflicts (see the FlexTM-S design below).

When a transaction commits, it copy-updates the original locations using software routines.

To ensure atomicity, the transaction updates its status word to inform concurrent accesses to

4Virtual page synonyms are cases where multiple virtual pages point to the same physical frame and a thread canaccess the same location with different virtual addresses. To resolve these, since software knows about the pagesthat are synonyms it ensures that the SM-cache is loaded with the same metadata for all the virtual synonym pages.

97

hold off until the copy-back completes. It then iterates through the metadata of the various

buffer-pages in the working set and copies back the cache blocks that it has written.

SM-Cache

The SM-cache stores metadata which hardware can use to accelerate block access and cache

evictions without software intervention. It resides on the L1 miss path and operates in parallel

with the L2 lookup (see Figure 4.10). SM-cache misses are handled entirely by software

handlers that index into it using the virtual page address. The L1 controller also uses a similar

technique to obtain metadata for redirecting evictions and for block reloads.

The metadata may be concurrently updated if different speculative cache-blocks in the page

are evicted at multiple processor sites. To ensure metadata consistency, the SM-cache partic-

ipates in coherence using the physical address of the metadata. This physical address tag is

inaccessible to software and is automatically filled by the hardware when an entry is allocated.

The dual-tagging of the SM-cache introduces the possibility that the two tags (virtual address

of page and physical address of metadata) might not map to the same set index. We solve this

with tag array pointers [Goo87] as in virtually-indexed caches.

FlexTM-S Transactions

FlexTM-S To evaluate the performance of the SM-cache approach, we developed FlexTM-S.

For bounded transactions, it operates similar to FlexTM, but supports Mixed in lieu of conflict

resolution.

Compared to FlexTM, FlexTM-S (1) simplifies hardware support for the versioning mecha-

nism by trading in FlexTM’s overflow hardware controller for an SM-cache (software metadata

cache) and (2) allows precise detection of conflicting writers. By restricting support to Mixed

and Eager modes, i.e., allowing only one speculative writer, the coherence protocol is also

simplified.

To ensure low overhead for detecting conflicting readers, FlexTM-S uses the Rsig for both

overflowed and cached state. To identify writer transactions, it uses a two-level scheme: if

the speculative state resides in the cache, the response message from the conflicting processor

98

identifies the transaction (the CST bits will identify the conflicter’s id). If the speculative state

has been evicted then the Osig membership tests will indicate the possibility of a conflict. This

type of conflict is also encoded in the response message. If an Osig conflict is indicated, the

requester checks the metadata for precise disambiguation, thereby eliminating false positives.

Since a block can be written by only one transaction (Mixed/Eager invariant), the Tx id in the

metadata precisely identifies the writer. If the metadata indicates no conflict, software loads

the SM-cache, instructing hardware to ignore the Osig response and allows the transaction to

proceed. Thus the metadata for versioning helps to disambiguate writer transactions which

(1) helps identify the conflicting writer precisely and (2) allows progress of non-conflicting

transactions, which would have otherwise required contention management (in Eager mode)

due to signature false-positives.

4.6.4 Handling OS Page Evictions

The challenges that need to be considered are (1) when a page is swapped out and its frame is

reused for a different page in the application, and (2) when a page is re-mapped to a different

frame. Since signatures are built using physical addresses, (1) can lead to false positives, which

can cause spurious aborts but not correctness issues. In a more ambitious design, we could

address these challenges with virtual address-based conflict detection for non-resident pages.

For (2) we adapt a solution first proposed in LogTM-SE [YBM07]. At the time of the

unmap, active transactions are interrupted both for TLB entry shootdown (already required)

and to flush TMI lines to the OT. When the page is assigned to a new frame, the OS interrupts

all the threads that mapped the page and tests each thread’s Rsig, Wsig, and Osig for the old

address of each block. If the block is present, the new address is inserted into the signatures.

Finally, there are differences between the support required from the paging mechanism for the

OT controller approach and the SM-Cache approach. The former indexes into the overflow table

using the physical address and requires the paging mechanism to update the tags in the table

entries with the new physical address. The latter needs no additional support since it uses the

virtual address of the buffer-page and at the time of writeback indexes into the TLB to obtain

the current physical address.

99

4.6.5 Context Switch Support

STMs provide effective virtualization support because they maintain conflict detection and ver-

sioning state in virtualizable locations and use software routines to manipulate them. For com-

mon case transactions, FlexTM uses scalable hardware support to bookkeep the state associated

with access permissions, conflicts, and versioning while controlling policy in software. In the

presence of context switches, FlexTM detaches the transactional state of suspended threads from

the hardware and manages it using software routines. This enables support for transactions to

extend across context switches (i.e., to be unbounded in time [AAK05]).

Ideally, only threads whose actions overlap with the read and write set of suspended trans-

actions should bear the software routine overhead. Both FlexTM and FlexTM-S handle context

switches in a similar manner. To track the accesses of descheduled threads, FlexTM maintains

two summary signatures, RSsig and WSsig, at the directory of the system. When suspending a

thread in the middle of a transaction, the OS unions (i.e., ORs) the signatures (Rsig and Wsig) of

the suspended thread into the current RSsig and WSsig installed at the directory. FlexTM updates

RSsig and WSsig using a Sig message that uses the L1 coherence request network to write the

uncached memory-mapped registers. The directory updates the summary signatures and returns

an ACK on the forwarding network. This avoids races between the ACK and remote requests

that were forwarded to the suspending thread/processor before the summary signatures were

updated.

Once the RSsig and WSsig are up to date, the OS invokes hardware routines to merge the

current transaction’s hardware state into virtual memory. This hardware state consists of (1) the

TMI lines in the local cache, (2) the overflow hardware registers, (3) the current Rsig and Wsig,

and (4) the CSTs. After saving this state (in the order shown), the OS issues an abort instruction,

causing the cache controller to revert all TMI and TI lines to I, and to clear the signatures, CSTs,

and overflow controller registers. This ensures that any subsequent conflicting access will miss

in the private cache and generate a directory request. In other words, for any given location,

the first conflict between the running thread and a local descheduled thread always results in

an L1 miss. The L2 controller consults the summary signatures on each such miss, and traps

to software when a conflict is detected. A TStore to a line in M state generates a write-back

100

(see Figure 4.2) that also tests the RSsig and WSsig for conflicts. This resolves the corner case

in which a suspended transaction TLoaded a line in M state and a new transaction on the same

processor TStores it.

On summary signature hits, a software handler mimics hardware operations on a per-thread

basis, testing signature membership and updating the CSTs of suspended transactions. When

using the SM-cache design, the software metadata from versioning can be used to precisely

identify the writer conflict. No special instructions are required, since the CSTs and signa-

tures of descheduled threads are all visible in virtual memory. Nevertheless, updates need to

be performed atomically to ensure consistency when multiple active transactions conflict with a

common descheduled transaction and update the CSTs concurrently. The OS helps the handler

distinguish among transactions running on different processors. It maintains a global conflict

management table (CMT), indexed by processor id, with the following invariant: if transac-

tion T is active, and has executed on processor P , irrespective of the state of the thread (sus-

pended/running), the transaction descriptor will be included in P ’s portion of the CMT.

The handler uses the processor ids in its CST to index into the CMT and to iterate through

transaction descriptors, testing the saved signatures for conflicts, updating the saved CSTs (if

running in lazy mode), or invoking conflict management (if running in eager mode). Similar

perusal of the CMT occurs at commit time if running in lazy mode. As always, we abort a

transaction by writing its TSW. If the remote transaction is running, an alert is triggered since it

would have previously ALoaded its TSW. Otherwise, the OS virtualizes the AOU operation by

causing the transaction to wake up in a software handler that checks and re-ALoads the TSW.

The directory needs to ensure that sticky bits are retained when a transaction is suspended.

Along with RSsig and WSsig, the directory maintains a bitmap indicating the processors on

which transactions are currently descheduled (the “Cores Summary” register in Figure 4.7).

When the directory would normally remove a processor from the sharers list (because a response

to a coherence request indicates that the line is no longer cached), the directory refrains from

doing so if the processor is in the Cores Summary list and the line hits in RSsig or WSsig. This

ensures that the L1 continues to receive coherence messages for lines accessed by descheduled

transactions. It will need these messages if the thread is swapped back in, even if it never reloads

the line.

101

When re-scheduling a thread, if the thread is being scheduled back to the same processor

from which it was swapped out, the thread’s Rsig, Wsig, CST, and OT registers are restored on

the processor. The OS then re-calculates the summary signatures for the currently swapped

out threads with active transactions and re-installs them at the directory. Thread migration is a

little more complex, since FlexTM performs write buffering and does not re-acquire ownership

of previously written cache lines. To avoid the inherent complexity, FlexTM adopts a simple

policy for migration: abort and restart.

Unlike LogTM-SE [YBM07], FlexTM is able to place the summary signature at the di-

rectory rather than on the path of every L1 access. This avoids the need for inter-processor

interrupts to install summary signatures. Since speculative state is flushed from the local cache

when descheduling a transaction, the first access to a conflicting line after re-scheduling is

guaranteed to miss, and the conflict will be caught by the summary signature at the directory.

Because it is able to abort remote transactions using AOU, FlexTM also avoids the problem of

potential convoying behind suspended transactions.

4.7 Evaluation

4.7.1 Area Analysis

In this section, we summarize the area overheads of our hardware mechanisms; area estimates

appear in Table 4.7.1. We consider processors from a uniform (65nm) technology generation

to better understand microarchitectural tradeoffs. Processor component sizes were estimated

using published die images. We used CACTI 6 to estimate the area overheads of the storage.

Only for the 8-way multithreaded Niagara-2 do the Rsig and Wsig have a noticeable area im-

pact: 2.2%; on Merom and Power6 they add only ∼0.1%. CACTI indicates that the signatures

should be readable and writable in less than the L1 access latency. These results appear to be

consistent with those of Sanchez et al. [SYH07]. The CSTs for their part are full-map bit-vector

registers (as wide as the number of processors), and we need only three per hardware context.

We do not expect the extra state bits in the L1 to affect the access latency because (a) they have

minimal impact on the cache area and (b) the state array is typically accessed in parallel with

the higher latency data array.

102

Finally, we compare the OT controller to the metadata cache (SM-cache) approach. While

the SM-cache is significantly more area hungry than the controller, it is a regular memory struc-

ture rather than a state machine. The SM-cache needs a separate hardware cache to store the

metadata while the OT controller’s metadata (i.e., hash-table index entries) contend with regular

data for L2 cache space. Overall, the OT controller adds less than 0.5% to core area. Its state

machine is similar to Niagara-2’s [KAO05] TLB walker. Niagara-2 with its 16-byte data cache

line presents a worst-case design point for the SM-cache. The small cache line leads to high

overhead in page-level metadata, since there are more cache blocks per page (4× more than

Merom or Power6) and per-cache line metadata, since the per-cache line entry (17 bits) is a

significant fraction of cache line size (16 bytes). Straightforward optimizations that would save

area include organizing the metadata to represent a larger than cache line region.

Overall, with either FlexTM (which includes the OT controller) or FlexTM-S (which in-

cludes the SM-cache) the overheads imposed on out-of-order CMP cores (Merom and Power6)

are well under 1-2%. In the case of Niagara-2 (high core multithreading and small cache lines),

FlexTM add-ons require a∼2.6% area increase while FlexTM-S’s add-ons require a∼10% area

increase.

Table 4.3: Area overhead of FlexTM’s hardware mechanismsProcessor Merom [SYM07] Power6 [FMJ07] Niagara-2 [Inc05]

Actual DieSMT (threads) 1 2 8Feature Size 65nm 65nm 65nmDie (mm2) 143 340 342

Core (mm2) 31.5 53 11.7L1 D (mm2) 1.8 2.6 0.4

line size (bytes) 64 128 16L2 (mm2) 49.6 126 92

CACTI PredictionRsig + Wsig (mm2) .033 .066 0.26

RSsig + WSsig (mm2) .033 .033 0.033CSTs (registers) 3 6 24Extra state bits 2 (T A) 3 (T A, ID) 5 (T A ID)

% Core increase 0.6% 0.59% 2.6%% L1 Dcache increase 0.35% 0.29% 3.9%OT controller (mm2) 0.16 0.24 0.035

32 entry SM-Cache (mm2) 0.27 0.96ID — SMT context of ‘TMI’ line

103

Table 4.4: Experimental setup.16-way CMP, Private L1, Shared L2

Processor Cores 16 1.2GHz in-order, single issue; non-memory IPC=1L1 Cache 32KB 2-way split, 64-byte blocks, 1 cycle,

32 entry victim buffer,2Kbit signature [CTC06, S14]

L2 Cache 8MB, 8-way, 4 banks, 64-byte blocks, 20 cycleMemory 2GB, 250 cycle latency

Interconnect 4-ary tree, 1 cycle, 64-byte links,Central Arbiter (Section 4.7.4)

Arbiter Lat. 30 cycles [CTM07]Commit Msg. Lat. 16 cycles/linkCommit messages also use the 4-ary tree.

4.7.2 FlexTM Evaluation

We evaluate FlexTM through full system simulation of a 16-way chip multiprocessor (CMP)

with private L1 caches and a shared L2 (see Table 4.4(a)), on the GEMS/Simics infrastruc-

ture [MSB05].

We evaluate FlexTM using the benchmarks listed in Appendix A. Workload set 1 is a set

of microbenchmarks obtained from the RSTM package [10a] and Workload set 2 consists of

applications from STAMP [MTC07] and STMBench7 [GKV07]. Kmeans and Labyrinth spend

60—65% of their time in transactions; for all other applications spend over 98% of time in

transactions. In the microbenchmark tests, we execute a fixed number of transactions in a single

thread to warm up the structure, then fork off threads to perform the timed transactions. In Bayes

and Labyrinth we added padding to a few data structures to eliminate frequent false conflicts.

As Table A.1 in Appendix A shows, the workloads we evaluate have varied dynamic charac-

teristics. Delaunay and Genome perform a large amount of work per memory access and repre-

sent workloads in which time spent in the TM runtime is small compared to overall transaction

latency. Kmeans is essentially data parallel and along with the HashTable microbenchmark rep-

resents workloads that are highly scalable with no noticeable level of conflicts. Intruder also has

small transactions but there is a high level of conflicts due to the presence of dueling write-write

conflicts. The short transactions in HashTable, KMeans, and Intruder suggest that TM runtime

overheads (if any) may become a significant fraction of total transaction latency. LFUCache and

Randomgraph have a large number of conflicts and do not scale; any pathologies introduced by

the TM runtime itself [BGH08] are likely to be exposed. Bayes, Labyrinth, and Vacation have

104

moderate working set sizes and significant levels of read-write conflicts due to the use of tree-

like data structures. RBTree is a microbenchmark version of Vacation. STMBench7 is the most

sophisticated application in our suite. It has a varied mix of large and small transactions with

varying types and levels of conflicts [GKV07].

Evaluation Dimensions

We have designed the experiments to address the following questions

• How does FlexTM perform relative to hybrid TMs, hardware-accelerated STMs, and

STMs?

• How does FlexTM’s CST-based parallel commit compare to a centralized hardware ar-

biter design?

• How do the virtualization mechanisms deployed in FlexTM and FlexTM-S compare to

previously proposed software instrumentation (SigTM [MTC07]) and virtual memory-

based implementations [CMM06]?

4.7.3 FlexTM vs. Hybrid TMs and STMs

Result 1: Separable hardware support for conflict detection, conflict tracking, and versioning

can provide significant acceleration for software controlled TMs; eliminating software book-

keeping from the common case critical path is essential to realizing the full benefits of hardware

acceleration.

Runtime systems

We evaluate FlexTM and compare it against two different sets of Hybrid TMs and STMs with

two different sets of workloads.

Workload set 1 (WS1) interfaces with three TM systems: (1) FlexTM; (2) RTM-F [SSH07],

a hardware accelerated STM system; and (3) RSTM [MSH06], a non-blocking STM for legacy

105

hardware (configured to use invisible readers, with self validation for conflict detection). Work-

load set 2 (WS2), which uses a different API, interfaces with (1) FlexTM, (2) TL2, a blocking

STM for legacy hardware [DSS06], and (3) SigTM [MTC07], a hybrid TM derived from TL2

that uses signatures to track accesses to accelerate conflict detection and a software write-buffer

to version data locations. Every read or write is instrumented to insert the address into the sig-

nature and cross-reference the write-buffer to check if a new local version exists. FlexTM, the

hybrids (SigTM and RTM-F), and the STMs (RSTM and TL2) have all been set up to perform

Lazy conflict resolution.

We use the “Polka” conflict manager [ScS05] in FlexTM, RTM-F, SigTM, and RSTM. TL2

limits the choice of contention manager and uses a timestamp manager with backoff. While all

runtime systems execute on our simulated hardware, RSTM and TL2 make no use of FlexTM’s

extensions. RTM-F uses only PDI and AOU and SigTM uses only the signatures (Rsig and

Wsig). FlexTM uses all the presented mechanisms. Average speedups reported are geometric

means.

Results

Figure 4.11 shows the performance (transactions/sec) normalized to sequential thread perfor-

mance for 1 thread runs. This demonstrates that the overheads of FlexTM are minimal. For

small transactions (e.g., Hashtable) there is some overhead ('15%) for the checkpointing of

processor registers, which FlexTM performs in software — it could take advantage of check-

pointing hardware if it exists.

We study scaling and performance with 16 thread runs (Figure 4.12). To illustrate the use-

fulness of CSTs (see the table in Figure 4.12), we also report the number of conflicts encoun-

tered and resolved by an average transaction—the number of bits set in the W-R and W-W CST

registers.

The performance of both STMs suffer from the bookkeeping required to track data versions,

detect conflicts, and guarantee a consistent view of memory (validation). RTM-F exploits AOU

and PDI to eliminate validation and copying overhead, but still incurs bookkeeping that accounts

for 40–50% of execution time. SigTM uses signatures for conflict detection but performs ver-

106

FlexTM RTM-F RSTM(a) Workload Set 1

0

0.2

0.4

0.6

0.8

1

Has

hTa

ble

RBTr

ee

LFU

Cac

he

Ran

dom

Gra

phNorm

aliz

ed T

hro

ughput

FlexTM SigTM TL2

(b) Workload Set 2

0

0.2

0.4

0.6

0.8

1

Bay

es

Del

aunay

Gen

om

e

Intr

uder

Km

eans

Labyr

inth

Vac

atio

n

STM

Ben

ch7

Norm

aliz

ed T

hro

ughput

Å

Throughput (transactions/106 cycles), normalized to sequential thread. All performance barsuse 1 thread.

Figure 4.11: 1 thread performance of FlexTM

sioning entirely in software. On average, the overhead of software-based versioning is smaller

than that of software-based conflict detection, but it still accounts for as much as 30% of exe-

cution time for on some workloads (e.g.,STMBench7). Because it supports only lazy conflict

detection, SigTM has simpler software metadata than RTM-F. RTM-F tracks conflicts for each

individual transactional location and could varies the eagerness on a per-location basis.

FlexTM’s hardware tracks conflicts, buffers speculative state, and ensures consistency in a

manner transparent to software, resulting in single thread performance close to that of sequen-

tial thread performance. FlexTM’s main overhead, register checkpointing, involves spilling of

107

FlexTM RTM-F RSTM

(a) Workload Set 1

0

2

4

6

8

10

Has

hTa

ble

RBTr

ee

LFU

Cac

he

Ran

dom

Gra

phNorm

aliz

ed T

hro

ughput 10.1

FlexTM SigTM TL2

(b) Workload Set 2

0

2

4

6

8

10

12

Bay

es

Del

aunay

Gen

om

e

Intr

uder

Km

eans

Labyr

inth

Vac

atio

n

STM

Ben

ch7

Norm

aliz

ed T

hro

ughput

Throughput (transactions/106 cycles), normalized to sequential thread. All performance barsuse 16 threads.

Figure 4.12: 16 thread performance of FlexTM

local registers into the stack and is nearly constant across thread levels. Eliminating per-access

software overheads (metadata tracking, validation, and copying) allows FlexTM to realize the

full potential of hardware acceleration, with an average speedup of 2× over RTM-F and 5.5×

over RSTM on WS1. On WS2, FlexTM has an average speedup of 1.7× over SigTM and 4.5×

over TL2.

HashTable and RBTree both scale well and have significant speedup over sequential thread

performance, 10.3× and 8.3× respectively. In RSTM, validation and copying account for 22%

of execution time in HashTable and 50% in RBTree; metadata management accounts for 40%

108

and 30%, respectively. RTM-F manages to eliminate the validation cost and copying cost but

unfortunately the metadata management hinders performance improvement. FlexTM stream-

lines transaction execution and provides 2.8× and 8.3× speedup over RTM-F and RSTM re-

spectively.

LFUCache and RandomGraph do not scale (no performance improvement compared to

sequential thread performance). In LFUCache, conflict for popular keys in the Zipf distribu-

tion forces transactions to serialize. Stalled writers lead to extra aborts with larger numbers

of threads, but performance eventually stabilizes for all TM systems. In RandomGraph, larger

numbers of conflicts between transactions updating the same region in the graph cause all TM

systems to experience significant levels of wasted work. The average RandomGraph transaction

reads ∼60 cache lines and writes ∼9 cache lines. In RSTM, read-set validation accounts for

80% of execution time. RTM-F eliminates this overhead, after which per-access bookkeeping

accounts for 60% of execution time. FlexTM eliminates this overhead as well, to achieve 2.7×

the performance of RTM-F.

In applications with large access set sizes (i.e., Vacation, Bayes, Labyrinth, and STM-

Bench7), TL2 suffers from the bookkeeping required prior to the first read (i.e., for checking

write sets), after each read, and at commit time (for validation) [DSS06]. This instrumentation

accounts for '40% of transaction execution time. SigTM uses signatures-based conflict detec-

tion to eliminate this overhead. Unfortunately, both TL2 and SigTM suffer from another source

of overhead: given lazy conflict resolution, reads need to search the redo log to see previous

writes by their own transaction. Furthermore, the software commit protocol needs to lock the

metadata, perform the copyback, and then release the locks. FlexTM eliminates the cost of ver-

sioning and conflict detection and improves performance significantly, averaging 2.1× speedup

over SigTM and 4.8× over TL2.

Genome and Delaunay are workloads with a large ratio between the transaction size and

the number of accesses. TL2’s instrumentation on the reads does add significant overhead

and affects its scalability—only 3.4× and 2.1× speedup (at 16 threads) over sequential thread

performance for Genome and Delaunay respectively. SigTM eliminates the conflict detection

overhead and significantly improves performance—an average of 2.4× improvement over TL2.

FlexTM, in spite of the additional hardware support, improves performance by 22%, since the

109

versioning overheads account for a smaller fraction of overall transactional execution.

Finally, Kmeans and Intruder have unusually small transactions. Software handlers add

significant overhead in TL2. In Kmeans, SigTM eliminates conflict detection overhead to im-

prove performance by 2.7× over TL2. Since the write sets are small, eliminating the versioning

overheads in FlexTM only improves performance a further 24%. Intruder has a high level of

conflicts, and does not scale well, with a 1.6× speedup for FlexTM over sequential thread per-

formance (at 16 threads). Both SigTM and FlexTM eliminate the conflict detection handlers

and streamline the transactions, which leads to a change in the conflict pattern (fewer conflicts).

This improves performance significantly—3.3× and 4.2× over TL2 for SigTM and FlexTM

respectively. As in Kmeans, the versioning overheads are smaller and FlexTM’s improvement

over SigTM is restricted to 23%.

4.7.4 FlexTM vs. Central-Arbiter Lazy HTMs

Result 2: CSTs are useful: transactions do not often conflict and even when they do the num-

ber of conflicts per transaction is less than the total number of active transactions. FlexTM’s

distributed commit demonstrates better performance than a centralized arbiter.

As shown in Table 4.4(b), the number of conflicts encountered by a transaction is small

compared to the total number of concurrent transactions in the system. Even in workloads that

have a large number of conflicts (LFUCache and RandomGraph) a transaction typically en-

counters conflicts only about 30% of the time. Scalable workloads (e.g., Vacation, Kmeans)

encounter essentially no conflicts. This clearly suggests that global arbitration and serialized

commits will not only waste bandwidth but also restrict concurrency. CSTs enable local arbi-

tration and the distributed commit protocol allows parallel commits thereby unlocking the full

concurrency potential of the application. Also, a transaction’s commit overhead in FlexTM is

not a constant, but rather proportional to the number of conflicting transactions encountered.

In this set of experiments, we compare FlexTM’s distributed commit against two schemes

with centralized hardware arbiters: Central-Serial and Central-Parallel. In both schemes, in-

stead of using CSTs and requiring each transaction to ALoad its TSW, transactions forward their

Rsig and Wsig to a central hardware arbiter at commit time. The arbiter orders each commit

110

request, and broadcasts the Wsig to other processors. Every recipient uses the forwarded Wsig

to check for conflicts and abort its active transaction; it also sends an ACK as a response to the

arbiter. The arbiter collects all the ACKs and then allows the committing processor to complete.

This process adds 97 cycles to a transaction, assuming unloaded links and arbiter (latencies are

listed in Table 4.4(a)). The Serial version services only one commit request at a time (queuing

up any others); the Parallel services all non-conflicting transactions in parallel (assuming infi-

nite buffers in the arbiter). Central arbiters are similar in spirit to BulkSC [CTM07], but serve

only to order commits; they do not interact with the L2 directory.

We present results (see Figure 4.13) for all our workloads and enumerate the general trends

below:

• Arbitration latency for the Central commit scheme is on the critical path of transactions.

This gives rise to noticeable overhead in the case of short transactions (e.g., HashTable, RBTree,

LFUCache, Kmeans, and Intruder). CSTs simplify the commit process: in the absence of

conflicts, a commit requires only a single memory operation on a transaction’s cached status

word. On these workloads, CSTs improve performance by an average of 25% even over the

aggressive Central-Parallel, which only serializes a transaction commit if it conflicts with an

already in flight commit.

• Workloads that exhibit inherent parallelism with Lazy conflict resolution (all except

LFUCache and RandomGraph) suffer from serialization of commits in Central-Serial. Cen-

tral-Serial essentially queues up transaction commits and introduces the commit latency of even

other non-conflicting transactions onto the critical path. The serialization of commits could also

change the conflict pattern. In some workloads (e.g., Intruder, STMBench7), in the presence

of reader-writer conflicts as the reader transaction waits for predecessors to release the arbiter

resource, the reader could be aborted by the conflicting writer. In a system that allows paral-

lel commits the reader could finish earlier and elide the conflict entirely. CST-based commit

provides an average of '50% and a maximum of 112% (HashTable) improvement over Cen-

tral-Serial. Central-Parallel removes the serialization overhead, but still suffers from commit

arbitration latency.

• In benchmarks with high conflict levels (e.g., LFUCache and RandomGraph) that do not

inherently scale, Central’s conflict management strategy avoids performance degradation. The

111

transaction being serviced by the arbiter always commits successfully, ensuring progress and

livelock freedom. The current distributed protocol allows the possibility of livelock. However,

the CSTs streamline the commit process, narrow the vulnerability window (to essentially the

interprocessor message latency), and eliminate the problem as effectively as Central. Lazy

conflict resolution inherently eliminates livelocks as well. [ShD09; SDM09]

At low conflict levels, a CST-based commit requires mostly local operations and its per-

formance should be comparable to an ideal Central-Parallel (i.e., zero message and arbitration

latency). At high conflict levels, the penalties of Central are lower compared to the overhead

of aborts and workload inherent serialization. Finally, the influence of commit latency on per-

formance is dependent on transaction latency (e.g., reducing commit latency helps Central-

Parallel approach FlexTM’s throughput in HashTable but has negligible impact on Random-

Graph’s throughput).

FlexTM Central-Parallel Central-Serial

0

2

4

6

8

10

12

Bay

es

Del

aunay

Gen

om

e

Intr

uder

Km

eans

Labyr

inth

Vac

atio

n

STM

Ben

ch7

Has

hTa

ble

RBTr

ee

LFU

Cac

e

Ran

dom

Gra

phN

orm

aliz

ed T

hro

ughput

(1 t

hre

ad=

1)

Throughput (transactions/106 cycles), normalized to sequential thread. All performance bars use 16 threads.

Figure 4.13: FlexTM vs. Centralized hardware arbiters.

4.7.5 FlexTM-S vs. Other Virtualization Mechanisms

To study TM virtualization mechanisms, we downgrade our private L1 caches to 32KB 2-way.

This ensures that in spite of the moderate write set sizes in our workloads they experience

112

overflows due to associativity constraints. Every L1 has access to a 64 entry SM-cache. Each

metadata entry is 136 bytes.

We use five benchmarks in our study: Bayes, Delaunay, Labyrinth, and Vacation from the

STAMP suite, and STMBench7. As Table 4.4(b) shows, these benchmarks have the largest

write sets and are most likely to generate L1 cache overflows, enabling us to highlight tradeoffs

among the various virtualization mechanisms. The fraction of total transactions that experience

overflows in Bayes, Delaunay, Labyrinth, Vacation and STMBench7 is 11%, 8%, 25%, 9% and

32% respectively.

We compare FlexTM-S’s performance against the following Lazy TM systems:(1) FlexTM,

which employs a hardware controller for overflowed state and signatures for conflict detection,

(2) XTM [CMM06], which uses virtual memory to implement all TM operations; (3) XTM-

e, which employs virtual memory support for versioning but performs conflict detection using

cache-line granularity tag bits; and (4) SigTM [MTC07], which uses hardware signatures for

conflict detection and software instrumentation for word-granularity versioning. All systems

employ the Polka [ScS05] contention manager.

Result 1: A software maintained metadata cache is sufficient to provide virtualization sup-

port with negligible overhead.

As shown in Figure 4.14, FlexTM-S imposes modest performance penalty (10%) com-

pared to FlexTM. This is encouraging since it is vastly simpler to implement the SM-cache

than the controller in FlexTM. The SM-cache miss and copyback handlers are the main con-

tributors to the overhead. Unlike FlexTM and FlexTM-S, which version only the overflowed

cache lines, XTM and XTM-e suffers from the overhead of page-granularity versioning. XTM’s

page-granularity conflict detection also leads to excessive aborts. XTM and XTM-e both rely

on heavyweight OS mechanisms; by contrast, FlexTM-S requires only user-level interrupt han-

dlers. Finally, SigTM incurs significant overhead due to software lookaside checks to determine

if an accessed location is being buffered.

We also analyzed the influence of signature false positives. In FlexTM-S, write signature

false positives can lead to increased handler invocation for loading the SM-cache, but the soft-

ware metadata can be used to disambiguate and avoid abort penalty. In FlexTM, signature

113

FlexTM FlexTM-S XTM-e SigTM XTM(a) Bayes

0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Thr

ough

put

(b) Delaunay

0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Thr

ough

put

(c) Labyrinth

0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Thr

ough

put

(d) Vacation

0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Thr

ough

put

(e) STMBench7

0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Thr

ough

put

Throughput at 16 threads for FlexTM-S vs. other TMs, normalized to FlexTM.

Figure 4.14: Comparing FlexTM-S (FlexTM-Streamlines) with other TMs

responses are treated as true conflicts, and causes contention manager invocations that could

lead to excessive aborts. We set the Wsig and Osig to 32 bits (see Figure 4.15) to investigate the

performance penalties of small write signatures.

114

Result 4: As Figure 4.15 shows, FlexTM-S’s use of software metadata to disambiguate

false positives helps reduce the needed size of hardware signatures while maintaining high

performance.

FlexTM (32bit Wsig) FlexTM-S (32bit Osig and OWsig)

0

0.2

0.4

0.6

0.8

1

Bayes Labyrinth Delaunay STMBench7

Nor

mal

ized

Thr

ough

put

Performance normalized to FlexTM with 2048-bit Wsig)

Figure 4.15: Effect of signature size on FlexTM performance

4.8 Other Applications

Programmable-Data-Isolation essentially provides support for capturing a snapshot of the cache

blocks. Threads can make updates to private isolated blocks without worrying about remote

threads and always obtain the up-to-date version for read-only blocks. Memory snapshots can be

used to improve concurrency in many applications where multiple threads need to sift through

the program state as the program is executing.

4.8.1 Profiling

A number of profling tools (e.g., memory profiling, call-graph profiling [FMF05]) rely on a

separate profiling thread to traverse and assimilate information as the application is concurrently

executing. In such cases, the program objects referenced by the profiling thread are also being

concurrently modified by the application. Locking can be used to ensure correct interaction.

However, the profiling thread has been developed by a third-party, and ensuring correctness

and high performance is a challenge. Note that any consistent snapshot of the memory suffices

115

for the profiling thread (and does not necessarily need the up-to-date values). PDI effectively

supports profiling; the profiling thread has to only execute in an isolated block and prefetch the

data it needs to profile in speculative exclusive mode.

4.8.2 Garbage Collection

Similar to profiling a common application of separate threads iterating over program state is

garbage collection. Fast garbage collectors try to improve concurrency between mutators and

collectors with complex synchronization operations. Snapshots of memory will help the collec-

tors assimilate the dead objects [DFH04] without having to worry about the object references

being modified by the application.

4.8.3 Concurrent Programming

Most recently Burkhardt et al. [BBL10] have proposed using isolated versions of shared data

(revisions in Burkhardt’s terminology) to reason about concurrency as an alternative to critical-

section (TM or locks) based synchronization. Revision-based concurrency control relies on

maintaining a separate isolated copy of the shared data per concurrent task. These concurrent

tasks (non-speculative) make updates to the private copies of the data, which are propagated

when the task completes; all conflicts are resolved without aborts using program-level seman-

tics. Revisions need isolation without conflict detection. Software-only implementations of

revisions requires custom GET and PUT functions per field of a data object. PDI could elimi-

nate these handlers and provide a low overhead mechanism to implement revisions in both typed

and untyped languages. Our current implementation of PDI only supports a single task per L1

cache that requires isolation. Future work could potentially investigate extensions to PDI for

supporting multiple lightweight tasks per L1.

116

Chapter 5Conflict Management and Resolution in

HTMs

This chapter seeks to analyze the interaction between TM conflict management and conflict res-

olution policies, while taking into consideration both performance and implementation trade-

offs. In Section 5.1, we introduce the various types of conflict scenarios and motivate the need

for studying conflict management and resolution policies. Section 5.2 defines the basic termi-

nology and discusses the options available to a contention manager based on when it is invoked

to resolve a specific conflict scenario. Following this, we incorporate contention management

heuristics in a step-by-step fashion. In Section 5.3, we analyze the influence of introducing

backoff (stalling) into the contention manager. Section 5.4 studies the interaction between con-

tention manager heuristics with conflict resolution time (Eager and Lazy). In Section 5.5, we

implement and evaluate a mixed conflict resolution policy (the semantics of which were defined

in [Sco06]). Finally, Section 5.6 includes a discussion on related work.

5.1 Introduction

Currently, there is very little consensus on the right way to implement transactions. Hardware

proposals are more rigid than software proposals with respect to the conflict resolution poli-

cies supported. However, this stems in large part not from a clear analysis of the tradeoffs,

117

but rather from a tendency to embed more straightforward policies in silicon. In general, TM

research has tended to focus on implementation tradeoffs, performance issues, and correctness

constraints while assuming conflicts are infrequent. This assumption does not seem to hold for

the first wave of TM applications that employ coarse-grain transactions (Table 5.1). Conflicts

are common with “tries” and linked-list data-structures prevalent in these workloads. In such

data structures, typically, insertion proceeds bottom-up while searching moves top-down, which

leads to non-trivial interaction and overlap between the associated writer and reader transac-

tions. Furthermore, conservative programming practices that encapsulate large regions within

a single atomic block might lead to unnecessary conflicts due to the juxtaposition of unrelated

data with different conflict properties.

Hardware support for TM seems inevitable and has already started to appear [TrC08]. How-

ever, there seems to be very little understanding and analysis of TM policies in a HTM context.

Our work seeks to remedy this situation. In the absence of conflicts, policy decisions take a

backseat and most systems perform similarly. In the presence of conflicts, performance varies

widely (orders of magnitude, see Section 4.7.2) based on policy. We focus on the interaction

between two policy decisions that affect performance in the presence of conflicts: conflict reso-

lution time and contention management policy. We informally describe these two critical design

decisions below.

Table 5.1: Percentage of total (committed and aborted) txs that encounter a conflict.

Benchmark % Conf. tx Benchmark % Conf. TxBayes 85% Labirynth 81%Delaunay 85% Vacation 73%Genome 11% STMBench7 68%Intruder 90% LFUCache 95%Kmeans 15% RandomGraph 94%

See Appendix A for workload description. These experiments are for 16 thread runs with Lazyconflict resolution and a “committer wins” contention manager.

A conflict detection mechanism is needed to detect conflicts so that the system can ensure

that transactions do not perform erroneous externally visible actions as a result of an inconsis-

tent view. The conflict resolution time decides when the detected conflicts (if they still persist)

are managed — Eager systems (pessimistic) detect and resolve a conflict when a transaction

118

accesses a location. In Lazy systems (optimistic), the transaction that reaches its commit point

first will resolve the resulting conflict (although it may detect the conflict earlier). Most systems

tend to conflate detection and resolution and perform both at the same time. Here, we separate

the concepts and refer to the mechanism implemented in hardware or software as conflict detec-

tion and the policy choice of when to react when a conflict is detected as the conflict resolution

time. At the time of conflict resolution, the TM system invokes a contention manager to deter-

mine the response action. The contention manager employs a set of heuristics to decide which

transaction has to stall/retry and which can progress. The job of a good contention manager is

to mediate access to conflicting locations and maximize throughput while ensuring some level

of fairness. A contention manager’s policy is also influenced by conflict resolution time — in

Eager it is invoked before the access and could potentially elide the conflict while in Lazy it is

invoked after an access (when a conflict becomes unavoidable) and the contention manager has

to try to choose the appropriate transaction to abort.

We use FlexTM developed in Chapter 4:Section 4.5 as our experimental framework.

FlexTM allows software the flexibility to control both the conflict resolution time and contention

management policy. This allows us to analyze the policy tradeoffs within a common framework.

We use a diverse set of workloads from STAMP [MTC07] and STMBench7 [GKV07] that have

varied transaction parameters and conflict characteristics. Appendix A provides more details on

our experimental set up and workloads. Interestingly, while conflicts are commonplace in most

applications (Table 5.1), most conflicts (average over 90%) are between reader and writer trans-

actions, which can possibly be elided with the appropriate conflict resolution and management

policy as we demonstrate.

In this chapter, we conduct the following three studies. First, we analyze the influence of

introducing backoff (stalling) into the contention manager and how it helps with livelock issues.

Second, we implement and compare a variety of contention manager heuristics and evaluate

their interaction with conflict resolution time (Eager and Lazy). We analyze the access patterns

and transaction interleaving in our applications that lead to performance differences between

Eager and Lazy — specifically, the wasted work due to aborts and concurrency between reader

and writer transactions. Finally, we implement and evaluate a mixed conflict resolution policy

(the semantics of which were defined in [Sco06]) in the context of a hardware-accelerated TM.

119

The mixed policy resolves write-write conflicts eagerly to save wasted work and resolves read-

write conflicts lazily to exploit concurrency. Finally, we also briefly discuss the implementation

challenges and feasibility of modifying other HTMs to use the Mixed mode.

5.2 Conflict Resolution Primer

5.2.1 Conflict Resolution and Contention Management

The contention manager (CM) is called on any conflict and has to choose from a range of actions

based on conflict resolution time. Assuming deadlock freedom is a property of the underlying

TM system, the additional goals of the runtime, broadly defined, are to try to avoid livelock

(ensure that some transaction makes progress) and starvation (ensure that a specific transaction

that has been aborted often makes progress). It also needs to exploit as much parallelism as

possible, ensuring that transactions execute and commit concurrently (whenever possible). The

contention manager is decentralized and is invoked by the transaction that detects the conflict,

which we’ll label the attacking transaction (Ta), using the terminology introduced in [GHP05a].

The conflicting transaction that must participate in the arbitration is labeled the enemy transac-

tion (Te) (as opposed to the victim [GHP05a], since Te might actually win the arbitration). On

detecting a conflict, Ta invokes its contention manager, which decides the order it wants to se-

rialize the transactions based on some heuristic: Tabefore−−−−→ Te or Te

before−−−−→ Ta. The actions

carried out by the contention manager may also depend on when it was invoked, i.e., the conflict

resolution time.

The two main conflict resolution modes that have been been explored by previous HTM

designs is Eager and Lazy. They exploit varying levels of application parallelism based on

their approach to concurrent accesses. Eager enforces the single-writer rule and allows only

multiple-readers while Lazy permits multiple writers and multiple readers to coexist until a

commit. Transactions need to acquire exclusive permission to the written locations sometime

prior to commit. Eager acquires this permission at access and blocks out other transaction

workers for the duration of the transaction while Lazy delays this to commit allowing concurrent

work. There is, of course, a sliding scale between access and commit time, but we have chosen

the two end points for evaluation.

120

With Lazy, it is possible for readers to commit concurrently with a potential writer enemy

if they reach and finish their commits earlier than the writer. This form of conflict-awareness

where potentially conflicting transactions are allowed to concurrently execute and commit un-

covers more parallelism than Eager. This parallelism tradeoff is completely orthogonal to the

contention management that typically focuses on improving progress and fairness. Eager can

possibly save more wasted work via early detection of conflicts but only if the winner commits;

if the winner of the conflict aborts, more work will be wasted. Fundamentally, Eager tries to

handle potential conflicts at access and finds it difficult to make optimal decisions about which

transaction has a higher likelihood of committing in the future. Since Lazydoes not resolve

conflicts until the commit point (by which time a conflict becomes unavoidable) it allows the

contention manager to do a better job of choosing the appropriate winner (one that is likely

to commit) in a conflict. Lazy reduces the window of vulnerability (where a transaction could

abort its enemies only to find itself aborted later) to the commit window. In Eager the window

extends from the point at which the conflict was detected and the resulting contention managed

(access time) to the transaction commit time.

Figure 5.1 shows the generic set of options available to a contention manager. We now

discuss in detail the option exercised for a specific conflict type. Table 5.2 summarizes the

details. Any transaction can encounter three types of conflicts: Read-Write, Write-Read1, and

Write-Write.

Table 5.2: Interaction of contention manager and conflict resolutionObjective Te

before−−−−→ Ta Tabefore−−−−→ Te

E L E LR(Ta)-W(Te) WAITa : Ta WAITc : Ta ABrem : Te COself : Ta

W(Ta)-R(Te) WAITa : Ta WAITc : Ta ABrem : Te ABrem : Te

W(Ta)-W(Te) WAITa : Ta WAITc : Ta ABrem : Ta ABrem : Ta

R(tx) - Tx has read the location; W(tx) - Tx has written locationTa - Attacking transaction; Te - Enemytransaction

Read-Write: Read-Write conflicts are noticed by reader transactions, where the reader

plays the role of the attacker. If in Eager mode, the contention manager can try to avoid the

1Read-Write and Write-Read conflicts are the converse of each other. They vary based on the transaction thatnotices the conflict; when a transaction T1 reads a location being concurrently accessed by T2 for writing, theconflict is classified as Read-Write at T1’s end and Write-Read at T2’s end.

121

AA Write location Read locationXact_Begin

Stall ActiveAbortXXact_Commit

Te Ta

A

A

Tim

e

(a) WAITa

A

TaTe

A

(b) WAITc

Te Ta

AA

X

(c) ABrem

Te Ta

AA

X

(d) ABself

WAITa - Backoff on conflict in Eager systemsWAITc - Backoff before commit in Lazy systemsABrem - Abort remote transactionABself - Self abortCOself - Commit the transaction

Figure 5.1: Contention manager actions

conflict by waiting and allowing the enemy transaction to commit before it reads. Alternatively,

it could take the action of either ABself (self abort, see Figure 5.1 to release isolation on other

locations it may have accessed or ABrem on the writer in-order to make progress. With Lazy

systems, when the reader reaches the commit point, the reader can commit without conflict.

Write-Read: A Write-Read conflict at the high level is the same as Read-Write, except that

the writer is the attacker. If the contention manager decides to commit the reader before the

writer then the writer has to stall irrespective of the conflict resolution scheme (Eager or Lazy

). Eager systems would execute a WAITa while Lazy systems would execute a WAITc only

if the reader has not committed prior to the writer’s commit. If the writer is to serialize ahead

of the reader, the only option available is to abort the reader. In this scenario aborting early in

Eager systems might potentially save more wasted work.

Write-Write: True write-write conflicts are serializable even if concurrent transactions

commit; but, due to constraints with coherence-based conflict detection, such implementations

need to conservatively also treat write-write conflicts as dueling read-write, write-read conflicts.

There is no serial history in which both transactions can concurrently commit. One of them has

122

to abort. However, since Eager systems manage conflicts before access, they can WAITa until

the conflicting transaction commits. Lazy systems have no such option and in this case will

waste work. Both Eager and Lazy may also choose to abort either transaction.

5.2.2 Design Space

As described in [GHP05a], each contention manager exports notification and feedback meth-

ods. Notification methods inform the contention manager about transaction progress. In order

to minimize overhead, unlike the STM contention managers in [GHP05a], we assume explicit

methods exist only at transaction boundary events — transaction begin, abort, commit, and stall.

Any information on access patterns is gleaned via hardware performance counters/registers.

Feedback methods indicate to the transaction, based on who the enemy and attacker are, the

information the contention manager has on their progress, and what type of conflict is detected

(R-W, W-R, or W-W), what action must be taken among aborting the enemy transaction, abort-

ing itself, or stalling in order to give the enemy more time to complete.

Exploring the design spectrum of contention manager heuristics is not easy since the objec-

tives are abstract. In some sense, the contention manager heuristic has the same goals as the

heuristics that arbitrate a lock. Just as a lock manager tries to maximize concurrency and pro-

vide progress guarantees to critical sections protected by the lock, the contention manager seeks

to maximize transaction throughput while guaranteeing some level of fairness. We have tried to

adopt an organized approach: a five dimensional design space guides the contention managers

that we develop and analyze. We enumerate the design dimensions here while describing the

specific contention managers in our evaluation Section 4.7.2.

1. Conflict type: This dimension specifies whether the contention manager distinguishes

between various types of conflict. For example, with a write-write conflict the objective

might be to save wasted work while with read-write conflicts the manager might try to

optimize for higher throughput.

Options: Read-Write, Write-Read, or Write-Write

2. Implementation (I): The contention manager implementation is a tradeoff between con-

currency and implementation overheads. For example, each thread could invoke its own

123

instance of the contention manager (as we have discussed in this paper) or there could be a

centralized contention manager that usually closely ties both conflict detection and com-

mit protocol (e.g., lazy HTM systems [HWC04]). The latter enables global consensus

and optimizations while the former imposes less performance penalty and is potentially

more scalable.

Options: Centralized or De-centralized

3. Conflict Resolution: This controls when the contention manager is invoked,

Options: Eager, Lazy, or Mixed (see Section 5.5)

4. Election: This indicates the information used to arbitrate among transactions. There are a

number of heuristics that could be employed, such as timestamps, read-set and write-set

sizes, transaction length, etc. Early work on contention management [ScS05] explored

this design axis. In this paper, we limit the information used in a tradeoff between re-

duced implementation complexity and statistics collection overhead, and throughput in

the presence of contention.

Options: Timestamp, Read/Write set size, etc.

5. Action: Section 5.2 included a detailed discussion of the action options available to a

contention manager when invoked under various conflict scenarios. These have a critical

influence on progress and fairness properties. A contention manager that always only

stalls is prone to deadlock while one that always aborts the enemy is prone to livelock.

A good option probably lies somewhere in between. We show in our results that aside

from progress guarantees, waiting a bit before making any decision is important to overall

throughput.

Options: abort enemy, abort self, stall, increase/decrease priority, etc.

Since we are focused on the influence of software policy decisions, we only investigate

the de-centralized manager implementation. We investigate three types of conflict resolution:

Eager, Lazy, and Mixed. We study the interaction of conflict resolution with various election

strategies, with varying progress guarantees (e.g., Timestamp vs. Transaction progress). With

all these design choices, the contention manager can exercise the full range of actions.

124

5.3 Effectiveness of Stalling

There are many different heuristics that can be employed for conflict resolution and contention

management. To better understand the usefulness of each heuristic, we analyze them in a step-

by-step fashion, targeting specific objectives. We show that livelock is a problem mainly in

Eager systems and analyze the effectiveness of randomized backoff in alleviating the prob-

lem. Our results corroborate earlier findings [ScS05] that randomized-backoff is an effective

solution. We further note that the timing of backoff (whether prior to access or after aborting)

is important. We also quantify the average wasted work that can be attributed to an aborted

transaction and the work wasted cumulatively due to all aborts.

Randomized backoff is perhaps best known in the context of the Ethernet framework. In the

context of transactional memory, it is a technique used to either (1) stall a transaction restart to

mitigate repeated conflicts with another or (2) stall an access prior to actual conflict and thereby

potentially elide it entirely. There seems to be a general consensus that backoff is useful – most

STM contention managers use it [ScS05] and some Eager HTMs embed it into hardware as

their default[MBM06].

We study three contention managers,Reqwin,Reqlose, andComwin, with and without back-

off. Reqwin and Reqlose are access-time schemes compatible with Eager In Reqwin the at-

tacker always wins and aborts the enemy while in Reqlose the attacker always loses, aborting

itself. Comwin is the simple committer always-wins policy in Lazy. Reqwin and Reqlose when

combined with backoff (+B systems), wait a bit, retrying the access a few times (until the av-

erage latency of a transaction) before falling back to their default action. Reqwin and Reqlose

strategies were first studied by Scherer [ScS05] in the context of STMs. Recently, Bobba et

al. [BMV07] studied similar schemes as part of a pathology exploration in fixed-policy HTMs.

5.3.1 Implementation Tradeoffs

There are significant challenges involved in integrating even these simple managers within the

existing framework. Reqwin requires no modifications to the coherence protocol but is prone to

livelock, while Reqlose requires the support of coherence NACKs to inform the requestor that

its access must be aborted. Comwin in previous Lazy systems [HWC04; CTC06] has required

125

the support of a global arbiter. In FlexTM [SDS08], it requires a software commit protocol to

collect the ids of all the enemies from the conflict bitmaps and abort them.

Stalling a transaction in the midst of its execution after discovering a conflict is not easy.

The backoff needs to occur logically prior to the access in order to avoid the conflict but it

should occur only on conflicting accesses lest it waste time unnecessarily. Hence, backoff oc-

curs after the coherence requests for an access have been sent out to find out if the access does

conflict. Therein lies the problem: coherence messages typically update metadata along the

cache hierarchy to indicate access; if backoff is invoked, the metadata needs to be cleaned up

since logically the access did not occur. This requires extra messages and states in the coher-

ence protocol. Furthermore, continually stalling without aborting can lead to deadlock (e.g.,

transactions waiting on each other). The coherence extensions needed to conservatively check

for deadlock (e.g., with timestamps on coherence messages [MBM06]) introduce verification

challenges. In FlexTM, since transactions can use alerts [SSH07] to abort remote transactions

explicitly, we use a software-based timestamp scheme similar to the greedy manager [GHP05b]

to detect deadlocks. Eliminating the need for backoff prior to an access would arguably make

the TM implementation simpler.

5.3.2 Analysis

Figure 5.2 shows the performance plots. The Y-axis plots normalized speedup compared to

sequential execution. Each bar in the plot represents a specific contention manager.

Result 1a: Backoff is an effective technique to elide conflicts and randomization ensures

progress in the face of conflicts. The introduction of backoff in existing contention managers

can significantly improve Eager’s performance. Lazy’s delayed commit inherently serves as

backoff.

Implication: HTM systems that rely on coherence protocols for conflict resolution should

include a mechanism to stall and retry a memory request when a conflict is detected. STMs

should persist with the out-of-band techniques that permit stalling.

At anything over moderate levels of contention (benchmarks other than Kmeans and

Genome (see Table A.1 in Appendix A) both Reqlose and Reqwin perform poorly (see Fig-

126

ure 5.2). Reqlose’s immediate aborts on conflicts does serve as backoff, but in these bench-

marks it ends up wasting more work. Bobba et al. [BMV07] observed this same trend for

other SPLASH2 workloads. Introducing backoff helps thwart these issues (see Figures 5.2

(a),(b),(f),(g)): waiting a bit prevents us from making the wrong decision and also tries to en-

sure someone makes progress. Bobba’s EL system [BMV07] is similar to Reqwin. In both of

them the requester wins the conflict but they vary in their utilization of backoff; Reqwin ap-

plies it to the requester logically prior to access whereas the EL system applies backoff to the

restart of the enemy transaction after aborting it. This leads to different levels of wasted work

compared to the Reqlose system (comparable to Bobba’s EE system); Bobba’s work reports

significant wasted work and futile stalls in the EL compared to EE while here Reqwin performs

similarly to Reqlose.

Comwin performs well even without backoff. In benchmarks with no concurrency (e.g.,

RandomGraph) Lazy ensures that the transaction aborting enemies at commit-time usually fin-

ishes successfully (i.e., some transaction makes progress). In other benchmarks (e.g., Bayes

and Vacation) it exploits concurrency (allows readers and writers to execute concurrently and

commit). Backoff improves the chance of concurrent transactions committing. With read-write

conflicts, it stalls the writer at the commit point and tries to ensure the reader’s commit oc-

curs earlier, eliding the conflict entirely (see Figure 5.1(b)). There is noticeable performance

improvement in workloads with read-write sharing such as Bayes and Vacation.

We did observe that randomizing backoff at transaction start time can help avoid convoying

that arises in irregular workloads such as STMBench7. There are many short-running concur-

rent writer transactions that desire to update the same location and when one of them commits

the rest abort, restart, and the phenomenon repeats. This is akin to the “Restart Convoy” ob-

served by Bobba et al [BMV07] in their microbenchmarks.

5.3.3 Effect of Wasted work

A significant fraction of the performance differences between Eager and Lazy can be attributed

to the wasted work due to aborted transactions. Since Lazy resolves conflicts later than Eager it

would be expected that the work lost when a transaction is aborted is larger. The amount of extra

127

Req. wins Req. Wins + B Req. losesReq. loses + B Commit wins Commit Wins + B

(a) Bayes

0

2

4

6

8

10

16 threads

Norm

aliz

ed T

hro

ughput

(b) Delaunay

0

2

4

6

8

16 threads

Norm

aliz

ed T

hro

ughput

(c) Genome

0

2

4

6

8

10

12

16 threads

Norm

aliz

ed T

hro

ughput

(d) Intruder

0

2

4

6

8

16 threads

Norm

aliz

ed T

hro

ughput

(e) Kmeans

0

2

4

6

8

10

12

14

16 threads

Norm

aliz

ed T

hro

ughput

(f) Labyrinth

0

2

4

6

8

10

12

14

16 threads

Norm

aliz

ed T

hro

ughput

(g) Vacation

0

2

4

6

8

10

16 threads

Norm

aliz

ed T

hro

ughput

(h) STMBench7

0

2

4

6

16 threads

Norm

aliz

ed T

hro

ughput

(i) LFUCache

0

0.2

0.4

0.6

0.8

1

1.2

16 threads

Norm

aliz

ed T

hro

ughput

(j) RandomGraph

0

0.2

0.4

0.6

0.8

1

16 threads

Norm

aliz

ed T

hro

ughput

Y-axis: Normalized throughput, 1 thread throughput = 1. +B: with randomized Backoff

Figure 5.2: Studying the effect of randomized-backoff on conflict management

work wasted when a transaction aborts is dependent on the time elapsed between the access that

caused the conflict (which is when Eager would have aborted the enemy) and the commit point

128

(which is when Lazy aborts the enemy). Table 5.3 lists the statistics for four systems: Reqlose,

Reqlose+B, Comwin and Comwin+B. The Lazy systems (Comwin policies) waste more work

on an aborted transaction (7–68%). However, the work wasted on an abort is limited since a

conflict is discovered and handled as soon as a conflicting writer reaches the commit point. As

Table A.1 in Appendix A indicates(see Wr1 and Wrn metrics) transactions typically finish up

soon after they start writing.

Furthermore, the performance loss due to wasted work is also dependent on the number

of aborts in the system (see Ab/Ct metric in Table 5.3). This is already visible when compar-

ing Reqlose with Reqlose+B—Reqlose+B performs 1.7× better than Reqlose by aborting fewer

transactions (6.6× fewer). This is despite the average aborted transaction wasting 40% more

work inReqlose+B since a transaction always performs backoff prior to aborting. A similar pat-

tern is visible when comparing Comwin+B with Reqlose+B—-Comwin+B wastes 31% more

work per aborted transaction but aborts an average 1.6× fewer transactions (maximum 5.4×

fewer transactions) compared to Reqlose+B. Finally, it is also important to consider that while

Eager can save wasted work, the resulting available resources must be utilized for other useful

work in order to reap any benefit. Most TM systems do not allow fast context switching to other

useful work and the resources are essentially left idle during a backoff, resulting in no overall

improvement in performance.

129

5.4 Interplay between Conflict Resolution and Management

In this section, we focus on the tradeoffs between the various software arbitration heuristics

(e.g., timestamp, transaction size) that prioritize transactions and try to avoid starvation. We

analyze the influence of these policies on the varying concurrency levels of Eager and Lazy. All

managers in this section include backoff to eliminate performance variations due to livelocking.

Note that backoff does not help a specific transaction make progress (i.e., starvation-freedom).

Election/arbitration heuristics help achieve this goal (refer to Section 5.2.2). Election deals with

fairness issues by increasing the priority of the starving transaction over others, letting it win

and progress on conflicts. Here, we investigate three heuristics: transaction age, read-set size,

and number of aborts; under both Eager and Lazy conflict resolution. The plots also include

Reqlose+B and Comwin+B as a baseline.

• Age: The Age manager helps with scenarios where a transaction is repeatedly aborted. It

also helps with increasing the likelihood of a transaction committing if it aborts someone.

Every transaction obtains a logical timestamp by incrementing a shared counter at the

start. If the enemy is older the attacker waits, hoping to avoid the conflict, and then aborts

after a fixed backoff period. If the enemy is younger it is aborted immediately. The

timestamp value is retained on aborts and no two transactions have the same age, which

guarantees that at least one transaction in the system makes progress.

• Size: The Size manager tries to ensure that (1) a transaction that is making progress

and is closer to committing does not abort and (2) read sharing is prioritized to improve

overall throughput. This heuristic approximates transaction progress by the number of

read accesses made by the transaction. It uses this count to arbitrate between conflicting

transactions.

This manager uses a performance counter to estimate the number of reads.2 Finally,

Size also considers the work done before the transaction aborted. Hence, transactions

restart with the performance counter value retained from previous aborts (similar to

Karma [ScS05]).

2When arbitrating, software would also need to read performance counters on remote processors to compareagainst. For this, we use a mechanism similar to the SPARC %ASI registers.

130

• Aborts: The Abs manager tries to help with transactions that are aborted repeatedly.

Transactions accumulate the number of times a transaction has been aborted. On a con-

flict the manager uses this count to make a decision. Unlike Size it does not need a

performance counter since abort events are infrequent and can be counted in software.

Abs has weaker progress guarantees compared to Age; two dueling transactions can end

up with the same abort counter and kill each other. Similar to Age and Size it always waits

a bit before making the decision.

Figure 5.3 shows the results of our evaluation of the above policies. ’-E’ refers to eager

systems and ’-L’ refers to Lazy systems. We have removed Kmeans and Genome from the plots

since they have very low conflict levels, with all policies demonstrating good scalability.

Result 2a: Conflict resolution mode seems to be more important than contention manage-

ment heuristic. Lazy’s ability to allow concurrent readers and writers finds more parallelism

compared to any Eager system and this helps with overall system throughput.

Result 2b: Starvation and livelock can be practically avoided with software-based priority

mechanisms (like Age). They should be selectively applied to minimize their negative impact

on concurrency. With Lazy there is typically at least one transaction at commit point, which

manages to finish successfully, ensuring practical livelock-freedom.

Overall, Lazy exploits reader-writer sharing and allows concurrent transactions to commit

while Eager systems are inherently pessimistic, which limits their ability to find parallelism.

Also, a Lazy transaction aborts its enemies only when it reaches its commit phase, at which

point it has a better likelihood of finishing. This helps with overall useful work in the system.

Note that multi-programmed workloads could change the tradeoffs [SSH07]. Currently, on an

abort, the transaction keeps retrying until it succeeds; if the resources were to be yielded to

other independent computation, Eager could be a better choice. Power and energy constraints

could also constrain Lazy’s speculation limits, which would affect the concurrency that can be

exploited for performance.

As shown in Figure 5.3, a specific contention manager may help some workloads (by help-

ing prevent starvation) while they hurt the performance in others (serializing unnecessarily).

131

We have observed that Size performs reasonably well across all the benchmarks. It seems

to have a similar effect on Eager and Lazy alike. Size maximizes concurrency and tries to

help readers commit early, thereby eliding the conflict entirely without aborting the writer (e.g.,

vacation); orthogonally, Size also helps with writer progress since typically writers that are

making progress have higher read counts and win their conflicts. Note that the number of reads

is also a good indication of the number of writes since most applications read locations before

they write them.

Age helps transactions avoid starvation in workloads that have high contention (LFUCache

and RandomGraph). Age’s timestamps ensure that a transaction gets the highest priority in a

finite time and ultimately makes progress. On other benchmarks, Age hurts performance when

interacting with Eager. This is due to the following dual pathologies: In Eager mode, (1)

Age can lead to readers convoying and starving behind a older long running writer; in Lazy

mode, since reads are optimistic, no such convoying results. (2) Age can also result in wasteful

waiting behind an older transaction that gets aborted later on (akin to “FriendlyFire” [BMV07]);

with Lazy the likelihood is that the transaction that reaches the commit point first (the one

that would wait) is also older and therefore makes progress and avoids waiting once again.

Bobba [BMV07] explored hardware prefetching mechanisms to prioritize starving writers in a

rigid-policy HTM that prioritized enemy responders. We have shown that similar effects can be

achieved with Age; we can also explore writer priority in a more straightforward manner since

the control of when and which transaction aborts is in software.

As for the Abs manager, its performance falls between Size and Age. This is expected, since

it does not necessarily try to help with concurrency and throughput like Size but does not hurt

them with serialization like Age.

The “Serialized commit” pathology observed by others [BMV07] does not arise with our

optimized Lazy implementation, which allows parallel arbitration and commits. Lazy exhibits

a significant performance boost compared to even contention management optimized Eager

systems. Although the contention manager can help eliminate pathologies, it does not affect

the concurrency exploited. In general, Lazy exploits more concurrency (reader-writer overlap),

avoids conflicts, and ensures better progress (some transaction is at the commit stage) than Ea-

ger. Combining Lazy with selectively invoking the Age manager (to help starving transactions

132

L-Abs E-Abs Comwin+B Reqlose+BL-Age E-Age L-Size E-Size

(a) Bayes

0

2

4

6

8

10

16 threads

Norm

aliz

ed T

hro

ughput

(b) Delaunay

0

2

4

6

8

10

16 threads

Norm

aliz

ed T

hro

ughput

(c) Intruder

0

2

4

6

8

16 threads

Norm

aliz

ed T

hro

ughput

(d) Labyrinth

0

2

4

6

8

16 threadsN

orm

aliz

ed T

hro

ughput

(e) Vacation

0

2

4

6

8

10

16 threads

Norm

aliz

ed T

hro

ughput

(f) STMBench7

0

2

4

6

8

16 threads

Norm

aliz

ed T

hro

ughput

(g) LFUCache

0

0.2

0.4

0.6

0.8

1

1.2

16 threads

Norm

aliz

ed T

hro

ughput

(h) RandomGraph

0

0.2

0.4

0.6

0.8

1

16 threads

Norm

aliz

ed T

hro

ughput

L-Lazy, E-Eager conflict resolution. Y axis: Normalized throughout, 1 thread throughput = 1

Figure 5.3: Interplay of conflict management and conflict resolution

get priority) and with backoff (to avoid convoying) would lead to an optimized system that can

handle various workload characteristics.

133

Finally, for some workloads (e.g., STMBench7), the conflict resolution modes and con-

tention management policies explored above do not seem to have any noticeable impact on

their performance. We analyze the reasons and propose solutions in the next section.

5.4.1 Wasted work in Eager and Lazy

To analyze the performance variations between Eager and Lazy we explore the access pat-

terns and transaction interleaving prevalent across the applications we study. We focus on

two aspects: (1) the performance loss due to wasted work in aborted transactions Earlier

works [MBM06; BMV07] have commonly argued that since Lazy resolves a conflict later in

a transaction’s execution (i.e., at commit) doomed enemies discover their status later and end

up wasting more work compared to Eager. Figure 5.4(a) sketches the scenario for a read-write

conflict. A transaction T1 reads a location A that is subsequently written by a concurrent writer

T2 that commits earlier. All of T1’s work is wasted regardless of whether a Lazy or Eager sys-

tem is used. There is extra work wasted in Lazy since T1 is not notified until T2’s commit point.

This extra work is proportional to the interval between T2’s write and commit. Since in most

transactions the writes occur in a more clustered manner than reads and towards the end of the

transaction, this extra wasted work can be expected to be limited.

The statistics in Table 5.3 indicate that Eager in fact tends to waste more cumulative work

than Lazy. This contradicts Eager’s design philosophy: detect and abort conflicting transactions

as early as possible to save wasted work. The primary reason for this is futile aborts — while in

Lazy each individual aborted transaction may waste more work (than Eager), overall, Eager can

lead to a higher number of aborts, wasting cumulatively more work than Lazy. Figure 5.4(b)

sketches the access pattern prevalent across most applications with moderate levels of conflict

(e.g., Bayes, Intruder, Labyrinth and Vacation). In Eager, transaction T1 aborts in favor of T2,

while T2 subsequently does not commit. The work in both both T1 and T2 is wasted. Conversely,

if T2 aborted itself in favor of T1, it is possible that T1 is aborted subsequently leading to a

similar wastage of work. These scenarios occur because of a fundamental design principle in

Eager — address potential conflicts early even before they are confirmed (e.g., here the conflict

between T1 and T2 is confirmed only when T1 attempts to commit before T2). Essentially, on

134

a conflicting access, Eager expects the contention manager to speculate on which transaction is

likely to commit and this is a difficult task with multiple concurrent operations.

AA Write location Read locationXact_Begin

Stall ActiveAbortXXact_Commit

(a) Wasted Work

AAAbort

X

Was

ted

work

AA

AbortX

Was

ted

work

Eager Lazy

time

(b) Futile Aborts

A

A

B

BAbort

X

Tx1 Tx2 Tx3

A

A

B

BAbort

X

Tx1 Tx2 Tx3Eager Lazy

XAbort

(c) Read-Write Sharing

AA

AAbortX

Abort

X

AA

A

Tx1 Tx2 Tx3 Tx1 Tx2 Tx3Eager Lazy

(a) Wasted work due to aborts. Lazy wastes more work than Eager proportional to the time duration between the

write and the commit. (b) Futile abort. Eager can cause aborts in favor of an attacker that itself does not commit, Lazy

reduces the likelihood of this. (c) Concurrency between readers and writer transactions. Lazy permits conflicting

transactions to concurrently execute and commit, Eager does not support this pattern.

Figure 5.4: Interaction of access patterns with conflict resolution

Lazy allows concurrent execution between even conflicting transactions and eliminates these

problems. Since T2 does not abort T1 until the commit stage, it reduces the likelihood of a third

transaction intervening to cause T2’s victory over T1 to be futile. When a transaction (T1)

aborts in favor of a transaction (T2), the likelihood of the winner (T2) committing is higher than

in Eager. The primary reason is the window of vulnerability: the time between the point at

which a transaction aborts its enemy to its commit point. During this time, it could be aborted

by other transactions, which would render the earlier abort of the enemies futile. In the case of

Eager this window extends from the point of the access that causes the abort to the commit. In

135

the case of Lazy, it is limited to the duration of the commit phase, which is negligible in most

workloads.

Another problem with Eager is that it does not permit both T1 and T2 to concurrently ex-

ecute. If the contention manager chooses T2 as the transaction likely to succeed, the only

option for T1 would be backoff the write to A. Note that this backoff could itself lead to an-

other problem; possible cascading of a third transaction T3 behind T1 (access to location B

in Figure 5.4(b)). This kind of cascaded stalls could potentially lead to deadlock and hence

backoff has to be finite: if the conflict persists eventually one of the transactions will have to

abort. Compared to this, Lazy permits concurrency between T1 and T2, and does not require

any contention management until T2 has finished all the work and reaches the commit point.

5.4.2 Concurrent Readers and Writers

An important benefit of Lazy is the support for concurrent readers and writers to the same lo-

cation. Read-Write (and its converse Write-Read) sharing is an access pattern prevalent across

most transaction workloads (see Table A.1 in Appendix A). Conflicts between reader and writer

transactions (R-W or W-R conflicts) is the dominant type (see Figure A.1 in Appendix A). Most

of the workloads (in our suite) use transactions to operate on pointer-based data structures like

trees. All transactions, irrespective of the sub-tree they are interested in, access the higher levels

in the tree (e.g., root node) to get to the sub-tree. Writer operations typically re-organize the

tree, modify the higher levels in the tree, and conflict with reader transactions. Note that modifi-

cations to the root would not in most cases affect the safety of the readers, which once they have

traversed to the sub-tree do not care about the root. Typically, the writer transaction also con-

flicts with multiple concurrent readers (e.g., Bayes, Labyrinth, Vacation). Figure 5.4(c) shows

an illustration of this conflict pattern. Both T1 and T2 read a location A, which is subsequently

written by T3. In an Eager system, if T3 is prioritized, then it needs to abort both T1 and T2 and

ends up wasting work from both transactions. However, Lazy permits all three transactions to

continue execution until T3 needs to commit. This provides a window of opportunity between

the write to A to T3’s commit point to allow other concurrent reader transactions to complete.

This allows Lazy to achieve more commits than Eager — in this case three compared to Eager’s

one.

136

5.5 Mixed Conflict Resolution

As shown in Figure 5.3, none of the contention managers seem to have any noticeable positive

impact on STMBench7’s scalability. Despite the high level of conflicts, both Eager and Lazy

perform similarly. STMBench7 has an interesting mix of transactions: unlike other workloads,

it has a mix of transactions of varying length. It has long running writer transactions interspersed

with short readers and writers. This presents an unhappy tradeoff between the desire to allow

more concurrency and avoid high levels of wasted work on abort. Eager cannot exploit the

concurrency since the presence of the long running writer blocks out other transactions. With

Lazy the abort of long writers by other potentially short (or long) writers starves them and

wastes useful work. We evaluate a new conflict resolution mode in HTM systems, Mixed, which

detects read-write and write-read conflicts lazily while detecting write-write conflicts eagerly. 3

For Write-Write conflicts, there is no valid execution in which two writers can concur-

rently commit. Mixed uses eager resolution to abort one of the transactions and thereby avoid

wasted work, although it is possible to elect the wrong transaction as the winner (one that will

subsequently be aborted). For Read-Write conflicts, if the reader’s commit occurs before the

writer’s then both transactions can concurrently commit. Mixed postpones conflict resolution

and contention management to commit time, trying to exploit any concurrency inherent in the

application.

Figures 5.5 and 5.6 plot the performance of Mixed against Eager and Lazy. To isolate

and highlight the performance variations due to conflict resolution, we plot different contention

managers on different plots— Figure 5.5 uses the Size contention manager and Figure 5.6 used

the Age contention manager (see description of contention managers in Section 5.4).

Result 3: Mixed combines the best features of Eager and Lazy. It can save wasted work on

write-write conflicts and uncover parallelism prevalent with read-write sharing.

As the results (see Figure 5.5 and Figure 5.6) demonstrate, Mixed is able to provide a sig-

nificant boost to STMBench7 over both Eager and Lazy. In STMBench7, which has a mix of

long running writers conflicting with short running writers, resolving write-write conflicts early

3In FlexTM [SDS08], this requires minor changes to the conflict detection mechanism. In Lazy mode, where thehardware would have just recorded the conflict in the W-W list, it now causes a trigger to the contention manager.

137

reduces the work wasted when the long writer aborts. Similar to Lazy it also exploits more

reader-writer concurrency compared to Eager.

When there is significant reader-writer overlap (Bayes, Delaunay, Intruder, Labyrinth, and

Vacation), Mixeds performance is comparable to the Lazy system. In Section 5.4, we saw that

for the STAMP workloads and STMBench7, Size was the best performing contention manager.

Comparing the plots of the STAMP workloads and STMBench7 between Figure 5.5 and Fig-

ure 5.6, the performance order between Size and Age carries over to Mixed as well.

On LFUCache, Mixed with the Size contention manager performs badly due to dueling

writer transactions. Trying to exploit reader-writer parallelism does not help since all trans-

actions seek to upgrade the read location causing a write-write conflict; Furthermore, a writer

could abort another transaction only to find itself aborted later (cascaded aborts). This leads

to an overall fall in throughput. On RandomGraph, Mixed livelocks with the Size contention

manager, similar to Eager. Switching the contention manager to Age which provides stronger

progress guarantees, helps both Eager and Mixed achieve higher performance. Further, Mixed’s

ability to exploit read-write sharing helps it exploit more concurrency than Eager and improve

performance.

In summary, Mixed saves some of Lazys wasted work in the case of write-write conflicts

while continuing to exploit read-write concurrency as in LazyThe inability to exploit read-

write concurrency is a fundamental design limitation of EagerHowever, Mixed may suffer from

progress problems similar to Eager but as in Eager this can be solved with the appropriate

contention manager.

5.5.1 Implementation Tradeoffs

There is a general assumption that Eager is easier to implement in hardware compared to Lazy

because of its more modest versioning requirements. Eager conflict mode maintains the “Sin-

gle writer or Multiple reader” invariant. There are a maximum of two versions that must be

maintained for a specific memory block: a single writer’s speculative version and the non-

speculative original version. At any given instant, either multiple readers are allowed to access

138

Eager Lazy Mixed

0

2

4

6

8

10

Baye

s

Delau

nay

Intru

der

Laby

rinth

Vaca

tion

STMBe

nch7

Norm

aliz

ed T

hro

ughput

0

0.2

0.4

0.6

0.8

1

LFUC

ache

RDG

Norm

aliz

ed T

hro

ughput

Y-axis: Normalized speedup at 16 threads, throughput normalized to sequential thread runs. RDG: RandomGraph.

We use the Size contention manager.

Figure 5.5: Interaction of Mixed resolution with Size contention manager.

Eager Lazy Mixed

0

2

4

6

8

10

Baye

s

Delau

nay

Intru

der

Laby

rinth

Vaca

tion

STMBe

nch7

Norm

aliz

ed T

hro

ughput

0

0.2

0.4

0.6

0.8

1

LFUC

ache

RDG

Norm

aliz

ed T

hro

ughput

Y-axis: Normalized speedup at 16 threads, throughput normalized to sequential thread runs. RDG: RandomGraph.

We use the Age contention manager.

Figure 5.6: Interaction of Mixed resolution with Age contention manager.

the non-speculative original version or the single writer is allowed to access its speculative ver-

sion. Mixed maintains the “Single writer and/or Multiple reader” invariant. Similar to Eager,

this requires a maximum of two versions of the memory block. At any given instant, how-

ever, a single writer may access its speculative copy and/or concurrent readers may access the

non-speculative original version. The existence of a maximum of two versions simplifies the

implementation of versioning.

Conversely, Lazy is a “Multiple writer and/or Multiple reader” scheme, which explodes

139

the number of possible data versions required, potentially requiring as many as the number of

speculative writer transactions in addition to the one non-speculative version required by the

readers. Thus, taking implementation costs into consideration, Mixed offers a good tradeoff

between performance and complexity-effective implementation.

5.5.2 Porting Mixed to other TMs

Given Mixed’s benefits, it would be interesting to consider the changes required to adopt it

in other HTMs. Any HTM system that desires to implement Mixed’s “Single writer and/or

Multiple reader” model needs to implement the following: (1) a versioning mechanism that

allows one transaction to speculatively write a block and allow other concurrent readers to

access the non-speculative copy and (2) a conflict detection mechanism that detects and resolves

writer-write conflicts at access time and resolves read-write (or conversely, write-read) conflicts

at commit time.

Existing Lazy HTM systems (Bulk [CTC06], TCC [HWC04]) could support Mixed with

minimal changes. Adapting the versioning mechanism to support Mixed is trivial since the

system already allows multiple writers and readers to share a cache block. Changes would be

need only in the conflict detection mechanism. Lazy systems other than FlexTM (Bulk [CTC06]

and TCC [HWC04]) implement conflict detection using a centralized arbitration mechanism

that intervenes on transaction commits. Such systems implement implicit transactions in which

processors communicate with the shared memory at the granularity of transactions rather than

individual memory operations. At commit time, a transaction broadcasts its write-set so that

other transactions can compare their access sets against it and detect conflicts. Mixed would

need to augment this with a more traditional coherence framework in which individual writes

are sent out at access time. Hence, a transaction write initiates coherence requests, which other

transactions use to detect conflicts with their write-sets. Commit-time write-sets also need to be

sent out to detect conflicts with the read-set of other transactions. Since these systems combine

detection and resolution, changes are needed to detect some conflicts (write-write) eagerly and

some (read-write and write-read) lazily. FlexTM always detects conflicts eagerly but leaves the

choice of resolution to software, which can then choose to resolve the conflict eagerly or lazily

based on the conflict type.

140

Porting other Eager HTM systems (e.g., LogTM-SE [YBM07]) to support Mixed is more

challenging when compared to Lazy. Eager systems allow only a single writer for a cache block

at any given instant and preclude other transactions from sharing the same block. They imple-

ment versioning using an undo-log which writes new speculative values in-place and stores the

old values in an undo-log (to be restored on commit). Like Eager, Mixed allows only one block

to write a line but unlike Eager it requires multiple concurrent readers to be able to access the

block. This means Eager systems which typically implement an undo-log (old values in log,

new values in memory location) need to provide concurrent readers the current committed value

from the undo-log. For conflict detection, Eager HTMs piggyback on the coherence framework

— they maintain the “single writer or multiple reader” invariant. This proves sufficient for

Mixed as well since it resolves write-write conflicts eagerly at access-time. The remaining

challenges to address is Mixed’s requirement for delayed commit-time resolution of read-write

conflicts. Possible implementations include either support for bulk coherence operations or the

use of conflict summary tables as in FlexTM [SDS08], which allow lazy resolution of multiple

conflicts in parallel without centralized hardware structures.

5.6 Other studies on contention management

The seminal DSTM paper by Herlihy et al. [HLM03b] introduced the concept of “contention

management” in the context of STMs. They postulated that obstruction-free algorithms enable

the separation of correctness and progress conditions (e.g., avoidance of livelock), and that a

contention manager is expected to help only with the latter. Scherer et al. [ScS05] investigated

a collection of arbitration heuristics on the DSTM framework. Each thread has its own con-

tention manager and on conflicts, transactions gather information (e.g., priority, read/write set

size, number of aborts) to decide whether aborting enemy transactions will improve system

performance. This study did not evaluate an important design choice available to the con-

tention manager: that of conflict resolution time (i.e., Eager or Lazy). Shriraman et al. [SSH07]

and Marathe et al. [MSS04] observed that laziness in conflict resolution can significantly im-

prove the throughput for certain access patterns. However, these studies did not evaluate con-

tention management. In addition, evaluation in all these studies was limited to microbench-

141

marks. Scott [Sco06] presents a classification of possible conflict resolution modes, including

the mixed mode, but does not discuss or evaluate implementations. Contention management can

also be viewed as a scheduling problem. Yoo et al. [YoL08] and CAR-STM [DHS08] use cen-

tralized queues to order and control the concurrent execution of transactions. These queueing

techniques preempt conflicts and save wasted work by serializing the execution of conflicting

transactions. Yoo et al. [YoL08] use a single system-wide queue and control the number of

transactions that run concurrently based on the conflict rate in the system. Dolev et al. [DHS08]

use per-processor transaction issue queues to serially execute transactions that are predicted to

conflict. While they can save wasted work, these centralized scheduling mechanisms require

expensive synchronization and could unnecessarily hinder existing concurrency. Furthermore,

the existing scheduling mechanisms serialize transactions on all types of conflict. Serializing

transactions that only have read-write overlap significantly hurts throughput and could lead to

convoying [SDS08; BMV07].

Most recently, Spear et al. [SDM09] have performed a comprehensive study of contention

management policy in STMs. Though limited to microbenchmarks, they analyze the perfor-

mance tradeoffs under various conflict scenarios and conclude that Lazy removes the need for

sophisticated contention managers in STMs. Our analysis reveals a similar trend in HTMs as

well, indicating that hardware designers must pay attention to the policies embedded in hard-

ware in order to avoid losing the benefits of hardware acceleration.

It would be fair to say that hardware supported TM systems have mainly focused on im-

plementation tradeoffs and have largely ignored policy issues. Bobba et al. [BMV07] were the

first to study the occurrence of performance pathologies due to specific conflict detection, man-

agement, and versioning policies in HTMs. Their hardware enhancements targeted progress

conditions (i.e., practical starvation-freedom, livelock-freedom) and did not focus on the con-

currency tradeoffs between Eager and Lazy (see Section 5.2). Furthermore, they limited their

study to three points in the design space, which does not fully capture the interaction between

the various other contention managers and conflict resolution policies. Baugh et al. [BNZ08]

and Ramadan et al. [RRP07] compare a limited set of previously proposed STM contention

managers in the context of Eager systems.

Most recently, Ramadan et al. [RRW08] have proposed dependence-aware transactions as

142

an alternative conflict resolution model. Instead of resolving conflicts, they forward data be-

tween speculative transactions and tie their destinies together (requiring support for potential

rollback of multiple dependent transactions) with the goal of uncovering more concurrency.

It is not yet clear that performance improvements promised by dependence-awareness merit

the hardware complexity. We demonstrate here that Lazy is sufficient to uncover most of the

parallelism prevalent in the application.

5.7 Summary

We presented a comprehensive study of the interplay between policies on “when to detect” (con-

flict resolution) and “how to manage” (conflict management) conflicts in hardware-accelerated

TM systems. Although the results were obtained on a HTM framework, the conclusions and

recommendations are applicable to any type of TM: hardware, software, or hybrid, and are

corroborated by those in Spear et al. [SDM09].

Our first set of experiments corroborated recent studies that randomized Backoff is an essen-

tial heuristic and is best applied before conflict management in order to potentially side-step the

conflict. We then demonstrated that Lazy provides higher performance than Eager by exploit-

ing reader-writer concurrency prevalent in many applications and by narrowing the window of

conflict. Finally, we evaluated a Mixed conflict resolution mode in the context of HTMs. Mixed

mode retains most of the concurrency benefits of lazy and outperforms it (by saving wasted

work) in workloads dominated by write-write conflicts.

143

Tabl

e5.

3:C

hara

cter

istic

sof

abor

ted

tran

sact

ions

Ben

chm

ark

Eag

er-R

eqw

ins

Laz

y-C

omm

iterw

ins

w/o

back

off

w/b

acko

ffw

/oba

ckof

fw

/bac

koff

Ab-

Tx

Size

Ab

/Ct

Ab-

Tx

Size

Ab

/Ct

Ab-

Tx

Size

Ab

/Ct

Ab-

Tx

Size

Ab

/Ct

(%C

tSiz

e)(%

CtS

ize)

(%C

tSiz

e)(%

CtS

ize)

Bay

es35

5.1

430.

5649

0.45

520.

31D

elau

nay

393.

146

0.41

510.

2455

0.2

Gen

ome

330.

0145

0.01

510.

0158

0.01

Intr

uder

2012

293.

141

0.8

490.

78K

mea

ns60

0.02

750.

0279

0.02

810.

02L

abyr

inth

1914

313.

668

0.65

790.

59V

acat

ion

304.

240

1.6

540.

2854

0.25

STM

Ben

ch7

327.

145

1.05

740.

6580

0.54

LFU

Cac

he45

—51

2389

12.9

9011

Ran

dom

grap

h35

—49

—48

18.3

5619

Ab-

Tx

Size

(%C

tsiz

e)-L

engt

hof

Avg

.abo

rtT

xas

a%

ofco

mm

itT

x’s

dura

tion.

Ab/

Ct-

Avg

.num

bero

fabo

rts

perc

omm

it.—

:Liv

eloc

k

144

Chapter 6Protection : Sentry

In this chapter, we describe the Sentry system that we proposed at ISCA 2010 [ShD10]. Sen-

try enables software developers to set up intra-application protection domains. Section 6.1

discusses the reliability challenges encountered by large applications that consist of multiple

modules and motivates the need for an access control mechanism. In Section 6.1.1, we dis-

cuss the design tradeoffs in the implementation of access control hardware and summarize the

limitations of current approaches. Section 6.2 presents the Sentry architecture and describes

the use of cache coherence states to implement access control. Section 6.3 presents the soft-

ware framework and compares the various intra-application protection models supported by

Sentry. In Section 6.6, we use Sentry to implement a protection model for the modules of the

Apache webserver and demonstrate that this can be achieved with moderate performance over-

head ('13%). We also develop a watchpoint debugger using Sentry and compare it against

other hardware-based watchpoint mechanisms in Section 6.7.

6.1 Motivation

Modern applications are complex artifacts consisting of millions of lines of code written by

many developers. Developing correct, high performance, and reliable code has thus become

increasingly challenging. The prevalence of multicore processors, resulting in the need for mul-

tiple threads of control in order to harness the available compute power, has increased the burden

145

Apache Core

1. Http request() 2. Allocator()3. Utilities()

D1 (M1:R/W)D2 (M1:R/W)

M1

Private data

...................

Mod_Cachecache root()insert_entry()

D3 (M2:R/W)D4 (M2:R/W)

M2

Private data

...................

D5 (M1:R/W) (M2:R)D6 (M1:R) (M2:R)setcache_size()get_request_packet()

Mod_log

D6 (M3:R/W)D7 (M3:R/W)

M3

Private data

...................

...................

log_bytes()log_filter()...................

D8 (M1,M3:R/W) (M2,R)log_config()insert_log()

Example of software modules in Apache. M1,M2,M3 — modules, D1...D8 — data elements. Dashed lines indicateshared data and function interface between modules. Tuple D: (M:P) indicates module M has permission P on thememory location D.

Figure 6.1: Software modules in Apache

on software developers. Intra- and inter-thread interactions to shared data make it difficult for

developers to track, debug and validate the accesses arising from the various software modules.

Figure 6.1 presents a high-level representation of the developers’ view of Apache [10b] at

design time. The system designers define a software interface that specifies the set of functions

and data that are private and/or exported to other modules. For the sake of programming sim-

plicity and performance, current implementations of Apache run all modules in a single process

and rely on adherence to the module API to enforce protection. A bug or safety problem in any

module could potentially (and does) affect the whole application.

There are two main aspects to enforcing the protection of a specific datum or object: (1)

the domain, referring to the environment or context, typically the module, that a thread is ex-

ecuting in; and (2) the access privileges, the set of permissions available to a domain, for the

object [Lam71]. Access control enforces the permissions specified for the domain and object.

An instance of this use is enforcing the rule that a plug-in should not have access to the appli-

cation’s data or another plug-in’s data, while the application should have unrestricted access to

all data. Access control can also be used for program debugging in order to intercept accesses

146

or detect violations of specific access invariants, such as accesses to uninitialized or unallocated

locations, dangling references, unexpected write values, etc. Tracking memory bugs is an in-

tensive task and it is especially complicated in modularized collaborative applications where

accesses to memory regions are split across codes developed independently.

In this section, we describe Sentry, a hardware framework that enables software to enforce

flexible application-specific protection policies at runtime. The core developer annotates the

program to define the policy and then the system ensures the privacy and integrity of a module’s

private data (no external reads or writes), the safety of inter-module shared data (by enforcing

permissions as described in the annotations), and adherence to the module’s interface (con-

trolled function call access points as in Multics [Org72]).

Sentry is a light-weight, multi-purpose access control mechanism that is independent of

and subordinate to (enforced after) the page protection in the Translation-Lookaside-Buffer

(TLB). We implement Sentry using a permissions metadata cache (M-cache) that intercepts

only L1 misses and resides entirely outside the processor core. It reuses the L1 cache coherence

states in a novel manner to enforce permissions and elide checks on L1 hits. This enables

a simpler design and places fewer constraints on the pipeline cycle compared to per-access

in-core checks [WCA02]. Since checks are less frequent and on higher latency L1 misses,

Sentry up to 100× less energy than required by the in-processor option. The M-cache also has

a novel dual-tag organization that transparently maintains metadata coherence and simplifies

management (no need for heavy-weight interprocessor interrupts).

From the software’s perspective, Sentry is a pluggable access control mechanism for

application-level memory watching and protection. It works as a supplement (not a replace-

ment) to OS process-based protection. These models are orthogonal to existing process-based

protection and provide varying degrees of flexibility with which applications can set up and en-

force access protection contracts. Applications that desire fine-grain protection annotate mem-

ory (code and data) regions and specify permissions. No other changes to the programming

model or language are required.

147

6.1.1 Access Control in the Memory Hierarchy

Access control must essentially provide a way for software to express permissions to a given

location, for the active threads. Hardware is expected to intercept the accesses, check them

against the expected permissions, and raise an exception if the thread does not have appropriate

rights. Current processors typically implement a physical memory hierarchy through which

an access traverses looking for data. Accesses can be intercepted at any level in the memory

hierarchy: within the processor core, in the L1 miss path (Sentry design), or any other level.

Here, we analyze the design tradeoffs between intercepting these accesses at the various levels.

Most protection schemes (Mondriaan [WCA02], Loki [ZKD08], Page-protection in the

TLB) adopt the logically simple option of intercepting all memory accesses within the pro-

cessor, even before they are visible to the memory system. After the access is intercepted, they

use the address in the access to perform a lookup into an hardware table which caches the per-

missions information. Since every access is intercepted it enables support for access control at

word-level granularity of memory operations, although hardware may choose to increase the

granularity of protection to reduce the overheads of the permissions metadata. Unfortunately,

checking permissions on every access induces significant energy cost. The permissions cache

requires additional structures in highly optimized stages of the processor pipeline. Since the

access control hardware is consulted in parallel with L1 access on every load and store, the

hardware design is constrained to complete the checks before the loaded value becomes visi-

ble to other instructions. To achieve this, high-performance transistor technology will likely be

employed, leading to a noticeable energy and area budget.

Placement of access control outside the processor in the memory hierarchy is challenging.

The lower levels of the cache system are shared between the multiple processors and this makes

it challenging to implement different permissions for each thread. Permission changes need to

also be handled carefully compared to the in-processor approach since cached data could poten-

tially overlook the new permissions. In the past, access control at the lower levels of the memory

hierarchy has been explored only to implement software-based cache coherence [SFL94]

Figure 6.2 summarizes qualitatively the tradeoffs between placing the hardware access-

control cache at the various levels in the memory hierarchy. As we locate the permissions

148

CPU

L1$

L2$

LN$

.....

2-3

ns

20-3

0 ns

Perm.$

Red

uced

# o

f che

cks

Fine

r gra

nula

rity

chec

ks

Mor

e de

sign

flex

ibilit

y

Figure 6.2: Access control in the memory hierarchy

cache closer to the processor it can intercept finer-granularity accesses and can support up to

word granularity checks. However, storing permissions at the granularity of words poses a

significant challenge in terms of space overhead and requires complex encoding of the entries

in the permission cache [WCA02]. As we move the permission cache away from the processor

to lower levels in the memory hierarchy it intercepts needs to intercept fewer accesses since

the higher levels cache hits filter out many accesses. This results in significant dynamic energy

savings. If the permissions cache is located in the lower levels in the memory hierarchy it can

afford better design tradeoffs. For instance, a permissions cache located between the processor

and L1 needs to complete the checks within the L1 access latency (typically 2-3ns) whereas if

it is located between the L1 cache and L2 cache it can complete it within the L2 access latency

(typically 20-30ns). This allows it to possibly use more energy efficient transistors to conserve

both dynamic energy and static power. Placing the access control hardware further from the

processor complicates the processor interface (e.g., propagating exceptions back to software).

6.2 Sentry : Auxiliary Memory Access Control

In Sentry, we place the access checks on the L1 miss path and transparently reuse the coherence

states of the L1 to implicitly check the accesses. The L1 cache filters out cache hits and the

149

permissions cache needs to be accessed only by L1 hits. The L1 miss rate in most applications

is considerably lower than the total accesses — PARSEC (1-4%), SPLASH2 (4-9%), Apache

(16%). This results in fewer lookups compared to the in-processor permissions cache and a

corresponding savings in dynamic energy. Another benefit is the design flexibility; since the

checks occur only in parallel with L2 accesses (which can take 20—30ns), we have a longer

time window to complete the checks and can trade performance for energy. Employing energy-

optimized transistor technology [MBJ07] can save leakage and dynamic power. Using LOP

transistor technology [Ass07] increases latency to 1ns, which is still well below the critical path

of the L2 access but enables 33% reduction in dynamic energy and a 10× reduction in static

power.

Our design choice to intercept L1 misses implies that the smallest protection granularity we

support is an L1 cache line. Sub-cache-line granularity can be supported at the cost of additional

bits in the cache. To accommodate L1 cache line granularity monitoring (typically 16 - 64

bytes) the memory allocator can be modified to create cache-line-aligned regions with the help

of software annotations (compiler or programmer). Software can also further disambiguate an

access exception to implement word granularity monitoring if necessary.

6.2.1 Metadata Hardware Cache (M-Cache)

In Sentry, we maintain the invariant that a thread is allowed to access the L1 memory cache

without requiring any checks i.e., L1 accesses are implicitly validated. To implement this we

include a metadata cache (M-cache) that controls whether data is allowed into the L1.

Figure 6.3 shows the M-cache positioned in the L1 miss path outside the processor core.

Each entry in the M-cache provides permission metadata for a specific virtual page. The per-

page metadata varies based on the intended protection model (see Section 6.3). In general, it

is a bitmap data structure that encodes 2 bits of access permission (read-only, read/write, no

access, execute) per cache line. Note that these are auxiliary to the OS protection scheme since

the access has to pass the TLB checks before getting to the memory system. This ensures that

the M-cache’s policies do not interfere with the OS protection scheme. An M-cache entry is

tagged on the processor side using the virtual address (VA) of the page using the metadata for

150

User Registers

Processor Core

Tag State Data

Private L1 Cache

Tag1 Data Tag2Metadata

Physical addrData

Virtual addrMetadata

M-Cache

Fwd. Coherence Request

L1 miss

L1-Shared L2Interface

VA PA attr Owner Dom. F V

TLB Entry

MetadataDom.

C

Thread Domain

Dark lines enclose add-ons. Every TLB includes three fields: ’F’ bit to indicate if the page desires fine-grainprotection, ’Domain’ field which specifies the owner domain (see Section 6.3 for details), and ’V’ which indicatesif L1 hits may need verification. The cache includes a

Figure 6.3: Permissions metadata cache (M-cache)

permission checks. Using virtual addresses allows user-level examination and manipulation of

the M-cache entries and, avoids the need to purge the M-cache on a page swap. The domain

id fields included in the processor register, TLB entry, and M-cache implement the protection

domain (we discuss this in more detail in the following Section 6.3).

In addition to the processor-side tags, the M-cache contains a network-side tag consisting of

the physical address of the metadata (MPA) to ensure entries are invalidated if the metadata is

changed. In Section 6.2.6, we discuss the need for these tags in detail. The dual side addressing

of the cache does introduce an interesting challenge — the VA, the data address using the access

control metadata, and the MPA, the metadata word’s physical address, may index into different

positions in their respective tag arrays. We need to ensure that when the entry in one of the

tag arrays is invalidated (VA due to processor action or MPA due to network message), the

corresponding entry in the other tag array is also cleaned up. To solve this issue, we include

forward and back pointers similar to those proposed by Goodman [Goo87] — a VA tag includes

a pointer (set index and way position) to the corresponding entry in the MPA array and the MPA

tag entry includes a back pointer. Table 6.1 lists the ISA interface of the M-cache.

151

Table 6.1: Permissions metadata cache (M-cache) APIRegisters%mcache_handlerPC: address of the handler to be called on a user-space alert%Domain1_Handler: address of handler to be called within user-level supervisor (see Sec-

tion 6.3.3)%mcache_faultPC: PC prior to the fault instruction%mcache_faultAddress virtual address that experienced the access exception%mcache_faultIns: local write or local read%mcache_faultType: M-cache miss or permission exception%mcache_entry: per-page metadata; 2 bits per cache line to represent access permissions

(Read-only, Read/Write, Execute, No Access)%mcache_vindex specifies the vaddr position in the M-cache.

Instructionsget_entryvaddr,%mcache_index

get an entry for vaddr and store its index in %mcache_index. Hard-ware specified LRU eviction policy.

inv_entry vaddr,%mcache_index

evict vaddr’s metadata from M-cache and return its position in%mcache index

LD vaddr,%mcache_index

load vaddr into M-cache position pointed to by %mcache_index

LL vaddr,%mcache_index

Load Linked version of the above.

LD_MPA vaddr,%mcache_index

load physical address of cache block corresponding to vaddr into MPAand set up the pointer to the vaddr tags

ST %mcache_index,%mcache_entry

store the data in %mcache_entry into the M-cache entry pointed toby %mcache_index

SC %mcache_index,%mcache_entry

Store-Conditional version of the above.

switch_call %R1,%R2

Effects a subroutine call to the address in %R1 and changes threaddomain to that specified in %R2

The %Domain1 Handler register and switch call instruction are needed for implementing protectiondomains and are discussed in detail in Section 6.3.3

6.2.2 Permission Cache Checks

Sentry requires access checks only on L1 misses. Even then, metadata must only be created

when an application requires fine-grain or user-level access control. We use an unused bit

in the page-table entry to encode an “F” bit (Fine-grain) to ensure that metadata is employed

only when necessary. If the software does not need the checks, it leaves the ‘F” bit unset and

hardware bypasses the M-cache for accesses to the page. 1 If the bit is set, on an L1 miss, the

M-cache is indexed using the virtual address of the access to examine the metadata. The virtual

address of the L1 miss is available at the load-store unit.

1We also use this technique to prevent accesses on the permissions metadata from having to look for metadata.

152

6.2.3 Coherence-based Access Checks

Once a block is loaded into the processor’s L1 cache, we use coherence states to transparently

enforce permissions (i.e., cache hits do not check the M-cache). The coherence states guarantee

that the corresponding cache block experiences a cache hit only if the specific type of access

occurs. Without any modifications to the coherence protocol, M or E can check read/write

permissions, S can enforce read-only, and I can check any permission type. An attempt to

access data without the appropriate coherence state will result in an L1 miss, which checks the

M-cache. As shown in Table 6.2), other coherence protocol states can also be mapped on to

the basic permission types, no-access, read-only, and read/write. In this dissertation, we only

implement the basic MESI coherence protocol.

Table 6.2: Mapping coherence protocol states to permission checksM Read/WriteE Read/WriteO Read/WriteS Read-only or Read/Write

F [MHS09] Read-only or Read/WriteI No-access, Read-only, or Read/Write

Data persistence in caches does introduce a challenge; data fetched into the L1 by one

thread continues to be visible to other threads on the processor. Essentially, if two threads

have different access permissions, the L1 hits could circumvent the access control mechanism.

Consider two threads T1 and T2, which the application has dictated to have read-write and

read-only rights to location A. T1 runs on a processor, stores A, and caches it in the “M” state.

Subsequently, the OS switches in T2 to the same processor. Now, if T2 was allowed to write

A, the permission mechanism would be circumvented. To ensure complexity-effectiveness, we

need to guard against this without requiring that all L1 hits check the M-cache. We employ two

bits, a “V” (Verify) bit in the page-table to indicate whether the page has different permissions

for different threads and “C” (checked) bit in the cache tag, which indicates if the cache block

has already been verified (1 yes, 0 no). All accesses (hit or miss) check permissions if the TLB

entry’s “V” bit is set and the “C” bit in the L1 cache tag is unset, indicating first access within

the thread context. Once the first access verifies permissions, the “C” bit is set. This ensures

that subsequent L1 hits to the cache line need not access the M-cache. The “C” bit of all cached

lines is flash-cleared on context switches.

153

6.2.4 Exception Trigger

When an access does not have appropriate rights, the hardware triggers a permission exception,

which should occur logically before the access. To implement this, we reuse the existing ex-

ception mechanism on modern processors. The permissions checks do not need to be enforced

until instruction retirement. The M-cache response marks the instruction as an exception point

in the reorder-buffer. When the instruction is about to be committed, the exception is checked

and the software handler is triggered, if needed. The permission check, which is performed by

looking up either the L1 cache state on a hit or the M-cache metadata on a miss, has potential

impact only at the back end of the pipeline in the case of a miss. Pipeline critical paths and

therefore cycle time consequently remain unaffected.

On a permission exception, the M-cache provides the following information in registers:

the program counter of the instruction that failed the check (%mcache faultPC) , the address ac-

cessed (%mcache faultAddress), and the type of instruction (%mcache faultInstruction). There

are separate sets of registers for kernel and user-mode exceptions.

6.2.5 How is the M-cache filled ?

The M-cache entries can be indexed and written under software control similar to a software

TLB. Allowing software to directly write M-cache entries (1) allows maintenance of the meta-

data in virtual memory regions using any data structure and (2) permits flexibility in setting

permission policies. Hardware never updates the M-cache entries and it is expected that soft-

ware already has a consistent copy. Hence, any evictions from the M-cache are silently dropped

(no writebacks). Since no controller is required to handle refills or eviction, the implementation

is simplified.

The ISA extensions (see Table 6.1) required are similar to the software-loaded TLB found

across many RISC architectures (e.g., SPARC, Alpha, MIPS). There are separate load and store

instructions that access the metadata using an index into the M-cache. In addition, to fill the

MPA (not typically found in the TLB) we use an instruction (LD MPA) that specifies the virtual

address of the physical address for which a tag needs to be set up in the MPA. 2 The MPA is

2Hardware also sets up pointer to the virtual address tags so that invalidations can cleanup entries consistently in

154

used by the hardware to ensure the entry is cleaned up if the metadata changes, i.e., when a

store occurs anywhere in the system to the MPA address. Typically, the metadata is maintained

in the virtual address space of the application. Figure 6.4 shows the pseudo code for the insert

routine: lines 1 — 2 get an entry from the M-cache and sets up the virtual address of the data

to be protected, lines 3 — 5 set up the permissions metadata in the corresponding entry, and the

final two instruction 6 — 7 try to update the metadata in the M-cache entry. An exception event

between 1 and 6 (e.g., context switch) will cause line 6 to fail and the insert routine is restarted.

M-Cache Fill Routine()/*X: Virtual address of data */

1. get_entry X,%mcache_index2. LL X,%mcache_index3. P = get_permissions(X)4. LD P,%mcache_entry5. LD_MPA P,%mcache_index6. SC %mcache_index,%mcache_entry7. if failed(SC) goto 1;

Figure 6.4: Pseudocode for inserting a new M-cache entry.

The software fill routine has control over the address chosen to be the MPA of an entry. One

option is to use the physical address of the metadata as the MPA, but in reality software can

set up any address to ensure the coherence of the M-cache entry. It is even valid and useful to

have multiple M-cache entries managed by the same MPA. For example, locations A, B, and C

could each have an entry in the M-cache where the VA is tagged with the page addresses of A,

B, and C while each of their corresponding MPA entries could be tagged with MPA X. When

X is written, a lookup in the MPA array would match three entries each with a back pointer to

the corresponding VA entries of A, B, and C, which can then be invalidated — an efficient bulk

invalidation scheme.

6.2.6 Changing Permissions

The permissions metadata structure is in virtual memory space and this allows software to di-

rectly access and modify it. When changes to the access permissions (metadata) are made, we

need to ensure that subsequent accesses to the data locations are performed only when allowed

both tag arrays

155

by the new metadata. For example, assume there are two threads T1 and T2 on processors P1

and P2 that have write permissions on location A. If T1 decides to downgrade permissions to

read-only, we need to ensure that (1) the old metadata entry in P2’s M-cache is invalidated and

(2) since P2 could have cached a copy of A in its L1 before the permissions changed, this copy

is also invalidated. Both actions are necessary to ensure that P2 obtains new permissions on the

next access. We deal with these in order.

Shooting down M-cache entries

This operation is simplified by the MPA field in the M-cache, which is used by hardware to

ensure consistency of the cached metadata. This set of tags snoop on coherence events and any

metadata updates result in invalidations of the corresponding M-cache entry. Hence all software

has to do is update the metadata and the resulting coherence operations triggered will clean up

all M-cache entries. Action 1 in Figure 6.5 illustrates this operation. Most previously proposed

permission lookaside structures (e.g., TLB, Mondriaan [WCA02]) typically use interprocessor

interrupts and sophisticated algorithms to shootdown the entry [TKS88].

Checks of future accesses

Sentry allows L1 hits to proceed without any intervention. Hence, when permissions are down-

graded, the appropriate data blocks need to be evicted from all the L1 caches to ensure future

accesses trigger permission checks. To cleanup the blocks from the L1 level, software can per-

form prefetch-exclusives to invalidate the cache block at remote L1s and then evict the cache

line from the local L1 (e.g., using PowerPC’s dcb-flush instruction). Action 2 in Figure 6.5

illustrates the cleaning up of the cached data from the L1s. A final issue to consider is that the

cleanup races with other concurrent accesses from processors and the system needs to ensure

that these accesses get the new permissions. To solve this problem, software must order the per-

mission changes (including cleaning up remote M-cache entries) before the L1 cache cleanup

so that subsequent accesses will reload the new permissions.

156

Processor Core 1

Tag DataX_Phys

Private L1 Cache

Tag1 Data Tag2P_Phys X

Private M-Cache

1. Store P; 2. Store X; Flush X

Processor Core 2

Tag DataX_Phys

Private L1 Cache

Tag1 Data Tag2P_Phys X

Private M-Cache

2

1

2

Shootdown of M-cache entries and cleanup of L1 cache lines using cache coherence. X: Address of data; P: Perm.metadata of X. X Phys, P Phys: physical addresses. Actions 1 shows a store operation shooting down the metadatain the M-cache. Action 2 illustrates the cleaning up of the cached data from the L1s.

Figure 6.5: Changing Permissions

6.3 Sentry Software

In this section, we develop the software infrastructure required to support Senty and discuss

the various types of protection models. Sentry is a pluggable access control mechanism which

supplements OS process-based protection and this leads to three main advantages. First, it

incurs space and time overhead only when additional intra-application protection is needed as

otherwise existing page-based access control can be used. Second, the software runtime that

manages the intra-application protection can reside entirely at the user level and can operate

without much OS intervention, making the system both efficient and attractive to mainstream

OS vendors. Third, within the same system, applications using the Sentry protection framework

can co-exist with applications that do not require Sentry’s services.

When developing a protection framework, the design decisions that need to be addressed are

(1) Where is the software that regulates the access control mechanism located (e.g., user-level

or OS) ? and (2) How do applications that do not take advantage of the protection framework

co-exist with code that requires it.

157

Conventional systems are critically dependent on hiding the protection metadata from ma-

nipulation by user applications. For example, the TLB and protection systems such as Mon-

driaan and Loki [WCA02; ZKD08] are exclusively managed in privileged mode by low-level

operating system software. In order to utilize the mechanism to achieve application-specific

protection, every application in the system must cross protection boundaries via well-defined

(system call) interfaces and abstractions to convey its protection policy to system software.

Furthermore, all applications in the system need to adopt the same framework to implement

protection, whether they desire it or not. The need to satisfy each application’s requirements

usually complicates the protection framework design.

In contrast, Sentry supplements the existing process-based protection provided by the TLB.

Our objective is to support low-cost and flexible protection models, with the emphasis on

application-controlled intra-process protection. Leaving the existing process-based framework

untouched allows us to relocate all the software that manages the access control mechanism to

the user level (within each application itself). This eliminates the challenge of porting a new

protection framework to the OS and reduces the risk of an application’s policy affecting another

application. Each application can independently choose whether to use the Sentry framework

or not. User-level management also allows application-specific optimizations.

6.3.1 Foundations for Sentry’s Protection Models

Sentry’s protection models are realized by regulating the M-cache access control mechanism.

A key concept in realizing the various models is the protection domain — the context of a

thread that determines the access permissions to each data location [Lam71]. Every executing

thread at any given instant belongs to exactly one protection domain and multiple threads can

belong to the same protection domain. Furthermore, a thread can dynamically transition be-

tween different protection domains if the system policy permits it. The M-cache entries for a

particular protection domain can be thought of as a capability list—they specify the data access

permissions for a thread running in the domain. The ability to dynamically change a thread’s

protection domain context allows software to easily perform permissions changes over large

regions without changing per-entry permissions.

158

Domain 0 and 1

Sentry uses integer values to identify protection domains. Domain 0 is reserved for the oper-

ating system while domain 1 and larger identifiers are used by applications. Within a process,

different application domains must carry different identifiers, but domains in different processes

may share the same identifier. Sentry is focused on intra-application protection and M-cache

entries are flushed on address space (process) context switches.

Domain 1 serves as an application-level supervisor, with the main thread of execution at

the time of process creation being assigned to this domain. Specifically, Domain 1 enjoys the

following privileges:

1. Page Ownership: Domain 1 controls page (both code and data) ownership; all requests

for ownership change must be directed to the operating system via Domain 1.

2. Thread Domain: Domain 1 handles the tasks of adding, removing, and changing the

domain of a thread during its lifetime.

3. M-cache manipulation by non-owner domains: M-cache exceptions (e.g., no metadata

entry) triggered on accesses to non-owner locations (addresses owned by a different do-

main) are handled by Domain 1.

4. Cross-Domain Calls: Domain 1 ensures cross-domain calls (instruction addresses that are

in pages owned by a different domain) can occur only at specific entry points (according

to application-specified policies registered with Domain 1).

Sentry requires the support of a few tasks from the operating system. First, the OS saves

and restores thread state registers associated with Sentry. Second, the OS stores the domain

ownership in the page tables and has them loaded into the processor TLB. It also handles

ownership changes (which must be recorded in the page table) from domain 1. Finally, to ensure

the application supervisor’s control over all process memory space, the OS restricts access to

certain memory space-related system calls (including mmap and sbrk) so that they can only

be made from domain 1.

159

Hardware Metadata

Several hardware elements play key roles in such realization of Sentry’s protection models.

They include a global %Thread_Domain register, a per-entry Metadata_Domain field

in the M-cache, and a per-entry Owner_Domain field in the TLB (all illustrated earlier in

Figure 6.3). Below we define their specific functions.

The M-cache needs to recognize the current thread’s protection domain and distinguish data

permission information for different domains. %Thread_Domain is a new CPU register that

identifies the protection domain of the current executing thread. Metadata_Domain is a

per-entry field in the M-cache identifying the protection domain that the entry’s access permis-

sion information applies to. On an M-cache check, the %Thread_Domain register and the

access address are both used to index into the M-cache. The M-cache entry with matching

Metadata_Domain and virtual-address tag is identified and its access permission informa-

tion is then checked.

The privilege of filling M-cache entries must be carefully regulated. If the M-cache were

allowed to be modified by any domain, then a thrqead would be able to grant itself arbitrary

permissions to any location. We introduce a per-entry Owner_Domain field in the TLB,

which identifies the domain that “owns” the page corresponding to that entry. Only a thread

in the page’s owner domain or the exception handler in domain 1 can fill the M-cache for the

locations in that page. The hardware enforces this by guaranteeing that an M-cache entry can be

filled only when the %Thread_Domain register matches the target page’s Owner_Domain3 or when the %Thread_Domain is 1. Note that the ownership is maintained at a coarser page

granularity while the access control mechanism is managed at cache line granularity.

6.3.2 One Domain Per ProcessIn this model, the existing process-based protection domain boundaries are inherited without

any further refinement. Sentry’s primary benefit is to support flexible cache-line granularity

access control. In this case, all application-level threads within a process belong to protection

domain 1. The domain identifiers of 0 and 1 differentiate operating system and application

3The page table entry for the filled address needs to be in the TLB when filling the M-cache. We ensure that anM-cache fill instruction (see Table 6.1) triggers a TLB reload for the filled address.

160

executions. Sentry supports two modes of access control. In the first mode, the operating

system retains the privileges of filling the M-cache content, managing M-cache misses, and

handling permission exceptions. In the second mode, the application threads can directly per-

form these tasks. Both modes of control can co-exist in a system on a page by page basis. The

mode is determined by each page’s owner domain (0 or 1) loaded into the page table entry’s

Owner_Domain field.

The application-managed model incurs much less overhead than the OS-managed model.

For instance, the cost of permission exception handling appears as an unconditional jump in the

instruction stream (10s of cycles). In comparison, an OS-level exception needs switching of

privilege level and saving of register state, which mirrors the cost of a privilege switch (100s of

cycles). For an application-managed M-cache, the access permission metadata is visible to all

threads within a process and any thread can set up the permissions for its use. The M-cache is

flushed on process context switches to ensure that permissions metadata does not leak beyond

the process boundary. Existing process-based protection isolates data access permissions across

different processes. It is useful for supporting cache-line-grain watchpoints. Reads and writes

to specific locations can be trapped by setting appropriate M-cache permissions. The low-

cost, cache-line-grain watchpoints can help with detecting various buffer overflow and stack

smashing attacks, as well as with debugging code [ZQL04; QLZ05]. This model has weak

protection semantics since any part of the application can modify the M-cache.

The OS-managed M-cache allows safer protected sharing across the different domains. For

example, in remote procedure calls (RPCs), the argument stack can be mapped between a RPC

client and the server process with different permissions (at a page granularity on current hard-

ware [BAL90]). Further, a safe kernel-user shared memory can eliminate expensive memory

copies between the operating system and applications. Sentry supports such protected sharing

at the cache line granularity.

6.3.3 Intra-Process Compartments

This model supports multiple protection domains within a single process. A real-world use of

this model is the Apache application (see Figure 6.1) that supports various features (like caching

161

pages fetched from the disk) in modules loaded into the web server’s address space. The core

set of developers sets the interface between the various code and data segments. Isolating these

modules into separate domains and enforcing the interface between them (data and functions)

can improve reliability and safety.

In the compartment model, each domain owns a set of data and executable code regions.

Threads in a protection domain can set up the permissions, execute code, and access the data

owned by the domain. They are responsible for handling M-cache misses and filling it. An

owner thread may want to do this to set up application-specific monitoring, like watchpoints.

This case is akin to the application-managed model described in Section 6.3.2.

Cross-Domain Data Accesses and Code Invocation

In some cases, threads in a protection domain may need to access data that is not owned by the

domain and we call these non-owner data accesses. For instance, a web server module accesses

some request data structure from the core application in order to apply output filters. Threads

in a domain may also want to call functions that are not owned by the domain and we call

them cross-domain calls. Applications set up specific permissions for authorizing non-owner

accesses but they must be carefully checked for safety.

One of our key design objectives is to minimize the role of the operating system and al-

low much of the management tasks to be performed directly by the application. This affords

the most flexibility in terms of application-specific protection policies. It also incurs low man-

agement cost, since the operating system does not need to be invoked on policy changes. We

designate domain 1 in each process as a user-level supervisor. Domain 1 takes the primary

responsibility for coordinating cross-domain data accesses and code invocation. It is implicitly

given the ownership role for all memory in the process address space, including the ability to

fill M-cache content and handle M-cache permission exceptions. Domain 1 can also dynam-

ically change the domain (anything other than domain 0) of a running thread by modifying

the %Thread_Domain register. Given its privileges, domain 1 guards its own data and code

against unauthorized accesses by other domains by owning those pages and setting up the ap-

propriate permissions.

162

Domain X: Caller Domain Y: Callee

Domain 1: User Level Supervisor

Domain 0: Operating System

Call Y_PC Return

1 2 34

Steps of a cross-domain call when a thread in domain X invokes a function whose code is owned by domain Y:

1. Domain X tries to jump into a code region that it does not own. The hardware intercepts the call by recogniz-ing a mismatch between the %Thread_Domain register and the Owner_Domain field in the TLB entryassociated with the instruction. The function entry address Y_PC is then saved. The hardware then effects anexception into Domain1_Handler while assuming the context of supervisor domain 1.

2. The domain 1 exception handler marshals the arguments on the stack, grants the appropriate data accesspermissions to the callee domain, and invokes the function through a special switch_call instruction.The instruction jumps into Y_PC and changes the protection domain context (%Thread_Domain) to Y.

3. On return, the function tries to jump back to the caller. Since the return target is owned by domain X, anexception is triggered into domain 1 and the return target address is saved.

4. Domain 1 finally executes switch_call back to X.

Figure 6.6: Cross-domain call execution flow

Non-owner data accesses that trigger M-cache miss exceptions (i.e., no M-cache entry) need

to be handled carefully. The thread itself lacks the privilege to set up the metadata. Hence, these

operations are handled by domain 1. The address to this exception handler in domain 1 is located

in a CPU register (%Domain1_Handler), which is managed exclusively by domain 1. On

a non-owner M-cache exception, hardware changes the domain context (i.e., %Thread Domain

register) to domain 1 and traps to the %Domain1_Handler. Based on application specific

policies, the domain 1 exception handler can let the non-owner data access progress (finish or

raise an exception) by filling the appropriate M-cache entry. When the miss handler has finished

its operations, domain 1 reverts back to the original domain context to continue execution.

Subsequent accesses to the same location (from the non-owner domain) can proceed without

interruption.

Cross-domain calls are registered with domain 1, according to the application policies. A

cross-domain call and its return are redirected and managed by domain 1 in a four-step process

illustrated in Figure 6.6.

163

The indirection through domain 1 for cross-domain calls does impose a minor performance

penalty. Unlike earlier work that speeds up cross-address space calls by using shared-memory

segments [BAL90], our cross-domain call remains in the same address space and the complete

execution flow utilizes the same stack. This allows us to achieve efficiencies comparable to a

standard function call.

6.3.4 Ring Protection

The M-cache access control mechanism is highly flexible and can support different protection

models with small changes. We briefly discuss its ability to support another popular protection

model, a Ring [Org72]. In this model, threads in domains 1, 2, · · · , k−1 can directly access data

and code owned by domain k. This model would be appropriate for workloads like Transaction

processing, which typically involve multiple layers of middleware stack [IsS99], and a web

browser client, which includes support for multiple plug-in levels that depend on each other. To

realize the ring protection model, we can tune the M-cache mechanism to expand the concept of

ownership of a memory region. Specifically, for a page owned by domain k, domains 1 through

k−1 all assume the owner’s privileges, such as directly filling the M-cache content for data in

the page. Cross-domain data accesses and code executions that do not fall into this expanded

concept of ownership will still need to go through the supervisor indirection.

Hierarchical protection rings were introduced in Multics [SCS77]. Hardware metadata

could be set up to regulate the instructions and memory state that can be accessed at any given

ring and restrict the points of control flow between the various ring levels (i.e., call gates). A

challenge with the system is that a program needs to be aware of the hardware protection. For

example, when using function pointers the compiler or programmer needs to specify whether

the callee points to an intra-ring or inter-ring address.

164

6.4 M-cache Design

6.4.1 Area, Latency, and Energy

The M-cache area implications are a function of its size and organization. All known cache

and TLB optimizations apply (banking, associativity, etc.). Most importantly, the M-cache

intercepts only L1 misses, thereby reducing its impact on the processor critical path. While

dual-tagged, each tag array is single-ported. The virtual tag is accessed only from the processor

side (for checking permissions and filling M-cache entries) and the MPA is accessed only from

the network side (for snoop-based coherence). We used CACTI 6 [MBJ07] to estimate cycle

time for a 256-entry M-cache (4KB, 4 way, 16 bytes per entry) that provides good coverage

(fine-grain permissions for a 1MB working set). We estimate that in 65 nm high-performance

technology (ITRS HP), the M-cache can be indexed within 0.5ns (1 processor cycle using our

parameters). Furthermore, since the M-cache is located outside the processor core and is ac-

cessed in parallel with a long latency L1 miss we can trade latency for energy benefits. Ta-

ble 6.3 shows a few of the possible designs when locating the permissions cache on the L1 miss

path. If we use the low power transistor technology from ITRS (see Design-2 in Table 6.3),

the access latency increases by 4× (to ∼4 cycles) compared to high-performance transistor

technology (ITRS HP). However, there is a significant energy reduction: an M-cache designed

with ITRS [Ass07] LOP consumes 1% of the leakage power and 33% of the dynamic access

power of an M-cache employing ITRS HP transistors. On average (geometric mean), across the

PARSEC, SPLASH2, and Apache workloads, the M-cache with ITRS LOP consumes 0.029nJ,

0.037nJ, and 0.27nJ respectively per memory access; in comparison a TLB with the same num-

ber of entries consumes 3nJ.

Table 6.3: Design tradeoffs in M-cache designDesign Latency (cycles) Dynamic energy (cycles) Static power (mW)

In-Processor 1 30 20Sentry Design-1 2 21 2.3Sentry Design-2 4 9 0.21

165

6.4.2 Operation

The energy required by the hardware cache used to maintain protection information needs to be

considered carefully. Word-granularity checks require larger permission caches and also require

the permissions M-cache to intervene on each memory access. In comparison, Sentry intervenes

only on L1 misses and its dynamic energy savings are directly proportional to the number of

accesses it has to check (L1 misses). Figure 6.7 shows the miss rate of workloads. Overall

the miss rate varies between 1.4% — 16% for SPLASH2, and 0.1% — 4% for PARSEC. Even

in commercial workloads like Apache with large working sets the number of L1 misses is 6×

lower than the total number of accesses.

(a) SPLASH2 and Apache

02468

1012141618

Bar

nes

Cho

lesk

y

FFT

LU

MP3

D

Oce

an

Rad

ix

Wat

er

Apa

cheL1

mis

s rat

e %

(mis

ses/

acce

sses

)

(b) PARSEC

0

1

2

3

4

5

Bla

cksc

hole

s

Bod

ytra

ck

Can

neal

Ded

up

Face

sim

Ferr

et

Flui

dani

mat

e

Freq

min

e

Stre

am

Swap

atio

ns

Vip

s

x264

L1 m

iss r

ate

% (m

isse

s/ac

cess

es)

Y-axis: Average number of L1 misses per 100 accesses per thread.

Figure 6.7: L1 miss rate in applications.

The management costs associated with any hardware protection mechanism are the latency

of switching to the privilege level that can manipulate permissions and the cost of maintaining

the coherence of the cached copies of the metadata. The M-cache can be managed entirely

at user level. In addition, the metadata physical address (MPA) tag allows the M-cache to

exploit coherence to propagate metadata invalidations to remote processors. This simplifies

the software protocol required to manage the M-cache and improves performance compared to

existing protection mechanisms that use the interprocessor interrupt mechanism.

We compare the cost of changing the metadata associated with the M-cache against the cost

of manipulating page-attribute protection bits in the TLB. We set up a microbenchmark that

166

TLB M-cache 1 M-cache 16

M-cache 64 M-cache Parallel 64

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1 8 16Threads

Exec

. Cyc

les (

log

scal

e)

M-cache N measures execution cycles for changing permissions on N cache lines. L1 handles stores in order. In

M-cache Parallel 64, the L1 can sustain 64 concurrent misses.

Figure 6.8: TLB vs. M-cache

creates K threads (varied from 1—16) on the machine, and every 10,000 instructions a random

thread is picked to make permission changes. To test the TLB, we change permissions for a

page, and to test the M-cache we change permissions for N cache lines within a page. Overall,

modifying permissions with the M-cache is 10—100× faster than TLB shootdowns (see Fig-

ure 6.8). The dominant cost with the M-cache is that of purging the data from the L1 caches (see

Section 6.2.6). The routine needs to prefetch the data cache blocks in exclusive mode in order

to invalidate all L1s, and this typically results in coherence misses. In our design, each L1 cache

allows only one outstanding miss and hence the invalidation of each cache line directly appears

on the critical path. The latency of the permission change is directly proportional to the number

of cache blocks that are invalidated (M-cache 1, M-cache 16 and M-cache 64 bars). We also

evaluate performance when 64 outstanding L1 prefetches are allowed (M-cache Parallel 64)

and show that overlapping the latency of multiple misses is sufficient to significantly reduce

permission change cost.

Teller et al. [TKS88] discuss hardware support to keep TLBs coherent and recently, con-

current with our work, UNITD [RLS10] explored the performance benefits of coherent TLBs.

Both these works mainly seek to reduce the overheads of conventional OS TLB management

routines while Sentry employs user-level software routines to manage the M-cache.

167

6.5 Experimental System

Our base hardware is a 16-core chip (2 GHz clock frequency), which includes private L1 in-

struction and data write-back caches (32KB, 4-way, 1 cycle) and a banked, shared L2 (8MB,

8-way, 30 cycle), with a memory latency of 300 cycles. The cores are connected to each other

using a 4 × 4 mesh network (3 cycle link latency, 64 byte links). The L2 is organized on one

side of this mesh and the cores interface with the 16-banked L2 over a 4× 16 crossbar switch.

The L1s are kept coherent with the L2 using a MESI directory protocol. This coherence proto-

col is based on the SGI ORIGIN 3000 3-hop scheme with silent evictions. Our infrastructure

is based on the full-system GEMS-Simics framework. We faithfully emulate the Sentry hard-

ware/software interface and model the latency penalty of the software handlers from within the

simulator. We simulate all memory references made by the handler and use instruction counts

to estimate the latency of non-memory instructions.

6.6 Application 1: Compartmentalizing Apache

In this section, we use Sentry’s intra-process compartment protection model (see Section 6.3.3)

to enforce a safety contract between a web server (Apache) and its modules. Apache includes

support for a standard interface to aid module development; the “Apache Portable Runtime”

(APR) [10b] exports many services including support for memory management, access to

Apache’s core data structures (e.g., file stream), and access to intercept fields in the web re-

quest. Apache’s modules are typically run in the same address space as the core web server

in the interest of efficiency and the desire to maintain a familiar programming model (conven-

tional procedure calls between Apache and the module). A module therefore has uninhibited

access to the web server’s data structures, resulting in the system’s safety relying on the module

developers’ discipline.

Our goal is to (1) isolate each module’s code and data in a separate domain and ensure that

the APR is enforced. This protects the web server’s data and ensure that modules can access the

web server’s data only through the exported routines; and (2) achieve this isolation with simple

annotations in the source code without requiring any source revisions. While the definition

168

of a module may be abstract, here we use the term to refer to the collection of code that the

developers included to extend the functionality of the web server. To enable protection, Sentry

annotates the source to perform the following tasks: (1) specify the ownership (domain) of code

and data regions, (2) assign permissions to the various data regions for the different domains,

and (3) assign domains to preforked threads that wait for requests. To simplify our evaluation,

we set up the core Apache code (all folders other than module/ ) to directly execute in domain 1

and emulate all the actions required by domain 1 (those that would be provided in the form of

a library) for cross-domain calls from within the simulator. The only modifications required to

Apache’s source code were the instructions that set up the domains and permissions.

The modules we compartmentalized are mod_cache and mod_mem_cache, which work

in concert to provide a caching engine that stores pages in memory that were previously fetched

from disk. “mod cache” consists of two parts: (1) module-Apache interface (mod_cache.c)

and (2) the cache implementation (cache_cache.c[.h],cache_hash.c[.h], and

cache_pqueue.c[.h]). “mod cache” also needs to interface with a storage engine,

for which we use mod_mem_cache (cache_storage.c, mod_mem_cache.c). We

compiled the modules into the core web server and compartmentalized the storage engine

(mod mem cache) into domain 2 and the cache engine (mod cache) into domain 3, respectively.

6.6.1 Compartmentalizing Code

Our primary goal was to enforce the APR interface and ensure that module domains cannot

call into non-APR functions. First, we compiled in the modules and did a static object code

analysis to determine the module boundaries. We then added routines to the core Apache web

server that (1) registered the APR code regions as owned by Apache and set up read-write

permissions for Apache (domain 1) and read-only permissions for the modules and (2) de-

clared the individual module code regions as owned by the appropriate domains. The modules

expose only a subset of the functions to other modules (e.g., the storage module exports the

cache_create_entity() to the cache module as read-only while it hides the internal

cleanup_cache_mem, which is used for internal memory management). The module’s en-

tire code is accessible to the Apache web server. Finally, the pre-forked worker threads are

started in domain 1. When the threads call into a module’s routines, the exception handler in

169

Baseline ApacheApache with storage and cache modulesApache with Sentry-protected storage moduleApache with Sentry-protected storage and cache modules

0

400

800

1200

1600

160 Clients 320 Clients

Req

ues

ts /

s

Figure 6.9: Performance of Sentry-protected Apache

domain 1 transitions the thread to the appropriate domain if the call is made to the module’s

exported APR. Non-APR calls from the module would be caught by Sentry.

Since memory ownership is assigned at the page granularity, we have to ensure that code

from two different modules or the core web server are not linked into the same page. While

current compilers do not provide explicit support for page alignment, it is fortunate that our

compiler (gcc) allows code generated from an individual source file to remain contiguous in

the linked binary. Given the contiguity of each module’s code segment, we added appropriate

padding in the form of no-ops (asm volatile nops) to the end of the source (.h and .c

files) to page align each module.

6.6.2 Compartmentalizing Data

Assigning ownership to data regions proved to be a simpler task. The

core web server and modules all use the APR provided memory allocator

(srclib/apr/memory/unix/apr_pools.c). With this allocator, it was possible

to set up different pools of memory regions and request allocation from a specific pool.

We specified separate pools for each of the domains: Apache web server (domain 1),

mod mem cache module (domain 2), and mod cache module (domain 3). We then assigned

170

ownership of the pool to the domain that it served. The allocator itself is part of the APR

interface (domain 1). The permissions rules we set up were (1) Apache’s core (domain 1)

can read/write all memory in the program, (2) each module can read/write any memory in

their pool, and (3) a module has read-only access to some specific variables exported by other

external modules — this is where fine-grain permission assignment was useful. In some cases,

a variable exported by a module (read-only for remote domains) was located in the same

page as a local variable (no permissions). For example, the cache object in the storage engine

(mem cache object) contained two different sets of fields, those that described the page being

cached (e.g., html headers) and those that described the actual storage (e.g., position in the

cache). The former needs to be readable from the cache engine, while the latter should be

inaccessible.

Structure-fields also constituted a challenge. Consider two collocated words A and B, with

permission specifications read-only and read-write, respectively. Since A and B fall in the same

cache line, our runtime system would downgrade B to read-only as well and then deal with false

permission exceptions. Interestingly, this case did not occur often since it appears that develop-

ers when writing code seem to collocate declaration for variables with similar semantics. In the

few cases, where this was an issue we were also able to re-organize the fields in the structure

after source inspection. This ensured that variables with similar permission specification were

allocated to the same cache line.

6.6.3 Performance Results

We now estimate the overheads of compartmentalizing the modules in Apache. In this exper-

iment, we pre-compiled modules into the core kernel, disabled logging, and used POSIX mu-

texes for synchronization. The specific configuration we used is -with-mpm=worker -enable-

proxy -disable-logio -enable-cgi -enable-cgid -enable-suexec –enable-cache. We configured

the Surge client for two different loads: 10 or 20 threads per processor (total of 160 or 320

threads) and set the think time to 0ms. Our total webpage database size was 1GB with the

file sizes following a Zipf distribution (maximum file size of 10 MB). We warmed up the soft-

ware data structures (e.g., OS buffer cache) for 50,000 requests and measured performance for

10,000 requests. Sentry permits incremental adoption of compartment-based protection. In our

171

experiments we protect apache’s modules in two stages: we first moved mod_mem_cache

into a separate domain leaving mod_cache within domain 1 (along with the core webserver).

Subsequently, we moved both mod_mem_cache and mod_cache out of domain 1, each into

a separate domain.

Figure 6.9 shows the relative performance. Sentry protection imposes an overhead of'13%

when compartmentalizing both mod_cache and mod_mem_cache and a '6% when com-

partmentalizing just the mod_mem_cache. The primary overheads with Sentry are the exe-

cution of the handlers required to set up data permissions in the M-cache and the indirection

through domain 1 needed for cross-domain calls. Approximately 20% of the functions invoked

by the module are APR routines, which involve a domain crossing. We believe the overhead is

acceptable given the safety guarantees and the level of programmer effort needed.

6.6.4 Process-based Protection vs. SentryTo evaluate the protection schemes afforded by OS process-based protection, we develop

mod case (a module that changes the case of ASCII characters in the webpage) using process-

based protection and compare it against Sentry’s compartment model. To employ process-

based protection we needed to make significant changes to the programming model and reor-

ganize the code and data. We had to implement a mod case specific shim (implemented as

an Apache module) that communicates with mod case through IPC (inter-process communi-

cation), requiring a shared-memory segment. All interaction between Apache and mod case

passes through the shim, which converts the function calls between Apache and mod case into

IPC. Although Apache’s APR helps with setting up shared-memory segments between the shim

and the mod case process, the shim and the module still needed to write explicit memory rou-

tines (for each interface function) to copy from and to the shared argument passing region. The

conversion of even this simple module to use IPC was frustrated by the inability to pass data

pointers directly between the processes and the need to use a specific interface between the

shim and the module itself. As Figure 6.10 shows, the process-based protection adds significant

overhead (33%) compared to Sentry (7%). To summarize, we believe the programming rigidity,

need to customize the interface for individual modules, and performance overheads are major

barriers to the adoption of process-based protection.

172

We briefly compare Sentry’s cross domain calls against existing IPC mechanisms. Despite

considerable work to reduce the cost of IPC [Lie93; KEG97; BAL90], it is still 103 slower than

an intra-process function call. In contrast, Sentry focuses on functions calls between protection

domains established within a process. They have lower overheads since they share the OS

process context and crossing domains does not involve address-space switches. The indirection

through the user-level supervisor domain 1 adds 2 call instructions (to jump from caller→ to

domain 1 and then into callee) and 9 instructions to align the parameters to a cache line boundary

and set appropriate permissions to the stack for the callee domain (i.e., permissions to access

only the parameters and its own activation records). The overhead at runtime varies between 20

and 30 cycles (compared to 5 cycles required for a typical function call).

0

200

400

600

800

1000

1200

Base

Proc

ess-

Prot

ectio

n

Sent

ry-

Prot

ectio

n

Requ

ests

/ s

Performance normalized to mod case with no protection

Figure 6.10: Comparing Sentry against process-based protection

6.6.5 Lightweight Remote Procedure Call

RPC is a form of inter-process communication where the caller process invokes a handler ex-

ported by a callee process. We first describe the scheme used to implement RPC in current

OS kernels (e.g., Solaris): (1) The caller copies arguments into a memory buffer and generates

a trap into the OS. (2) The OS copies the buffer content into the address space of the callee.

173

It then context switches into the callee. (3) To return, the callee traps into the kernel, which

unblocks the caller process. The main overhead is due to the four privilege switches (two from

user→kernel space and two from kernel→user space) and the memory copy required. Earlier

proposals [BAL90] have optimized RPC by using a shared memory segment (i.e., shmem()) to

directly pass data between the caller and callee process. The minimum size of the shared mem-

ory segment is a page and a copy is still needed from the argument location to the shared page.

More recently, Mondrian[WCA02] postulated that word granularity permission and translation

can eliminate the argument passing region.

We use the M-cache to provide fine-grain access control of the locations that need to be

passed from the the caller to the callee process and eliminate the copying entirely. The caller

process must align the arguments to a cache line boundary and request the kernel to map the

argument memory region into the callee’s address space (requiring a user-kernel and kernel-

user crossing). We experiment with a client that makes RPC calls to a server periodically (every

30,000 cycles). The server is passed a 2 KB random alphabet string that it reverses while switch-

ing the case of characters. We compare the cost of argument passing using several approaches

— Sentry, the implementation in the Solaris RPC library, and an optimized implementation that

uses page sharing [BAL90]. Our results indicate that completely eliminating the copying pro-

vides a (∼9–10×) speedup compared to the optimized page sharing approach. The unoptimized

RPC implementation has 103–104 × higher latency.

6.7 Application 2: Sentry-based Watchpoint Debugger

We developed Sentry-Watcher, a C library that applications call into for watchpoint support.

A watchpoint is usually set by the programmer to monitor a specific region of the memory

and when an access occurs to this region, it raises an exception. Sentry-Watcher employs

the one-domain per process model and operates in the application mode (permissions meta-

data and M-cache managed by a library in the application space). It exports a simple inter-

face: insert_watchpoint(), delete_watchpoint(), and set_handler(), with

which the programmer can specify the range to be monitored, the type of accesses to check, and

the exception handler.

174

Apart from fine granularity watchpoints, there are three additional benefits with Sentry-

Watcher: (1) it supports flexible thread-specific monitoring, allowing different threads to set up

different permissions to the same location 4, (2) it supports multi-threaded code efficiently since

the hardware watchlist can be propagated in a consistent manner across multiple processors at

low overhead (also supported by MemTracker [VRS07]), and (3) it effectively virtualizes the

watchpoint metadata and allows an unbounded number of watchpoints.

Table 6.4: Application CharacteristicsBenchmark Mallocs/1s Heap Size Heap Access % Bug Type

BC 810K 60KB 75% BONcom 2 8B 0.6% SSGzip 0 0 0% BO,IVGO 10 294B 1.9% BOMan 350K 210KB 89% BO,SSPoly 890 11KB 27% BO

Squid 178K 900KB 99.5% MLB-Bytes, KB-Kilobytes, MB-MegabytesBO-Buffer Overflow SS-Stack Smash, ML-Memory Leak,IV-Invariant Violation. Squid is multi-threaded.

M-cache (256 entries) Discover

0

0.5

1

1.5

2

2.5

3

BC

NCO

M GO

GZI

P1

GZI

P2

Man

Poly

Squi

dNor

mal

ized

Slo

wdo

wn

(App

=1) 4x75x 50x 18x 35x 65x 6x

N/A

N/A -Discover is not compatible with multi-threaded code. Gzip1 detects Buffer-Overflowbugs. Gzip2 detects memory leak, buffer overflow, and stack smash bugs.

Figure 6.11: Sentry-Watcher vs. Binary instrumentation

6.7.1 Debugging Examples

Generic unbounded watching of memory can help widen the scope of debugging. Here, we use

the system to detect four types of bugs: (1) Buffer Overflow — this occurs when a process ac-

4Since all threads run in the same domain, this would require a flush of the M-cache on a thread switch.

175

cesses data beyond the allocated region. To detect it we pad all heap buffers with an additional

cache line and watch the padded region. (2) Stack Smashing — A specific case of buffer over-

flow where the overflow occurs on a stack variable and manages to modify the return address;

we watch the return addresses on the stack. Dummy arguments need to be passed to the func-

tions to set up the appropriate padding and eliminate false positives. (3) Memory Leak — We

monitor all heap allocated objects and update a per-location timestamp on every access. Any

location with a sufficiently old access timestamp is classified as a potential memory leak. (4)

Invariant Violation—Monitor application-specific variables and check specified assertions. We

demonstrate the prototype on the benchmark suite provided with iWatcher [ZQL04] — Table 6.4

describes the benchmark properties. We compare the performance of Sentry-Watcher against

Discover (a SPARC binary instrumentation tool [10c]). Sentry-Watcher is evaluated on our sim-

ulator. Discover is set up on the Sun T1000. Compared to the binary instrumentation technique,

Sentry-Watcher provides 6–75× speedup. Sentry still incurs noticeable overhead compared to

the original applications—varying between 3–50% for most applications. At worst, we en-

counter up to ∼2× overhead on the memory-leak detector experiments, which instrument all

heap accesses (see Squid in Figure 6.11).

6.7.2 Comparison with Other Hardware

To understand the performance overheads of Sentry-Watcher, we further compared it against

the performance of MemTracker [VRS07], a hardware mechanism tuned for debugging, and

Mondrian [WCA02], a fine- and variable-grain flexible protection mechanism placed within the

processor pipeline.

In MemTracker, a software-programmable hardware controller maintains the metadata,

fetching and operating on it in parallel with the data access. To emulate the controller’s opera-

tions, we assign a 0 cycle penalty for all operations other than the initial setup of the metadata.

This is an optimistic assumption since typically metadata misses in MemTracker also add to the

overhead. In Mondrian, setting up or changing the metadata require privilege mode switches

into and out of the operating system kernel. To estimate the cost of a fast privilege switch, we

measure the latency of an optimized low overhead call (e.g., gethrtimer()) on three widely

available microarchitectures (SPARC, x86, and PowerPC). We observed 280 cycles (SPARC),

176

MemTracker Sentry Mondrian

0

5

10

15

20

25

per

lben

ch

mcf

asta

r

gobm

k

xala

nc

h264

oce

an

wat

er

dat

amin

e

tsp

mp3dPe

rform

ance

Ove

rhea

d (

%)

Effect of software handlers and cost of manipulating metadata in the OS.

Figure 6.12: M-cache vs other hardware-based watchpoints

312 cycles (x86) and 450 cycles(PowerPC). To emulate Mondrian, we add a penalty of 300

cycles to every permission exception and metadata modification; we assign a 0 cycle penalty to

metadata operations. To limit the number of variables, we keep the metadata cache size fixed

across all the systems (256 entries). For this comparison, we implemented the debugger tool

discussed by the MemTracker paper [VRS07], which checks for heap errors, buffer overflow,

and stack smashing. The workloads we use are from the SPECCPU 2006 suite [Sta06]. We also

include the SPLASH2 benchmarks to verify that our findings are valid for multi-threaded work-

loads. Figure 6.12 shows that the overhead of Sentry-Watcher averages 6.2%, compared to the

idealized MemTracker’s 2.6% and Mondrian’s 14%. Since MemTracker requires a hardware

controller to fetch and manipulate the metadata while we leave all such operations in software,

we believe that our system is more complexity-effective.

6.8 Extensions for address-translation

Sentry’s M-cache entries can also include other metadata. The M-cache associates software

permissions metadata with cache-line granularity address regions. We discussed another form

of software metadata in Section 4.6.3, which associated fine-grain translation information with

cache-line granularity regions. The key difference between the metadata cache structures de-

signed in Section 4.6.3 and here, is the hardware interpretation of the metadata. In the case of the

177

fine-grain translation, hardware uses the metadata to isolate data blocks in transactional mem-

ory and redirect misses and writebacks. Here, hardware simply checks the permission bitmap

against the type of miss and raises an exception. In both the metadata caches, the hit path for

the cache is left untouched; astute readers will also observe the similarities of the organization

and tag design of both the metadata caches. This is encouraging from a hardware complexity

perspective since these designs can be essentially fused, giving rise to a single general-purpose

metadata cache, which can store translation and permissions metadata. Note that, the data array

design does not need to change; it only needs to become wider to accommodate the additional

metadata.

6.9 Discussion

Our results demonstrate that fine-grain auxiliary access control can enable lightweight protec-

tion models and can support watchpoint-based debuggers. However, these results do not reflect

on the tradeoff between programmability and granularity of protection. Word-granularity pro-

tection provides maximal flexibility and requires few program changes. Since a word is the

smallest granularity of datum in the system it enables application developers to specify permis-

sions without having to worry about collocated words. However, there is inherent complexity

associated with intercepting word-granularity accesses. Software-based systems require instru-

mentation on every access, which adds significant performance overhead. Word-granularity

accesses need to be intercepted within the processor itself and require modifications to the crit-

ical processor pipeline.

We increased the basic granularity of protection to simplify the implementation but main-

tained it at L1 cache line granularity (typically 8–16 words) to provide flexibility to the appli-

cation developers. We have demonstrated that a cache-line granularity interception provides

significant energy benefits and reduces complexity of the access control hardware. We also

ported a few modules in a widely-used webserver to use intra-application protection and dis-

cussed the challenges when using our framework. The two key challenges we had to address

were the memory allocator and structure fields. Overall, we find our intermediate approach quite

promising and think it makes the appropriate tradeoff between implementation complexity and

protection requirements.

178

Chapter 7Summary and Future Work

In this chapter, I summarize the contributions of the dissertation and present directions for future

work. I also reflect back on my research work and comment on it with the benefit of hindsight.

7.1 Summary

Multicore processors are the only game in town and are an inflection point for software

development [Pat10]. Multicore processors have exacerbated the challenges for software

development—algorithms, languages, and operating systems have all been forced to adopt par-

allelism, which increases the likelihood of concurrency bugs. In this dissertation, we have

utilized the transistors afforded by Moore’s law to develop hardware mechanisms that aid soft-

ware development. The dissertation focused on the memory system, which plays an important

role in a program’s lifetime and contains a wealth of information. I proposed hardware mech-

anisms that shed light on the memory system and expose information that software can use for

both self-diagnosis and control of data flow using higher level semantics, such as transactions.

Specifically, I developed mechanisms for Monitoring, Isolation, and Protection of memory.

These mechanisms have been designed to support fine-grain cache block regions (typically 10s

of bytes), which simplifies the interface with software. A key novelty is the use of cache coher-

ence mechanisms and caches to implement the proposed hardware mechanisms in a lightweight

manner. Modern memory hierarchies already include the requisite mechanisms for managing

179

data and we demonstrate that few extensions are needed to enable software to track and control

data transfers in the memory hierarchy. Here, we summarize the contribution of each mecha-

nism.

7.1.1 Monitoring

Many program analysis tasks required for debugging, performance optimization, and specu-

lative threading need to track a program’s accesses dynamically and insert instrumentation to

record information about the accesses. We investigated hardware-based monitoring to help ex-

pose the data movement in the cache hierarchy to software and reduce this overhead. We track

the coherence events that occur in shared memory and enable software to track the various

data accesses in the system. In Chapter 3, we proposed Alert-On-Update (AOU), a lightweight

mechanism which triggers a software handler on coherence events and provides information

about the event such as address and type. Specifically, we monitor lines marked by software in

the L1 cache and trigger a handler when remote (or local) events occur on such lines. Software

controls the use of and reaction to the event, and can relate the event information to program

semantics in various ways.

Using AOU, we proposed the first hardware-accelerated software-based transactional mem-

ory system (Section 3.5), which boosts the performance of software transactions while main-

taining their policy flexibility. We demonstrated that AOU-acceleration can boost the perfor-

mance of STMs by 1.4-2×. In this dissertation, we also demonstrated the use of AOU in thread

synchronization, accelerating locks, and detecting concurrency bugs.

7.1.2 Isolation

In Chapter 4, we introduced the notion of data isolation, which refers to a thread’s capabil-

ity to hide modifications from other threads of the application and then expose or revoke the

modifications in bulk.

Our hardware mechanism, Programmable-Data-Isolation, is based on the observation that

the multiple levels of caching in the memory system can be designed to hold the different

versions of data. We adapt the processor private L1 caches to hold the thread-private specula-

180

tive version, while maintaining the globally consistent data version at the shared L2. Multiple

processors can concurrently isolate and read the same location. This scheme supports low over-

head commit and revocation of isolated data. We propose the notion of lazy coherence, in

which coherence messages are eagerly sent out at each memory operation (speculative or non-

speculative) but the coherence state changes are performed lazily (under software control) to

enable isolation. The overall coherence state design for isolation is independent of the proto-

col type and we develop both snoopy and directory protocols. The dissertation also explores

the tradeoffs between the use of hardware and software based approaches to virtualizing and

extending isolated state beyond processor caches.

Our application case study for isolation is transactional memory. A noteworthy feature of

programmable data isolation is that it permits multiple new versions of a location, allowing

different software tasks to isolate the same location concurrently. We used this feature to ac-

celerate two different TM systems: RTM, a hardware-software TM that handles large and long

transactions entirely in software, and FlexTM, a scalable flexible Lazy HTM system that does

not require centralized arbitration. While RTM accelerate STM systems by '2.5×, it still suf-

fered performance overheads due the software bookeeping needed to interact between hardware

and software transactions. FlexTM includes a more streamlined runtime system and improves

performance by '2× relative to RTM. We demonstrated that FlexTM’s distributed commit

demonstrates '25% better performance than an aggressive hardware-based centralized arbiter

design (which can handle an unbounded number of parallel commits).

Both RTM and FlexTM support flexible software-defined policies for contention manage-

ment and conflict resolution laziness. In Chapter 5, using FlexTM, we investigate the interaction

between conflict management and resolution for different types of conflict scenarios and make

recommendations on what policies need to be adopted. Overall, Lazy outperformed Eager by up

to 2× with an average of 40%. In some cases Lazy’s wasted work limits performance improve-

ment to 20%. We evaluated an intermediate Mixed approach that limits the negative effects of

failed speculation and improves performance by up to 40% over Lazy.

181

7.1.3 Protection

In Chapter 6, we developed Sentry, a flexible memory protection mechanism that helps to im-

prove the reliability and debugging of modular software. Sentry allows software to enforce

protection among a program’s modules. Sentry ensures the integrity of a module’s private data

(no external accesses), the safety of inter-module shared data (enforce the permissions speci-

fied), and adherence to the module’s interface (regulate function invocation). The key hardware

innovation is a lightweight permissions cache that only intercepts L1 cache misses and reuses

the coherence states to implicitly validate L1 cache hits. This design is based on the observa-

tion that the permissions metadata do not change often and hence can be checked infrequently,

possibly only when they change. We use the fact that a location is in the cache to indicate that

its permissions have not changed and elide checks for the cache hits.

Sentry provides a protection framework using virtual memory tagging that is completely

subordinate to the existing OS-TLB framework. It realizes intra-application protection mod-

els with predominantly user-level software, requires very little OS intervention, and makes no

changes to process-based protection. We developed an intra-application compartment protec-

tion model and used it to isolate the modules of a popular web server application (Apache),

thereby protecting the core web server from buggy modules. Our evaluation demonstrated

that Apache’s module interface can be enforced at low overhead ('13%), with few application

annotations, and in an incremental fashion. Finally, we also investigated the use of memory

protection in watchpoint-based debuggers.

7.2 Future Work

Given the broad range of problems that we tackled in this dissertation, we summarize future

work on the two main applications we explored (1) Transactional memory and (2) Memory

protection.

182

7.2.1 Transactional Memory

In this dissertation, we made a case for making TM implementations more flexible to deal

with the various conflict scenarios that arise in applications. We explored policy decisions in

the context of FlexTM. The hardware-software TM, RTM, is more appealing since it requires

changes to only the on-chip cache hierarchy and handles large and long transactions entirely in

software. Future work could explore conflict scenarios in more detail with the RTM system,

which would introduce new challenges such as the conflict between software and hardware

transactions. We would need to port benchmark suites (e.g., STAMP in Appendix A) to RTM.

This task could be made a lot easier by porting the Intel compiler [Int] recently released for

STMs. As part of this exercise, one of the key challenges that needs to be addressed is the

cache-line alignment restriction.

A general limitation of HTMs that buffer speculative data in the cache is that they do not

allow speculatively written data and non-speculatively written data to reside in the same cache

line. Most HTMs track speculative locations at the granularity of cache lines and throw an

exception when a speculatively written cache line is subsequently written non-speculatively.

This could give rise to spurious aborts when two program level objects that do not have a race

happen to be co-located. We would need extensions to the compiler and memory allocator

to ensure this limitation is invisible to the programmer. An alternative would be to make the

non-speculative stores write-through to the shared cache, but the performance effects of such

write-through operations need to be investigated.

A key design decision that we made early on was to not consider support for closed or

open nesting. It was not clear to us then whether any form of support other than simple flat-

tening was needed. Furthermore, a workable solution based entirely on software has been pro-

posed [LeM08]. One could possibly revisit this assumption in light of new applications and

study the support needed for constructs like nested parallelism [VWA09]. The study would

have to also consider performance implications for the various approaches to nesting. Perhaps

the biggest challenge in hardware support is that each nesting level may need separate support

for conflict detection and versioning.

Finally, isolation is a key property which enables transactions to run independently of other

183

threads in the program. This feature is useful in applications like testing, debugging, and sand-

boxing. Future work could explore the interface that TM runtime systems need to support to

enable these applications. We may also need to extend the underlying TM mechanism itself

to support these new uses. For example, software testing requires the isolation mechanism to

enable software to query both the new and old versions of a location.

7.2.2 Fine-grain memory protection

In Chapter 6, we demonstrated that access control can be implemented efficiently if we take

into consideration that the permissions of many locations do not change frequently. Based on

this observation, we exploited cache states to implicitly check accesses and elide the actual per-

mission checks for the majority of the accesses. The observation about the temporal stability

of permissions can also be used to drive a software virtual-machine-based implementation of

Sentry. A virtual-machine-based design is attractive since it requires few changes in the appli-

cation and no changes to the hardware. It would also allow the development of library support

and applications before the Sentry hardware is available.

The key challenge in software-based implementations is the performance overhead of in-

strumentation that is needed on every access. This overhead can be elided by using hardware-

based access control available on current microprocessors. Current microprocessors provide

support in the form of small number of watchpoint registers and coarse-grain page protection.

The virtual machine could dynamically utilize the hardware support for frequently and recently

accessed data where possible, while inserting run time checks for other data in the application.

For example, if all entries in a large array had the same permissions, then we can protect it using

page protection and elide instrumentation for accesses to the array. We could also dynamically

detect the most frequently used words and assign them watchpoint registers [LH10].

As part of Sentry’s design, we presented four different protection models and demonstrated

Sentry on a few modules of the Apache webserver. Future work, could investigate the pro-

tection models in a more formal setting and investigate more applications; an attractive study

could be the Firefox web browser. Languages such as C# and Java also provide type informa-

tion and one could potentially investigate techniques to extract protection models, and insert

184

permission annotations in an automated fashion. Part of this project could also devise new

pragmas and annotations to allow programmers to indicate permissions with higher level spec-

ifications [WSC10].

7.3 Reflections

In this section, I reflect on my dissertation research with the benefit of hindsight, experience of

working with memory hierarchies for over five years, and the freedom to state my my opinion.

7.3.1 Future of Transactional Memory

Based on the researcher, TM is either nothing more than a research toy [CBM08], or it could

potentially ease the task of writing parallel programs. Why the stark difference of opinion

among academic researchers of reason? Well, it depends on how you view the experiments

set up by the different research groups or in Mark Moir’s words — All short statements about

TM are wrong! [Moi10]. Concurrency control mechanisms in general are notoriously hard to

evaluate due to the influence of subtle interactions between threads; speculative techniques in

TM only add further pandemonium. As we showed in Chapter 5, introducing a simple backoff

mechanism on a conflict can change the timing of events and lead up to 10× improvement or

loss in throughput.

TM evaluation has been largely driven by simple benchmarks and small application kernels.

In my opinion, this does not necessarily prove or disprove anything about the usefulness of TM

as the impact on a real application could possibly be minimal due to the fraction of the appli-

cation using transactions. Most researchers do understand that software-only TMs will suffer

from inherent overheads due to the instrumentation needed. However, STMs have reasonable

scalability and it may be possible to use extra cores in the system to improve programmability at

the expense of reduced performance efficiency. More work is needed to convince ourselves that

TM improves programmability. The appearance of a somewhat open standard [ACC11], mod-

ern compilers [Int], spectrum of runtime systems [10a], and system support [PHR09] will all

aid in getting TM into mainstream applications and investigating the programmability question.

185

Overall, in my opinion TM provides a clean atomic abstraction that can help programmers

avoid the difficult tradeoffs associated with locking conventions and avoid the possibility of

deadlock. In many cases, programmers appear to commit fewer mistakes with TM than they do

with locks. TM also allows the implementation to transparently employ speculative execution

to concurrently execute atomic actions and recover performance, when the conditions are right.

I see best-effort hardware support for TMs as being fundamental to pave the way for TM

adoption. HTMs help in tackling the uneasy tradeoffs among the scalability, latency, and live-

ness of many existing parallel programs. I am not convinced that we need anything more than

best-effort HTMs, given that most transactions tend to be small in current applications and that

there is no clear usage case for unbounded transactions in general. Best-effort HTM transac-

tions clearly help with applications such as concurrent data structures, memory allocators, and

lock elision; we only need a real system to ensure the benefits continue to hold in the real world.

It has the potential to simplify the writing and debugging of these complex parallel programs,

while improving performance in many cases.

Unfortunately, the first iteration of HTMs, ROCK, has proven to be too frustrating, with

transaction failures caused by low-level hardware conditions, which are difficult to diagnose

and even harder to resolve in a satisfactory manner. The lack of clear feedback to software has

also not helped. Going forward, I think that best-effort HTMs need to make some guarantees

about the ultimate progress of small transactions, similar to the compare-and-swap instruction.

To encourage software developers, best-effort HTMs need to streamline transactions and im-

prove performance over STMs as much as possible by providing support for tasks like register

checkpointing, even though it can be performed in software. Perhaps most importantly, best-

effort HTMs need to recognize the value of flexibility in conflict management. As we showed

in Chapter 5, to ensure scalability and freedom from pathological conditions, TM systems need

to delay choosing the winner in the event of a conflict and even then carefully make the choice.

Policies like eager requestor-wins conflict management, while directly compatible with cache

coherence, introduce significant performance problems. Care needs to be taken that we do not

hardwire policies in the HTM that cause poor performance and lead to software developers

blaming TM in general.

Finally, a key question that needs to be answered is what should best-effort HTMs look

186

like. While everyone agrees that it should involve minimal modifications to existing hardware,

there is less consensus on what the support should look like. Many research HTM systems (in-

cluding our own) have recognized that conflict detection constitutes the dominant overhead in

STM systems. This suggests that HTM systems should begin by including support for tracking

accesses in a transaction and detect conflicts against remote writes in the system. I particularly

think that we should avoid support for versioning in a best-effort TM because (1) it does not

constitute the dominant overhead in TMs, (2) supporting lazy conflict resolution would require

versioning to make changes that change the invariants of a traditional coherence protocol, and

(3) eager versioning can only support eager conflict resolution, which could introduce perfor-

mance pathologies. We demonstrated the value of decoupling conflict detection from resolution

in Chapter 5; we believe supporting flexible conflict resolution is an important goal.

Overall, I think that we should start with a system that includes alert-on-update (or bloom-

filter based access signatures) for detecting conflicts and adds conflict-summary-tables (Sec-

tion 4.3.1 in Chapter 4) to provide more information to software about conflicts. This would

allow the system to support both eager and lazy conflict resolution and allow flexible con-

tention managers to deal with pathological scenarios in applications. This design requires min-

imal changes to the coherence protocol and can be implemented with extra metadata bits at the

cores.

7.3.2 Which one of your hardware mechanisms holds the most promise?

In computer architecture, it has become increasingly hard to directly influence real micropro-

cessors due to their general complexity. Industry vendors need to be assured of many different

aspects and application cases, before they can incorporate “your favorite idea” in a future prod-

uct. Given this status, over the years, I have tried to be mindful of design complexity when

working on my research ideas. I first invented Alert-On-Update back in 2005 — it was a design

that was born out of the need for a way to let a transaction in an STM system know when it has

been aborted. All AOU requires is a single bit per cache-line (in some cases we can make do

with one bit for the entire cache) and few status registers. It requires no changes to the cache

coherence protocol itself and our design only detects events at the cores in the coherence proto-

187

col. I see Alert-On-Update as probably the mechanism that holds the most promise and offers

the potential for impacting industry and academia.

AOU started with the simple question: “What could software do, if it were aware of the

coherence events?”. Over the years, I have been pleasantly surprised at the number of possible

answers to this question: we have used AOU to speed up multiple TM systems, improve locks,

and enable watchpoint-based debugging. Other researchers have used extended versions of

AOU, which can also permit software to control coherence response messages, to enable thread-

speculation and software-based replay tools [NaG09]. The key to AOU’s usability has been the

design decision not to fix the semantics for an alert in hardware. The response actions to an

alert are entirely controlled in software, which significantly improves its versatility. AOU also

had well-defined behavior when interacting with regular memory operations and this enabled

us to incrementally incorporate it in applications.

The key challenge left in moving AOU into mainstream processors is virtualization. Per-

mitting multiple applications to simultaneously use the cache requires the addition of extra

metadata to the cache. To provide guarantees about the minimum number of AOU lines that

will not experience overflows requires additional logic in the cache replacement algorithms. It

would also be useful to investigate generalized techniques to handle cache overflows; in the

dissertation (Section 3.5) we investigated the use of software version numbers to handle missed

alerts in STMs.

7.3.3 How do you know your cache protocols work?

Finally, I reflect on the development and evaluation process used in my dissertation research.

Over the years, I have received multiple conference reviews posing questions about the com-

plexity and correctness of our protocols. Evaluating ideas that define the coherence protocol

interactions between multiple cores requires significant manpower and expenditure even for in-

dustry, and it is a large undertaking for a student. The SLICC language developed as part of the

GEMS tool chain [MSB05] is a boon: it allowed me to specify the protocol with varying levels

of transient states and verify system behavior with a random tester. The table-based specifica-

tion allowed me to carefully visualize the protocol’s operations and reason about the correctness

188

and invariants. SLICC has empowered even graduate students to actively research areas such

as coherence protocols with reasonable fidelity. I think it is the next best thing to full blown

formal verification, which has its own challenges. One of the key features that got me hooked

on to SLICC was being able to design tests and rapidly catch obvious mistakes such as missing

transitions and events. Using a tool such as SLICC has its disadvantages: the specification may

contain latent bugs that may not be caught by random testing. SLICC also makes it hard to

specify circuit-level optimizations like broadcast networks or to directly manipulate coherence

state bits.

189

Appendix ASupplement for Transactional Memory

A.1 Experimental Framework

We evaluated the TM systems discussed in Chapter 3 and Chapter 4 through full system simu-

lation of a 16-way chip multiprocessor (CMP) with private split L1 caches and a shared L2. We

use the GEMS/Simics infrastructure [MSB05], a full system functional simulator that faithfully

models the SPARC architecture. GEMS itself underwent multiple revisions during my Ph.D;

the work described in Chapter 3 and Chapter 4 used GEMS 1.2 and the work in Chapter 6 used

GEMS 2.0. The move to a new version of the simulator was necessitated since Simics 2.2.X

licenses expired in December 2009.

The instructions specified for the Alert-On-Update and Programmable-Data-Isolation

mechanisms interface with the TM runtime systems using the standard Simics “magic instruc-

tion” mechanism. We implemented support for the TMESI protocol and AOU mechanism using

the SLICC [MSB05] framework to encode all the stable and transient states in the system. We

employ GEMS’s network model for interconnect and switch contention and use 64 byte links.

Simics allows us to run an unmodified Solaris 9 kernel on our target system with the “user-

mode-change” and “exception-handler” interface enabling us to trap user-kernel mode cross-

ings. On crossings, we suspend the current transaction context and allow the OS to handle TLB

misses, register-window overflow, and other kernel activities required by an active user context

in the midst of a transaction. On transfer back from the kernel we deliver any exception signals

190

received during the kernel routine, triggering any user-level handlers if required. We used such

handlers for managing the interaction between hardware and software transactions in the RTM

system.

A.2 Application Characteristics

While microbenchmarks help stress-test an implementation and identify pathologies, de-

signing and understanding policy requires a comprehensive set of realistic workloads.

In this study, we have assembled seven benchmarks from the STAMP workload suite

v0.9.9 [MCK08], STMBench7 [GKV07], a CAD database workload, and two microbench-

marks from RSTMv3.3 [MSH06]. We briefly describe the benchmarks, where transactions are

employed, and present their runtime statistics (see Table A.1). Our runtime statistics include

transaction length, read/write set sizes, read and write event timings, and average conflict levels

(number of locations on which and the number of transactions with which conflicts occur). We

have also included information on the number of conflicting transactions and type of conflicts

(i.e., Read-Write or Write-Write) in order to analyze the behavior of the applications. We corre-

late this information with the influence of contention manager heuristics and measure the ability

of Eager/Lazy to uncover parallelism in Chapter 5.

Bayes: The bayesian network is a directed acyclic graph that tries to represent the relation

between variables in a dataset. All operations (e.g., adding dependency sub-trees, splitting

nodes) on the acyclic graph occur within transactions. There is plenty of concurrency but the

data is accessed in a fine-grain manner, resulting in potential for conflict.

Read/Write Set: Large Contention: High

Input: -v32 -r1024 -n2 -p20 -s0 -i2 -e2

Delaunay: There have been multiple variants of the Delaunay benchmark that have been re-

leased [SSD07; KCP06]. This version is from the STAMP suite and implements the Delaunay

mesh refinement [Rup95]. There are primarily two data structures: (1) a Set for holding mesh

segments and (2) a graph that stores the generated mesh triangles. Transactions protect access

to these data structures. The operations on the graph (adding nodes, refining nodes) are complex

191

and involve large read/write sets, which leads to significant contention.

Read/Write Set: Large Contention: Moderate

Input: -a20 -i inputs/633.2

Genome: This benchmark implements a gene sequencing algorithm that reconstructs the gene

sequence from shorter known fragments of a larger gene (short strings of alphabets A,T,C,G).

It uses transactions for (1) eliminating duplicates by inserting the fragments into a hash-set and

(2) matching segments in parallel using a string matching algorithm to find the longest match.

In general, the application is highly parallel and contention free.

Read/Write Set: Moderate Contention: Low

Input: -g256 -s16 -n16384

Intruder: This benchmark parses a set of packet traces using a three stage pipeline. There

are also multiple packet-queues that try to use the data-structures in the same pipeline stage.

Transactions are used to protect the FIFO queue in stage 1 (capture phase) and the dictionary in

stage 2 (reassembly phase).

Read/Write Set: Moderate Contention High

Input: -a10 -l16 -n4096 -s1

Kmeans: This workload implements the popular clustering algorithm that tries to organize data

points into K clusters. This algorithm is essentially data parallel and can be implemented with

only barrier-based synchronization. In the STAMP version, transactions are used to update the

centroid variable, for which there is very little contention.

Read/Write Set: Small Contention: Low

Input: -m10 -n10 -t0.05 -i inputs/random2048-d16-c16.txt

Labyrinth: This implements a route finding algorithm in a three-dimensional grid. Multiple

threads are set up to find an optimized route, each with their own copy of the grid. They update a

separate globally shared grid that stores the best routes. All the work including the route finding

with the private copy of the grid is enclosed in a transaction, leading to a large working set. The

contention on the shared grid is also high with many transactions desiring simultaneous access

to check and update the best route.

Read/Write Set: Large Contention High

192

Input: -i inputs/random-x32-y32-z32 -n96

Vacation: Implements a travel reservation system. Client threads interact with an in-memory

database that implements the database tables as a Red-Black tree. Transactions are used during

all operations on the database.

Read/Write Set: Moderate Contention: Moderate

Input: -n4 -q45 -u90 -r1048576 -t4194304

STMBench7: STMBench7 [GKV07] was designed to mimic a transaction processing CAD

database system. Its primary data structure is a complex multi-level tree in which internal nodes

and leaves at every level represent various objects. It exports up to 45 different operations with

varying transaction properties. It is highly parametrized and can be set up for different levels

of contention. Here, we simulate the default read-write workload. This benchmark has high

degrees of fine-grain parallelism at different levels in the tree.

Read/Write Set: X-Large Contention: High

Input: Reads-60%, Writes-40%. Short Traversals-40%. Long Traversals 5%, Ops. - 45%,

Mods. 10%.

µbenchmarks: We chose four data structure benchmarks from RSTMv3.3 [10a]

a) HashTable: Transactions use a hash table with 256 buckets and overflow chains to lookup,

insert, or delete a value in the range 0 . . . 255 with equal probability. At steady state, the table

is 50% full. b) RBTree: In the red-black tree (RBTree) benchmark, transactions attempt to

insert, remove, or delete values in the range 0 . . . 4095 with equal probability. At steady state

there are about 2048 objects, with about half of the values stored in leaves. c) LFUCache:

LFUCache uses a large (2048) array based index and a smaller (255 entry) priority queue to

track the most frequently accessed pages in a simulated web cache. When re-heapifying the

queue, transactions always swap a value-one node with a value-one child; this induces hysteresis

and gives each page a chance to accumulate cache hits. Pages to be accessed are randomly

chosen using a Zipf distribution: p(i) ∝ Σ0<j≤ij−2. d) RandomGraph The RandomGraph

benchmark requires transactions to insert or delete vertices in an undirected graph represented

with adjacency lists. Edges in the graph are chosen at random, with each new vertex initially

having up to 4 randomly selected neighbors.

193

A.3 Conflict Scenarios in Applications

We also profiled the conflict patterns present in the applications and categorized conflicts into

three types: Read-Write, Write-Read or Write-Write (see Figure A.1). Read-Write and Write-

Read conflicts vary based on which transaction notices the conflict — in Read-Write the reader

notices the conflict and in Write-Read the writer notices the conflict. Read-Write conflicts can

be resolved amicably between the transactions if the reader commits before the writer since the

conflict is elided entirely. Write-Write conflicts are problematic since both transactions can’t

commit concurrently (see footnote 2 on page 6). Bayes, Delaunay, Labyrinth, STMBench7, and

Vacation all employ “trie”-like data structures extensively and transactions are primarily used

to protect the “trie” operations. There are an extensive number of conflicts arising between

transactions that perform lookups on the tree and writer transactions that perform rotations and

balancing. The primary cause of conflicts is read-write sharing (which Lazy can exploit) and

account for '97% of the conflicts. The working set size is also moderate to high due to the

prevalence of pointer chasing. Intruder has small transactions but has a conflict pattern similar

to Delaunay (Read-Write and Write-Read together represent over 99% of the conflicts) Kmeans

and Genome are essentially data parallel, and have a negligible fraction of conflicts (<15%).

Finally, LFUCache and RandomGraph are both stress tests — they have small, highly contended

working sets; a significant fraction of the conflicts are write-write (87% in LFUCache and 29%

in RandomGraph) and it is not possible to avoid wasted work.

194

R-W W-R W-W

0%

20%

40%

60%

80%

100%

Baye

s

Delau

nay

Genom

e

Intru

der

Kmea

ns

Laby

rinth

Vaca

tion

STMBe

nch7

LFUC

ache

Rand

omGra

ph

Conflic

t Ty

pe

%

R-W: Read-Write conflict. Reader accesses location before Writer. W-R: Write-Read conflict. Writer accesses

location before reader. W-W: Write-Write conflicts. Represents true write-write conflicts and upgrade conflicts in

which the reader in a R-W or W-R conflicts writes the location. All conflicts estimated for a Lazy system.

Figure A.1: Conflict type breakdown

195

Tabl

eA

.1:T

rans

actio

nalW

orkl

oad

Cha

ract

eris

tics

Ben

chm

ark

Inst

/txW

r set

Rd

set

Wr 1

Rd1

Wr N

Rd

NC

STco

nflic

tspe

r-tx

Avg

.pe

r-tx

W-W

Avg

.pe

r-tx

R-W

Bay

es70

K15

022

50.

60.

050.

80.

953

01.

7D

elau

nay

12K

9017

80.

50.

120.

850.

91

0.1

1.1

Gen

ome

1.8K

949

0.55

0.09

0.99

0.85

00

0In

trud

er41

041

140.

50.

040.

990.

82

01.

4K

mea

ns13

04

190.

650.

10.

990.

70

00

Lab

yrin

th18

0K19

016

00.

570.

010.

990.

94

02

Vac

atio

n5.

5K12

890.

750.

020.

990.

81

01.

6ST

MB

ench

715

5K31

059

00.

40

0.85

0.9

30.

53.

6H

ash

110

13

0.96

0.1

0.96

0.95

00

0R

BTr

ee2K

225

0.9

0.01

0.99

0.8

10

1.1

LFU

Cac

he12

51

20.

990

0.99

0.78

60.

80.

8R

ando

mG

raph

11K

960

0.6

00.

90.

995

0.6

3Se

tup:

16th

read

sw

ithla

zyco

nflic

tres

olut

ion;

Inst

/Tx-

Inst

ruct

ions

pert

rans

actio

n.K

-Kilo

Wr s

et(R

dset):

Num

bero

fwri

tten

(rea

d)ca

che

lines

Wr

1(W

rN

):Fi

rst(

last

)wri

teev

entt

ime;

Mea

sure

das

frac

tion

oftx

exec

utio

ntim

e.R

d-R

ead

CST

confl

icts

per

tx:N

umbe

rofC

STbi

tsse

t.M

edia

nnu

mbe

rofc

onfli

ctin

gtr

ansa

ctio

nsen

coun

tere

dW

-W(R

-W):

-Avg

(N

o.of

confl

icts

/txN

umbe

rofs

etC

STbi

ts/tx

).A

vg.n

umbe

rofc

onfli

ctin

glo

catio

nssh

ared

betw

een

txs.

196

Appendix BCoherence State Machine

This chapter shows detailed specification for the L1 cache controllers of the broadcast version

of the TMESI protocol. We use the 2-state TMESI protocol discussed in Section 4 to keep the

discussion simple. The transition table split across Figure B.1 and Figure B.2 was generated

using the SLICC parser from the GEMS toolchain. These tables provide a clear and concise

representation of the protocol, including all the transient states and detailed actions that occur

on specific transitions. This methodology provides a clear illustration of the additional actions

and events introduced by the TMESI states.

The row of each table corresponds to the states that the cache controller can be in. The columns

correspond to events triggered that cause the cache controller to take actions and move to state

indicated in the cell. Events are typically triggered as the result of a processor access or when a

message is received on the interconnect. The table entries indicate a set of actions performed in

an atomic fashion when the state change is triggered by an event.

Table B.1 indicates the states of our L1 cache controller — it includes the stable states M,E,S,I,

and TM. In this broadcast based system, the L1 controller interfaces with three logical networks,

Address network (which handles L1 broadcasts and L2 forwarded invalidations), Data response,

and Snoop response. Table B.3 enumerates the type of messages sent out on these network.

GETS, UPGRADE, GETX, TGETX, and TUPGRADE all use the address network. The snoop

response messages corresponding to ACKs are sent out on the snoop response queue. The

shared-L2 collects the snoop responses and forwards the final response on the address queue

197

to indicate request completion. The data messages, DATA, travel on a separate logical data

network; writebacks (WB) from the L1 to the shared-L2 also use the same network. The logical

networks, data and address, share the same physical network and contend for bandwidth, while

the snoop response queue uses a separate physical network. In SLICC, the separate physical

network for the snoop response can be implemented by instantiating a seperate queue in the

chip class.

Table B.2 list the events that trigger state changes in the L1 controller: we add six new events,

TStore, TLoad, and T TLoad from the processor side. Other TGETX, Other TUPGRADE,

and ACK Threat from the network side compared to the basic MESI protocol. Table B.4 list the

actions of the coherence protocol. To illustrate a sample transition, consider a cache line in state

I which receives the trigger Load; it performs the actions ggets, c, a, k (refer Table B.4)

and moves to the state ISAD indicated by the slash (/).

Every miss handler register (MSHR) include three single bit fields: a) Abort, which indicates

that the abort handler should be invoked when the request completes, b) Alert, which indicates

if the memory operation in flight is an alert memory operation. It detects racy remote writes and

sets the Abort field, if necessary, and c) Trans, which indicates transactional operation in flight.

It detects racy non-transactional writes for aborts. It sets the T bit in the cache line if threatened

(ACK Threat). Every cache line includes two bits: a) Alert, which indicates if the cache line

has been tagged as alert-on-update, and b) Trans, which tags the transactional states.

We use the Trans in each cache line to encode the TI state, and eliminate the separate state

shown in our state machine diagram (Figure 4.3 in Chapter 4). TI’s transitions mirrors I for

all events except the processor TLoad event, in which case TI returns a hit, while I initiates

miss activity. Hence, within the SLICC specification we can eliminate the separate state by

including a new event type, T_Load to handle the case where a TLoad is issued to a TI line;

other processor events do not distinguish between TI and I. Table B.1 also shows the transient

states part of the L1 cache controller. Note that the majority of the transient states are part of

the basic MESI coherence protocol. We add 11 new transient states, ITMAD, ITMD, ITMA,

ITMADI , STMUAD, STMD, STMDI , STMA, MTMA, andMTMAI . Interestingly, the

transitions from I and S into the TM state, which include the ITM [∗] and STM [∗] states are

similar to the transient states IM [∗] and SM [∗], which move to the M state. Although, there

198

are extra states in this representation, the number of unique transitions for which the actions

differ is small. For example, when a core issues a TStore to the cache controller in State

I, it broadcasts a TGETX message and transitions to ITMAD. The other cache controllers

will provide a snoop response, which is collected by the L2. Let us assume that there were

no other sharers, in which case the L2 supplies the DATA message and sends a separate ACK

message. Based on the order in which these messages are received, ITMAD transitions to

ITMA, ITMD, and finally into TM . Now when the core issues a Store to the cache controller

in I , it broadcasts a GETX and transitions to IMAD and its counterparts; essentially the IM [∗]

states mirror the ITM [∗] states. We could potentially eliminate all the ITM [∗] transient states

by including checks in the final transitions to detect if the transient states were initiated by a

TStore or a Store, and then transitioning to TM orM based on this check. This would reduce

the readability of the protocol since these checks would need more events and modifications to

the controller that triggers the events. However, it reduces the number of states in the protocol

and may potentially reduce the verification complexity.

199

Figure B.1: State Machine:Part 1

200

Note that the states are listed in the last column.

Figure B.2: State Machine:Part 2

201

Table B.1: TMESI L1 controller states

NP Default state. Not PresentI InvalidS SharedE OwnedM ModifiedTM Transactionally modifiedISAD Invalid,issued GETS, have not seen GETS ACK or Data yetIMAD Invalid, issued GETX, have not seen GETX ACK or Data yetITMAD Invalid, issued TGETX, have not seen TGETX ACK or Data yetSMUAD Shared, issued Upgrade, have not seen upgrade yetSTMUAD Invalid, issued TUpgrade, have not seen Upgrade ACK or Data yetISA Invalid, issued GETS, have not seen GETS ACK, have seen DataIMA Invalid, issued GETX, have not seen GETX ACK, have seen DataITMA Invalid, issued TGETX, have not seen TGETX ACK, have seen DataSMA Shared, issued GETX, have not seen GETX ACK, have seen DataSTMA shared, issued TUPGRADE, have not seen TUPGRADE ACK, have seen DataMIA Modified, issued WB, have not seen WB ACK yetISD Invalid, issued GETS, have seen GETS, have not seen Data yetIED Invalid, issued GETS, have seen Ack exclusive, have not seen Data yetISDI Invalid, issued GETS, have seen GETS, have not seen Data, then saw other GETX. Move

to Invalid after receiving data.IMD Invalid, issued GETX, have seen GETX, have not seen Data yetITMD Invalid issued TGETX have seen ACK. have not see Data yetIMDS Invalid, issued GETX, have seen GETX, have not seen Data yet, then saw other GETSIMDI Invalid, issued GETX, have seen GETX, have not seen Data yet, then saw other GETXITMDI Invalid, issued TGETX, have seen TGETX, have not seen Data yet, then saw other GETX.

Abort TransactionSMD Shared, issued GETX, have seen GETX, have not seen Data yetSTMD Shared, issued TUPGRADE, have seen TUPGRADE, have not seen Data yetSMDS Shared, issued GETX, have seen GETX, have not seen Data yet, then saw other GETSSMDI Shared, issued GETX, have seen GETX, have not seen Data yet, then saw other GETX. On

Data, drain store and then supply the updated cache block.STMDI Shared, issued TUPGRADE, have seen TUPGRADE, have not seen Data yet, then saw

other GETX. Abort transaction.MTMA Modified, issued threatened WB. Not seen ACKMTMI Modified, issued threatened WB, then saw another GETX before ACK. Abort transaction.

202

Table B.2: TMESI L1 controller events

Processor EventsLoad Load request from the processorALoad ALoad request from the processorStore Store request from the processorTStore Transactional Store from the processorTLoad Transactional Load from the processorT TLoad Represents TLoad to a cache line with T state set

Address queue eventsOther GETS Occurs when we observe a GETS request from another processorOther GETX Occurs when we observe a GETX request from another processorOther UPGRADE Occurs when we observe a Upgrade request from another processorOther TGETX Occurs when we observe a TGETX request from another processorOther TUPGRADE Occurs when we observe a TUPGRADE request from another processorOwn Request Occurs when we observe our own request in global orderInv L2 sent invalidation requestAck Occurs when it indicates a AckWB Ack Occurs when it indicates a Writeback AckAck Threat Occurs when it indicates a Ack ThreatAck Upgrade Occurs when we observe an Ack UPGRADE; no data neededAck Exclusive Occurs when it indicates a Ack Exclusive; can move to E state if data

received without any intervening GET

Table B.3: TMESI L1 controller messages

Coherence messagesMessage Description Source→DestinationGETX Get exclusive. L1 Request→broadcastTGETX Transactional GETX L1 Request→broadcastUPGRADE Upgrade to exclusive L1 Request→broadcastTUPGRADE Transactional Upgrade to exclusive L1 Request→broadcastGETS Get shared copy L1 Request→broadcastINV INValidate L2 eviction→ L1ACK Generic ack to L1 L2 ack→ L1ACK THREAT Generic Threatened ACK L2 ack→ L1ACK EXCLUSIVE Exclusive Copy, data response L2 ack→ L1ACK UPGRADE Exclusive copy for upgrade, no data response L2 ack→ L1WB ACK writeback ack L2 ack→ L1

L1 coherence response messagesDATA Data response to forwarded coherence message L1→L2 on data networkACK Invalidation response L1→L2 on snoop responseACK THREAT Threat ack; L1 has line in TM L1→L2 on snoop responseACK Exclusive Exclusive copy returned L1→L2 on snoop responseWB Writeback data L1→L2 on data network

203

Table B.4: TMESI L1 cache controller actions

a Allocate MSHR with Trans=false, Alert=false, and Abort=false.cab Abort handlerc Set L1 Dcache tag equal to tag of block B.ca Check A Bitct Check T Bit.cc Commit handlerchka Check for abort handlercb Clear A bitctb Clear T bitsa Set Alert bit.st Set Trans bitd Deallocate MSHR.ggets Issue GETS.ggetu Issue Upgrade.ggetx Issue GETX.gtgetu Issue transactional UPGRADE.gtgetx Issue transactional GETX.h Notify sequencer the load or store completed.i Pop incoming address queue.j Pop incoming data queue.k Pop mandatory queue.m Deallocate L1 cache block.packdCopy Issue Ack Data Copypackd Issue Data-ACKpacke Issue Exclusive-ACK.pack Issue Ack.packt Issue threatened ACK.packy Issue specific Ack type based on response from remote cores.pwd Issue data.pwb Issue writeback.pwbt Issue threatened writeback.qtbe Write data from the cache into the MSHR.sca Save data into cache.z Recyle mandatory queue from processor. Can’t handle access in this cycle.

204

Bibliography

[ACC11] Ali-Reza Adl-Tabatabai, Calin Cascaval, Steve Clamage, Robert Geva, Victor Luchangco,Virendra Marathe, Maged Michael, Mark Moir, Ravi Narayanaswamy, Clark Nelson, Yang Ni,Daniel Nussbaum, Tatiana Shpeisman, Raul Silvera, Xinmin Tian, Douglas Walls, Adam Welc,Michael Wong, and Peng Wu. Draft Specification (v3.0) of Transactional Language Constructs forC++. June 2011. http://groups.google.com/group/tm-languages.

[ASH88] Anant Agarwal, Richard Simoni, John Hennessy, and Mark Horowitz. An Evaluation of Di-rectory Schemes for Cache Coherence. In Proc. of the 15th Intl. Symp. on Computer Architecture,pages 280-289, Honolulu, HI, June 1988.

[AAK05] C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, Charles E. Leiserson, and Sean Lie.Unbounded Transactional Memory. In Proc. of the 11th Intl. Symp. on High Performance ComputerArchitecture, pages 316-327, San Francisco, CA, Feb. 2005.

[ApL91] A. W. Appel and K. Li. Virtual Memory Primitives for User Programs. In Proc. of the 4thIntl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages96-107, Santa Clara, CA, Apr. 1991.

[ArB84] J. Archibald and J.-Loup Baer. An Economical Solution to the Cache Coherence Problem. InProc. of the 11th Intl. Symp. on Computer Architecture, pages 355-362, 1984.

[Ass07] Semiconductor Industries Association. Model for Assessment of CMOS Technologies andRoadmaps (MASTAR). 2007. http://www.itrs.net/models.html.

[BNZ08] Lee Baugh, Naveen Neelakantan, and Craig Zilles. Using Hardware Memory Protection toBuild a High-Performance, Strongly Atomic Hybrid Transactional Memory. In Proc. of the 35thIntl. Symp. on Computer Architecture, Beijing, China, June 2008.

[BCD72] A. Bensoussan, C. T. Clingen, and R. C. Daley. The MULTICS Virtual Memory: Conceptsand Design. Comm. of the ACM, 15(5), May 1972.

[BAL90] B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M. Levy. Lightweight Remote Pro-cedure Call. ACM Trans. on Computer Systems, 8(1):37-55, Feb. 1990. Originally presented at the12th ACM Symp. on Operating Systems Principles, Dec. 1989.

[BSP95] Brian N Bershad, Stefan Savage, Przemysław Pardyak, Emin Gun Sirer, Marc Fiuczynski,David Becker, Susan Eggers, and Craig Chambers. Extensibility, Safety and Performance in theSPIN Operating System. In Proc. of the 15th ACM Symp. on Operating Systems Principles, CopperMountain, CO, Dec. 1995.

[Blo70] Burton H. Bloom. Space/Time Trade-Off in Hash Coding with Allowable Errors. Comm. of theACM, 13(7):422-426, July 1970.

205

[BMV07] Jayaram Bobba, Kevin E. Moore, Haris Volos, Luke Yen, Mark D. Hill, Michael M. Swift,and David A. Wood. Performance Pathologies in Hardware Transactional Memory. In Proc. of the34th Intl. Symp. on Computer Architecture, pages 32-41, San Diego, CA, June 2007.

[BGH08] Jayaram Bobba, Neelam Goyal, Mark D. Hill, Michael M. Swift, and David A. Wood. To-kenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory. In Proc.of the 35th Intl. Symp. on Computer Architecture, Beijing, China, June 2008.

[BBL10] Sebastian Burckhardt, Alexandro Baldassion, and Daan Leijen. Concurrent Programming withRevisions and Isolation Types. In Proc. of the 2010 OOPSLA, 2010.

[CKD94] Nicholas P. Carter, Stephen W. Keckler, and William J. Dally. Hardware Support for FastCapability-Based Addressing. In Proc. of the 6th Intl. Conf. on Architectural Support for Program-ming Languages and Operating Systems, pages 319-327, San Jose, CA, Oct. 1994.

[CBM08] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chiras,and Siddhartha Chatterjee. Software Transactional Memory: Why Is It Only a Research Toy?Queue, 6:46–58, 5, September 2008.

[CTC06] Luis Ceze, James Tuck, Calin Cascaval, and Josep Torrellas. Bulk Disambiguation of Spec-ulative Threads in Multiprocessors. In Proc. of the 33rd Intl. Symp. on Computer Architecture,Boston, MA, June 2006.

[CTM07] Luis Ceze, James Tuck, Pablo Montesinos, and Josep Torrellas. BulkSC: Bulk Enforcementof Sequential Consistency. In Proc. of the 34th Intl. Symp. on Computer Architecture, San Diego,CA, June 2007.

[CCC07] Hassan Chafi, Jared Casper, Brian D. Carlstrom, Austen McDonald, Chi Cao Minh, WoongkiBaek, Christos Kozyrakis, and Kunle Olukotun. A Scalable, Non-blocking Approach to Trans-actional Memory. In Proc. of the 13th Intl. Symp. on High Performance Computer Architecture,Phoenix, AZ, Feb. 2007.

[CVP99] Tzicker Chiueh, Ganesh Venkitachalam, and Prashant Pradhan. Integrating Segmentation andPaging Protection for Safe, Efficient and Transparent Software Extensions. In Proc. of the 17thACM Symp. on Operating Systems Principles, Charleston, SC, Dec. 1999.

[CMK08] Matt Chu, Christian Murphy, and Gail Kaiser. Distributed in vivo testing of software applica-tions. In Proc. of the 1st International Conference on Software Testing, Verification, and Validation,2008.

[CNV06] Weihaw Chuang, Satish Narayanasamy, Ganesh Venkatesh, Jack Sampson, Michael Van Bies-brouck, Gilles Pokam, Brad Calder, and Osvaldo Colavin. Unbounded Page-Based TransactionalMemory. In Proc. of the 12th Intl. Conf. on Architectural Support for Programming Languages andOperating Systems, pages 347-358, San Jose, CA, Oct. 2006.

[CMM06] JaeWoong Chung, Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi, Brian D.Carlstrom, Christos Kozyrakis, and Kunle Olukotun. Tradeoffs in Transactional Memory Virtual-ization. In Proc. of the 12th Intl. Conf. on Architectural Support for Programming Languages andOperating Systems, pages 371-381, San Jose, CA, Oct. 2006.

[CCL81] S. Colley, G. Cox, K. Lai, J. Rattner, and R. Swanson. The Object-Based Architecture of theIntel 432. In Proc. of the IEEE COMPCON Spring ’81, Feb. 1981.

[Cor] Intel Corporation. Intel thread checker. http://developer.intel.com/software/products/threading/tcwin.

[CPM98] Crispin Cowan, Calton Pu, Dave Maier, Heather Hinton, Peat Bakke, Steve Beattie, AaronGrier, Perry Wagle, and Qian Zhang. StackGuard: Automatic Adaptive Detection and Preventionof Buffer-Overflow Attacks. In In Proceedings of the 7th USENIX Security Symposium, 1998.

206

[DCW11] L. Dalessandro, F. Carouge, S. White, Yossi Lev, Mark Moir, Michael L. Scott, and MichaelF. Spear. Hybrid Transactional Memory. In Proc. of the 16th Intl. Conf. on Architectural Supportfor Programming Languages and Operating Systems, Mar. 2011.

[DFL06] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, and Dan Nuss-baum. Hybrid Transactional Memory. In Proc. of the 12th Intl. Conf. on Architectural Support forProgramming Languages and Operating Systems, San Jose, CA, Oct. 2006.

[DFH04] David Detlefs, Christine Flood, Steve Heller, and Tony Printezis. Garbage-first garbage col-lection. In Proc. of the 4th Intl. Symp. on Memory Management, 2004.

[DSS06] Dave Dice, Ori Shalev, and Nir Shavit. Transactional Locking II. In Proc. of the 20th Intl.Symp. on Distributed Computing, pages 194-208, Stockholm, Sweden, Sept. 2006.

[DiS06] Dave Dice and Nir Shavit. What Really Makes Transactions Fast? In Proc. of the 1st ACMSIGPLAN Workshop on Transactional Computing, Ottawa, ON, Canada, June 2006.

[DiH08] Stephan Diestelhorst and Michael Hohmuth. Hardware Acceleration for Lock-Free Data Struc-tures and Software-Transactional Memory. In Workshop on Exploiting Parallelism with Transac-tional Memory and other Hardware Assisted Methods (EPHAM), Boston, MA, Apr. 2008. Inconjunction with CGO.

[DHS08] Shlomi Dolev, Danny Hendler, and Adi Suissa. CAR-STM: Scheduling-Based CollisionAvoidance and Resolution for Software Transactional Memory. In Proc. of the 27th ACM Symp. onPrinciples of Distributed Computing, Toronto, Canada, Aug. 2008.

[Enn06] Robert Ennals. Software Transactional Memory Should Not Be Lock Free. Technical ReportIRC-TR-06-052, Intel Research Cambridge, 2006.

[Feu73] Edward A. Feustel. On The Advantages of Tagged Architecture. IEEE Transactions on Com-puters, 22(11):644-656, 1973.

[FlQ03] C. Flanagan and S. Qadeer. A Type and Effect System for Atomicity. In Proc. of the 2003 Conf.on Programming Language Design and Implementation, June 2003.

[FrH07] Keir Fraser and Tim Harris. Concurrent Programming Without Locks. ACM Trans. on Com-puter Systems, 25(2):article 5, May 2007.

[FMJ07] J. Friedrich, B. McCredie, N. James, B. Huott, B. Curran, E. Fluhr, G. Mittal, E. Chan, Y.Chan, D. Plass, S. Chu, H. Le, L. Clark, J. Ripley, S. Taylor, J. Dilullo, and M. Lanzerotti. Designof the Power6 Microprocessor. In Proc. of the Intl. Solid State Circuits Conf., pages 96-97, SanFrancisco, CA, Feb. 2007.

[FMF05] Nathan Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of un-modified, optimized code. In Proc. of the 19th ACM Intl. Conf. on Supercomputing, 2005.

[GiS87] D. Gifford and A. Spector. Case Study: IBM’s System/360-370 Architecture. Comm. of theACM, 30(4):291-307, Apr. 1987.

[GFV99] Chris Gniady, Babak Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In Proc. of the 26thIntl. Symp. on Computer Architecture, pages 162-171, Atlanta, GA, May 1999.

[Goo87] J. R. Goodman. Coherency for Multiprocessor Virtual Address Caches. In Proc. of the 2ndIntl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages72-81, Palo Alto, CA, Oct. 1987.

[GHP05a] Rachid Guerraoui, Maurice Herlihy, and Bastian Pochon. Polymorphic Contention Manage-ment in SXM. In Proc. of the 19th Intl. Symp. on Distributed Computing, Cracow, Poland, Sept.2005.

[GHP05b] Rachid Guerraoui, Maurice Herlihy, and Bastian Pochon. Toward a Theory of Transactional

207

Contention Managers. In Proc. of the 24th ACM Symp. on Principles of Distributed Computing,Las Vegas, NV, Aug. 2005.

[GKV07] Rachid Guerraoui, Michal Kapalka, and Jan Vitek. STMBench7: A Benchmark for SoftwareTransactional Memory. In Proc. of the 2nd EuroSys, Lisbon, Portugal, Mar. 2007.

[HWC04] Lance Hammond, Vicky Wong, Mike Chen, Ben Hertzberg, Brian Carlstrom, ManoharPrabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Olukotun. Transactional Memory Co-herence and Consistency. In Proc. of the 31st Intl. Symp. on Computer Architecture, Munchen,Germany, June 2004.

[HPS06] Timothy Harris, Mark Plesko, Avraham Shinnar, and David Tarditi. Optimizing MemoryTransactions. In Proc. of the SIGPLAN 2006 Conf. on Programming Language Design and Im-plementation, pages 14-25, Ottawa, ON, Canada, June 2006.

[HLR10] Tim Harris, James R. Larus, and Ravi Rajwar. Transactional Memory (2nd edition), SynthesisLectures on Computer Architecture. Morgan & Claypool, 2010.

[HaD68] E. A. Hauck and B. A. Dent. Burroughs’ B6500/B7500 Stack Mechanism. Proc. of the AFIPSSpring Joint Computer Conf., 32:245-251, 1968.

[HLM03a] Maurice Herlihy, Victor Luchangco, and Mark Moir. Obstruction-Free Synchronization:Double-Ended Queues as an Example. In Proc. of the 23rd Intl. Conf. on Distributed ComputingSystems, Providence, RI, May 2003.

[HLM03b] Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer III. SoftwareTransactional Memory for Dynamic-sized Data Structures. In Proc. of the 22nd ACM Symp. onPrinciples of Distributed Computing, pages 92-101, Boston, MA, July 2003.

[HeM93] M. Herlihy and J. E. Moss. Transactional Memory: Architectural Support for Lock-Free DataStructures. In Proc. of the 20th Intl. Symp. on Computer Architecture, pages 289-300, San Diego,CA, May 1993. Expanded version available as CRL 92/07, DEC Cambridge Research Laboratory,Dec. 1992.

[HMM96] Mark Horowitz, Margaret Martonosi, Todd C. Mowry, and Michael D. Smith. InformingMemory Operations: Providing Memory Performance Feedback in Modern Processors. In Proc. ofthe 23rd Intl. Symp. on Computer Architecture, Philadelphia, PA, May 1996.

[Inc05] Sun Microsystems Inc. OpenSPARC T2 Core Microarchitecture Specification. July 2005.

[Int] Intel. Intel(R) C++ STM Compiler, Prototype Edition. http://software.intel.com/en-us/articles/intel-c-stm-compiler-prototype-edition/.

[Int06] Intel Corporation. Intel Itanium Architecture Software Developer’s Manual. Revision 2.2, Jan.2006.

[IsS99] Haruna R. Isa and William R. Shockley. A Multi-threading Architecture for Multilevel SecureTransaction Processing. In Proc. of the IEEE Symp. on Security and Privacy, May 1999.

[KEG97] Frans Kaashoek, Dawson Engler, Greg Ganger, Hector Briceno, Russel Hunt, David Mazieres,Tom Pinckney, Robert Grimm, and Ken Mackenzie. Application Performance and Flexibility onExokernel Systems. In Proc. of the 16th ACM Symp. on Operating Systems Principles, St. Malo,France, Oct. 1997.

[KCE92] Eric J. Koldinger, Jeffrey S. Chase, and Susan J. Eggers. Architectural Support for SingleAddress Space Operating Systems. In Proc. of the 5th Intl. Conf. on Architectural Support forProgramming Languages and Operating Systems, pages 175-186, Boston, MA, Oct. 1992.

[KAO05] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A 32-Way Mul-tithreaded SPARC Processor. In IEEE Micro, pages 21-29, Mar.-Apr. 2005.

208

[KCP06] Milind Kulkarni, L. Paul Chew, and Keshav Pingali. Using Transactions in Delaunay MeshGeneration. In Workshop on Transactional Memory Workloads, Ottawa, ON, Canada, June 2006.

[KCH06] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and Anthony Nguyen.Hybrid Transactional Memory. In Proc. of the 11th ACM Symp. on Principles and Practice ofParallel Programming, New York, NY, Mar. 2006.

[Lam05] Christoph Lameter. Effective Synchronization on Linux/NUMA Systems. In Proc. of the 2005Gelato Federation Meeting, 2005.

[Lam71] B. W. Lampson. Protection. In Proc. of the 5th Princeton Symp. on Information Sciences andSystems, pages 437-443, Mar. 1971. Reprinted in ACM SIGOPS Operating Systems Review 8:1(Jan. 1974), pp. 18-24.

[LaL97] James Laudon and Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. InProc. of the 24th Intl. Symp. on Computer Architecture, Denver, CO, June 1997.

[LeM08] Yossi Lev and Jan-Willem Maessen. Split Hardware Transaction: True Nesting of TransactionsUsing Best-effort Hardware Transactional Memory. In Proc. of the 13th ACM Symp. on Principlesand Practice of Parallel Programming, Salt Lake City, UT, Feb. 2008.

[Lev84] H. M. Levy. Capability-Based Computer Systems. Digital Press, Bedford, MA, 1984.

[LH10] Xin Li, Michael C. Huang, Kai Shen, and Lingkun Chu. A Realistic Evaluation of MemoryHardware Errors and Software System Susceptibility. In Proc. of the USENIX 2010 TechnicalConf., Jan. 2010.

[Lie93] J. Liedtke. Improving IPC by Kernel Design. In Proc. of the 14th ACM Symp. on OperatingSystems Principles, Ashville, NC, Dec. 1993.

[LT06] Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. AVIO: Detecting Atomicity Violationsvia Access Interleaving Invariants. In Proc. of the 12th Intl. Conf. on Architectural Support forProgramming Languages and Operating Systems, San Jose, CA, Oct. 2006.

[LP08] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from Mistakes — A Com-prehensive Study on Real World Concurrency Bug Characteristics. In Proc. of the 13th Intl. Conf.on Architectural Support for Programming Languages and Operating Systems, Mar. 2008.

[Mai05] Ken Mai. Design and Analysis of Reconfigurable Memories. Ph. D. dissertation, StanfordUniv., June 2005.

[MSS04] Virendra J. Marathe, William N. Scherer III, and Michael L. Scott. Design Tradeoffs in Mod-ern Software Transactional Memory Systems. In Proc. of the 7th Workshop on Languages, Com-pilers, and Run-time Systems for Scalable Computers, Houston, TX, Oct. 2004.

[MSS05] Virendra J. Marathe, William N. Scherer III, and Michael L. Scott. Adaptive Software Trans-actional Memory. In Proc. of the 19th Intl. Symp. on Distributed Computing, Cracow, Poland, Sept.2005.

[MSH06] Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David Eisenstat,William N. Scherer III, and Michael L. Scott. Lowering the Overhead of Software TransactionalMemory. In Proc. of the 1st ACM SIGPLAN Workshop on Transactional Computing, Ottawa, ON,Canada, June 2006. Expanded version available as TR 893, Dept. of Computer Science, Univ. ofRochester, Mar. 2006.

[MaS0.] E. Marcus and H. Stern. Blueprints for high availability. In John Willey and Sons,, 2000.

[MaT02] Jose F. Martınez and Josep Torrellas. Speculative Synchronization: Applying Thread-LevelSpeculation to Explicitly Parallel Applications. In Proc. of the 10th Intl. Conf. on ArchitecturalSupport for Programming Languages and Operating Systems, pages 18-29, San Jose, CA, Oct.

209

2002.

[MSB05] Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, Min Xu,Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and David A. Wood. Multifacet’s GeneralExecution-driven Multiprocessor Simulator (GEMS) Toolset. In ACM SIGARCH Computer Archi-tecture News, Sept. 2005.

[MAK01] P. E. McKenney, J. Appavoo, A. Kleen, O. Krieger, R. Russel, D. Sarma, and M. Soni. Read-Copy Update. In Proc. of the Ottawa Linux Symp., July 2001.

[McK04] Paul E. McKenney. Exploiting Deferred Destruction: An Analysis of Read-Copy-UpdateTechniques in Operating System Kernels. Ph.D. dissertation, Dept. of Computer Science and En-gineering, Oregon Graduate Institute, July 2004.

[MBS08] Vijay Menon, Steven Balensiefer, Tatiana Shpeisman, Ali-Reza Adl-Tabatabai, Richard L.Hudson, Bratin Saha, and Adam Welc. Practical Weak-Atomicity Semantics for Java STM. InProc. of the 20th ACM Symp. on Parallelism in Algorithms and Architectures, pages 314-325,Munich, Germany, June 2008.

[MTC07] Chi Cao Minh, Martin Trautmann, JaeWoong Chung,, Austen McDonald, Nathan Bronson,Jared Casper, Christos Kozyrakis, and Kunle Olukotun. An Effective Hybrid Transactional MemorySystem with Strong Isolation Guarantees. In Proc. of the 34th Intl. Symp. on Computer Architecture,San Diego, CA, June 2007.

[MCK08] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. STAMP: Stan-ford Transactional Applications for Multi-Processing. In Proc. of the 2008 IEEE Intl. Symp. onWorkload Characterization, Seattle, WA, Sept. 2008.

[Moi10] Mark Moir. All Short Sentences about Transactional Memory are Wrong! In TransactionalMemory Workshop (TMW) 2010, Apr. 2010.

[MHS09] Daniel Molka, Daniel Hackenberg, Robert Schone, and Matthias S. Muller. Memory Perfor-mance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proc. of the18th Intl. Conf. on Parallel Architectures and Compilation Techniques, Sep, 2009.

[MBM06] Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, and David A. Wood.LogTM: Log-based Transactional Memory. In Proc. of the 12th Intl. Symp. on High PerformanceComputer Architecture, Austin, TX, Feb. 2006.

[Mos06] J. Eliot B. Moss. Open Nested Transactions: Semantics and Support. In Proc. of the 4th IEEEWorkshop on Memory Performance Issues, Austin, TX, Feb. 2006. Held in conjunction with HPCA2006.

[MBJ07] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. Optimizing NUCA Or-ganizations and Wiring Alternatives for Large Caches With CACTI 6.0. In Proc. of the 40th Intl.Symp. on Microarchitecture, December 2007.

[MKC07] C. Murphy, G. Kaiser, and M. Chu. Towards in vivo testing of software applications. Techni-cal Report cucs-038-07, Columbia University, 2007.

[NaG09] V. Nagarajan and R. Gupta. ECMon: Exposing Cache Events for Monitoring. In Proc. of the36th Intl. Symp. on Computer Architecture, June 2009.

[NMW02] George C. Necula, Scott McPeak, and Westley Weimer. CCured: type-safe retrofitting oflegacy code. In Proc. of the 29th ACM Symp. on Principles of Programming Languages, 2002.

[NeZ07] Naveen Neelakantam and Craig Zilles. UFO: A General-Purpose User-Mode Memory Protec-tion Technique for Application Use. Technical Report, UIUCDCS-R-2007-2808, Jan 2007.

[NeS07] Nicholas Nethercote and Julian Seward. Valgrind: A Framework for Heavyweight Dynamic

210

Binary Instrumentation. In Proc. of the SIGPLAN 2007 Conf. on Programming Language Designand Implementation, June 2007.

[Org72] E. I. Organick. The Multics System: An Examination of Its Structure. MIT Press, Cambridge,MA, 1972.

[PaF06] Stefano Di Paola and Giorgio Fedon. Subverting Ajax. In Proc. of the 23rd Chaos Communi-cation Congress, 2006.

[Pat10] David Patterson. The Trouble with Multicore. In IEEE Spectrum, July, 2010.

[PHR09] Donald E. Porter, Owen S. Hofmann, Christopher J. Rossbach, Alex Benn, and EmmettWitchel. Operating System Transactions. In Proc. of the 22nd ACM Symp. on Operating SystemsPrinciples, October 2009.

[QLZ05] Feng Qin, Shan Lu, and Yuanyuan Zhou. SafeMem: Exploiting ECC-Memory for DetectingMemory Leaks and Memory Corruption During Production Runs. In Proc. of the 10th Intl. Symp.on High Performance Computer Architecture, Feb. 2005.

[RaG01] Ravi Rajwar and James R. Goodman. Speculative Lock Elision: Enabling Highly ConcurrentMultithreaded Execution. In Proc. of the 34th Intl. Symp. on Microarchitecture, Austin, TX, Dec.2001.

[RaG02] Ravi Rajwar and James R. Goodman. Transactional Lock-Free Execution of Lock-Based Pro-grams. In Proc. of the 10th Intl. Conf. on Architectural Support for Programming Languages andOperating Systems, pages 5-17, San Jose, CA, Oct. 2002.

[RHL05] Ravi Rajwar, Maurice Herlihy, and Konrad Lai. Virtualizing Transactional Memory. In Proc.of the 32nd Intl. Symp. on Computer Architecture, Madison, WI, June 2005.

[RRP07] Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter, Owen S. Hofmann, AdityaBhandari, and Emmett Witchel. MetaTM/TxLinux: Transactional Memory For An Operating Sys-tem. In Proc. of the 34th Intl. Symp. on Computer Architecture, San Diego, CA, June 2007.

[RRW08] Hany E. Ramadan, Christopher J. Rossbach, and Emmett Witchel. Dependence-Aware Trans-actional Memory for IncreasedConcurrency. In Proc. of the 41ST Intl. Symp. on Microarchitecture,Dec 2008.

[RPA97] Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, and Sarita V. Adve. The Inter-action of Software Prefetching with ILP Processors in Shared-Memory Systems. In Proc. of the24th Intl. Symp. on Computer Architecture, Denver, CO, June 1997.

[RLS10] Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. UNified Instruc-tion/Translation/Data (UNITD) Coherence: One Protocol to Rule Them All. In Proc. of the 16thIntl. Symp. on High Performance Computer Architecture, January 2010.

[Rup95] J. Ruppert. A Delaunay Refinement Algorithm for Quality 2-Dimensional Mesh Generation.In Journal of Algorithms, pages 548-555, May, 1995.

[SAJ06] Bratin Saha, Ali-Reza Adl-Tabatabai, and Quinn Jacobson. Architectural Support for SoftwareTransactional Memory. In Proc. of the 39th Intl. Symp. on Microarchitecture, pages 185-196, Dec.2006. Orlando, FL.

[SAH06] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and BenjaminHertzberg. McRT-STM: A High Performance Software Transactional Memory System for a Multi-Core Runtime. In Proc. of the 11th ACM Symp. on Principles and Practice of Parallel Program-ming, pages 187-197, New York, NY, Mar. 2006.

[SYM07] N. Sakran, M. Yuffe, M. Mehalel, J. Doweck, E. Knoll, and A. Kovacs. The Implementationof the 65nm Dual-Core 64b Merom Processor. In Proc. of the Intl. Solid State Circuits Conf., pages

211

106-107, San Francisco, CA, Feb. 2007.

[SYH07] Daniel Sanchez, Luke Yen, Mark D. Hill, and Karu Sankaralingam. Implementing Signaturesfor Transactional Memory. In Proc. of the 40th Intl. Symp. on Microarchitecture, Chicago, IL, Dec.2007.

[SBN97] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson.Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM Trans. on ComputerSystems, 15(4):391-411, Nov. 1997. Earlier version presented at the 16th ACM Symp. on OperatingSystems Principles, Oct. 1997.

[ScS05] William N. Scherer III and Michael L. Scott. Randomization in STM Contention Management(poster paper). In Proc. of the 24th ACM Symp. on Principles of Distributed Computing, Las Vegas,NV, July 2005.

[ScT89] David L. Schleicher and Roger L. Taylor. System Overview of the Application System/400.IBM Systems Journal, 28(3):360-375, 1989.

[SFL94] Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, andDavid A. Wood. Fine-grain Access Control for Distributed Shared Memory. In Proc. of the 6thIntl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages297-306, San Jose, CA, Oct. 1994.

[SCS77] M. D. Schroeder, D. D. Clark, and J. H. Saltzer. The Multics Kernel Design Project. In Proc. ofthe 6th ACM Symp. on Operating Systems Principles, pages 43-56, West Lafayette, IN, Nov. 1977.

[Sco06] Michael L. Scott. Sequential Specification of Transactional Memory Semantics. In Proc. of the1st ACM SIGPLAN Workshop on Transactional Computing, Ottawa, ON, Canada, June 2006.

[SSD07] Michael L. Scott, Michael F. Spear, Luke Dalessandro, and Virendra J. Marathe. DelaunayTriangulation with Transactions and Barriers. In IEEE Intl. Symp. on Workload Characterization,Boston, MA, Sept. 2007. Benchmarks track.

[SSF99] Jonathan S. Shapiro, Jonathan M. Smith, and David J. Farber. EROS: a Fast Capability System.In Proc. of the 17th ACM Symp. on Operating Systems Principles, Charleston, SC, Dec. 1999.

[SMD06] Arrvindh Shriraman, Virendra J. Marathe, Sandhya Dwarkadas, Michael L. Scott, DavidEisenstat, Christopher Heriot, William N. Scherer III, and Michael F. Spear. Hardware Accelerationof Software Transactional Memory. In Proc. of the 1st ACM SIGPLAN Workshop on TransactionalComputing, Ottawa, ON, Canada, June 2006. Expanded version available as TR 887, Dept. ofComputer Science, Univ. of Rochester, Dec. 2005, revised Mar. 2006.

[SSH07] Arrvindh Shriraman, Michael F. Spear, Hemayet Hossain, Sandhya Dwarkadas, and MichaelL. Scott. An Integrated Hardware-Software Approach to Flexible Transactional Memory. In Proc.of the 34th Intl. Symp. on Computer Architecture, San Diego, CA, June 2007. Earlier but expandedversion available as TR 910, Dept. of Computer Science, Univ. of Rochester, Dec. 2006.

[SDS08] Arrvindh Shriraman, Sandhya Dwarkadas, and Michael L. Scott. Flexible Decoupled Transac-tional Memory Support. In Proc. of the 35th Intl. Symp. on Computer Architecture, Beijing, China,June 2008.

[ShD09] Arrvindh Shriraman and Sandhya Dwarkadas. Refereeing Conflicts in Hardware TransactionalMemory. In Proc. of the 2009 ACM Intl. Conf. on Supercomputing, June 2009.

[ShD10] Arrvindh Shriraman and Sandhya Dwarkadas. Sentry: An Auxilliary Memory Access Control.In Proc. of the 37th Intl. Symp. on Computer Architecture, June 2010.

[SBV95] Guri Sohi, Scott Breach, and T. N. Vijaykumar. Multiscalar Processors. In Proc. of the 22ndIntl. Symp. on Computer Architecture, Santa Margherita Ligure, Italy, June 1995.

212

[SMS06] M. F. Spear, V. J. Marathe, W. N. Scherer III, and M. L. Scott. Conflict Detection and Valida-tion Strategies for Software Transactional Memory. In Proc. of the 20th Intl. Symp. on DistributedComputing, pages 179-193, Stockholm, Sweden, Sept. 2006.

[SMP08] Michael F. Spear, Maged M. Michael, and Christoph von Praun. RingSTM: Scalable Trans-actions with a Single Atomic Instruction. In Proc. of the 20th ACM Symp. on Parallelism in Algo-rithms and Architectures, pages 275-284, Munich, Germany, June 2008.

[SDM09] Michael F. Spear, Luke Dalessandro, Virendra Marathe, and Michael L. Scott. A Comprehen-sive Strategy for Contention Management in Software Transactional Memory. In Proc. of the 14thACM Symp. on Principles and Practice of Parallel Programming, Mar. 2009.

[Spe09] Michael F. Spear. Fast Software Transactions. Ph. D. dissertation, Univ. of Rochester, June2009.

[SMS09] Michael Spear, Maged Michael, Michael Scott, and Peng Wu. Reducing Memory OrderingOverheads in Software Transactional Memory. In Proc. of the Intl. Symp. on Code Generation andOptimization, March 2009.

[Sta06] Standard Performance Evaluation Corporation. SPEC CPU06 Benchmarks. Mar. 2006. Avail-able at http://www.spec.org/cpu2006/.

[SCZ00] J. Gregory Steffan, Christopher Colohan, Antonia Zhai, and Todd Mowry. A Scalable Ap-proach to Thread-Level Speculation. In Proc. of the 27th Intl. Symp. on Computer Architecture,Vancouver, BC, Canada, June 2000.

[SSH93] Janice M. Stone, Harold S. Stone, Philip Heidelberger, and John Turek. Multiple Reservationsand the Oklahoma Update. IEEE Parallel and Distributed Technology, 1(4):58-71, Nov. 1993.

[SBL03] Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Improving the Reliability of Com-modity Operating Systems. In Proc. of the 19th ACM Symp. on Operating Systems Principles,Bolton Landing (Lake George), NY, Oct. 2003.

[TKS88] P. J. Teller, R. Kenner, and M. Snir. TLB Consistency on Highly-Parallel Shared-MemoryMultiprocessors. In Proc. of the 21st Hawaii Intl. Conf. on System Sciences, pages 184-192, Kailua-Kona, HI, Jan. 1988.

[TPK09] Sasa Tomic, Cristian Perfumo, Chinmay Kulkarni, Adria Armejach, Adrian Cristal, OsmanUnsal, Tim Harris, and Mateo Valero. EazyHTM, Eager-Lazy Hardware Transactional Memory.In Proc. of the 42nd Intl. Symp. on Microarchitecture, New York, New York, Dec. 2009.

[TrC08] M. Tremblay and S. Chaudhry. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT. In Proc. of the Intl. Solid State Circuits Conf., San Francisco, CA, Feb. 2008.

[Tro10] Trollaxor. Firefox has too many developers. 2010. http://www.trollaxor.com/2009/12/firefox-has-too-many-developers.html.

[TXZ09] Joseph Tucek, Weiwei Xiong, and Yuanyuan Zhou. Efficient online validation with deltaexecution. In Proc. of 14th Intl. Conf. on Architectural Support for Programming Languages andOperating Systems, 2009.

[TAC08] James Tuck, Wonsun Ahn, Luis Ceze, and Josep Torrellas. SoftSig: software-exposed hard-ware signatures for code analysis and optimization. In Proc. of the 13th Intl. Conf. on ArchitecturalSupport for Programming Languages and Operating Systems, Seattle, WA, Mar. 2008.

[VRS07] Guru Venkataramani, Brandyn Roemer, Yan Solihin, and Milos Prvulovic. MemTracker: Ef-ficient and Programmable support for Memory Access Monitoring and Debugging. In Proc. of the13th Intl. Symp. on High Performance Computer Architecture, Feb. 2007.

[VWA09] Haris Volos, Adam Welc, Ali-Reza Adl-Tabatabai, Tatiana Shpeisman, Xinmin Tian, and

213

Ravi Narayanaswamy. NePalTM: design and implementation of nested parallelism for transac-tional memory systems. In Proc. of the 14th ACM Symp. on Principles and Practice of ParallelProgramming, 2009.

[WCW07] Cheng Wang, Wei-Yu Chen, Youfeng Wu, Bratin Saha, and Ali-Reza Adl-Tabatabai. CodeGeneration and Optimization for Transactional Memory Constructs in an Unmanaged Language.In Proc. of the Intl. Symp. on Code Generation and Optimization, San Jose, CA, Mar. 2007.

[WAF07] Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. Mecha-nisms for store-wait-free multiprocessors. In Proc. of the 34th Intl. Symp. on Computer Architec-ture, San Diego, CA, June 2007.

[WiS92] J. Wilkes and B. Sears. A Comparison of Protection Lookaside Buffers and the PA-RISCProtection Architecture. HLP-92-55, Hewlett Packard Laboratories, Mar. 1992.

[Wil92] P. R. Wilson. Pointer Swizzling at Page Fault Time: Efficiently and Compatibly SupportingHuge Address Spaces on Standard Hardware. In Proc. of the Intl. Workshop on Object Orientationin Operating Systems, pages 364-377, Paris, France, Sept. 1992.

[WCA02] Emmett Witchel, Josh Cates, and Krste Asanovic. Mondrian Memory Protection. In Proc. ofthe 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems,pages 304-316, San Jose, CA, Oct. 2002.

[WSC10] Benjamin P. Wood, Adrian Sampson, Luis Ceze, and Dan Grossman. Composable Specifica-tions for Structured Shared-Memory Communication. In OOPSLA 2010 Conf. Proc., 2010.

[YBM07] Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Valos, Mark D. Hill,Michael M. Swift, and David A. Wood. LogTM-SE: Decoupling Hardware Transactional Mem-ory from Caches. In Proc. of the 13th Intl. Symp. on High Performance Computer Architecture,Phoenix, AZ, Feb. 2007.

[YoL08] Richard M. Yoo and Hsien-Hsin S. Lee. Adaptive Transaction Scheduling for TransactionalMemory Systems. In Proc. of the 20th ACM Symp. on Parallelism in Algorithms and Architectures,pages 169-178, Munich, Germany, June 2008.

[ZKD08] Nickolai Zeldovich, Hari Kannan, Michael Dalton, and Christos Kozyrakis. Hardware En-forcement of Application Security Policies. In Proc. of the 8th Symp. on Operating Systems Designand Implementation, Dec. 2008.

[ZLL04] Pin Zhou, Wei Liu, Fei Long, Shan Lu, Feng Qin, Yuanyuan Zhou, Sam Midkiff, and JosepTorrellas. AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-basedInvariants. In Proc. of the 37th Intl. Symp. on Microarchitecture, Dec. 2004.

[ZQL04] Pin Zhou, Feng Qin, Wei Liu, Yuanyuan Zhou, and Josep Torrellas. iWatcher: Efficient Archi-tecture Support for Software Debugging. In Proc. of the 31st Intl. Symp. on Computer Architecture,Munchen, Germany, June 2004.

[ZiB06] Craig Zilles and Lee Baugh. Extending Hardware Transactional Memory to Support Non-Busy Waiting and Non-Transactional Actions. In Proc. of the 1st ACM SIGPLAN Workshop onTransactional Computing, Ottawa, ON, Canada, June 2006.

[10a] The Rochester Software Transactional Memory Runtime. 2010.www.cs.rochester.edu/research/synchronization/rstm/.

[10b] Apache Project. In http://April.apache.org/, 2010.

[10c] Cool Tools. In http://cooltools.sunsource.net/, 2010.