automatic memory management for ﬂexible real-time systems · 2015-07-29 · trik persson for...

Automatic memory managementfor flexible real-time systems

Automatic memory managementfor flexible real-time systems

Sven Gestegård Robertz

Doctoral dissertation, 2006

Department of Computer Science

Lund University

ISBN 91–628–6829–2ISSN 1404 –1219Dissertation 24, 2006LU-CS-DISS:2006-01

Department of Computer ScienceLund UniversityBox 118SE-221 00 LundSweden

Typeset using LATEX 2ε

Cover artwork by the author, inspired by Wadler [Wad76]

Printed in Sweden by Tryckeriet i E-huset, Lund, 2006

c© 2006 by Sven Gestegard Robertz

Abstract

In a flexible real-time system, the constraints in available CPU time andmemory lead to resource management problems, which must be handledcarefully in order to maximize quality of service while avoiding over-load. Managing CPU time — scheduling — is well studied and dynamicscheduling is widely accepted in the real-time industry. In order to makesafe high-level languages, like Java, practically feasible for use in hardreal-time systems, memory management and particularly the dependen-cies between memory and CPU usage must be studied.

The traditional approach to incremental GC scheduling, to performgarbage collection work in proportion to the amount of allocated mem-ory, has drawbacks such as inconsistent utilization due to bursty allo-cations. To remedy this, time-triggered GC scheduling is proposed. It isshown that this strategy gives real-time performance that is equal to, orbetter than, that of an allocation-triggered GC. It is also shown that byusing a deadline-based scheduler, the GC scheduling and, consequently,the real-time performance, is independent of complex and error-pronework metrics.

Time-triggered GC also allows a more high-level view on GC sched-uling, as the entire GC cycle is considered rather than each individual in-crement. This makes it possible to schedule GC as a normal task. As thescheduling parameters are explicit in the model, it also makes the time-triggered strategy well suited for auto-tuning and fits well into feedbackscheduling systems.

A novel approach of applying priorities to memory allocation is intro-duced and it is shown how this can be used to enhance the robustnessof real-time applications. The proposed mechanisms can also be used toincrease performance of systems with automatic memory managementby limiting the amount of garbage collection work.

Together, these solutions facilitate flexible and robust automatic mem-ory management for real-time systems. Adaptive techniques are pre-sented, aimed at replacing or complementing a priori analysis with on-line auto-tuning. The presented ideas have been successfully imple-mented and validated in an experimental real-time Java environment,supporting the claim that this work is a step towards write once — runanywhere with hard real-time performance.

ACKNOWLEDGMENTS

I wish to thank my supervisors, Boris Magnusson and Klas Nilsson fortheir support and for giving me the freedom to choose and pursue re-search ideas and directions I have found interesting. I am also mostgrateful to my assistant supervisor Roger Henriksson, who introducedme to the field of real-time garbage collection. Gorel Hedin, my supervi-sor in a previous project, at the beginning of my graduate studies, taughtme much about research and technical writing. Thank you all for yourpatient help, valuable input, and encouragement throughout this work.

The prototype implementations have been made in cooperation withother projects and much of the experimental work had not been possiblewithout the assistance of others. Thanks to Anders Ive for his help withimplementing some of these ideas in the IVM virtual machine, AndersNilsson for help with the LJRT compiler, Mathias Haage for the virtualrobots, and Anders Blomdell for expert help with the embedded Pow-erPC platforms and insightful comments on low-level run-time systemimplementation. The LJRT compiler was developed using the JastAddcompiler tools by Gorel Hedin, Torbjorn Ekman, and Eva Magnusson.

I also wish to thank Anton Cervin and Dan Henriksson for valuablediscussions on scheduling and control systems, Torbjorn Ekman and Pa-trik Persson for interesting and enjoyable discussions on programminglanguages and real-time systems development, Ulf Asklund for teachingme a great deal about configuration management, and Christian Ander-sson for assisting with LATEX tips and tricks.

I am very grateful to Anne-Marie Westerberg, Lena Ohlsson, AnnaNilsson, Peter Moller, Lars Nilsson, Tomas Richter, Jakob Westerbergand Jonas Wisbrant for all help with practical details — everything hadbeen much harder without you.

iv

I thank everybody at the Department of Computer Science andLUCAS (Lund Center for Applied Software Research) for providing ideas,perspective, discussion, and good company.

The presented research has been a collaboration between the Depart-ment of Automatic Control and the Department of Computer Science atLund University, and was carried out within the research project “Inte-grated Control and Scheduling” for which Karl-Erik Arzen, Klas Nils-son, and Ola Dahl wrote the original proposal, and the “FLEXCON —Flexible Embedded Control Systems” research program. The work hasbeen financially supported by ARTES (A network for Real-Time researchand graduate Education in Sweden) and SSF (the Swedish Foundationfor Strategic Research). The experiments have been carried out in coop-eration with projects financed by VINNOVA (the Swedish Agency forInnovation Systems).

Finally, I would like to thank my friends, the members of the aca-demic symphony orchestra, and all crazy people in spex and spaax formaking my years as a student in Lund thoroughly enjoyable, my parentsfor always being there for me, and Kerstin for her love and support.

Lund, April 2006

Sven

CONTENTS

1 Introduction 1

1.1 Resource-aware computing . . . . . . . . . . . . . . . . . . 21.2 Memory management . . . . . . . . . . . . . . . . . . . . . 41.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . 61.4 About the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Preliminaries 11

2.1 Real-time systems . . . . . . . . . . . . . . . . . . . . . . . . 112.1.1 Concurrent programming . . . . . . . . . . . . . . . 122.1.2 Timing requirements . . . . . . . . . . . . . . . . . . 122.1.3 Control systems . . . . . . . . . . . . . . . . . . . . . 152.1.4 Predictability and scheduling . . . . . . . . . . . . . 182.1.5 Co-existence of hard and soft processes . . . . . . . 212.1.6 Feedback scheduling . . . . . . . . . . . . . . . . . . 22

2.2 Embedded systems . . . . . . . . . . . . . . . . . . . . . . . 232.2.1 Safety and dependability . . . . . . . . . . . . . . . 242.2.2 Well-defined area of application . . . . . . . . . . . 25

2.3 Memory management . . . . . . . . . . . . . . . . . . . . . 252.3.1 Garbage collection . . . . . . . . . . . . . . . . . . . 272.3.2 Incremental and real-time GC . . . . . . . . . . . . . 312.3.3 Semi-concurrent GC scheduling . . . . . . . . . . . 332.3.4 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Real-time Java for embedded systems . . . . . . . . . . . . 362.4.1 Real-time virtual machine . . . . . . . . . . . . . . . 362.4.2 The Lund Java-based real-time platform . . . . . . . 372.4.3 Multi-stage deployment of control software . . . . . 39

vi

3 Time-triggered garbage collection scheduling 43

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 GC cycle time calculation . . . . . . . . . . . . . . . . . . . 453.3 GC work calculation . . . . . . . . . . . . . . . . . . . . . . 50

3.3.1 Traditional GC work metrics . . . . . . . . . . . . . 513.3.2 Using time as the GC work metric . . . . . . . . . . 52

3.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.4.1 Fixed priority scheduling . . . . . . . . . . . . . . . 553.4.2 EDF scheduling . . . . . . . . . . . . . . . . . . . . . 56

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Adaptive garbage collection scheduling 59

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2 Automatic GC cycle time tuning . . . . . . . . . . . . . . . 61

4.2.1 Application-independent auto-tuning . . . . . . . . 624.2.2 Using information about the application . . . . . . 654.2.3 Estimating allocation rate . . . . . . . . . . . . . . . 694.2.4 Feed-forward from the application . . . . . . . . . . 70

4.3 GC workload prediction . . . . . . . . . . . . . . . . . . . . 724.3.1 Black box estimation . . . . . . . . . . . . . . . . . . 734.3.2 Clear box prediction . . . . . . . . . . . . . . . . . . 744.3.3 Conservative prediction . . . . . . . . . . . . . . . . 78

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 Priorities for memory allocation 81

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2 Applying priorities to memory allocations . . . . . . . . . . 82

5.2.1 Avoiding out-of-memory situations . . . . . . . . . 825.2.2 Improving performance by reducing GC work . . . 84

5.3 Non-critical allocations . . . . . . . . . . . . . . . . . . . . . 845.3.1 Non-critical allocation limit . . . . . . . . . . . . . . 855.3.2 Fixed GC cycle length . . . . . . . . . . . . . . . . . 85

5.4 Detailed description . . . . . . . . . . . . . . . . . . . . . . 875.4.1 Calculating the GC cycle length . . . . . . . . . . . . 875.4.2 Live memory and floating garbage . . . . . . . . . . 885.4.3 GC for the low priority processes . . . . . . . . . . . 885.4.4 Non-critical limit calculations in the real world . . . 905.4.5 Time-based GC scheduling . . . . . . . . . . . . . . 915.4.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.5 Non-critical memory in Java . . . . . . . . . . . . . . . . . . 925.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

vii

6 Memory-aware feedback scheduling 97

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.2 GC-aware period assignment . . . . . . . . . . . . . . . . . 99

6.2.1 Separate GC tuning and feedback scheduler . . . . 100

6.2.2 Integrated GC and feedback scheduling . . . . . . . 101

6.3 Utilizing slack . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.4 Controlling the allocation rate . . . . . . . . . . . . . . . . . 105

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7 GC in an uncooperative environment 113

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.2 Exact GC in an uncooperative environment . . . . . . . . . 116

7.2.1 Uncooperative compiler . . . . . . . . . . . . . . . . 116

7.2.2 Uncooperative scheduler . . . . . . . . . . . . . . . 120

7.3 Garbage collector interface . . . . . . . . . . . . . . . . . . . 120

7.4 Performance issues . . . . . . . . . . . . . . . . . . . . . . . 122

7.4.1 Too frequent locking . . . . . . . . . . . . . . . . . . 123

7.4.2 A read barrier requires locking . . . . . . . . . . . . 125

7.4.3 Locking at method calls . . . . . . . . . . . . . . . . 126

7.4.4 Effects on optimization . . . . . . . . . . . . . . . . . 126

7.5 Reducing the overhead . . . . . . . . . . . . . . . . . . . . . 126

7.5.1 Reducing the need for synchronization . . . . . . . 126

7.5.2 Reducing the cost of synchronization . . . . . . . . 131

7.5.3 Compiler optimization effects . . . . . . . . . . . . . 134

7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8 Experiments 137

8.1 Experiment platforms . . . . . . . . . . . . . . . . . . . . . 137

8.2 Time-triggered GC . . . . . . . . . . . . . . . . . . . . . . . 141

8.3 GC cycle time auto-tuning . . . . . . . . . . . . . . . . . . . 146

8.4 GC work prediction . . . . . . . . . . . . . . . . . . . . . . . 149

8.5 Priorities for memory allocation . . . . . . . . . . . . . . . . 152

8.5.1 Avoiding out-of-memory situations . . . . . . . . . 152

8.5.2 Improving performance . . . . . . . . . . . . . . . . 153

8.6 Feedback scheduling . . . . . . . . . . . . . . . . . . . . . . 155

8.7 Performance evaluation . . . . . . . . . . . . . . . . . . . . 158

8.7.1 Inlined overhead . . . . . . . . . . . . . . . . . . . . 158

8.7.2 Latency and jitter . . . . . . . . . . . . . . . . . . . . 159

8.7.3 Lazy locking . . . . . . . . . . . . . . . . . . . . . . . 162

viii

9 Future work 1639.1 Adaptive GC scheduling . . . . . . . . . . . . . . . . . . . . 1639.2 Priorities for memory allocation . . . . . . . . . . . . . . . . 164

9.2.1 Configurable behaviour . . . . . . . . . . . . . . . . 1649.2.2 Non-critical memory using aspects . . . . . . . . . . 165

9.3 GC scheduling interface . . . . . . . . . . . . . . . . . . . . 1659.4 Feedback scheduling and QoS . . . . . . . . . . . . . . . . . 1669.5 Distributed hard real-time systems . . . . . . . . . . . . . . 166

10 Related Work 16910.1 Time-based garbage collection scheduling . . . . . . . . . . 16910.2 Adaptive GC scheduling . . . . . . . . . . . . . . . . . . . . 17210.3 Memory Management in Real-Time Java . . . . . . . . . . . 17210.4 Soft references . . . . . . . . . . . . . . . . . . . . . . . . . . 17410.5 GC in an uncooperative environment . . . . . . . . . . . . 17410.6 Worst case and schedulability analysis . . . . . . . . . . . . 175

11 Conclusions 17711.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 17811.2 Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

Bibliography 183

CHAPTER 1

INTRODUCTION

Today, computers are used as components in all kinds of systems andproducts, from industrial robots and cars to home appliances and toys,and functionality that has traditionally been implemented using me-chanical systems or analog electronics now often include a computer oreven a network of computers. Such embedded systems typically need tointeract with an external evironment, and the dynamics of the environ-ment give rise to timing constraints on program execution. Therefore,most embedded systems are also real-time systems, meaning that theyhave to react to external stimuli within a specified time.

In a system with timing requirements, all parts of the system —including the run-time system — must be implemented in a way thatmakes them temporally predictable. The overall goal of this thesis is todevelop new techniques to improve the real-time properties of run-timesystems for embedded and real-time applications, particularly with re-spect to memory management,

As the complexity of embedded software increases, so does the en-gineering and programming effort required. High-level programminglanguages provide programmer-friendly abstractions that hide much ofthe low-level details, notably memory management. Apart from makingprogramming easier, that also improves safety and robustness, as sometypes of common programming errors are simply not possible to make atthe abstracted level. The drawback is that, at the higher level of abstrac-tion, the programmer no longer has full control of all low-level aspectsof the system. When responsibility for handling low-level tasks is trans-ferred from the programmer to the run-time system, predictability mustnot be lost, and engineering decisions traditionally expressed directly incode must be possible through e.g. parameters to the run-time system.

2 1. INTRODUCTION

1.1 Resource-aware computing

A fundamental property which makes embedded software different fromcomputer programs in general is that an embedded application must,to a much higher degree, execute in a resource-constrained environ-ment. Therefore, we will very briefly examine the issues associated withresource-constrained applications, in order to help putting the presentedwork into context.

In general, resource constraints on computer systems can be consid-ered to fall into five cathegories, of which the first two (and, in particular,the dependencies between them) and the last one are of primary interestin this thesis;

1. CPU time

2. Memory

3. Input/Output capabilities (I/O, networking, etc.)

4. External physical (Energy, power, space)

5. Engineering effort

The first four are physical constraints that obviously cannot be violatedand therefore must be taken into account in the development of an ap-plication. The engineering effort required, on the other hand, addressesthe economic aspects of the development; just as there are trade-offs be-tween the physical constraints (e.g., using a faster CPU or larger mem-ory may allow a more sophisticated algorithm to be used, but it comesat the cost of higher power consumption), programming in a high-levellanguage may result in a bit less efficient code than hand-written as-sembly language, but the development time, and the number of errors,would most likely be much lower. Therefore, economical or time-to-market concerns may favour a high-level language even if it incurs e.g.additional run-time overhead and higher hardware costs.

Resource constraints are extra-functional requirements, and shouldideally be handled independently of the functional requirements. Thatis further emphasized by the increasing desire to use modular, or com-ponent based, methods of development; when the individual modulesor components are developed, it is not known how they will be com-posed into the final system, and therefore their implementation mustnot depend on such knowledge.

Instead, resources should be managed by a resource manager, a com-ponent of the run-time system. In the real-time community, the resource

1.1 RESOURCE-AWARE COMPUTING 3

management problem applied to CPU time has been thoroughly stud-ied and the theoretical foundation of on-line scheduling is well built.Recently, with the development of processors with variable clock fre-quency, the relation between CPU time and power consumption hasbeen investigated. Memory management, on the other hand, has beenregarded as part of the application, and has not been considered in thiscontext. While viable in traditional systems, the introduction of auto-matic memory management complicates the picture as illustrated by thefollowing sketch of past, present, and future software architectures.

Traditionally, a real-time system uses on-line process scheduling andstatic or manual memory management. Figure 1.1 illustrates how thisarchitecture provides temporal isolation both between different appli-cations and between applications and the run-time system; the only re-source that is managed is CPU time, and the process scheduler has fullcontrol over the CPU time assignment. Thus, one application overrun-ning its designed execution time cannot cause a failure of another that isexecuting with higher priority on the same CPU.

P1 P2 P3

Run-time system

Figure 1.1: Traditional model, whith independent processes running on top ofa run-time system.

Introducing automatic memory management, as a part of a safe lan-guage, complicates the picture, and the boundary between the applica-tion and the run-time system becomes unclear as shown in Figure 1.2.There are multiple reasons for this; First, memory is a global resource,and without global management, over-use of memory in one part of thesystem may cause failure of another part 1. Secondly, most current real-time garbage collectors (RTGC) are intrusive, in the sense that they caninterrupt the application threads at arbitrary times, causing latenciesand jitter. Scheduling of garbage collection work also often by-passes

1While the focus of this thesis is on memory management, similar problems arise fromshared use of any shared global resource for which the system does not provide arbitration.

4 1. INTRODUCTION

the normal task scheduler, further compilicating things. Finally, the CPUrequirements of the GC depends heavily on the behaviour of the appli-cation, which complicates off-line schedulability analysis, as worst caseanalysis would require global analyis of both memory and CPU usageof both the applications and the GC.

P1 P2 P3

Run-time system

Memory management

Figure 1.2: Introducing automatic memory management into a real-time sys-tem complicates the boundary between applications and run-time system

The problem of resource management is that the utilization of differ-ent resources cannot be handled independently, as performing a certaintask typically requires simultaneous use of several resources. In orderto get the clear separation of the traditional model, future run-time sys-tems will need to include some sort of resource manager, as in Figure 1.3.With control over resource utilization, the dependencies between differ-ent resources can be taken into account, resulting in the desired isola-tion between applications and the run-time system. That is an emergingtrend, and research on quality-of-service and quality-of-control has re-sulted in techniques and mechanisms for such isolation of applicationswith respect to CPU time and I/O capabilities. Investigating how mem-ory management concerns can be incorporated into that picture is partof the motivation for the presented work .

1.2 Memory management

Memory management in real-time and embedded systems is still han-dled in a very conservative manner and for reasons of safety and pre-dictability, static memory management is often the technology of choice.However, as the complexity of embedded systems increase, static im-plementations become problematic; they are difficult to maintain anddevelop as even minor changes may require major reorganization ofthe software, and resource utilization may be low. For those reasons,

1.2 MEMORY MANAGEMENT 5

P1 P2 P3

Run-time system

Resource manager

Figure 1.3: A resource manager provides the desired isolation between appli-cations and run-time system

dynamic memory management becomes increasingly desirable. Whilemore flexible than static memory management, manually managed dy-namic memory introduces new problems of predictability, robustness,and maintainability — important properties of embedded systems. Manyof these problems can be overcome by automatic memory management,or garbage collection (GC). With the advent of type-safe languages likeJava on the real-time systems scene it becomes increasingly importantto develop reliable, predictable, and non-intrusive garbage collectorswhich are capable of meeting the memory allocation demands of ourapplications at all times. The garbage collector should also be transpar-ent to the application developer and not require cumbersome manualtuning to be effective on any particular platform. This thesis proposes anew approach to GC scheduling aimed at meeting these demands.

The focus of this thesis is on GC scheduling rather than algorithmdesign, and the fundamental idea is to let elapsed time, rather than per-formed allocations, determine when to run the garbage collector, usingan approach called time-triggered garbage collection. Using either know-ledge of the worst-case allocation need of the application, or by usingauto-tuning techniques, it is possible to calculate a deadline for whengarbage collection must be completed and new memory made availablefor allocation. Having an explicit deadline for the GC cycle implies thatit would be possible to schedule GC using standard scheduling tech-niques, such as rate monotonic or earliest deadline first scheduling. Thisthesis investigates the feasibility of such an approach.

Another area of growing research interest and recent developmentis that of handling non-determinism in real-time systems, and an ap-proach that has been successful is feedback scheduling. By using feed-back control, the period times of the processes are dynamically altered in

6 1. INTRODUCTION

order to keep the total CPU utilization at a safe level. This is particularlyuseful in control systems, where it is the resulting control performance,rather than real-time performance, that is the ultimate goal. By gettingthe process scheduler into the loop, this allows co-design of control andreal-time systems. Furthermore, worst-case analysis is not always fea-sible, due to non-determinism in modern computers, lack of engineer-ing resources or simply that a design based on worst-case assumptionswould be too pessimistic and therefore yield too low average resourceutilization to be economically feasible. For these reasons, it is interest-ing to study adaptive memory management. This thesis presents twoapproaches aimed at enhancing the robustness of memory managementfor systems run in an unknown or changing environment.

1.3 Problem statement

This work comes from a practical engineering perspective and is aimedtowards developing techniques that facilitate the production of embed-ded and real-time systems without the need for rigorous analysis andhuge engineering effort that is currently required to develop hard real-time systems. Two categories of problems are addressed: The first isadding flexibility to embedded systems without jeopardizing their real-time properties. The second is how to implement hard real-time garbagecollection in an actual run-time system.

Adding flexibility to hard real-time systems

In this thesis, the focus is on memory management. Previous researchon flexible real-time systems has focused on process scheduling and lit-tle attention has been given to memory management issues and theirimpact on the real-time behaviour of a system. Also, while many of theproblems are generic to all kinds of resource allocation, memory alloca-tion differs from CPU allocation in a major way in that preemption is notpossible2. Therefore, running out of memory is likely to cause the entiresystem to fail while requesting too high CPU utilization may cause someor all processes to miss deadlines but the system may be able to continueexecuting with decreased performance.

2In systems with virtual memory, swapping and paging may be viewed as memorypreemption, but this is uncommon in embedded systems as they typically lack secondarystorage. Exceptions of course exist, for instance large embedded systems like ships andpower plants. However, in such systems, virtual memory should not be used for the time-critical tasks as it reduces predictability. One method of ensuring this is to lock the pagesused by hard real-time tasks into RAM in order to avoid page faults.

1.3 PROBLEM STATEMENT 7

Let us start by making three observations on real-time and embed-ded systems: The first one is that the need for flexibility in hard real-timesystems is increasing. Component based software development helps fa-cilitate code reuse and makes it possible to build systems quickly bycomposing and configuring components. While it is possible, in theory,to perform worst case and schedulability analysis on each configuration,constraints on the amount of available engineering resources may pro-hibit such analysis. Therefore, adaptive techniques like feedback sched-uling are increasing in popularity as they allow a system to adapt itsresource utilization in order to keep the system from overload while stillproducing an acceptable quality of service.

Another technique that is gaining interest is dynamic reconfigurationand code exchange where communicating devices may send pieces ofcode to each other in order to perform some cooperative task. In such asystem, an introduction of a new device may cause pieces of code thatwere not part of the original design to be executed on other devices.This is facilitated on the programming language level by e.g., dynamicloading of code, but the run-time system aspects need further studies.For instance, in an environment where code is dynamically loaded andreplaced at run-time, static worst-case analysis (and scheduling basedthereupon) is not possible. Yet, it is desirable to include such techniquesin hard real-time systems.

The second observation is that not all hard real-time systems are safetycritical. A system is a hard real-time system if it fails or suffers majorperformance degradation if deadlines are missed. But, for systems thathave a safe failure mode, a small theoretical risk of failure may be ac-ceptable if the probability is low enough. This is also motivated by thehigh cost of the engineering effort required to make absolute guaranteesthat a system will never fail.

The final observation is that a problem with the current methods forreal-time systems development is the gap between theory and practice; thereal-time theory requires hard worst case calculations in order to guar-antee schedulability. However, it is very common to use measurementsor “gut feeling” estimates rather than exact analysis to obtain the worstcase memory and CPU requirements and then, the quality of the real-time guarantees is no better than that of the worst case estimates. Forthose reasons, it may be better, both in terms of development costs andrun-time performance, to reserve the hard, a priori analysis based meth-ods for the development of systems which are safety critical and to useadaptive techniques for systems which are not.

8 1. INTRODUCTION

Motivated by these observations, the high level goal of this work is todevelop techniques for implementing hard real-time run-time systems,particularly memory managers, that are independent of a priori analysisof the application. That is, if an application is schedulable, the run-timesystem should be able to guarantee that it will execute with real-timeperformance — write once, run anywhere for hard real-time systems.

Making hard real-time memory management feasible in practice

The second problem addressed in this work is that previous research onhard real-time garbage collection may not be directly applicable whenimplementing actual real-time systems.

The first issue is the metric used to measure garbage collection work.A good metric is essential to both schedulability analysis and for theactual scheduling at run-time. Unfortunately, in much of the existingliterature, the problem is either neglected or the reasoning is done on atoo abstract level to be practically applicable.

Secondly, non-intrusiveness is a fundamental requirement on a hardreal-time garbage collector as GC work must not cause processes to misstheir deadlines. However, the common way of implementing real-timeGC is to use an incremental garbage collector that performs small por-tions of work at each memory allocation — in line with the applicationprocesses — and previous research has often been content with show-ing that it is possible to find tight upper bounds on the lengths of eachincrement. That is not a good strategy if one wants to minimize latencyand jitter due to garbage collection; even though each increment has asmall upper bound, if a process makes many allocations the total delaycaused by garbage collection will be large. Therefore, it is not enough toprove predictability — in actual product development it is equally im-portant to have a scheduling model that allows maximum utilization ofavailable resources.

Finally, previous real-time garbage collectors have required very finegrained analysis in order to tune them to a particular application, andthe run-time scheduling has been done at the individual increment level.This has made the utilization of real-time GC difficult and tedious andthe whole concept of automatic memory management in real-time sys-tems has often been shunned.

This work is an attempt to provide a conceptual framework and tech-niques that are independent of the GC implementation and allow rea-soning about garbage collection scheduling at a higher level, withoutabstracting away the difficulties. The goal is to make it possible to sched-ule garbage collection as any other task.

1.4 ABOUT THE THESIS 9

1.4 About the thesis

Outline

The rest of the thesis is organized as follows:

Chapter 2: Preliminaries describes the fundamental concepts of theareas of real-time computing and memory management andpresents previous results on which this thesis is based.

Chapter 3: Time-triggered garbage collection introduces the idea oftime-triggered garbage collection and discusses its impact in fixed-priority and earliest deadline first scheduled systems.

Chapter 4: Adaptive garbage collection scheduling discusses how atime-triggered garbage collector can be made auto-tuning andpresents techniques for on-line estimation of the GC cycle lengthand the amount of work required to perform a GC cycle.

Chapter 5: Priorities for memory allocation presents a novel notion ofapplying priorities to memory allocations and shows how that canincrease robustness and performance of real-time systems.

Chapter 6: Memory-aware feedback scheduling presents an approachto extending traditional feedback scheduling results to also incor-porate the costs of memory management explicitly in the periodtime optimization.

Chapter 7: GC in an uncooperative environment describes the chal-lengenges faced when implementing accurate concurrent GC inan environment where one cannot rely on cooperation from thecompiler back end or scheduler.

Chapter 8: Experiments presents experimental support for the pro-posed techniques.

Chapter 9: Future work outlines directions for future research andpoints out possible areas of application for the presented ideas.

Chapter 10: Related work relates the work presented in this thesis withprevious results in the areas of garbage collection scheduling,memory management for real-time Java and worst case analysis.

Chapter 11: Conclusions summarizes and discusses the contributionsof this thesis.

10 1. INTRODUCTION

Publications

This thesis is largely based on papers published, or under submission.The ideas of time-triggered GC and on-line estimation of GC schedulingparameters of Chapter 3 and Chapter 4 include results from

Sven Gestegard Robertz and Roger Henriksson, Time-TriggeredGarbage Collection — Robust and Adaptive Real-Time GC Schedulingfor Embedded Systems [RH03], in Proceedings of the ACM SIGPLANLangauges, Compilers, and Tools for Embedded Systems – 2003(LCTES’03).

Chapter 5 and the corresponding experiments was published as

Sven Gestegard Robertz, Applying Priorities to Memory Alloca-tion [Rob02] in Proceedings of the 2002 International Symposiumon Memory Management (ISMM’02).

Chapter 6 is based on

Sven Gestegard Robertz, Dan Henriksson, and Anton CervinMemory-aware feedback scheduling, to be submitted.

Chapter 7 includes results presented in

Anders Nilsson and Sven Gestegard Robertz, On real-time perfor-mance of ahead-of-time compiled Java [NR05], in Proceedings of the 8thIEEE International Symposium on Object-oriented Real-time dis-tributed Computing (ISORC’05).

The prototype implementations used in the experimental verificationare closely related to the development of the garbage collector interfacewhich was presented in

Anders Ive, Anders Blomdell, Torbjorn Ekman, Roger Henriksson,Anders Nilsson, Klas Nilsson and Sven Gestegard Robertz, GarbageCollector Interface [IBE+02], in Proceedings of NWPER’02.

There is also a close relation between the development of control appli-cation prototypes and and the Lund Java-Based Real-Time (LJRT) plat-form which is described in

Anders Nilsson and Sven Gestegard Robertz, LJRT compiler referencemanual, Department of Computer Science, Lund University, 2005 –2006.

The LJRT platform is also central to the ideas presented in

Sven Gestegard Robertz, Anders Nilsson, Klas Nilsson and MathiasHaage, Multi-stage deployment of robot control software [RNNH06], toappear in Proceedings of the 8th International IFAC Symposium onRobot Control, SYROCO, September 2006.

CHAPTER 2

PRELIMINARIES

This chapter briefly presents the fundamental concepts of real-time andembedded systems, scheduling and memory management. Previous re-search in the fields of scheduling and automatic memory managementfor real-time systems, which forms a base for the remainder of this the-sis, is presented and discussed.

2.1 Real-time systems

In order to understand the problems and challenges associated withmemory management in real-time systems, we will now review the fun-damental properties of systems with timing requirements. We will dis-cuss what defines a real-time system, how timing requirements arise andare classified, how a set of processes may share a single CPU while stillperforming in a timely manner and how processes with firm timing re-quirements can co-exist with non real-time processes.

For any computer program, its task is, generally speaking, to pro-duce some output based on its input values. The fundamental defini-tion of correctness is that the program produces the right output for anyvalid input values. However, for many systems, typically those that in-teract with an external environment in some way, that is not enough.In addition to producing the right output, the definition of correctnessis strengthened to also require that such a system produces the outputbefore a given time, the deadline. Such systems are called real-time sys-tems and typical examples are found in the areas of automatic control,communications, audio/video, interactive computer programs, etc.

12 2. PRELIMINARIES

2.1.1 Concurrent programming

Concurrent programming is the common name for the techniques usedto allow many processes to execute in parallel on the same computer,either truly in parallel on a multi-CPU machine, or virtually so by time-sharing on a single CPU, in a consistent way. Important problems arehow to deal with communication and synchronization between parallelactivities. There are several reasons for using concurrent programming,but for our purposes, the most important one is to model parallelismin the external environment. A program in a control system may needto react both to occurrences of certain events and perform operationsat certain times. If the different events are independent of each otherand of the passage of time, there are parallel activities in the controlledsystem, and therefore it is convenient to have the same parallelism inthe software. We will not go into any details on the different problemsin concurrent programming or their solutions. For now it will suffice tostate that concurrent and real-time programming are very tightly related(and sometimes even used synonymously); virtually any real-time orembedded system will consist of many processes, either cooperating orindependent.

If there are more parallel processes than there are processors, likewhen running multiple processes on a single-CPU machine, not all pro-cesses can execute simultaneously with true parallelism. Therefore, someform of time-sharing mechanism must be used to allow them to seem-ingly execute in parallel, by switching back and forth between processes,interleaving their execution. Time-sharing can be implemented eitherexplicitly in the processes, e.g. using co-routines [Knu73], or by the run-time system or OS. In the latter case, a special piece of software calledthe process scheduler is responsible for selecting which process that getsto execute next.

Normally, the scheduler runs periodically, and at each invocation itsuspends, or preempts, the running process and selects another process,which is then allowed to execute until the next scheduler period. Thismodel is called time-slicing. How processes are scheduled obviously af-fects their temporal behaviour, and different scheduling techniques arepresented in Section 2.1.4.

2.1.2 Timing requirements

The term real-time systems represent a wide range of applications withwidely varying timing requirements, and the consequences of failing tomeet deadlines also range from minor inconveniences to total failures.

2.1 REAL-TIME SYSTEMS 13

Computer systems can be categorized based on their real-time require-ments and a brief overview of the taxonomy is given here.

Batch systems

Most computer programs do not have any real-time requirements otherthan that it, naturally, is desirable that the result is produced as quicklyas possible in order to make the program practically usable. Examplesof such programs are compilers, mathematical programs, etc. Such pro-grams are called batch systems, as they typically take a batch of input,perform some processing, and output the result. In batch systems, thecorrectness of the system is completely independent of the time it takesto produce the output.

Interactive systems

The next class of systems are systems where a human user interacts withthe system in the sense that the user gives a command, the system pro-cesses it and presents the result, the user issues another command, andso on. Typical examples are window systems, word processors, andother desktop applications. Here, the response time of the system mustnot be too long for the interaction to work well. If the system takes sec-onds or more to respond to commands, the user tends to be annoyed,but as long as the response times are of the same order as the humanresponse time — typically one or two tenths of a second — the systemis perceived to react instantly, and delays up to half a second are usuallytolerable. Therefore, while interactive systems have some degree of real-time requirements, they are quite relaxed and also, the consequences ofexcessive delays are merely an inconvenience.

Real-time systems

Computer systems that interact with external electrical or mechanicaldevices or communicate via some shared medium typically have tightertiming requirements. The term real-time systems is used to denote sys-tems where timeliness is required for correct operation.

Systems that need to meet deadlines in order to function correctly,but where a failure to do so only causes a temporary decrease in thequality of service and does not cause the whole system to fail are calledsoft real-time systems. One example is audio/video systems, where amissed deadline causes a glitch but the playback still continues. Anotherexample is embedded systems, e.g, a computer controlling the electric

14 2. PRELIMINARIES

windows or the cabin lighting in a car, where occasional small delayswill not have any severe consequences.

If missing a deadline may cause the whole system to fail, we have ahard real-time system. Continuing the car example, the engine controlsystem is a hard real-time system, as it is critical to the operation of theengine that the fuel injection and ignition are performed at exactly theright time.

It is common that embedded systems consist of both hard and softreal-time tasks, and then techniques like e.g. priority based schedulingare used to guarantee that the hard tasks always get the resources theyneed, possibly at the expense of the soft tasks.

Specifying temporal behaviour

Having defined a real-time system as a system that must react in a timelymanner, we will now discuss how timeliness can be parametrized andmeasured. The real-time behaviour of a process can be specified as atuple (R, C, D), where R is the set of release times, C is the set of executiontimes, and D the set of deadlines. For a periodic process with start time t0and period time T , a constant execution time C and deadline D, we get

R = {Rk} = {t0 + kT} ; k ≥ 0

C = C

D = {Dk} = {t0 + kT + D} ; k ≥ 0 (2.1)

which is the form we will assume if nothing else is stated.In addition to the fundamental real-time requirement — that a pro-

cess always finishes before its deadline — there are other aspects that areof interest when specifying real-time systems, namely latency, responsetime, and jitter. When a process is released, it may not always start exe-cuting directly; for instance, if another process is executing it will haveto wait until that process has finished. Therefore, the actual invocationtime will be some time after the release, and the difference is called la-tency. The response time of a process is defined as the time from releaseto finish. These definitions are illustrated in Figure 2.1.

Finally, the variations in these quantities from one period to another,or jitter, may be important. Figure 2.2 shows an example of jitter in bothlatency and response time. Process 1 is periodic with period time T , andhas the release times {t0 + nT}, n ≥ 0. However, in the example, theprocess actually starts executing in t0, t0 + T + L2 and t0 + 2T + L3, andthe execution is not exactly periodic. If we define latency jitter as thedifference between the minimum and maximum latency, in this case we


execution timelatency

response time

Timet2t1 t3

Processready running sleeping state

Figure 2.1: Definitions of real-time parameters. t1 is the release time of theprocess, at t2 the process is invoked, and at t3, the process has finished itsexecution and sleeps until its next release. The time from release to invocationis called latency, and the time from release to finish is called response time.

get the maximum jitter ∆Lmax = L2 − 0 = L2. In the example, there isalso jitter in the response time, defined analogously.

It should be stressed that real-time systems are a very heterogenousclass of systems, and there are large variation in which aspects are im-portant. For instance, one application may be very sensitive to jitterwhereas in another, only the response time is important. Also, there arevast differences in time-scale. A typical video application has a samplingrate (frame rate) in the range of 25 – 100Hz, while control applicationsmay have sampling rates of tens of kilohertz. Therefore, the fields ofreal-time systems research and engineering is also quite heterogenous,as the different requirements will give rise to different technical solu-tions.

2.1.3 Control systems

A large class of embedded systems are control systems [AW97], wherethe task of the computer is to control the behaviour of some external,physical, process. When implementing a controller for a continuous-time process on a computer, the process must be sampled, and most ofthe control theory assumes that the samples are taken at a constant rate,i.e. periodic sampling. Therefore, it is common to implement controllersas periodic processes.

16 2. PRELIMINARIES

R1 R3L3

R2L2

T T

Time

Process 2

Process 1

t0

Figure 2.2: Example of latency and response time jitter. Process 1 is periodicwith time T , Process 2 is sporadic and has higher priority. In this figure, R

denotes response time and not release time.

In controller design, it is common to discretize the continuous-timesystem under the assumptions of periodic sampling and constant input-output delay. Any jitter, in the sampling or output times, will causeerrors in the linearized model, which in turn leads to degraded perfor-mance of the controller. Note, however, that it is not the jitter or delaysin the computer task, but in the sampling and control action, that hasimpact on control performance.

In analog with the previously defined real-time parameters, for con-trol systems we add corresponding quantities directly related to the con-trol task. Figure 2.3 shows the execution of one invocation of a controller.In addition to the latency and response time, the control counterparts aredefined. Figure 2.4 illustrates how the interactions between processesaffect both latency and response time. Controller 2 is executing whenController 1 is released which causes latency to Controller 1. Controller2 has higher priority than Controller 1 and is therefore allowed to pre-empt Controller 1, which increases the response time of Controller 1.

This illustrates how timing requirements on the controller processcome from the assumptions made in the controller design. For instance,if a small sample delay is desired, the latency of the control task mustbe small. The integration of control design and real-time scheduling hasbeen studied [Cer03], tools for analysis of the effects of varying delayshave been developed [LC02] as well as theoretical results for taking jitterinto account when doing control design [CLE+04].


Time

Controller

control delaysample delay

execution timelatency

response time

control response time

Output signal

t4t2t1 t3 t5

Figure 2.3: Definitions of timing parameters for the control task. At t1 thecontroller is released, at t2 it is invoked, at t3, the input signal is sampled, at t4the new control signal is output and at t5 the process has finished its execution.

Time

Controller 1

response time

control response time

Output signal

t2

Controller2

t1

latency

t3

control delaysample delay

t7t6t5t4

Figure 2.4: Interference from other processes affect real-time behaviour. Con-troller 1 is released at t1, but does not get to execute until t2, when Controller2 has finished. Then, at t4, Controller 1 is preempted by Controller 2 and issuspended until t5.

18 2. PRELIMINARIES

2.1.4 Predictability and scheduling

A key attribute of proper real-time systems is predictability; if we wantto make real-time guarantees, we must know how long each task maytake to execute in the worst case, the worst case execution time (WCET).This is one big difference between interactive and real-time systems; inan interactive system, it is the average case performance that usually is themost interesting, as the worst case typically is quite unlikely to occur andit is possible to achieve much better performance on a given platform bydisregarding the worst case and optimizing for the common case.

In real-time systems, on the other hand, predictability is paramountas the system must not fail even in the unlikely event that the worstcase does occur. Therefore, in hard real-time systems it is often neces-sary to trade off performance for predictability; in the average case wemay have a low CPU utilization in order to guarantee that there will beenough CPU time for every process in the worst case.

In order to meet these requirements on predictability, it is necessaryto perform worst case analysis on execution time and memory usageand, based on this, do a priori schedulability analysis — a theoretical anal-ysis aimed at determining whether it can be guaranteed that a givenset of processes always can be scheduled in a way that they meet theirdeadlines under a given scheduling model. This is a well understoodarea and the theoretical foundation is well built.

The scheduling problem is, simply put, this: Given a set of processesthat should execute on a shared processor, find an execution order thatensures that all processes meet their deadlines. This can be done in anumber of ways. The oldest, which is still widely used in safety-criticalsystems, is static cyclic scheduling; the CPU time is divided into timeslots and then each process invocation is statically assigned to a partic-ular time slot. The run-time scheduling is simple; the processes of eachtime slot are executed in due order and when the end of the schedule isreached, execution is restarted from the top. As both execution and com-munication is statically scheduled, it is easy to verify that a schedule willwork. The drawback is that it may be difficult to create the schedule andsmall changes to the processes may require that a whole new scheduleis created from scratch. Also, a static schedule may result in low CPUutilization since the execution times of the different tasks are not equaland therefore, there will often be unused time in some of the time slots.If the execution times of the tasks are not constant, the length of the timeslots has to be long enough to accommodate the worst case executiontime, as tasks may not overrun their time slot. This further decreases themaximum safe CPU utilization.


An alternative scheduling strategy, which adds more flexibility andtransfers the low-level scheduling decisions from the programmer to therun-time system is dynamic1 scheduling; the process scheduler dynam-ically selects which process that should be allowed to execute at anygiven instant based on whether that process has work to perform andthe relative importance compared to other processes in the system. Therest of this thesis will assume dynamic scheduling and now a brief in-troduction to various scheduling algorithms will be given.

Fixed priority scheduling

In a fixed-priority scheduler, a priority value is assigned to each process.If more than one process is ready to execute, the scheduler always givesprecedence to the process with the highest priority. Usually, the sched-uler also allows preemption, i.e., if a process is executing when anotherprocess with higher priority becomes ready, the lower priority processwill be suspended in order to allow the higher priority process to exe-cute without delay.

With fixed priority scheduling, it is usually not possible to have 100%processor utilization without missing deadlines. However, due to thestrict priorities, such overload is handled in a way that lets the high prior-ity processes continue executing unaffected while those with low prior-ities are delayed. In cases of severe overload, the low priority processesmay not get any CPU time at all. This is called starvation.

A problem with fixed priority scheduling is how to assign prioritiesto processes. The most common approach is Rate Monotonic Scheduling(RMS), which says that the shorter the period time a process has, thehigher its priority should be. If priorities are assigned in this way, stan-dard methods for schedulability analysis exist.

A system is schedulable if all processes are guaranteed to meet theirdeadlines, i.e. that their worst case response time is less than the dead-line. The fundamental result in fixed-priority scheduling is that if allprocesses are released at the same time (known as a critical instant), thesystem is schedulable if all processes finish before their deadline. RMSis optimal in the sense that if a set of processes are not schedulable withpriorities assigned according to RMS, it will not be schedulable for any

1A note on terminology: Here, static and dynamic are used with respect to the scheduleitself. Other terms are off-line and on-line scheduling. This should not be confused withthe taxonomy used by Liu and Layland [LL73]. They discuss the distinction between staticand dynamic scheduling algorithms based on whether priorities are fixed or may changeduring execution. In that context RMS and DMS are static algorithms, wheras EDF isdynamic.

20 2. PRELIMINARIES

other set of priorities. It can be proved that for n independent, periodicalprocesses, with execution time Ci and period time Ti, a RMS system isguaranteed to be schedulable if

n∑

i=0

(

Ci

Ti

)

< n(

21/n − 1)

(2.2)

From this, it follows that, for an arbitrary number of processes, a RMSsystem is schedulable if the total CPU utilization is less than 69% [LL73].The assumptions in this result are quite restrictive and not directly ap-plicable for exact analysis in practice. It is still, however, a good rule-of-thumb to be used as a starting point.

In the rate monotonic case, the deadline is assumed to be equal tothe period time. For tasks where the response time is important, it iscommon to have a deadline that is shorter than the period time. In thatcase, priorities may be assigned based on their deadlines rather thantheir period times — deadline monotonic scheduling. If D = T , DMS andRMS are obviously equivalent. If D < T it has been proved that DMS isoptimal, in the above sense.

For detailed analysis of real systems, the assumptions must be re-laxed. In particular, processes are seldom unrelated; either they coop-erate in order to perform a common task, or they compete for somecommon resources. In both cases, they may interfere with each otherin a way that breaks the assumptions behind (2.2). This is dealt with inthe generalized RMS [SRL94]. The schedulability criterion is the same,all processes should have a worst case response time shorter than theirdeadline, but the response time calculations are extended to take block-ing, while waiting for shared resources, etc., into account.

Earliest deadline first scheduling

Another approach to dynamic scheduling is deadline driven scheduling[LL73], also known as earliest deadline first (EDF). Here, instead of assign-ing fixed priorities to processes, the scheduling is done based directly onthe deadlines of processes; the process with the shortest time left to itsdeadline is scheduled to run. Thus, this strategy requires no schedulingdecisions, other than the deadline assignment, to be made by the devel-oper — the translation from timing requirements to priorities is done bythe scheduler, at run-time.

An interesting property of EDF scheduling is that 100% CPU utiliza-tion is possible and, thus, EDF scheduling is optimal in the sense that if


the system is not schedulable using EDF, it will not be schedulable us-ing any other scheduling strategy. However, the handling of overload isdrastically different from a fixed priority scheduler; in an EDF system,if the requested CPU utilization is greater than 100%, all processes willmiss their deadlines. In effect, the period times will be scaled so thatthe CPU utilization is 100% and this may be fatal to processes with harddeadlines.

It should be noted that, as there are no strict priorities in an EDF sys-tem, it may cause more jitter to high frequency tasks, compared to RMS;the fast tasks may occasionally be preempted by slower tasks that arecloser to their deadline. This, however, only occurs in a heavily loadedsystem.

2.1.5 Co-existence of hard and soft processes

An important property in real-time and safety-critical systems is isolationbetween processes. In management of global resources, it is desirable tohave a model that ensures that an overrun or violation of design-time as-sumptions in one part of the system cannot cause a shortage of resourcesin other parts of the system. Applied to scheduling, that means that, in asystem with independent processes, if one process overruns its allocatedexecution time, it should not be allowed to “steal” CPU time from otherprocesses.

In order to overcome the problems with handling overload, espe-cially in EDF scheduled systems, techniques for letting hard real-timeprocesses run with guaranteed deadlines while process with soft or nodeadlines may be delayed in order to keep the total CPU utilization at asafe level have been developed.

Constant bandwidth servers

One approach to handling the problem with running both determinis-tic and non-deterministic processes on the same processor using EDFscheduling is is the constant bandwidth server (CBS) model [AB98]. Foreach process or group of processes, a limit on the maximum fraction ofthe CPU time, the CPU bandwidth, is assigned and this is enforced by thescheduler: If a server has used up its CPU quota in the current period it isblocked until the next CBS period. A set of constant bandwidth serversrunning on a single CPU can be viewed as a if each process were runningon dedicated CPU with a given fraction of the original CPU speed. TheCBS model combines the advantages of fixed priority and EDF sched-uling; it is possible to guarantee that the hard real-time processes al-

22 2. PRELIMINARIES

ways meets their deadlines by isolating them from non-deterministicprocesses while still allowing 100% CPU utilization.

The control server model

The control server model [CE03] is an attempt to combine the predictabil-ity of static scheduling with the flexibility of dynamic scheduling aimedat ensuring jitter-free execution of control tasks. The basic idea is to usestatic scheduling for input and output while the computations in be-tween are scheduled dynamically, with EDF. Isolation of tasks is pro-vided through a mechanism similar to CBS.

The control server model was designed to facilitate component-baseddevelopment of control systems. The characteristic property of the modelis that all parameters effecting control performance (latency, responsetime, period time, etc.) are linear in CPU utilization. This enhancescomposability, as each component only has one knob, CPU utilization,that needs to be tuned when the system is assembled.

2.1.6 Feedback scheduling

Another approach to handling non-determinism is based on that themain goal is to optimize the resulting quality of service rather than someaspect of scheduling like, for instance, minimizing the number of misseddeadlines. By using feedback control, the scheduling parameters are au-tomatically adjusted at run-time in order to keep the CPU utilizationat a safe level while optimizing the quality of service of the application.This is called feedback scheduling [AP00, Cer03, CEBA02]. One area wherethis approach is useful is control systems, where it has been shown thatthe total quality of control can be dramatically increased if the real-timerequirements are relaxed.

Figure 2.5 shows the structure of a basic feedback scheduler. A setof tasks generate jobs that are passed to a run-time dispatcher. The exe-cution times of the jobs and the total CPU utilization, U , are measured.Based on this, the scheduler adjusts the period times of the tasks, Ti, inorder to keep the CPU utilization at the set-point, Usp.

If a system contains both hard and soft real-time tasks, it is reason-able that the CPU utilization of the soft processes should be decreasedmore than that of the hard processes. This can be done by using elasticscheduling [BLA02], where a stiffness value is assigned to each processand the scaling of period times is done in proportion to that value.

The general period assignment problem can be expressed as follows.A set of n tasks, Ti, i ∈ {1 . . . n} with execution time Ci, an adjustable

2.2 EMBEDDED SYSTEMS 23

Scheduler Tasks Dispatcher

Usp {Ti} {jobs} {Ci}, U

Figure 2.5: The structure of a basic feedback scheduler. The scheduler measuresthe execution time of each process, Ci, and the total CPU utilization, U . Theperiod times of the tasks, Ti, are scaled to achieve the setpoint utilization, Usp.

period hi, and a cost function Ji(h) share the same computer. The taskof the feedback scheduler is to assign new sampling intervals h1 . . . hn

so that the total cost is minimized and the total CPU utilization is keptbelow a set-point, Usp. This is formulated as the optimization problem

minh1...hn

n∑

i=1

Ji(hi)

subject to

n∑

i=1

Ci

hi≤ Usp (2.3)

In that formulation, the cost only depends on the sampling rate. Moreelaborate models, where also the state of the plant is taken into accounthave been developed [HC05].

2.2 Embedded systems

An embedded system can be defined as a system that has a computer butis not in itself a computer, and this is currently the dominating use forcomputers, accounting for a vast majority of processor sales. Embeddedsystems are found all over the range from tiny to very large systems,and examples include intelligent price tags, home appliances, toys, mo-bile phones, industrial robots, cars, aircraft, ships, and power plants.Therefore, one must be careful when making general statements aboutembedded systems and their properties. Design and implementation ofembedded system is also a vast research area, and most of it is outsidethe scope of this thesis. Nonetheless, we will now briefly examine someof the key differences between embedded systems and general-purposecomputers. The discussion will primarily target the small to medium

24 2. PRELIMINARIES

sized range of systems, with single CPU computers2 and memory in therange from several kilobytes to a few megabytes3, that are commonlyfound in e.g., process and robot controllers.

There is a strong connection between the fields of real-time systemsand embedded systems, as most real-time systems are embedded andvice versa. Therefore, when studying real-time systems, it is often nec-essary to also consider the special properties and requirements of em-bedded systems. In addition to timing requirements this includes safetyaspects and resource constraints.

2.2.1 Safety and dependability

A program in an embedded computer typically runs “forever” and isnot directly accessible to the user. This puts stronger demands on ro-bustness and dependability on embedded software as compared to, e.g.,desktop applications4. While it is annoying if a word processing appli-cation occasionally crashes, users often accept having to reboot their PConce in a while. This is not the case for embedded systems; it would beunacceptable if the software of a microwave oven crashed and requiredthe user to pull the plug to turn off the appliance.

The fact that the program never terminates means that even minorerrors may, in time, cause a fault. For instance, a small memory leak in adesktop application will probably not cause any problems, as that mem-ory will be returned to the system when the application is terminated. Incontrast, in a program which never terminates, like an embedded systemor a server application, even a small memory leak will eventually causethe system to run out of memory and fail.

Viewing the problem from a slightly different perspective, fault toler-ance is an important aspect of embedded systems design. I.e., the systemshould always be able to reach a safe state if a fault occurs. For somesystems, this may simply mean to emergency stop in case of a fault, orto use a watchdog mechanism to automatically reboot a computer if itstops responding. However, for many systems this is not possible, asthey do not have a simple safe state. For instance, in a moving car witha drive-by-wire system, simply turning off the power does not leave the

2In the cases where more than one CPU is required, we use separate computers com-municating via shared memory or a real-time network.

3Our platforms for experiments include the Atmel ATmega128 microcontroller with an8 bit, 8 Mhz, 8 MIPS RISC CPU and 32 kilobytes of RAM and PowerPC G3 boards with300 MHz CPU and 32 megabytes of RAM.

4A desktop computer can, of course, be part of a large embedded system, but usuallynot of its time-critical parts.


system in a safe state — it must be actively stopped. When using em-bedded computers in such systems, care must be taken to ensure that thecritical parts of the software always will be able to continue executing,albeit in a “safe mode” with degraded performance.

With increasing system complexity, especially in software, the engi-neering effort required to ensure fault tolerance increases rapidly. There-fore, run-time systems and development tools for embedded applica-tions must provide as much support as possible for making softwaresafe and robust.

2.2.2 Well-defined area of application

The additional extra-functional requirements on embedded software in-crease the complexity of system design and makes software design moredemanding. On the other hand, in contrast to a general-purpose com-puter, an embedded system is specifically designed to perform a numberof well-defined tasks. This means that all aspects of the embedded hard-ware and software may be tailored for a particular task, making someaspects of software engineering easier.

For instance, while desktop or server applications may be subjectedto widely varying workloads, embedded software in time-critical appli-cations commonly operate in steady state most of the time, with distinctmode changes. Therefore, some problems that are undecidable in thegeneral case, like worst case execution time or live memory analysis,may be practically feasible in an embedded systems context as problem-atic program constructs like unbounded loops are typically avoided.

2.3 Memory management

Memory management consists of two principal tasks; memory allocation,which means mapping a variable or an object to a particular memoryaddress, and de-allocation or reclamation, to return the chunk of memoryoccupied by a variable or an object to the system so that it can be usedto satisfy another allocation request. Memory management can be staticor dynamic, and the latter is further divided into manual and automaticmemory management. We will now briefly review these different tech-niques and the fundamental concepts of memory management.

The oldest form of memory management is static memory manage-ment, where the space required for all variables and data structures of theprogram is allocated statically by the programmer or compiler. As withall static techniques, this makes it easy to verify that a program will work

26 2. PRELIMINARIES

and requires no run-time decisions regarding memory management butthe limitations are severe when it comes to writing programs that e.g.,build dynamic data structures depending on input. With the exceptionof some parts of safety critical applications, static memory managementis seldom used due to the low flexibility and difficult development andmaintenance, and low average resource utilization which often resultsfrom using off-line techniques.

Dynamic memory management [Knu73, WJNB95] overcomes these lim-itations by making it possible to allocate memory at any time in the pro-gram. However, this comes at the cost of having to manage memory atrun-time; when the program wants to allocate more memory, the run-time system must find a suitable space in memory where the requestedobject will fit. As the amount of physical memory is limited, it is alsonecessary to reuse the memory occupied by objects that will no longerbe used. This can be done manually, by explicitly inserting instructionsto deallocate a certain memory area (as free and delete in C/C++) inthe code or automatically by the run-time system.

There are two major problems with manual memory management andboth are caused by the difficulty of manually determining object life-times; failing to deallocate objects that will no longer be used, causingmemory leaks and deallocating objects too soon, causing dangling pointers.The effects of the former is obvious — failure to deallocate objects thatare no longer needed causes excessive memory usage and may cause thesystem to run out of memory. The latter problem, dangling pointers, ismore insidious. It arises when one part of the program deallocates anobject, O1, that is still used by another part of the program. The memoryoccupied by O1 may then be used to allocate a new object, O2. Then, thesituation where one part of the program modifies O1 and another partmodifies O2 may arise. As both O1 and O2 refer to the same address,this will result in memory corruption and program failure.

Determining when an object should be deallocated in order to avoidboth memory leaks and dangling pointers is non-trivial in a complexsystem. To ensure memory consistency, systems with manually man-aged memory require rigid coding conventions and protocols for whenallocation, deallocation and pointer passing is allowed.

In systems with automatic memory management, the task of keepingtrack of when an object is no longer in use and can be safely deallocatedis preformed by the run-time system, which frees the programmer fromthis complex and error-prone task. The technique used to identify andreclaim dead objects is called garbage collection (GC). Examples of earlyprogramming languages with GC are LISP [McC60] and Simula [DN76].


2.3.1 Garbage collection

There are different approaches to implementing GC [JL96]. In this the-sis, we will only consider tracing collectors — collectors that traverse thereference graph in order to determine which objects are live and whichare not. Examples are mark-sweep [McC60] and copying [Min63, FY69]collectors. Another approach to garbage collection is reference count-ing [Col60], where the idea is to keep a count of how many referencesthere are to each object and reclaim objects when the reference countreaches zero 5.

This thesis focuses on scheduling of GC work, rather than GC algo-rithm design or implementation, and the presented approach is appli-cable to any concurrent tracing garbage collector. However, differentcollectors have different properties and different requirements on thecollector/mutator interface. Therefore, we will now briefly review themost common fundamental algorithms for tracing garbage collection.

GC algorithms can be divided into moving or non-moving. If a non-moving collector is used, objects reside at the same address from alloca-tion to reclamation, just as in manually managed memory (e.g., mallocand free). A moving collector, on the other hand, may move objectsduring collection in order, for instance, to compact the heap by movingall live objects to one end of the heap, leaving a single contious area offree memory, and thus avoiding (external) fragmentation. Mark-sweep isan example of a non-moving algorithm, and mark-compact and copyingalgorithms are moving.

The cyclic nature of garbage collection

Most (tracing) garbage collectors need to make multiple passes in orderto identify the live objects and reclaim the garbage. For example, a mark-

5A problem with reference counting is that it cannot reclaim cyclic structures; even ifa set of objects are no longer reachable from a program, cycles in the object graph willprevent reference counts from reaching zero, thereby preventing unreachable objects frombeing reclaimed. For this reason, pure reference counting is not suitable for embedded orother long-running software, where even small memory leaks will eventually cause thesystem to fail. There are, however, reference counting real-time GCs, including techniquesfor reclaiming cyclic structures [Rit03]. Cyclic structures are reclaimed either manually (bymanually breaking cycles or using weak references), or by having a tracing GC as back-up. The former approach just transfers the responsibility to the programmer, and the latterstill requires a tracing real-time GC to ensure real-time performance. Also, in a referencecounting GC, the scheduling of GC work is implicit in the algorithm. Reference counts areupdated at reference assignments, and thus performed in-line with the mutator code. Forthese reasons, reference counting is outside the scope of this thesis.

28 2. PRELIMINARIES

sweep collector first scans all root pointers6, then marks live objects andfinally sweeps the heap. We call all the activities required to identifyand reclaim garbage a GC cycle. E.g., in the mark-sweep case, a GC cycleconsists of root scanning, pointer traversal and sweeping.

It should be noted that during some of the phases (e.g., root scanningand pointer traversal), performing GC work does not cause any mem-ory to be reclaimed. Thus, a generic GC model for use in schedulinganalysis must assume that no memory is reclaimed until at the end ofthe GC cycle. Compacting or copying garbage collectors typically havethis behaviour, whereas a non-compacting mark-sweep frees memorycontinuously during the sweep phase.

Mark-sweep

Mark-sweep is the classic non-moving tracing collector. A GC cycle con-sists of two phases: In the mark phase the object graph is traversed andeach visited object is marked as live. Then, during the sweep phase, allobjects on the heap are examined and those that have not been markedare reclaimed.

Figure 2.6 shows an example of a heap before and after a mark-sweep GC cycle. As it is a non-moving algorithm, the free space isnon-contigous after the GC cycle. The free blocks are typically linkedtogether to form a free list, just as in traditional memory allocators.

Stack

� � ��

After sweeping

Heap

Live object

Free memory

Dead object

� � � ��

� � ��

Heap

After markingStack

Figure 2.6: Example of a heap before and after a mark-sweep GC cycle.

6The roots of the object graph are objects that are, by definition, live. The roots areidentified through root pointers — pointers located outside the garbage collected heap thatreference objects on the heap. Typical examples are pointers located in global variables orvariables on the stack.


Mark-compact

Mark-compact, is a moving version of mark sweep, where the heapis compacted in order to get one large contigous area of free memory,which both avoids external fragmentation and makes allocation simpler.The mark phase is the same as in mark-sweep, but instead of reclaimingthe dead objects, the live ones are moved. Compaction can be done indifferent ways, and the one described in Figure 2.7 is sliding objects. Inthe compact phase, for each “hole” in the heap, the next live object ismoved to the start of the hole, leaving one large chunk of free memoryat one end of the heap. When objects have been moved, references areupdated to point to the new location of the object.

Stack

Stack

Heap � � �� Live object

Free memory

Dead object

After compaction

� � � ��

� � � � � � ��

After marking

Heap

Figure 2.7: Example of a heap before and after a mark-compact GC cycle.

Copying collectors

Copying collectors, or semi-space collectors, in their basic form, work bydividing the heap into two halves. New objects are allocated in onespace, fromspace, until it is filled up. Then, the reference graph is tra-versed and all live objects are evacuated into tospace. When an object isevacuated, a forwarding pointer in the fromspace copy points to the newlocation. When the heap is scanned, all encountered references pointingto the fromspace copy of an opbject are updated to point to the tospacecopy. Finally, the spaces are flipped, so that the old tospace becomesfromspace, and vice versa. After the flip, the new tospace contains onebig area of free memory. Figure 2.8 shows an example of a GC cycle witha copying collector.

30 2. PRELIMINARIES

Stack

Fromspace Tospace

Stack

Fromspace Tospace

� � ��

� � ��

Object

Old position of object

Dead object

Free memory

Tospace Fromspace

Stack

After flip

Heap

��

� ��

� � ��

� � ��

Before

Heap

After evacuating live objects

Heap

Figure 2.8: Example of a GC cycle in a copying collector. First, the referencegraph is traversed and the live objects evacuated to tospace. Then, the flip is per-formed, and the memory of (the old) fromspace is reclaimed. This is a schematicillustration of the principle of operation. In a real collector, references are up-dated as they are encountered during evacuation, and the relative positions ofobjects may differ between fromspace and tospace.


2.3.2 Incremental and real-time GC

In the first systems with automatic memory management, the applica-tion program, or mutator7, allocated memory until there was no morefree memory. Then, the mutator was suspended and the garbage col-lector performed a full GC cycle, reclaiming the unused memory. Thisis commonly known as stop the world garbage collection, as the wholeapplication is stopped when the garbage collector is running. Anotherterm is batch GC. The obvious drawback of batch GC, from a real-timeperspective, is that the GC pauses, although infrequent, may be verylong, which is unacceptable in a system with hard timing constraints.For such applications, long GC pauses can be avoided by making theGC incremental.

Research within the field of incremental and real-time garbage col-lection has been going on since the late sixties. In the earliest attempts toimplement non intrusive garbage collectors the GC work was split intoa number of very small increments which were performed interleavedwith the execution of the application [Bob68, Ste75, Wad76, DLM+78,Bak78]. In order to guarantee progress of the garbage collector, a suitablenumber of increments of GC work are performed in connection witheach memory allocation request, in proportion to the size of the request.An example of such an algorithm is Baker’s algorithm [Bak78]. Let Fmin

denote the minimum amount of memory available for allocation dur-ing a GC cycle, a denote the amount of memory requested, and Wmax

denote the maximum amount of GC work (according to a given metricand corresponding unit) that might be required to complete a GC cycle.Then, the size w of the GC work increment that must be performed inconnection with the allocation in order to guarantee that we do not runout of memory before the GC cycle is complete is:

w ≥ Wmax ·a

Fmin(2.4)

Incremental GC triggered by allocation requests has at least two ma-jor disadvantages. Firstly, even if the overhead incurred by a single GCincrement is small, a burst of allocation requests can lead to long accu-mulated delays. Secondly, in order to keep the cost of each GC incre-ment within a low upper bound we might need to use a complex GC

7The term mutator comes from that, from the collector’s point of view, the applicationis a process that changes, or mutates, the reference graph. In the sequel, the terms mutatorand application will be used synonymously, when there is no risk of confusion. However,from the view of the underlying OS, a Java application includes both the mutator threadsand the collector.

32 2. PRELIMINARIES

work metric in order to decide when to end each increment, since a sim-ple metric often gives a poor approximation of the temporal behaviourof the garbage collector. For instance, if a metric based on measuring thenumber of evacuated objects in a copying garbage collector is used , anincrement which should be short according to the metric can take a longtime to perform. The problem is that we might have to scan a significantamount of pointers in order to find just one object to evacuate. Thus in-creasing the performed amount of work according to the metric by oneunit may require a virtually unbounded amount of time.

Performing GC at the time of allocation does make it easy to provethat the garbage collector will always keep up with the application, butit also means that it suffers from the inherent problem of GC work al-ways being performed when the mutator runs — thus causing interfer-ence. The problem of GC work always being performed when applica-tion threads run can be overcome by making the collector concurrent, i.e.assigning the GC work to a separate GC thread executing in parallel withthe mutator threads. This is a strategy applied by a number of garbagecollectors, e.g. the Appel-Ellis-Li collector [AEL88], but it has not beenmuch used in real-time settings. Typically, in traditional concurrent col-lectors, no provision is made for guaranteeing that the collector keepsup with the allocation demands of the application.

Read and write barriers

When doing incremental garbage collection, the collector will operate onthe heap while the mutator is potentially modifying the pointer graph.Therefore, mechanisms are required to ensure that mutator operationsdoes not cause live objects to be missed by the collector, and that theoperations of a moving collector does not cause dangling pointers in themutator. Such mechanisms are called read and write barriers.

As barriers are preformed at every reference access it has been as-sumed that they, and especially read barriers, add a significant over-head. Therefore, much work has concentrated on developing algorithmsthat do not rely on barriers for synchronization between collector andmutator. However, a recent study found that, in many cases, the aver-age overhead of both read and write barriers was small, and that theassumption is not always correct [BH04].

Read barriers are used in copying collectors, where reads of point-ers to objects in fromspace are trapped in order to evacuate live objectsand/or update pointers into fromspace to point to the new tospace copy.


Read barriers are also used in concurrent, moving collectors, like mark-compact, to ensure that pointer dereferences always return the currentlocation of an object after it has been moved.

The current location of objects can be recorded in two ways. The firstis the forwarding pointer approach typically used in copying collectors,where a field in the object header is used in fromspace objects to indicatetheir new, tospace, location. Another possibility is to use an indirectiontable outside the objects, where each object on the heap is pointed tofrom an entry in the table. The mutator do all accesses to objects via theindirection table, and thus, when an object is moved, only the table entryneeds to be changed, and not all references to the object.

Write barriers are used in incremental or concurrent mark-sweep col-lectors to ensure that pointer updates during the mark phase cannotcause too few objects to be identified as live. Write barriers can be ei-ther of the snapshot-at-the-beginning or the incremental update type [Wil92,JL96], where the former is the more conservative of the two, ensuringthat everything that was live at the start of the cycle will be retained.The latter works on the principle of preventing pointers to unmarkedobjects to be written into marked ones. An attempt to do so causes oneof the objects to be (re)queued for marking.

2.3.3 Semi-concurrent GC scheduling

In order to satisfy the demands of hard real-time systems, a techniquemust be found to schedule the GC work of a concurrent GC such that theapplication is guaranteed to meet all of its hard deadlines. Such a sched-uling technique was presented by Henriksson in [Hen98]. That workfocuses on embedded systems which are assumed to have a numberof high-priority (typically periodic) threads that must meet hard dead-lines. It can be observed that in most embedded systems, a relativelysmall number of such threads exist. Apart from these, low-priority (pe-riodic or background) threads are often executing with more relaxeddeadline requirements. This leads to the fundamental idea of Henriks-son’s work, which is as follows: Do not perform any GC work whenthe high-priority threads are executing. Instead, assign the work moti-vated by high-priority allocations to a separate GC thread which is runwhen no high-priority thread is executing. When invoked, it performsan amount of GC work proportional to the amount of memory allocatedby the high-priority threads. Since the garbage collector may temporar-ily get behind with its work in this way, there must always be an amount

34 2. PRELIMINARIES

of memory reserved for the high-priority threads. Slightly modifiedgeneralized rate monotonic analysis can be used both for calculatingthe amount of memory which need to be reserved and to verify thatthe garbage collector thread will always keep up with the high-prioritythreads. Garbage collection work motivated by low-priority threads areperformed incrementally at allocation time. Since GC work is partlyperformed concurrently and partly incrementally in such a system theapproach is called semi-concurrent scheduling. A system using this sched-uling strategy can be described as having three levels of priority:

1. High priority processes

2. Garbage collection required to satisfy the high priority processes

3. Low priority processes and incremental garbage collection

Figure 2.9 shows how the CPU time will be used in a system with oneperiodic high priority process and one low priority process.

Priority

LP/GC

HP

GC

HP

GC

LP/GC LP/GC

Time

Figure 2.9: Dividing the CPU time between processes. The system consistsof one periodic high priority process (HP) and one low priority process (LP).Whenever a high priority process is suspended, and no other HP process iseligible for execution, the garbage collector (GC) is run. GC work is also in-terleaved with the low priority process using traditional incremental garbagecollection.

The effect of this scheme is that it makes it possible to guarantee hardreal-time performance for threads that actually require it in a systemscheduled by a fixed-priority scheduler. Since garbage collection workis not performed while high-priority threads run we can allow ourselvesto use a more coarse garbage collection work metric without affecting


real-time performance. An unnecessarily conservative metric will onlyprevent low-priority threads without hard deadlines to execute as oftenas they would prefer.

The approach still has some drawbacks, however. One drawback isthat it is not immediately suitable for systems with EDF schedulers. An-other drawback is that it is necessary to do a fair amount of schedulinganalysis in order to tune the collector to a specific target platform.

2.3.4 Definitions

For clarity, this section and Figure 2.10 introduces the important termsused in the discussion of garbage collection scheduling. The operationof a GC is divided into GC cycles, and the time from the start (release)of a GC cycle to the end is called the GC cycle time, denoted TGC. Ifnothing else is stated, the end time (deadline) of a GC cycle is equal tothe release time of the following one. The execution time required tocomplete the GC work of one GC cycle is denoted CGC. Scheduling ofGC is aimed at avoiding out-of-memory situations, and the analysis isbased on the amount of free memory, F , and the allocation rate, a.

TGC(k)

Fs(k)

Fe(k)

ts(k) te(k)

a

Free memory

Time

Figure 2.10: Definitions used when discussing GC scheduling. The startand end time of a GC cycle is denoted ts and te, respectively and TGC(k) =te(k) − ts(k) is the GC cycle time. Fs(k) = F (ts(k)) and Fe(k) = F (te(k))is the amount of free memory at the start (end) of GC cycle k, and a is theallocation rate.

36 2. PRELIMINARIES

2.4 Real-time Java for embedded systems

Recently, Java has become more widely used in real-time applicationsand different solutions for developing and executing Java programs withtiming requirements have been developed. We will now briefly reviewsome of those.

Given a program, written in Java, there are basically two differentalternatives for how to execute that program on the target platform. Thefirst alternative is to compile the Java source code to byte code, and thenhave a, possibly very specialized, Java Virtual Machine (JVM) to executethe byte code representation. This is the interpreted solution (as requiredto be Java certified) used today for Internet programming, where the tar-get computer type is not known at compile time. The second alternativeis to compile the Java source code, or byte code, to native machine codefor the intended target platform.

From the real-time garbage collection perspective, the differences be-tween the two approaches are not significant, and the contributions ofthis thesis is applicable to both. The GC scheduling decisions are takenat a higher level, and is not dependent on instruction-level differencesbetween platforms. There are also many similarities when GC imple-mentation is considered. For instance, when a just-in-time (JIT) com-piling JVM has compiled the byte codes to native code, this code is nodifferent — to the GC — from code that was compiled ahead of time.

2.4.1 Real-time virtual machine

In virtual machines for real-time Java, the trade-off between predictabil-ity and performance becomes apparent. Just-in-time (JIT) compilation isvery hard to combine with real-time demands, and using an interpretertypically has execution speeds 10 times slower than natively compiledcode. To improve performance, some JVM (e.g. mackinac [Mac04]) usethe JIT compiler to compile the application at initialization time. Thathowever, comes at the cost of a significantly larger memory footprint.Also, the overhead of the JVM itself makes virtual-machine based solu-tions unsuitable for small embedded systems.

In order to speed up execution, reduce memory footprint or improvepredictability, a number of hardware-assisted approaches to execution ofJava byte-code have recently been developed. By using a co-processorto execute Java byte-codes, or by augmenting the instruction set of theprocessor with Java instructions, performance similar to that of nativecode can be achieved, without the overhead in time and space of a JITcompiler.

2.4 REAL-TIME JAVA FOR EMBEDDED SYSTEMS 37

Most JVMs for the embedded market do not include real-time garbagecollection, but rely on other mechanisms for memory management, likethe scoped memory of the Real-time specification for Java (RTSJ) [B+01,Wel04].

With more and more large, high-performance computers used in con-trol applications, the range of platforms which are referred to as “embed-ded” is vast. At the lower end of that range, virtual machines for embed-ded systems with only a few kilobytes of RAM can be found, includingthe Infinitesimal Virtual Machine (IVM) [Ive03] and SimpleRTJ 8. Forthose systems, small memory footprint is the dominating design goal.

2.4.2 The Lund Java-based real-time platform

The Lund Java-based9 Realtime Platform (LJRT) makes it possible towrite hard real-time applications for small and medium sized embed-ded computers, in a portable way, using standard Java. The LJRT plat-form consists of two parts, the LJRT compiler and runtime system, andthe LJRT class library.

The set of target systems considered include small (350 MHz PPCG3 with 32 MB ram) to very small (AVR µcontroller at 8MHz/32 kBRAM) embedded computers. Therefore, we prefer ahead-of-time com-pilation to using a JVM. One thing in common for almost all CPUs, isthat there exists a C compiler with an appropriate back-end. In the inter-est of maintaining good portability while compiling Java to native code,C is used as the intermediate language; The Java front-end generates Ccode which, in turn, is compiled by a standard C compiler, as shown inFigure 2.11.

The compiler and run-time system part is made up of three looselycoupled components; the Java compiler, the garbage collector interface(GCI), and the run-time system. The Java compiler ([Nil04, NIEH04])generates C code where all heap object accesses are made through thegeneric GCI ([IBE+02]) in order to provide an abstraction from the de-tails of the different hard real-time garbage collector (GC) implemen-tations ([Hen98, RH03]) that are part of the run-time system. To date,the run-time system has been ported to real-time Linux/RTAI/Xeno-

8http://www.rtjcom.com9The term Java-based is due to the fact that our way of accomplishing a J2SE-compatible

(any embedded program will run with the proper concurrency behaviour on any Java-enabled desktop) real-time Java platform is not compatible with the Java license conditionsfrom Sun (we provide a real-time improved J2SE subset affecting the RTOS API withoutgoing via the JCP and without requiring a JVM). Thus, we may not call our free solution“Java”, so we call it Java-based.

38 2. PRELIMINARIES

<Main class>.java

Java compiler

Other user-writtenclasses

Standard class library

C file Header files

Automaticmemory management (GC)

Run-time system

Application executable

GCC

Standard library native methodsimplementations

User nativemethods implementations

Application Source

Standard library

Figure 2.11: LJRT compiler overview: The application Java source, togetherwith the required classes from the standard library are translated to C by theJava compiler. That C code, any native method implementations provided by theuser or the standard library, and run-time system code, GC, etc., is compiledand linked to produce the executable.

mai with both user-space and kernel-space real-time threads on Pow-erPC and Intel, the STORK real-time kernel on PowerPC, a locally de-veloped real-time kernel for the ATMEL AVR series of microcontrollers,and posix (for running in user-space, without hard real-time guaran-tees). Porting to a new RTOS is quite simple and requires writing a smallnumber of native functions to interface with the RTOS system calls.

The LJRT class library is an open-source Java package containingclasses for real-time threads, semaphores, monitors, mailboxes, etc. TheLJRT library has both a pure Java implementation, allowing real-timeapplications implemented using the library to be executed on any JVMwith proper concurrency behaviour, and native implementations, giv-ing hard real-time performance on the target systems supported by theLJRT run-time system. The dual implementations are transparent to theuser at the Java level, and the target-system specific features are auto-matically inserted through the LJRT compiler and run-time system.


Due to external requirements, we want to be able to use an off-the-shelf RTOS as well as external, legacy or automatically generated, Ccode. That, combined with using a standard C compiler as the back-end, means that we cannot rely on detailed assumptions on the behav-ior of the back-end C compiler or the thread scheduler, which makesimplementation of a real-time GC more challenging. For instance, itmeans that any synchronization required between collector and appli-cation needs to be done explicitly. It also means that the generated Ccode must be written so that it ensures, in a portable way, that no back-end optimization causes interference with the GC. The challenges of im-plementing accurate real-time GC in an uncooperative environment isexplored in Chapter 7.

2.4.3 Multi-stage deployment of control software

For future control systems, there is a strong need for tools and meth-ods supporting the development and deployment of control software.To this end, we have proposed a method for developing hard real-timesoftware based on the standard Java language and multi-stage deploy-ment and verification towards the embedded platform [RNNH06]. Asenabling technology, the LJRT platform is used, making it possible to de-velop embedded Java software on the desktop using standard softwaretools for implementation, testing, and verification, before deploymentonto the embedded platform.

Development of embedded real-time software adds complexity com-pared to software development in general, as it typically includes writ-ing, or interfacing to, proprietary hardware drivers (such as I/O), andcross compilation, resulting in platform-related problems. In order tomitigate these problems, it is desirable to separate platform concernsfrom application development. The presented method for developmentand deployment provides such a separation of concerns: The major partof the application can be implemented and its correctness in logical andconcurrent behaviour verified, on the desktop, where building and exe-cution is done using the standard Java SDK, and powerful developmenttools are readily available. In this stage, any process I/O, etc., is simu-lated. With the application working on the desktop, the move towardsthe embedded target system is done in steps where cross-compilation,drivers for I/O, and real-time requirements can be added and tested,one at a time.

40 2. PRELIMINARIES

While it is desirable to be able to do as much of the development andtesting of an application as possible in a standard desktop environment,the subsequent port of the application to the hard real-time embeddedsystem can require a large effort if done in an ad hoc manner. Therefore,we propose a method for doing the transition from desktop to target in aseries of steps, where only one parameter is changed in each step, in or-der to facilitate verification of the different components, or identificationof problems.

The fundamental principle is that the source code of the applicationshould remain unchanged during all the stages of the deployment. Whatis changed, as the desktop application gradually is moved towards theembedded target, is, in turn, the class library, the compiler, the computer,the I/O drivers and the thread model. When the tools and the platformhave been verified to work, it is possible to directly do the transitionfrom the simulated environment on the desktop, to the target system.The major benefit of the intermediate steps is when things do not work,or when doing verification (or development) of the platform. The possi-bility of doing the deployment in several steps also makes it much easierto pinpoint at what stage of the deployment an error occurs and, hence,if the source of the error is in the application code, the tools, hardwaredrivers, or the operating system.

As a case study, a motion controller for the IRB-6 robot was devel-oped. On the desktop, the application was run, in simulated time, ona standard JVM, with a virtual robot consisting of a simple dynamicsmodel and Java3D visualization ([HN99]). On the real robot, the pro-gram was compiled to C code using the LJRT Java compiler and to na-tive code with gcc. The target system was a Motorola MVME 2600-1computer, with a 200MHz PowerPC G3 CPU and the operating systemwas Linux/RTAI fusion10, version 0.9.1. Figure 2.12 shows the real robotin the robot lab, and a screenshot of the virtual robot.

10Recently, the fusion branch of the RTAI project was moved into a separate project;Xenomai. RTAI fusion v0.9 corresponds roughly to Xenomai v 2.0.


Figure 2.12: The IRB-6 in the robot lab and its virtual counterpart.

CHAPTER 3

TIME-TRIGGERED GARBAGE

COLLECTION SCHEDULING

Traditionally, in order to ensure sufficient progress, incremental garbagecollectors have been scheduled based on the allocations of the applica-tion — for each unit of allocation, a corresponding amount of garbagecollection work must be performed. This chapter presents a differentapproach where time, instead of allocation, is used as the trigger for GCwork. That is, garbage collection is scheduled to make the GC cycle fin-ish at a certain time, rather than after a certain amount of allocation.

In Section 3.2 an upper bound on the GC cycle time that ensures thatnew memory is always made available in time is formulated. Section 3.3presents the problems associated with traditional metrics used to mea-sure garbage collection work, and argues that time should be used asthe unit for garbage collection work and that this is practically feasible.Section 3.4 discusses how the process scheduling strategy affects a time-triggered GC scheduler and it is shown how time-triggered GC can beused to achieve the same objectives using a deadline-based scheduler asthe semi-concurrent scheduling strategy does in a fixed-priority system.

3.1 Introduction

In [Rob02] the idea of time-based garbage collection scheduling and hav-ing a fix GC cycle length was introduced. That made it possible to deter-mine how much memory will be allocated during a cycle or to reserve acertain amount of memory for the next cycle while still making it possi-ble to perform schedulability analysis and give real-time guarantees onthe run-time system in a straight-forward manner.

44 3. TIME-TRIGGERED GARBAGE COLLECTION SCHEDULING

In that work, a hybrid approach was used, where the GC schedul-ing on the cycle level was time-based and the increments were sched-uled using a traditional work metric in a fixed-priority scheduled sys-tem. This chapter presents time-triggered garbage collection more thor-oughly, and we will see that having an explicit GC cycle time simplifiesreasoning about more aspects of the memory system. It also mitigates orcircumvents certain problems associated with real-time scheduling of anallocation-triggered GC. The main areas where time-triggered garbagecollection scheduling has impact are:

Concurrent GC in deadline-based systems: In order to schedule GC ina way that we can give real-time guarantees while still disturbingthe mutator (application) threads as little as possible in a deadline-based system, we want to be able to schedule the GC just as anyother thread. With time-triggered GC, this property is inherent inthe model, as the only scheduling parameter is the deadline, andwe explicitly specify the deadline of each garbage collection cycle.

GC work metric concerns: A traditionally scheduled incremental GCrelies on some kind of work metric to determine whether it is insync with the mutator or needs to perform more GC work. There-fore, such a GC relies on the accuracy of the metric and using apoor metric may cause poor real-time performance. Errors causedby a poor metric can be avoided by using the optimal GC workmetric — the actual CPU time required to complete a GC cycle.Additionally, with time-triggered GC, the actual scheduling is in-dependent of the work metric1 and thus a poor metric does not af-fect the real-time properties of the run-time system. This allows usto separate the problems of schedulability analysis2 and run-timescheduling.

Bursty allocation: Applications often show bursty allocation patterns.This means that an allocation-triggered GC would have a burstyexecution pattern. Time-triggered GC scheduling does not havethis problem as GC work is scheduled so that each GC cycle fin-ishes before its deadline, regardless of when the application per-forms its allocations.

Unified GC scheduling: Garbage collection schedulers based on a tra-ditional GC work metric are tightly coupled to the actual garbagecollector implementation. By using a time-based approach to GC

1This is not the case for semi-concurrent scheduling, see Section 3.4.2That, of course, still requires worst-case execution time analysis.

3.2 GC CYCLE TIME CALCULATION 45

scheduling, it would be possible to separate the GC scheduler fromthe GC algorithm; using time as both the trigger and the GC workmetric provides a simple interface between the GC and the sched-uler. Also, as time is easy to measure directly, time-based GCscheduling fits very well into a feedback scheduling framework.

3.2 GC cycle time calculation

With time-triggered garbage collection, there is no direct connection be-tween the GC scheduling and the application, so the GC cycle time is theonly parameter that controls the progress of the garbage collector. Thus,a time-triggered GC needs correct (or conservative) cycle time estimatesin order to make real-time guarantees as each garbage collection cyclemust be completed before the application runs out of memory. This sec-tion shows how an upper bound on the GC cycle time, which guaranteesthat the application never runs out of memory, can be calculated.

The following symbols will be used in this section: period time (T ),frequency (f ), heapsize (H), total amount of allocated memory on theheap (A), amount of memory allocated during this cycle (a), free mem-ory (F ), live objects (L), floating garbage3 (G), amount of memory re-claimed this cycle (r), the set of threads (P), and the maximum allocationper period of thread j (aj).

Lemma 1. For a set of processes, P, with, for each thread j, frequencies fj ,allocation requirements of aj bytes per period and F bytes of memory availableat the start of the GC cycle, it is guaranteed that the cycle will be completedbefore the available memory is exhausted if the GC cycle time, TGC, satisfies

TGC ≤F −

∑

j∈Paj

∑

j∈Pfj · aj

(3.1)

Proof. A GC cycle must finish before the available memory at the startof the cycle has been allocated. That is,

a =∑

j∈P

⌈

TGC

Tj

⌉

· aj ≤ F (3.2)

where the ceiling is necessary to cover the worst case schedule. A slightlystronger condition is

3Floating garbage is objects that are no longer reachable by the mutator but are stillbelieved to be live by the collector. For example, objects that die shortly after they havebeen marked will not be reclaimed until in the next GC cycle.


∑

j∈P

(

TGC

Tj+ 1

)

· aj ≤ F (3.3)

Substituting fj = 1Tj

we get

∑

j∈P

(TGC · fj + 1) · aj =

TGC

∑

j∈P

fj · aj +∑

j∈P

aj ≤ F (3.4)

∴ TGC ≤F −

∑

j∈Paj

∑

j∈Pfj · aj

The amount of free memory needs some further discussion. Sinceany incremental garbage collector suffers from the problem of floatinggarbage, we must take that into account when calculating the worst caseamount of memory available at the start of a GC cycle (Fmin). Or putdifferently, we may not be able to use all the free memory during a cycleif we want to be sure that there is also enough memory for the nextcycle as the amount of memory that is reclaimed by the garbage collectorcan vary from one cycle to another due to floating garbage. Let us nowexamine floating garbage in more detail.

Lemma 2. Let an be the amount of memory that is allocated during the nth GCcycle and Lmax be the maximum amount of live memory. Then, the sum of livememory and floating garbage at the start of cycle n + 1 satisfies the inequality

Ln+1 + Gn+1 ≤ Lmax + an (3.5)

Proof. Let δn be the net change in live memory during cycle n:

Ln+1 = Ln + δn (3.6)

Let un be the amount of memory that becomes unreachable during cyclen. Then,

δn = an − un =⇒ un = an − δn (3.7)

which gives

Gn+1 ≤ un = an − δn

Ln+1 = Ln + δn

}

=⇒ Ln+1 + Gn+1 ≤ Ln + an (3.8)

But ∀n, Ln ≤ Lmax, which concludes the proof.


In order to make hard guarantees, we must determine the maximumamount of memory that can be allocated during a GC cycle without risk-ing that the system runs out of memory due to floating garbage.

Lemma 3. Let H be the heapsize and Lmax be the maximum amount of livememory. Then, the maximum amount of memory that can be safely allocatedduring a GC cycle is

amax =H − Lmax

2(3.9)

Proof. The heap contains allocated and free memory

H = A + F = L + G + F (3.10)

and therefore,F = H − (L + G) (3.11)

Applying Lemma 2 to (3.11) gives that, at the start of any GC cycle,

F ≥ H − (Lmax + amax) = Fmin (3.12)

Thus, the worst case occurs when L = Lmax, and the remainder of theproof makes this assumption. Then the system has to be in steady state4

and the maximum amount of floating garbage during a worst case cycleis

GWCmax = amax (3.13)

An upper bound on the amount of memory allocated during a GC cyclemust, of course, not be greater than the minimum amount of availablememory so the trivial bound is amax ≤ Fmin. We will now prove theequality. Objects that are floating garbage at the start of cycle n willhave been reclaimed by the start of cycle n + 1, which means that

Fn+1 ≥ Gn (3.14)

The amount of available memory at the start of cycle n + 1 is

Fn+1 = Fn − an + rn (3.15)

Cycle n is a worst case cycle (Fn = Fmin) iff the amount of floatinggarbage at the start of the cycle is at the maximum (Gn = GWC

max). In theworst case, rn = Gn, which corresponds to equality in (3.14). Applyingthis to Equation (3.15) gives

Fn+1 = Fmin − an + GWCmax = GWC

max =⇒ an = Fmin (3.16)

4I.e., for each allocated object, another object becomes unreachable.


Consequently, we can allocate all available memory during a worst casecycle while still guaranteeing that the amount of available memory atthe start the following cycle is no less than Fmin. I.e.,

amax = Fmin (3.17)

Finally, equations (3.12) and ( 3.17) give

amax =H − Lmax

2

Because the amount of floating garbage may vary, depending onhow the execution of the application and the garbage collector are in-terleaved, the amount of memory reclaimed will also vary from cycle tocycle. Therefore, we cannot always allocate all of the available memoryif we want to guarantee that the system never will run out of memory.Consequently, the length of the garbage collection cycles must be calcu-lated based on the worst case amount of available memory.

Theorem 1. For a set of processes with, for each thread j, frequencies fj , allo-cation requirements of aj bytes per period and a maximum total amount of livememory Lmax, it is guaranteed that every GC cycle will be completed before theavailable memory is exhausted if the cycle time, TGC, satisfies

TGC ≤H−Lmax

2 −∑

j∈Paj

∑

j∈Pfj · aj

(3.18)

Proof. The theorem follows from lemmas 1 and 3.

Remark. The term∑

aj is typically very small compared to the amountof memory available for allocation. (If not, heap occupancy is very high,a situation which is generally avoided, as it causes GC thrashing, in-creasing the CPU overhead of the GC.) Therefore, under normal circum-stances, and for most practical reasons, it is safe to disregard this term,to get the simplified expression

TGC ≤H − Lmax

2 · a(3.19)

where a is the total allocation rate of the mutator.

For an example of how varying amounts of floating garbage affectsthe amount of available memory, see Figure 3.1. Note that, somewhatcounter-intuitively, the dangerous case is when there is less than theworst case amount of floating garbage, as this could lead to a situationwhere we allocate too much memory if care is not taken to avoid that.


Assume that at the start of the nth GC cycle there is Lmax = 50%live memory (black), G = 25 % floating garbage (dark gray) andFmin = 25 % (white) available memory:

When the free memory has been allocated, the floating garbage andsome of the objects that died during this cycle has been marked asgarbage that will be reclaimed in this cycle (light gray) and some ofthe old objects become floating:

The GC cycle is concluded (i.e., the objects that are not to be reclaimedare compacted and a continuous area of available memory is formed):Note that during this cycle, we reclaimed more than Fmin:

Therefore, we cannot use all the free memory during cycle n + 1 asthat might result in less than Fmin available memory in cycle n + 2.The solution is to reserve a part of the memory (striped) so that weonly allocate amax = Fmin.

at the end of cycle n + 1:

the cycle is finished and the reserved memory is made available:

This cycle, we reclaimed less than Fmin, but the amount of reclaimedmemory + the reserved memory = Fmin. Thus, the amount of avail-able memory at the start of cycle n + 2 is Fmin and our worst caseassumptions hold.

Figure 3.1: Example of a how the amount of floating garbage may vary betweencycles and how our reservation strategy guarantees that there always will be atleast Fmin available memory at the start of a cycle.


It may seem that the limit on the amount that may be allocated dur-ing a garbage collection cycle may cause unnecessarily low memory uti-lization but this isn’t the case; the limit on the amount of memory thatmay be allocated during a GC cycle expressed in Equation (3.9) only af-fects the cycle time calculations. It is true that in the best case (when wehave no floating garbage) at most half of the available memory is allo-cated during a cycle, but this has nothing to do with the total memoryutilization. If the GC cycle time is reduced, the amount of allocation percycle — and, consequently, the maximum amount of floating garbage —is also reduced. This means that if both high allocation rates and highmemory utilization is required, the GC cycles will be short, but as longas Lmax < H and there is enough CPU time to accommodate both ap-plication and GC, the system is guaranteed to work.

3.3 GC work calculation

In order to schedule an incremental or concurrent garbage collector sothat it will finish at a certain time or after a certain amount of memoryhas been allocated, the amount of garbage collection work required to com-plete a GC cycle must be known. We will now examine how GC workcan be expressed.

The purpose of a GC work metric is to use quantities that can be di-rectly measured to approximate the temporal behaviour of the garbagecollector as closely as possible. However, somewhat surprisingly, thereal-time GC literature does not pay much attention to work metrics,and is often content with using some high level abstraction, e.g., thenumber of “scanned objects”, to measure GC progress. Scanning theheap is defined as doing all the GC work to complete a GC cycle. Thus,for a multi-pass GC, like for instance a mark-sweep collector, scanninginvolves both the mark and sweep phases. This is a way of dodging themetric problem altogether, as it does not define which quantities thatshould be measured in order to calculate the GC work.

When studying incremental garbage collectors without hard real-time requirements, the focus is on ensuring GC progress while keepingthe average GC pause time reasonably short. In a traditional, allocationtriggered garbage collector, when garbage collection work is performedin conjunction with each allocation and in proportion to the size of therequested object, it is enough to prove that the metric is conservative.Unfortunately, when applying the same incremental techniques to real-time systems, it is not enough that the GC work metric is conservative;if we want upper bounds on GC pause times, we must also have upper

3.3 GC WORK CALCULATION 51

bounds on how conservative the work metric is. If a poor metric is used,a real-time algorithm may lose its real-time properties. For example, ifwe have a copying collector and use the number of evacuated objects asthe work metric, we might reach a situation where we need to evacuateone more object to complete the current increment. However, this may— in the worst case — require us to scan all the remaining objects in theheap before we find a pointer that causes that last object to be evacuated.Thus, an unsuitable work metric causes the worst case amount of work,in actual execution time, of an increment that is small according to the metric,to be practically unbounded.

3.3.1 Traditional GC work metrics

For an allocation-triggered garbage collector, the minimum GC ratio,Rmin, (in work units per allocated byte) that will ensure that the GCcycle finishes before the mutator runs out of memory is

Rmin =Wmax

Fmin

where Wmax is the worst case amount of work to complete a GC cycleand Fmin is the worst case amount of available memory at the start of acycle. Let the current GC ratio (R) be the ratio between performed work(W ) and allocated memory (A):

R =W

A

In order to guarante that the GC finishes on time, we must ensure thatthe invariant

R ≥ Rmin

is satisfied at all times. Now, the problem is, how do we express, andmeasure, W ? A common work metric for copying collectors is the evac-uation pointer metric, i.e., use the amount of evacuated memory as ameasure of performed GC work. Let ∆B denote the position of theevacuation pointer relative to the start of tospace (i.e., the amount ofevacuated memory) and Emax the maximum amount of memory thatmay need to be evacuated. Then, the amount of performed work, W ,and the maximum amount of work during a cycle, Wmax will be

W = ∆B

Wmax = Emax


Unfortunately, this metric doesn’t model the temporal behaviour ofthe garbage collector very well. For each allocation, an amount of garbagecollection work, according to the metric, has to be performed. However,since GC progress is measured in the amount of evacuated objects, anyGC activity that doesn’t cause new objects to be evacuated will not becaptured by the metric. For example, tracing objects that only containspointers to already evacuated objects will not increase W . In a worst casescenario, evacuating one single object may require scanning all remain-ing objects on the heap. Thus, this metric may, in the worst case, causean incremental collector to have a behaviour close to that of a batch GC.

This problem is described in [Hen98], and Henriksson presents animproved evacuation pointer metric which also takes scanning of objectsand roots as well as initialization of reclaimed memory into account. Theimproved metric, as used in his semi-concurrent GC scheduling, is

W = α · roots + β · ∆S + ∆B + γ · ∆P

Wmax = α · rootsmax + β · Emax + Emax + γ · MHP

where S is the amount of scanned memory and P is the amount ofinitialized memory. The constants α, β and γ depend on the implemen-tation of the algorithm and rootsmax and Emax depend on the applica-tion and these have to be manually tuned in order to make the discrep-ancy between the metric and the actual execution time as small as pos-sible. For a compacting mark-sweep collector, a similar GC work metriclooks as follows

W = α · (roots + mark) + β · sweep + γ · compact

Wmax = α · (rootsmax + livemax) + β · heapsize + γ · livemax

On the other end of the scale are the concurrent algorithms that havea separate GC thread which performs garbage collection in parallel withthe mutator. In this case, the collector thread is run without synchro-nization with the mutator (in the sense that it does GC work until thecycle is complete and then waits for another cycle to be triggered.).

3.3.2 Using time as the GC work metric

As the purpose of a GC work metric is to approximate the execution timerequired to complete a GC cycle as closely as possible, the optimal GCwork metric is the actual execution time used and this is the approachchosen here; using time as both the trigger for the garbage collector and

3.3 GC WORK CALCULATION 53

as the GC work metric (I.e., the total GC work of a cycle is the CPU timethe system has to spend on performing garbage collection.) in the actualrun-time system. This has, to our knowledge, not previously been done.

By using time as the GC work metric, the amount of performed workcan be measured directly, which eliminates all errors in the performedwork metric. The total amount of CPU time required to complete a GCcycle, has to be calculated using standard worst case execution timeanalysis techniques5. Then the GC scheduling will be independent ofboth the application and GC implementation and the problems withbursty allocation patterns and imperfect GC work metrics are avoided.An additional advantage is that no assumptions about the GC algorithm,implementation or application behaviour are hard-wired into the GCwork metric6.

Another important result of using CPU time as the GC work metricis that the GC work calculations are made on a per cycle instead of a perincrement basis. Thus, if the Wmax estimates are conservative, the addi-tional overhead will be distributed evenly across the GC cycle instead ofcausing individual increments to be too long as described above. Hence,using time as the GC work metric helps mitigate the negative effects ofusing a conservative GC work metric when using an incremental GC.

Also, using execution time as the GC work metric together with time-triggered garbage collection scheduling makes it easier to integrate theGC scheduling with the application process scheduler, since the twoscheduling parameters, execution time and deadline, are explicit in themodel. Thus, the GC thread can be scheduled like any other thread inEDF as well as fixed-priority systems. It also fits well into a feedbackscheduling system, as it makes the execution time requirements of thegarbage collector explicit. Finally, it has the advantage that it makes itpossible to incorporate other factors that affect the GC execution time,but are not directly tied to the garbage collection algorithm (e.g., caches,pipelines, etc.) into the GC work calculations and measurements.

5Note that this requirement is no restriction in relation to traditional real-time garbagecollection techniques; if we want to be able to make hard real-time guarantees, we haveto do worst case analysis. If this is not possible, it may be better to use some adaptivetechnique, as described in Chapter 4.

6Of course, these aspects affect the GC workload and has to be taken into account whencalculating the GC workload, but having a generic metric allows us to separate e.g., theGC scheduler from the GC algorithm.


3.4 Scheduling

This section discusses how time-triggered GC scheduling can be imple-mented in fixed priority and deadline based systems, respectively andhow the general process scheduling policy affects the garbage collec-tion scheduling. It also relates time-triggered GC scheduling to semi-concurrent scheduling and handling of background tasks.

Based on the cycle time calculations presented in Section 3.2, we canuse standard scheduling techniques (e.g., RMS or EDF) and schedulethe GC as any other thread since the scheduling of individual GC in-crements is implicit; the only real requirement is that the GC cycle hasended and enough memory is made available before the applicationruns out of memory. As the deadline is the sole scheduling parameter,this means that the GC work calculations are only needed for schedula-bility analysis and not for ensuring GC progress at run-time. Hence anerror in the metric alone cannot cause the GC to run too slowly, whichgives a more robust system. If the system is schedulable, the GC willfinish on time, without causing any other thread to miss its deadline.

In systems where hard real-time tasks co-exist with background taskswithout timing requirements, we want hard guarantees that the GC al-ways will make memory available to the real-time tasks on time but wealso want to avoid unnecessary disturbance of the background tasks.Conversely, we want to protect the GC from the background tasks inthe sense that allocations performed by a background task must notcause the GC to miss its deadline or fail to make enough memory avail-able. These problems are addressed by the semi-concurrent GC sched-uling strategy. The effects of incorporating time-triggered GC and semi-concurrent scheduling will now be examined.

When implementing a semi-concurrent garbage collector under theaforementioned scheduling policies, the main difference is that in a fixedpriority system we must explicitly schedule each GC increment in or-der to spread the garbage collection overhead evenly across the cycle.That is, each time the garbage collector is invoked, it has to determinehow long that increment should be (according to the metric used) and,when enough work has been performed, the GC must suspend itselfuntil the next increment is triggered. Otherwise, the garbage collectorthread might starve low priority threads for long periods of time. In anEDF system, the scheduling of GC increments can be left to the processscheduler, as there are no fixed priorities and, thus, no risk of starvation.

A consequence of the requirement that the garbage collector mustdetermine the length of each increment is that the actual scheduling willdepend on both the cycle time and the work metric. In an EDF system,

3.4 SCHEDULING 55

the only scheduling parameter is the deadline, and the garbage collec-tion thread can be scheduled like any other thread. Therefore the run-time scheduling is independent of the work metric and worst-case anal-ysis, which is a big advantage in practice, as worst-case analysis often isbased on measurements rather than exact analysis.

A problem with using allocation-triggered, concurrent GC in hardreal-time systems is that it is necessary to reserve a certain amount ofmemory for allocations of the high priority processes. Without a safetymargin it is impossible to guarantee that schedulability will not be jeop-ardized due to special effects near the end of GC cycles [Hen98].

The reason that a safety margin is required is that when using fixed-priority scheduling, the garbage collector is never allowed to interrupt ahigh priority thread. Without a safety margin, the system could reach astate when there is memory left (and, thus, the cycle not yet finished) butnot enough memory for all of the allocations of a high priority threadduring its execution. Since GC work is suspended during the execu-tion of high priority threads, activating a high priority thread at such aninstant would cause the system to run out of memory which, in turn,causes “panic” stop-the-world GC. Therefore it was necessary to reserveenough memory for the worst case allocation requirements of the HPthreads during the maximum response time of the GC thread.

With time-triggered GC, on the other hand, this would not be a prob-lem. As the deadline of the GC thread is explicit in the model, traditionalschedulability analysis could be performed and the safety margin wouldnot be necessary.

3.4.1 Fixed priority scheduling

In a fixed priority system, a higher priority thread always get prece-dence over lower priority threads. Therefore, a semi-concurrent GCmust spread the GC work evenly across the whole cycle and not do morework in each increment than absolutely necessary, in order to avoid sub-jecting threads that run with a lower priority than the GC thread to un-necessary starvation and excessive jitter. Thus, some GC work metrichas to be used to determine if the garbage collector has made enoughprogress.

Naturally, for a given GC cycle time, TGC , all the garbage collectionwork required to complete a GC cycle has to be performed before TGC

seconds have elapsed. In order to ensure sufficient GC progress, the GCscheduler must maintain the invariant

∑

w ≥ Wmax ·t − tcycle start

TGC(3.20)


That is, the fraction of GC work performed should be greater thanor equal to the fraction of the cycle time elapsed. This corresponds toEquation (2.4) on page 31 with time instead of allocations as the trig-ger, on the right hand side. Scheduling garbage collection according tothis invariant ensures that progress will be made at a well-defined rateregardless of if, and when, the application allocates memory.

3.4.2 EDF scheduling

The first property of semi-concurrent scheduling, non-intrusiveness, isinherent in the EDF model; if the requested CPU utilization is less than100%, all deadlines will be met.

The second property of the semi-concurrent model, isolating the highpriority threads from the low priority ones, and thus not having to doworst-case analysis on the LP threads, can in an EDF system be achievedby using Constant Bandwidth Servers (CBS) with the addition of a prior-ity, or importance, attribute for the servers. Then, the HP and LP threadsin the semi-concurrent model would correspond to HP and LP servers.

In such a model, the threads running on HP servers would just doallocations without any GC penalty, while the threads on the LP serverswould do incremental GC at allocation time. When incremental GC isperformed due to a LP allocation, both the deadline and execution timeof the GC thread should be decreased as the memory allocation has re-duced the amount of available memory and the incremental GC workhas brought the GC cycle closer to its finish. Moving deadlines to anearlier point in time is, however, not allowed in an EDF system in thegeneral case as this causes a temporary increase in the requested CPUutilization and might lead to missed deadlines. This could be solved bytemporarily reducing the bandwidth of the LP server with a correspond-ing amount or, if the remaining CPU time in the LP server’s budget istoo low, delaying the allocation that would cause incremental GC workuntil the next CBS period. In practice, however, this is not a problem asthe GC cycles typically are much longer than the period times of the ap-plication threads and therefore the deadlines and/or server bandwidthscan be adjusted at the thread release times when it is safe to do so.

Another way to make sure that the memory management overheadnever may cause the critical parts of the application to miss their dead-lines is presented in Chapter 5. By introducing priorities for memory al-locations, the run-time system is able to automatically prioritize memoryallocation requests (i.e., deny non-critical allocations) in order to guaran-tee that the system will not run out of memory or become unschedulablebecause of a too high GC workload. In essence, this can be viewed as di-

3.5 SUMMARY 57

viding the application into critical aspects, which are guaranteed to beexecuted on time and non-critical aspects, which are only executed if itis safe to do so.

3.5 Summary

A new way of scheduling garbage collection work in real-time systemswas presented; instead of using allocation as the trigger for GC work,time is used, and instead of ensuring that every GC cycle finishes beforeall available memory has been allocated, garbage collection is scheduledin a way that gives a fixed GC cycle time.

This approach leads to a number of desirable properties: It makes iteasy to spread the garbage collection work evenly across the GC cycle.Consequently, a time-triggered GC does not suffer from the bursty exe-cution pattern, due to the application performing allocations in bursts,that an allocation-triggered GC does.

As the most important scheduling parameter, the deadline, is explicitin the model, a time-triggered GC can be scheduled as any other pro-cess in both fixed-priority and EDF systems with real-time requirements.It is shown how a GC cycle time that guarantees that the applicationnever runs out of memory can be calculated based on the amount of livememory and allocation rate of the application.

The metrics used to measure garbage collection work in previousreal-time garbage collectors often fail to model the temporal behaviourof the garbage collector which may cause poor real-time performance.By using time as the GC work metric, such inaccuracies can be avoided,as time can be measured directly. This also makes it suitable for use in afeedback scheduling environment.

CHAPTER 4

ADAPTIVE GARBAGE

COLLECTION SCHEDULING

Worst case analysis is, in the general case, difficult even for relativelysmall programs and for a concurrent garbage collector it is even harder,as the execution time of the garbage collector not only depends on theGC implementation and application code per se, but also on the threadscheduling, which affects both how the application and GC are inter-leaved and in what order memory allocations are performed and conse-quently where on the heap the objects are placed. Furthermore, the ex-ecution time of the memory manager depends on memory performancewhich is a big source of non-determinism on a modern computer systemwith caches, etc. Even if worst-case analysis could be performed it maybe quite pessimistic, leading to unacceptably low CPU utilization. Usingfeedback control, on the other hand, makes it possible to exploit varyingresource utilization among the application threads, allowing better over-all utilization of both CPU and memory. If the CPU overhead of memorymanagement is made explicit, in a feedback scheduling system, that in-formation can be measured at run-time and taken into account whenscheduling the application threads1.

We will now investigate how a time-triggered GC can be made auto-tuning by estimating the scheduling parameters of the GC thread at run-time. Section 4.1 gives an introduction to the problem to motivate thework, and gives an overview of the proposed approach. In order toschedule a task, two parameters are needed; its deadline and its execu-tion time. Section 4.2 shows how the cycle time can be automaticallytuned and Section 4.3 discusses how the amount of CPU time requiredto complete a GC cycle can be estimated.

1An approach to incorporating an auto-tuning GC into a feedback scheduler is sug-gested in Chapter 6.

60 4. ADAPTIVE GARBAGE COLLECTION SCHEDULING

4.1 Introduction

Manual tuning of GC scheduling parameters is based on certain as-sumptions about the heap usage pattern of a particular application. Tun-ing a real-time GC requires a great engineering effort and is usuallyonly practically feasible for safety-critical, hard real-time systems witha small number of simple processes and not for larger systems or sys-tems with less rigorous safety requirements.

In order to achieve greater flexibility and allow a larger number ofdiverse applications to run with adequate performance without requir-ing huge engineering efforts to tune the GC, we investigate whether itis possible to make the GC scheduler auto-tuning, which would let usrun applications with real-time performance without any a priori ana-lysis. We should also not forget that hard real-time guarantees are onlyas good as the worst case assumptions they are based on so if the worstcase estimates are wrong the system will fail even if the scheduling al-gorithms and GC work metrics used are correct. This implies that usingan adaptive strategy may result in a more robust system compared to amanually tuned system where the worst-case estimates have been foundusing measurements and a safety margin.

The proposed adaptive garbage collection scheduling model consistsof two auto-tuners; the GC cycle time (deadline) and the GC work (ex-ecution time) estimations. The cycle time estimation is used directly todetermine the deadline of the GC thread (which is used by the sched-uler for the actual scheduling, either directly, as in the EDF case, orindirectly, when using RMS scheduling). The execution time estima-tion is only needed if the GC is to be used in a semi-concurrent system,where it is needed to determine the length of the increments, or in a feed-back scheduling system, where the execution time is used in the on-lineschedulability analysis required to guarantee that the system remainsschedulable.

In a system with garbage collection, allocations can be measured con-tinuously whereas measurements of the heap state are only availableafter the completion of a GC cycle. Therefore, the proposed approachhas the structure sketched in Figure 4.1. The scheduling parameters aretuned based on measurements of the amount of available memory andthe allocation rate. The work function describing how the execution timeof the GC depends on the heap state is based on previous measurementsof the heap state and GC execution time.

In the cycle time tuning, a black-box view on the application is used;the estimates do not depend on any information about the applicationother than the allocation rate, which can be measured directly. The state

4.2 AUTOMATIC GC CYCLE TIME TUNING 61

of the memory manager, on the other hand, is quite important for theexecution time estimation and might therefore be necessary to take intoaccount, either through manual or automatic tuning. Section 4.3 dis-cusses both a black box and a clear box approach to garbage collectionwork estimation.

GC tunerScheduler/Tasks

Heap IdentificationCGC, TGC Memory operations Heap state

Available memory, Allocations

CGC

GC work function

Figure 4.1: Block diagram of an adaptive GC. Based on measurements of theamount of available memory, the allocation rate of the application, the heap stateand the previous execution of GC, the cycle time and execution time of the GCis estimated.

4.2 Automatic GC cycle time tuning

As we have seen in Chapter 3, a GC cycle length that ensures that theapplication never runs out of memory can be calculated at design-time,if the allocation requirements of the (high priority) mutator threads areknown. If that is not practical for some reason (for instance that theapplication’s execution pattern varies greatly depending on operatingmode or that it should be run on many different platforms and we donot want to do analysis for all possible target platforms, or even knowwhich platform it will run on) or if we want the GC scheduler to becompletely transparent to the developer, we have to use some adaptivetechnique to automatically tune the GC scheduling parameters on-line.

When doing on-line tuning without any information about the ap-plication, the fundamental problem is that the amount of live memoryand floating garbage is not known and must be estimated in a safe androbust way. Section 4.2.1 examines, in more detail, the model for howthe GC cycle time can be automatically tuned without any a priori in-formation about the application. Section 4.2.2 investigates how the GCscheduling can be improved if some information about the behaviour ofthe application is available; for instance, through feed-forward of modechanges. The GC cycle length depends on the allocation rate, and as


allocations typically are bursty, the allocation rate estimation must bedone carefully, which is discussed in Section 4.2.3. Feed-forward fromthe mutator to the GC scheduler is discussed in Section 4.2.4.

4.2.1 Application-independent auto-tuning

The fundamental requirement on the GC cycle time is that each GC cy-cle must finish before the application runs out of memory. In an on-lineGC tuner, that can be achieved by calculating or measuring the alloca-tion rate (a) of the application and extrapolating at which time all thecurrently remaining free memory (F ) will have been allocated — thedeadline of the current GC cycle.

Let ts denote the start(release) time of the current cycle and te = DGC

the deadline of the GC cycle. The GC cycle must end before the timewhen all free memory will have been allocated. Therefore, at time t ; ts ≤t < te, assuming that a is constant, we can extrapolate when all memoryhas been allocated, and we get the constraint

te ≤ t +F (t)

a(4.1)

which gives the cycle time

TGC = te − ts ≤ t +F (t)

a− ts (4.2)

The simple model of Equation (4.2) will work if the same amountof memory is reclaimed in each GC cycle but it suffers from the sameproblems with floating garbage as the fixed deadline case discussed inChapter 3, although the symptoms are a bit different. With a fixed dead-line, the system might run out of memory if the GC cycle time is toolong. In an adaptive system where the cycle time is tuned to ensurethat this does not happen, the problem is that the system might becomeunschedulable. One example of this encountered during experimentswith this simple model is that if there, for some reason, is much floatinggarbage during one cycle, little memory will be reclaimed during thatcycle2. Then, the following cycle will have to be very short and we geta memory trace like the one shown in Figure 4.2. This could cause real-time problems since the required CPU utilization of the GC will be muchhigher during the short cycles than during the long ones, as the amount

2Variation in the amount of floating garbage is mainly a concern when using anincremental-update GC. The conservatism of snapshot-at-the-beginning collectors willgive more floating garbage but less variations.


of GC work is roughly the same3 in all cycles, but it has to be done in amuch shorter time in the short cycles.

Fre

em

emo

ry

Time

Figure 4.2: Example of a very short GC cycle caused by large amounts offloating garbage.

In order to handle the worst case amount of floating garbage, mem-ory must be reserved so that the allocations during the next cycle can besatisfied even if no objects are reclaimed during the current cycle. I.e.,the GC cycle must end before all available memory has been allocated.

Theorem 2. Let â(t) be an estimate of an unknown but constant allocation

rate, a, such that â(t) ≥ a . Then, for ts(k) ≤ t < te(k), the GC cycle willbe completed before the available memory is exhausted if the cycle time, TGC,satisfies

TGC(t) =1

2

(

t +F (t)

â(t)− ts(k)

)

(4.3)

Proof. Let a(k) be the allocation rate during GC cycle k. Then, the amountof memory allocated during cycle k is a(k) = TGC · a(k). In the worstcase, no memory is reclaimed during cycle k, so a(k + 1) bytes must bereserved for the following cycle in order to satisfy all allocations. I.e., therequirement is that

Fs(k + 1) ≥ TGC · a(k + 1) (4.4)

During cycle k, i.e., for ts(k) ≤ t < te(k), the amount of memory avail-able at the start of cycle k + 1 is

3Of course, this depends on the garbage collection algorithm as well as on implemen-tation details. However, the execution time of a garbage collector typically depends onboth the amount of retained and reclaimed memory. Even algorithms where there is noexplicit free operation, like for instance a copying collector, have a fraction of the cost thatis proportional to the amount of reclaimed memory if, e.g., the initialization of memory istaken into account.


Fs(k + 1) ≥ F (t) − (te(k) − t) a(k) (4.5)

with equality in the worst case, that no memory is reclaimed duringcycle k. Using the equalities in (4.4) and (4.5) we get

TGC · a(k + 1) = F (t) − (te(k) − t))a(k) (4.6)

∴ te(k) = t +F (t) − TGC · a(k + 1))

a(k)(4.7)

Thus, the GC cycle time estimate is

TGC = te(k) − ts(k) = t +F (t) − TGC · a(k + 1))

a(k)− ts(k) (4.8)

which can be rearranged as

TGC =F (t) + (t − ts(k)) a(k)

a(k) + a(k + 1)(4.9)

If the allocation rate is constant, i.e., a(k + 1) = a(k), we get (4.3).

If the allocation rate is constant, this means that we should reservehalf of the available memory at the start of the current cycle for the al-locations during the next GC cycle. Doing so guarantees4 that we canhandle the worst case, when all the objects that die during a cycle be-come floating garbage and will not be reclaimed until at the end of thenext GC cycle. Figure 4.3 shows how the memory trace of the floatinggarbage example would look with the reservation strategy in place; thecycles are shorter and the floating garbage anomaly in the first cycle hasmuch less impact on the GC cycle lengths.

Fre

em

emo

ry

Time

Figure 4.3: Example of how reserving memory for the next cycle mitigates theproblems of floating garbage depicted in Figure 4.2.

As GC cycles are shortened, the number of GC cycles increase andconsequently the incurred GC overhead increases. However, as we do

4Given, of course, that the total amount of live memory is smaller than the heap-size.


not use all of the heap, the additional overhead is not as big as it wouldseem. Also, only allocating at most half of the available memory eachGC cycle might seem wasteful, but this is the price we pay for incre-mentality. It should, however, be noted that this reservation strategyonly affects the length of the GC cycles and not the overall memory uti-lization. If, for instance, the amount of allocated memory is 80% of theheap, the GC cycle length would be set so that 10% of the total memoryis reserved for the next cycle.

In (4.3), it is assumed that the allocation rate is constant. For a typi-cal control system with a number of periodic threads running the samecontrol algorithm every sample, that is a reasonable assumption, and inexperiments, the GC cycle time estimates have been stable and accurate.Also, note that the assumption that a is constant only means that TGC ischosen to ensure that the allocations can be satisfied at the current rate.If the allocation rate changes, the auto-tuner will change the TGC esti-mate according to the new allocation rate, to ensure that all allocationscan be satisfied under the assumption that allocations will continue atthe new rate.

4.2.2 Using information about the application

If the GC cycle times are tuned according to Equation (4.3), the risk ofrunning out of memory due to floating garbage is reduced, but the cycletimes, and thus the CPU utilization of the GC will vary if there are bigvariations in the amount of floating garbage. In particular, the GC cy-cle time estimates will, in the average case, be quite conservative. Thisis due to the fact that if the GC cycle time tuner has no informationabout the behaviour of the mutator, it cannot differentiate between anunusually large amount of floating garbage and an actual increase in theamount of live memory, where the former should not affect the GC cycletime, but the latter should. Therefore, under the proposed strategy, itmust always ensure that no more than half of the available memory atthe start of a cycle is used during that cycle. That means that in the ex-treme case that nearly all of the objects that die during a cycle becomesfloating garbage, the cycle time estimate will be halved, as shown in Fig-ure 4.3. That is, of course, better than without any reservation strategy,but still unnecessarily conservative.

Based on this observation, we will now see how having informationabout the behaviour of the mutator can improve the GC cycle time esti-mates. In common special cases, additional information about the stateof the mutator and memory system allows using a less conservative GCcycle time estimate. Such special cases include when the application


is known to be in steady-state, and allocation and release of large datastructures, including creation and deletion of processes.

If the system is known to be in steady-state, the amount of live mem-ory is constant5. Then, the variations in available memory at the startof GC cycles are due to variations in floating garbage. The example inFigure 4.4 shows how this information can be used to avoid unnecessarychanges to the GC cycle time.

Fre

em

emo

ry

Time

∆G

Figure 4.4: Example of how information about the amount of floating garbageallows a less conservative GC scheduling strategy. If we know that the systemis in steady state, the difference in free memory at the start of the cycles is dueto floating garbage. Thus, during the second cycle, we know that at least ∆G

bytes will be made available after the cycle and we can allow allocation of morethan half of the available memory.

Similarly, information about changes in the amount of live memorycan be used. While performing worst case live memory analysis in thegeneral case is very difficult, programmers — especially when devel-oping real-time and embedded systems — will have a reasonably goodidea about what persistent data structures each process uses. If a modechange requires some data structure to be allocated or causes some otherdata structure to go out of scope, this is typically known at design time.By informing the GC tuner about this, it can react to the changes in theamount of live memory sooner and in a more accurate way 6. One spe-cial case is when a process is created, the amount of live memory willincrease at a well-defined point in time. Conversely, when a processdies, the amount of live memory will decrease. Typically, a process hasa set of persistent objects. E.g., in a control system, a process will typi-cally create a set of objects for inputs, outputs, and control algorithms,

5In practice, that is merely an approximation, as a fraction of the allocated objects areused for temporary results and not for persistent data, which adds small, high-frequent,variations to the amount of live memory. Still, if the GC cycle time is much longer than theperiod times of mutator processes, the impact of temporary objects will be small.

6Having a hint about object liveness is much less dangerous than explicit (manual)deallocation; the former only affects the scheduling of garbage collection, whereas thelatter may cause dangling pointers and memory leaks.


and the size of these may be found by compile-time analysis, especiallyif doing whole-system compilation.

Just as in the case where the system was known to be in steady state,if the amount of live memory has changed by a known amount, sim-ilar reasoning can be used, with the addition of taking the change inlive memory into account: If there is less memory available at the startof a GC cycle than at the start of the previous one, the sum of floatinggarbage and live memory has changed, and if the change in live memoryis known, the change in floating garbage can be calculated.

From the preceding discussion we see that information about changesin the amount of live memory, or knowing that the application is insteady state, makes it possible to estimate the amount of floating garbage.With that information, the GC cycle time estimates can be less conserva-tive, allowing more uniform resource utilization. We will now formalizethat idea. In order to reason about differences in the amount of memoryreclaimed in different GC cycles, the amount of free memory just afterthe reclaimed memory has been made available is used. For convenienceand clarity of the presentation, let Fs(k) = F (ts(k)) denote the amountof free memory at the start of GC cycle k.

If information about the state of the memory system is available, it ispossible to generalize Theorem 2 slightly.

Theorem 2a. Let â(t) be an estimate of an unknown but constant allocation

rate, a, such that â(t) ≥ a, and ∆Fff (k) ≥ 0 an amount of memory that isknown to be reclaimed during GC cycle k. For ts(k) ≤ t < te(k), the GC cyclewill be completed before the available memory is exhausted if the cycle time,TGC, satisfies

TGC(t) =1

2

(

t +F (t) + min(∆Fff (k), Fs(k))

â(t)− ts(k)

)

(4.10)

Proof. If at least ∆Fff (k) will be reclaimed during cycle k, then

Fs(k + 1) ≥ F (t) − (te(k) − t) a + ∆Fff (k) (4.11)

In order to satisfy the allocations of cycle k + 1, it must hold that

Fs(k + 1) ≥ TGC · a (4.12)

In analog with the proof of Theorem 2, that gives

TGC =1

2

(

t +F (t) + ∆Fff (k))

a− ts(k)

)

(4.13)


However, as the GC cycle must still end before the available memoryis exhausted, another condition is that

F (t) − (te(k) − t)a ≥ 0 (4.14)

which, by using te(k) = TGC + ts(k) and reorganizing gives

F (t) − ∆Fff (k) + (t − ts(k))a ≥ 0 (4.15)

But if a is constant, F (t) + (t − ts(k))a = Fs(k) and thus (4.13) is safe if∆Fff (k) ≤ Fs(k).

Otherwise, as any reclaimed memory will not be made available un-til in the next GC cycle, the compensated GC cycle will be too long, andthe system will run out of memory. Therefore, if ∆Fff (k) > Fs(k), thecompensating term must be limited. (4.2) is an upper bound on the fea-sible GC cycle times. I.e., the GC cycle time must satisfy the constraint

TGC =1

2

(

t +F (t) + X

a− ts(k)

)

≤ t +F (t)

a− ts(k) (4.16)

where X is the compensating term. Reorganizing gives

X ≤ F (t) + (t − ts) a (4.17)

and, for a constant allocation rate, that is equivalent to

X ≤ Fs(k) (4.18)

which obviously is satisfied for

X = min(∆Fff (k), Fs(k)) (4.19)

Thus, the amount of memory reserved for cycle k + 1 can safely be re-duced by min(∆Fff (k), Fs(k)).

Given information about the behaviour of the application, the amountof floating garbage can be estimated, and Theorem 2a can be applied inorder to reduce conservatism in the GC cycle time.

Theorem 3. Let ∆L(k) be the net change in live memory during GC cycle k.For ts(k) ≤ t < te(k), a safe upper bound on the GC cycle time is

TGC(t) =

t+F (t)+min(∆G(k−1),Fs(k))

ˆa(t)−ts

2 ; ∆G(k − 1) > 0

t+ F (t)+ˆa(t)

−ts

2 ; otherwise

(4.20)

where∆G(k) = Fs(k) − Fs(k + 1) − ∆L(k) (4.21)


Proof. First, consider the change in floating garbage. The heap containslive objects, garbage, and free memory: H = L+G+F . As the heapsize,H , is constant, comparing the heap state at the start of GC cycles k andk + 1, respectively, gives

Ls(k) + Gs(k) + Fs(k) = Ls(k + 1) + Gs(k + 1) + Fs(k + 1) (4.22)

Introducing the symbols ∆L and ∆G and rearranging gives (4.21). Now,for the GC cycle time:

(i) If ∆G(k − 1) > 0, the total amount of floating garbage in cyclek − 1 must have been at least ∆G(k − 1). As floating garbage willbe reclaimed in the following cycle, the amount of memory madeavailable before the start of cycle k +1 must also be greater than orequal to ∆G(k − 1), and the result follows from Theorem 2a.

(ii) If ∆G(k − 1) ≤ 0, nothing is known about the absolute amount offloating garbage, and TGC must be estimated according to Theo-rem 2.

Remark. Equation (4.21) estimates the amount of floating garbage basedon knowledge about changes in the amount of live memory and the dif-ference in free memory in two GC cycles. That estimate can be improvedby using a longer time horizon: If the system is in steady state, a high-water mark of the amount of free memory at the start of a cycle sincethe system entered steady state gives a minimum for the sum of live andfloating objects. Thus, by comparing the current amount of free memorywith the high-water mark, a less conservative estimate of the amount offloating garbage is obtained. I.e., if the system has been in steady-statesince cycle j < k, (4.21) can be replaced by

∆G(k) = maxi∈[j,k]

{Fs(i) − Fs(k + 1)} (4.23)

If feed-forward information about changes in the live memory amountis available, the high-water mark must be adjusted correspondingly.

4.2.3 Estimating allocation rate

In the preceding discussion, it was assumed that the allocation rate, ora conservative estimate of it, was available. We will now briefly exam-ine some properties of allocation rate measurement and how such anestimate can be obtained.

As allocations are discrete events, there is, by definition, no instan-taneous allocation rate that can be measured, so for any discussion, an


average allocation rate must be used. Allocations are carried out at ar-bitrary places in the mutator code, so on a short timescale the allocationrate will vary with very high frequency. The GC cycle tuner will typi-cally run at a much slower rate than this, and therefore the allocation ratemeasurements can be viewed as slow sampling of a signal with high-frequency components. Thus, there is a risk that those high-frequencycomponents introduce low-frequency noise into the allocation rate mea-surement, through aliasing. Also, if there are multiple processes, and thevariations happen to be aliased into frequencies that are close, the effectcan be exagerrated by interference beating [AW97].

The allocation rate estimate is used to determine the GC cycle time,so the estimate must not be too low as that may cause the application torun out of memory. Therefore, using simple averaging or a normal low-pass filter is not suitable, as it may smooth out steps in the allocationrate, leading to temporary under-estimation. One method that is simpleto implement and has proven to work well in practical experiments isto periodically measure the amount of allocated memory, and filter byusing the maximum (averaged over a certain time window) allocationrate, combined with a forgetting factor for the max value, to give suitableresponsiveness to changes.

By using feed-forward, the estimation can be improved. If threadsare periodic, and execute the same code in each invocation, the alloca-tion rate (expressed as allocations per period) can be measured exactlyby simply recording the allocations performed by each thread from onerelease to the next. In that case, there will be no aliasing, as the samplingis synchronized with the allocations. By measuring the allocation rateseperately for each periodic thread, interference effects are eliminated.If there are small variations, that can be detected and handled by somefiltering (max + a forgetting factor). In addition, if the memory man-ager is informed about changes to the allocation rate, measurements canbe low-pass filtered in order to reduce noise, while still reacting quicklyto actual changes. Also, if it is known that a particular allocation is aone-time occurence (e.g., allocating a large persistent data structure atstart-up), it should not affect the allocation rate estimate (although itmay affect the amount of live memory).

4.2.4 Feed-forward from the application

The results in Section 4.2.2 are based on having information about theoperation of the application. In order to satisfy that requirement, thissection sketches a set of feed-forward operations for both qualitative andquantitative information about the memory usage of the mutator.


Qualitative feed-forward

The feed-forward has to be provided by the mutator code, where thefeed-forward instructions have to be inserted manually or, perhaps, au-tomatically by a tool. In order to be practically feasible, the amount ofanalysis required for finding the feed-forward information must be keptat a minimum, and therefore, a model requiring only qualitative infor-mation about the mutator behaviour is desirable. As we have seen, thesimple information about whether the application is in steady state ornot is quite valuable to the on-line auto-tuner.

If all threads are in steady state, the amount of live memory is con-stant (i.e., ∆L = 0), and Theorem 3 may be used. If one or more muta-tor threads are in a transient state, the GC cycle time estimate must bedone according to Theorem 2 until the finish of the GC cycle after the onewhere all threads had returned to steady state. Conversely, during theGC cycle when a thread enters the transient state, the amount of floatinggarbage from the cycle before is known, and is known to be reclaimedat the end of the cycle, and the GC cycle time may be calculated usingTheorem 3.

Mode changes can cause transients in the memory usage pattern ofan application. As discussed, the only continously available measure-ments the GC auto-tuner can use is allocation rate and amount of allo-cated memory. Large one-time allocations e.g. at the start-up of a threador at a mode change may cause spikes in the allocation rate measure-ment, leading to changes in the GC scheduling. Such effects can be mit-igated with information that a certain memory allocation is a one-timeoccurrence.

If the GC scheduler knows that a thread is executing periodically,that can be used to improve allocation rate estimations. That informa-tion is often available, as many real-time operating systems has a spe-cial type of thread or process for periodic tasks. Additionally, threadswhich are not periodic in the usual real-time programming sense, mayexecute periodically. One example of such a thread is a controller in adistributed control system, receiving measurements from a remote node,over a network, as sketched in Figure 4.5. In the code, the controller isnot a periodic thread, it is simply blocked waiting for the next packageon the network. However, if the sampling thread on the remote nodeis periodic, network packages will arrive periodically and the controllerwill effectively be periodic. The period time can either be measured, byrecording release times, or explicitly fed forward from the application.


while(!interrupted()) {Sample s = receiveSample(); // Blocking callControl c = compute(s);output(c);

}

Figure 4.5: Simple example of a main loop of a controller thread in a distributedcontrol system. This code is not periodic per se. However, if samples arrive ata fixed rate, it will be effectively periodic.

Quantitative feed-forward

For a thread that is known to be periodic, the period time is also oftenknown, either at design time, or — in a feedback scheduling system —at run-time. For effectively periodic threads, where the period time isn’tknown locally, it is useful to explicitly state this information which isavailable somwhere in the system. Also, if period times are changed atrun-time, it is better to feed forward this information at the time of thechange than waiting for it to show up in measurements.

If a mode change is known to affect the amount of live memory, thatinformation can be used to improve the GC scheduling, as in Theorem 3.As discussed, in embedded systems development, the programmer isoften required to know the memory requirements of an application toensure that there is enough memory in the system. The memory usagefigures could also be obtained by a worst case analysis tool. Using theapproach taken e.g. in [Per99], annotations can be used to perform theanalysis for the different modes of operation.

4.3 GC workload prediction

As discussed in Chapter 3, using semi-concurrent GC in a fixed-prioritysystem requires good estimates on the total amount of GC work thatmust be performed to complete a GC cycle as the scheduling of GC in-crements depends on it7. Also, in feedback scheduling systems, on-lineschedulability analysis is performed and the allowed CPU time utiliza-tion of the application threads is tuned to keep the total requested CPUutilization at the setpoint. Therefore, in such systems, it must be possibleto determine how much CPU time that is required in order to complete

7However, for the real-time performance of high priority threads, it is enough that it isconservative; over-estimating the CPU requirement of the GC only leads to (temporary)starvation of background threads.

4.3 GC WORKLOAD PREDICTION 73

a GC cycle. It is important that the GC work estimates are not too lowsince this might cause us to allocate too large a fraction of the CPU timeto the mutator, causing the GC thread to miss its deadline, which might,in turn, cause an out-of-memory situation and stop-the-world GC. Theestimates should also not be too high in order to avoid unnecessarilylow CPU utilization and undue disturbance of low priority threads.

Thus, in an adaptive system, the role of the workload estimation isto feed-forward information about changes in required CPU utilizationto the scheduler, so that any necessary change in scheduling parametersmay be done before the measured CPU utilization gets too high. Also,in an adaptive system, there are no absolute guarantees, but rather atrade-off between safety and performance, and GC work prediction canbe more or less conservative. Techniques for producing both tight andconservative CGC estimates will be discussed.

In many cases, the occasional under-estimation is not a problem; Asstated, feedback scheduling works on the principle of measuring actualCPU utilization and changing scheduling parameters in order to handleoverload conditions, and is therefore inherently robust to overload. Thisis reinforced by the fact that the TGC estimates are based on worst caseassumptions and therefore usually are conservative, giving some slackin the schedule. Conversely, if a conservative CGC estimate is used, theresulting slack in the schedule is not wasted, but can be utilized by themutator if the FBS is aware of when the GC is running and when it isidle.

4.3.1 Black box estimation

A black box model doesn’t use any information of the internals of thememory manager and only tries to predict the future execution timesbased on the history. This has the advantages that it is fairly easy toimplement and that it, by design, is independent of the actual garbagecollector used.

A simple scheme which has been experienced to work fairly well inpractice is to estimate the GC cycle execution time with the highest valueduring the last n cycles. Another alternative is to use e.g. a moving av-erage filter, but that has a greater risk of under-estimating the executiontime, where using the max value tends to be conservative.

The main drawback of any such approach is that it cannot take ad-vantage of any information the memory manager has about applicationbehaviour or system state and thus will react poorly to transients.


4.3.2 Clear box prediction

In a clear box approach, the principle is to measure a number of param-eters of the memory system, and, using some automatic system identi-fication technique, determine how they affect the execution time of theGC. That requires a more detailed interface between the GC schedulerand the memory management system.

In order to predict the amount of GC work, a GC work metric isrequired, expressing GC work as a function of the state of the heap

CGC = f(Sh). (4.24)

Given the structure of GC algorithms, it is reasonable to approximatethe work required to perform a GC cycle with a linear combination ofthe components of Sh. For instance, the time required to mark all ob-jects is proportional to the number of live objects, the time required toevacuate live objects depends on the size of the live memory, initializa-tion of memory depends on the amount of dead memory, etc. Thus, anapproximation of the GC workload can be expressed as

CGC = K Sh (4.25)

for some vector K , which is identified on-line. Given a function f , orcoefficient vector K, the GC work estimate only depends on the heapstate, and not on any internal state of the GC. This facilitates the devel-opment of a well-defined interface between the memory manager andthe GC scheduler, which makes it possible to separate the two problemsand, hence, implement a generic GC scheduler that can be automaticallytuned to fit different GC algorithms.

In order to estimate the amount of CPU time required to perform theGC work needed to finish a GC cycle, there are a number of problems;we need to

Measure and predict the heap state: In its most simple form, only theamount of available memory is measured. A more detailed modelwould take into account the amount of live memory, dead objectsand other quantities that affect the execution time of the GC (e.g.,the number of pointers that need to be traversed, the number ofobjects that will be relocated, etc.).

Measure the amount of performed GC work: This can be done in a quitestraight-forward manner if we use time as the GC work metric,provided that we have control over the process scheduler and haveaccess to a high resolution timer. Some operating systems also pro-vide execution time statistics.


Identify a GC work function: In order to predict the amount of GC workrequired to complete a GC cycle, a function from heap state toGC execution time has to be identified. If a linear model is as-sumed, the problem becomes on-line estimation of the elements ofK, given past measurements of CGC and Sh, which can be donee.g. with a recursive least squares algorithm [AW89].

Estimate the total amount of GC work in a cycle: Finally, based on theother estimates, the total amount of work required to complete aGC cycle is estimated by inserting the predicted heap state into theidentified work function.

Measuring and predicting heap state and predicting CGC will now bediscussed in more detail.

Measuring and predicting heap state

Of course, it is not practically feasible to use the state of the heap perse when calculating the amount of GC work and therefore an abstractmodel is required. Objects allocated on the heap are either live or dead,but may float for one cycle, which leads us to the following abstractrepresentation of the heap state:

Sh =

# live objects

# live bytes

# dead objects

# dead bytes

# floating objects

# floating bytes

(4.26)

A problem with garbage collection is that some aspects of the heapstate, like for instance the amount of live or dead memory, can only beobserved at the end of GC cycles. Even worse, with an incremental GC, itis not possible to distinguish between live memory and floating garbage.Therefore, the heap state cannot be measured directly, but must be cal-culated based on what can be measured. It is possible to formulate adynamic system that, under certain assumptions, is observable. How-ever, for a system with n states, it takes at least n samples for the errorto reach zero. In this case, samples equals GC cycles, meaning that themodel would be quite slow. Combined with the noisy measurements(e.g. due to floating garbage) such a detailed model would be problem-atic in practice.


Also, (4.26) fails to capture two important factors; the actual place-ment of the objects on the heap, and the distribution of references inobjects. The placement of objects affect the GC workload since it affectswhich objects needs to be moved in a compacting collector or the degreeof fragmentation in a non-moving GC. However, taking object place-ment into account would essentially mean using the entire heap itself asthe heap state representation. The reference content of objects affects thetime required to trace the live object graph, with the extremes being dataarrays at one end of the spectrum, and reference arrays at the other.

Therefore, using (4.26) as an abstraction of the heap state and at-tempting to predict it by simulating a dynamic system appears prob-lematic for two reasons. Using an observer to reconstruct many stateslimits how quickly the model can react to changes, and the approxima-tions done still leaves out important aspects that affect the GC workload.

We need some way of predicting Sh based on quantities that can bemeasured. Therefore, the approach taken here is to use a simplified heapstate representation, only using the number of live (L) and dead (D)bytes and not taking object sizes into account8.

Sh =

[

L

D

]

(4.27)

Then, in principle, the heap state can be predicted by finding the prob-ability of a memory cell being live or dead, respectively, and applyingthat to the total amount of allocated memory (A):

L = P (Live) · A (4.28)

andD = P (Dead) · A (4.29)

That has the advantage that while L and D cannot be observed directly,A can be measured at any time. However, what is interesting for predict-ing CGC(k) is a prediction of the amount of allocated memory at the endof the GC cycle, Ae(k), which can be predicted by extrapolation similarto that in the TGC calculation:

Ae(k) = As(k) + TGC(k) · a (4.30)

The prediction of L and D is then given by inserting that value into (4.28)and (4.29).

8The terms live and dead memory are actually not very accurate in this context; what isinteresting for the amount of GC work is what the garbage collector thinks is live and deadmemory. In this presentation, the terms live and dead should be understood as synonymsfor retained and reclaimed, respectively.


Now, P (Live) and P (Dead) must be found. Excluding startup, typ-ical embedded or other long-running programs with a well-defined setof tasks can be expected to behave quite similarly from one GC cycle tothe next. For such systems, the fraction of live (dead) memory in theprevious GC cycle(s) can be used: P (Live) = L

A . Robustness againstvariations in live and dead memory due to e.g. floating garbage can beachieved by adding low pass filtering using the maximum, median ormean observed value, and responsiveness to actual change by using aforgetting factor for reducing the weight of old measurements.

A potential problem with using only the amount of live and deadmemory is that if the GC work function have been identified on-line,based on past measurements of L, D, and CGC, there is no guaranteethat the function will be valid if the distribution of objects changes, as ifdoes not take the number of objects, pointer density, or placement, intoaccount. Therefore, an extreme change in object distribution like, e.g.,from the heap being dominated by a highly connected linked structureof small nodes to consisting mainly of huge data arrays, might causea large error in the work estimate, until the work function identificationhas had time to react. In practice, this is unlikely to be a problem. Firstly,while mode changes often occur, they are seldom as drastic as that, andwith many threads, effects are likely to even out. Secondly, previouswork has shown that e.g. the variation in pointer density and fraction ofnon-null pointers between the different SPECjvm benchmarks is quitelow [BCR03a].

Predicting GC work

Now, we need to put it all together into a prediction of the amount ofCPU time required to complete a GC cycle. With the simple heap statemodel, the GC work function is

CGC(L, D) = α L + β D (4.31)

where the coefficients α and β are identified on-line using e.g. a recur-sive least-square algorithm based on previous measurements of L, D,and CGC.

With the heap state prediction of (4.28) and (4.29), (4.31) can be writ-ten

CGC ≈ (α P (Live) + β P (Dead))A (4.32)

which, according to (4.30), can be extrapolated:

CGC(k) = (α P (Live) + β P (Dead)) ((As(k) + TGC(k) · a) (4.33)


4.3.3 Conservative prediction

The heap state prediction, as presented in Section 4.3.2, depends onP (Live) and P (Dead), in addition to the identified GC work function.For programs with a random, or highly varying, memory usage pattern,the estimates of P (Live) and P (Dead) will contain little information, re-ducing the quality of the prediction. In such cases, or when robustnessis a higher priority than efficiency, a conservative estimate of CGC canbe useful. Based on (4.31), it is observed that

α L + β D ≤ max(α, β)(L + D) (4.34)

and L + D is the total amount of allocated memory. Thus, a conserva-tive prediction of CGC is given by extrapolating the amount of allocatedmemory at the end of the GC cycle, using the amount of memory at thestart of the cycle, the GC cycle time, and the allocation rate:

CGC(k) ≤ max(α, β)(As(k) + TGC(k) · a) (4.35)

The main drawback with (4.35) is that the estimate may be very con-servative if α and β or L and D are of different magnitude. E.g., if L = D,the conservative estimate will be at most twice the true value of (4.31),for any α and β. If, however, the fraction of live memory is only 10%,this method may give an over-estimation of 10 times (in the worst case,β = 0.) However, for embedded applications it is likely to be reason-able; having very low memory utilization is typically avoided for costefficiency reasons (and due to the fact that software tends to eventuallyuse all available resources), while a very high memory utilization shouldbe avoided as it causes GC thrashing and poor efficiency [JL96].

4.4 Summary

An approach to making a time-triggered garbage collection schedulerauto-tuning was presented, based on the observation that we need toestimate the two scheduling parameters deadline and execution time. Itwas shown how a GC cycle time that ensures that the application neverruns out of memory can be determined at run-time, and how it is robustagainst variations in floating garbage. It was also shown how havinginformation about the mutator can be used to reduce the conservativismof the TGC tuning.

Different approaches for on-line estimation of CGC was presentedand discussed: first, a black box, “yesterday’s weather”, approach that

4.4 SUMMARY 79

is simple and does not require any information of the state of the mem-ory manager; second, a clear box method based on identifying a GCwork function and predicting the state of the heap based on the alloca-tion rate and GC cycle time; and third, a conservative variant of the clearbox approach, based on an identified GC work function and worst caseassumptions. For the clear box approaches, a simplified representationof the heap state was suggested in order to make implementation prac-tically feasible. Finally, the degree of conservativism in the conservativeapproach was discussed and it was argued that, for typical embeddedsystems, it will be within reasonable limits.

On-line estimations of the scheduling parameters for the GC taskmakes it possible to take the GC overhead into account when doing on-line schedulability analysis, e.g. in a feedback scheduling system. It alsomakes it possible to make a semi-concurrent garbage collector adaptivein order to minimize the disturbance of low priority threads. Integrat-ing the scheduling of garbage collection and the scheduling of mutatorprocesses is an important step towards making safe object-oriented lan-guages like Java practically feasible for many real-time applications inautomatic control and embedded systems, without requiring a huge en-gineering effort to tune the GC.

CHAPTER 5

PRIORITIES FOR MEMORY

ALLOCATION

This chapter presents a novel approach of applying priorities1 to mem-ory allocation and it is shown how this can be used to enhance the ro-bustness of real-time applications. The proposed mechanisms can alsobe used to increase performance of systems with automatic memorymanagement by limiting the amount of garbage collection work.

A way of introducing priorities for memory allocation in a Java sys-tem without making any changes to the syntax of the language is alsoproposed and this has been implemented in an experimental Java virtualmachine and verified in an automatic control application.

5.1 Introduction

With the recent development in small, cheap and fast processors for em-bedded systems and the emerging trend of writing embedded applica-tions in high level object oriented languages, the performance limitingbottleneck may no longer be CPU time but rather memory and memorymanagement. This is accentuated by the high relative cost of memory inembedded systems and systems on chip.

Memory management is a system-global problem and currently putsa great responsibility on programmers. For instance, a memory leak orexcessive memory allocation in one module, or component, of a systemwill eventually cause the entire system to run out of memory and fail.Therefore it is interesting to study whether it is possible to apply priori-ties to memory as well as CPU time allocation; just as we don’t want an

1Here, we use the words “memory priority” in a sense that may correspond better tothe RTSJ notion of “importance” than the real-time sense of the word priority.

82 5. PRIORITIES FOR MEMORY ALLOCATION

important process to be delayed because a less important one is execut-ing we don’t want an unimportant memory allocation to cause a criticalprocess to fail or be delayed, because the system runs out of memory orhas to do a large amount of garbage collection work to satisfy its alloca-tion needs.

Therefore, a novel approach is proposed which addresses two prob-lems: firstly, how to increase program robustness by avoiding out-of-memory problems and secondly, how to increase application perfor-mance in systems with automatic memory management by reducing thegarbage collection workload. Section 5.2 briefly describes both aspects,whereas the rest of the chapter will focus on the robustness issue.

While this chapter focuses on object oriented systems with garbagecollection, especially Java, the robustness issues should be equally ap-plicable to any memory allocator. Similarly, the presentation focuses onreal-time systems, but the proposed mechanisms can be useful in anysystem where robustness to variations in workload, or isolation betweendifferent parts, is required.

A note on terminology; in order to avoid confusion we will use theterms high priority (HP) and low priority (LP) to denote the CPU timepriority of a process and the terms critical and non-critical (NC)2 for ournew notion of priorities for memory allocations.

5.2 Applying priorities to memory allocations

It is desirable to be able to view memory allocation as any other resourceallocation. The goal of this work is to provide run-time system supportfor doing the most important memory allocation if the system has lim-ited memory in analogy with how the process scheduler makes sure thatthe most important process is run and less important ones are delayed ifCPU time is scarce.

5.2.1 Avoiding out-of-memory situations

A high priority process in an embedded system may perform other tasks3

in addition to its core functionality. For example, a digital controller pro-cess may produce log data in addition to calculating and outputting its

2The terms critical and non-critical correspond to the terms mandatory and optional some-times used in the safety critical systems community.

3The word task is used in the sense “a piece of work to be done” and not in any stringentreal-time programming sense. For the latter, the words process and thread are used.

5.2 APPLYING PRIORITIES TO MEMORY ALLOCATIONS 83

control signal. In such a process, memory allocations by the less impor-tant tasks (e.g., producing log data) must never interfere with the corefunctionality (calculating the control signal).

This can be achieved by manually ensuring that the amount of logdata never exceeds a certain value, for instance by using a boundedbuffer for delivering it to the logger process. Doing this manually hasthe drawback that the size of the buffer has to be calculated and this cal-culation is highly platform and application dependent. (I.e., each time achange that affects the application’s memory allocation behaviour or theamount of memory available to the application is made, the maximumamount of non-critical memory has to be recalculated.) If more than oneprocess does unrelated non-critical memory allocations, the complexityof managing this increases rapidly. Thus, manual solutions require a lotof work and risk being unnecessarily conservative, error prone, or both.

The proposed approach to this problem is to transfer the responsibil-ity for making the decisions about when to allow non-critical memoryallocations from the programmer to the run time system. Then, the onlya priori calculation that has to be done is to calculate the amount of criti-cal allocations done by each (high priority) process during its period andthis depends only on the application and not on target platform proper-ties like memory size.

This approach can also be used to provide a “limp home” mode — amode of operation with lesser performance but radically lower memoryconsumption that will allow the application to continue executing in anlow-on-memory situation, facilitating a more graceful degradation. Thismay be useful for adding some amount of predictability to applicationswith non-predictable memory requirements.

Finally, non-critical memory allocation gives programmers the pos-sibility to add more features to a system without risking that these ad-ditions cause the system to run out of memory and jeopardize the corefunctionality of the system even if it is moved to a smaller platform. E.g.,a low priority process with only non-critical memory allocations cannotcause a system to fail since, if the CPU load is dangerously high it willnot get any CPU time and if the amount of memory is too low, it willnot be allowed to allocate any memory. This also has the advantage thatit makes it easier to make hard real-time guarantees since worst caseand schedulability analysis only has to be done on the critical parts ofthe system. Such analysis still has to be done using existing techniques[JP86, SRL94, Per99].


5.2.2 Improving performance by reducing GC work

Another reason to limit non-critical memory allocations is to reduce theamount of garbage collection work needed and thereby increasing theamount of CPU time available to the application. This can, in turn, im-prove the application’s performance by, e.g., allowing more advancedalgorithms to be used. Furthermore, in a real-time GC system, suchas semi-concurrent GC scheduling, additional memory allocations in ahigh priority process may cause starvation of low priority processes; ei-ther directly, through increased execution time, or indirectly, due to theincrease in GC work caused by these allocations (since the garbage col-lector for the high priority processes run at a higher priority than thesystem’s low priority processes). In complex systems, however, the LPprocess may be more important for good system performance than asecondary task of the high priority process.

With priorities for memory allocations, an application may be writ-ten so that, if the system runs low on memory, the primary tasks of boththe HP and the LP processes are performed, but the less important taskof the HP process is not. Hence, for the quality of service of the system,performance can be tuned in a more flexible and appropriate manner.

5.3 Non-critical allocations

The semi-concurrent garbage collection scheduling model introduces aspecial garbage collection scheduling for the high priority processes inorder to guarantee that they are never delayed. Here, this is taken onestep further by also considering the behaviour of the memory alloca-tor and the risk of running out of memory, due to, for instance, unpre-dictable application behaviour or even wrong worst case estimates. Thisis done by introducing the notion of non-critical memory allocation re-quests, i.e., requests for memory that the run-time system may choose todeny without causing the program to fail.

Ultimately, what we want to do is to keep the amount of live non-critically allocated memory below a certain limit in order to make guar-antees that critical allocations never will fail. Unfortunately, live mem-ory amount is not a very suitable measurement, since keeping track ofthis is not always practically possible.

Particularly, in automatically managed memory systems, where wehave the problem with floating garbage4, there is no real way of knowing

4Floating garbage is memory that is no longer reachable from the application but hasnot yet been reclaimed by the garbage collector.

5.3 NON-CRITICAL ALLOCATIONS 85

how much live memory there is in the system. The only factor we canbe sure of is the amount of memory available for allocation, so we needto base our decisions on that.

5.3.1 Non-critical allocation limit

The decision whether to grant or deny a non-critical memory allocationrequest has to be as simple as possible if it is to be used in high per-formance applications. That is accomplished by introducing an alloca-tion limit for non-critical allocations; if there is less free, or allocatable5,memory than this limit, no non-critical allocations may be done. Thislimit will vary over time; at the start of a GC cycle, we have to reservememory for all the (critical) HP memory allocations needed during thisGC cycle and then, as the HP process runs and does its allocations, theamount of reserved memory is reduced accordingly. Figure 5.1 showsschematically how the amount of allocated, reserved and free memoryvaries over a GC cycle.

When deciding whether to grant or deny a non-critical memory re-quest, we look at how much allocatable memory there is, and how muchmemory we need to reserve for the HP process so that all its remainingmemory allocations during this GC cycle will succeed. Let n be the num-ber of HP periods in a GC cycle, and mHP the amount of critical memoryallocated during each period by the HP process. Then, i HP periods intoa GC cycle we need to reserve RHPi

= (n− i) mHP bytes for the remain-ing HP periods during this GC cycle. Non-critical memory allocationsshould only be allowed if they won’t cause the amount of allocatablememory to drop below RHP .

5.3.2 Fixed GC cycle length

In order to be able to guarantee that the HP process always will get thememory it requests, we need to make sure that the GC always keeps upwith the application. I.e., after each invocation of an HP process, the GCmust do enough GC work so that all the allocations during the next HPprocess invocation will succeed. Given the amount of memory allocatedby the HP process each period and the amount of memory reserved for

5Allocatable memory is memory that is immediately available for allocation. We preferthe term allocatable memory to free memory since, depending on the memory allocatoror garbage collection algorithm used, the term free memory may be difficult to define oreven irrelevant. E.g., in a non-compacting system, the amount of free memory may bemuch larger than the amount of allocatable memory due to fragmentation.


Critical memory

allocated by HP process

high priority

allocations

Reserved for

non-critical allocation

Memory available for

allocations allowed

Only critical

Time

Heapsize

Free memory

All

oca

ted m

emory

Time

Heapsize

Fre

e m

emory

Non-critical limit

Allocated memory

memory

Non-critical

GC cycle

Figure 5.1: Schematic illustration of the limit for non-critical allocations. Thedotted lines indicate the times where the non-critical limit is equal to the amountof allocatable memory, i.e., when the system starts to deny non-critical alloca-tion requests.

5.4 DETAILED DESCRIPTION 87

HP allocations, we can calculate the GC cycle time expressed in numberof HP process periods. We call this time the nominal GC cycle time.

To ensure that no HP allocation fails, we need to complete each GCcycle within this time, even if the actual amount of allocations done dur-ing the current GC cycle are less than the worst case. Otherwise, thesituation may arise that there is allocatable memory left, but not enoughfor another complete HP process invocation. If a HP process is startedat that time, it will require more memory than currently available andthus, that HP process will be delayed by panic garbage collection.

5.4 Detailed description

This section describes the suggested approach in more detail. We dis-cuss how the garbage collection cycle length can be calculated, how thedecisions about when to deny non-critical memory allocation requestsare taken, how the scheduling can be done and finally we give an exam-ple of how such a system may work.

5.4.1 Calculating the GC cycle length

Since we want to be able to make guarantees that the application neverwill run out of memory while still having hard real time constraints, weneed a simple model so that we can make e.g., schedulability analysis.This is done by using a fixed GC cycle time which is calculated at appli-cation design-time.

The GC cycle time, the allocation rate of the HP process and theamount of memory available for non-critical allocation all affect eachother and there are several ways to calculate the cycle length. One ap-proach is to define how much memory should be reserved for HP alloca-tions each GC cycle, MHP . If the HP process allocates mHP each periodwe get the GC cycle length expressed in HP periods:

TGC = n · THP ; n =MHP

mHP(5.1)

Here, the GC cycle length will be the same regardless of how much totalmemory the system has and changes to the amount of memory will onlyaffect how much non-critical allocation that can be made.

Another way is to define the ratio of memory reserved for HP pro-cesses to non-critical memory. This has the advantage that the appli-cation will behave in the same way, with respect to non-critical alloca-tions, independent of how much memory the system it is running on


has. This is preferable since while non-critical allocation cannot causean out of memory situation, they add to the amount of GC work thathas to be done and thus affect the schedulability analysis. Using the ra-tio of critical to non-critical memory instead of a fixed amount for one ofthe quantities has the property that the (amortized) amount of GC workper allocated object is independent of the total size of the memory —the memory size only affects the length of the GC cycles. Thus, this ap-proach reduces the platform dependency of the schedulability analysis.

5.4.2 Live memory and floating garbage

In all calculations we must account for the amount of memory that livesacross GC cycle boundaries and floating garbage that may exist in theworst case. This can be viewed as a reduction of the (usable) heap sizewith a constant. If this isn’t taken into account, there will be less avail-able memory at the start of each GC cycle than we have calculated withand the application will run out of memory.

Less obviously, it is also a problem if there is more allocatable memoryat the start of a GC cycle than in the worst case, since this leads to theamount of memory available for non-critical allocations becoming toolarge, which could cause problems later. Therefore, we need to compen-sate for this, so that we always assume the worst case (i.e., we reservea portion of memory to allow the amount of live memory or floatinggarbage to increase in the future).

With this taken into consideration, the least amount of free memoryrequired in order to allow non-critical allocations during period i cannow be expressed as

LNCi= (n − i)mHP + f(Astart, C) ; 1 ≤ i ≤ n (5.2)

where Astart is the amount of allocated memory at the start of this cycle,C the maximum amount of live and floating objects, and

f(x, y) =

{

y − x , x < y;0 , x ≥ y;

(5.3)

5.4.3 GC for the low priority processes

We will now discuss LP processes in a system with semi-concurrent GC.When LP processes are added to the system, they will also allocate mem-ory but the GC work corresponding to their allocations will be done atallocation time using traditional incremental GC. When LP allocations


are done, the actual GC cycle time will be less than the nominal cycletime. In a traditional incremental garbage collector, this is intrinsic tothe scheduling principle; the extra GC work done by the LP process ad-vances the current GC cycle.

In our system where GC work is triggered by time, however, we haveto explicitly shorten the current GC cycle. Furthermore, the new, shortercycle time still has to be a whole number of HP process periods to ensurethat there always is enough allocatable memory for one full HP processinvocation. This is done by decreasing the current cycle time by l HPperiods, where

l =

⌈

ALP

mHP

⌉

(5.4)

and ALP is the amount of memory allocated by the low priority pro-cesses. Thus, if the nominal GC cycle length is n HP periods, the effec-tive GC cycle length due to LP memory allocations will be n′ HP periods,where n′ = n − l.

Note that this should only affect the effective GC cycle length (i.e.,the scheduling) and not the NC limit calculations. If we were to adjustthe NC limit accordingly when the GC cycle was shortened, it wouldbe possible for non-critical allocations in a HP process to “steal” the GCwork done for a critical allocation in a LP process, and that is not whatwe want. On the other hand, we do need to change the NC limit dueto the actual critical LP allocations made, because if we don’t, we wouldeffectively reduce the amount of memory available for NC allocations.This may seem counter-intuitive but bear in mind that the purpose ofthe NC limit is to limit the amount of non-critical allocations and hasnothing to do with controlling the critical allocations in LP processes.

As described above, when an allocation is made in a LP process, thecorresponding GC work is done incrementally and the GC cycle is short-ened so that there still will be memory for a whole number of HP processactivations. Also, when a LP allocation is done, the amount of allocat-able memory is decreased and in order to maintain the same amount ofmemory available to non critical allocations we have to reduce the NClimit with the same amount as the size of the LP allocation.

If we have allocated ALP bytes of critical memory in the LP processesduring this GC cycle, the NC limit can be written

LNCi= (n − i)mHP + f(Astart, C) − ALP . (5.5)

Non-critical allocations in LP processes, on the other hand, should notbe included in ALP . That means that, if NC LP allocations are made,LNC > 0 at the end of the GC cycle, and the total amount of non-critical


allocations allowed during the cycle is not affected by the decrease incycle time. Just as in the case when there is more available memorythan in the worst case, it is not enough to ensure that all HP criticalallocations succeed in the current cycle — the ultimate objective is tolimit the amount of live non-critical memory.

5.4.4 Non-critical limit calculations in the real world

In all the previous calculations in this chapter, we have assumed that aGC cycle can easily be divided into a number of HP process periods andthat the memory allocations of each period are done instantaneously atthe start of the period. This model is well suited for reasoning aboutsystems and off-line analysis but doesn’t lend itself well to actual imple-mentation.

In real systems, the high priority processes often have different pe-riod times, and real programs do allocations more or less sporadicallyduring their execution rather than at the start of a well defined period.For these reasons, among others, a NC limit based on the number ofelapsed HP periods is not a very practical one for run-time calculations.Instead, we will use the following algorithm:

• At the start of each GC cycle, the amount of memory needed by allthe critical allocations by HP processes is calculated6. This is theamount of memory reserved for HP allocations (compensated forfloating garbage, etc), RHP = MHP + f(Astart, C)

• Whenever a critical HP allocation is done, RHP is decreased bythe size of the allocated object. When a critical allocation is doneby a low priority process, ALP is increased. The non-critical limitis then updated; LNC = RHP − ALP .

• If the amount of allocatable memory is less than or equal to LNC ,non-critical allocation requests will be denied.

This way, the NC limit will always be correct, regardless of how muchmemory the HP processes actually allocates and at what time duringtheir execution they perform the allocations.

Another implementation issue is that our calculations assume thatthe garbage collector only frees memory at the very end of each GC cy-cle. This simplifies the non-critical limit calculations as each cycle can be

6The actual calculation of the worst-case memory requirements for each process couldbe done either manually or at compile time. Another possibility for soft real time systemsis that it could be estimated by the run-time system based on measurements from previousGC cycles.


viewed independently but when implementing support for non-criticalallocations, care must be taken to assure that this assumption holds.

Mark-sweep collectors, of course needs some attention as they, bynature, free memory continuously during the sweep phase. A copyingcollector has this behaviour in principle, but still might have to be mod-ified; it does free all memory after the last object has been moved, butthis could happen before the full GC cycle time has elapsed.

Thus, in any case the memory manager must be designed so that itdoes not make any memory available to the allocator until at the start ofthe next GC cycle. Otherwise, too many non-critical allocations mightbe allowed in the current cycle, which might cause problems later. Thisalso means that if the GC work metric is conservative and the garbagecollector finishes early, the freed memory should not be made availableto the allocator until at the start of the next cycle.

5.4.5 Time-based GC scheduling

Traditionally, incremental garbage collectors have been implemented sothat GC work has been triggered by memory allocation, and done inproportion to the amount of allocated memory. I.e., when half of thememory available at the start of the cycle has been allocated, half of theGC work required to complete the cycle has been done and when all thememory has been allocated the GC cycle is completed.

That approach to GC scheduling does not fit well into a system withnon-critical allocations. The problem is that it may cause low memoryutilization; If the application does less critical allocations than its worstcase the GC cycle will be longer. The limit for non-critical allocations,on the other hand, is not affected, so when the amount of allocatablememory reaches the non-critical limit, no more non-critical allocationsare allowed during that GC cycle. Thus, the less critical memory the ap-plication allocates, the longer the GC cycle gets and the less non-criticalallocations are allowed, which is not what we want.

Therefore, we use time, rather than allocation, as the trigger for GCwork and do GC work in proportion to how large a fraction of the GCcycle time has elapsed. I.e., when half of the GC cycle time has elapsed,the GC should have done (at least) half the work needed to completethe cycle. This ensures that each GC cycle finishes within the fixed time,even if there is allocatable memory left. Thus, time-triggered GC ensuresthe same non-critical memory behaviour regardless of how much criticalmemory the application actually allocates (as long — of course — as theallocated amount is less than the assumed worst case).


5.4.6 Example

As an example, we take a system with one high priority process doingboth critical and non-critical memory allocations and a set of low prior-ity processes doing critical memory allocations.

In figure 5.2 you see how the amount of allocated and allocatablememory, respectively, varies over three GC cycles. In the first GC cycle,the amount of memory reserved for critical HP allocations (or rather, thenon-critical limit) is larger than in the other two. This is because we mustcompensate for the fact that there is less than the maximum amount ofallocated memory at the start of the GC cycle (see Section 5.4.2).

The second GC cycle shows how the system behaves when there areno allocations (and thus no incremental GC work) done by the low pri-ority process. The first and third cycles are shorter than the nominalcycle length since low priority allocations are done.

Since we have a fixed nominal GC cycle length and use time, ratherthan memory allocation, to trigger GC work the GC cycles may end be-fore all available memory has been allocated. This can happen if theapplication uses less memory than in the worst case or due to quantiza-tion when low priority allocations are made (see section 5.4.3).

5.5 Non-critical memory in Java

The main objective when implementing these ideas in a Java environ-ment was that no changes to the syntax of the Java language shouldbe made, and that programs written for our system should work onany Java platform (but, of course, without the added semantics of non-critical memory allocations).

The proposed approach is to use the exception mechanism of Java,so we define an exception class, NoNonCriticalMemoryException,with the added special semantics that all allocations that are done in ablock which catches that exception are non-critical. Figure 5.3 showsa simple program which does both critical and non-critical memory al-locations. This program will run on any Java platform with the onlyaddition of an (empty) exception class.

Non-criticality is transitive. Memory allocations in a method thatis called from a non-critical region, like the calls to the methods foo()and doSomething() on lines 6 and 7 in Figure 5.3, are also non-critical.Note, however, that the first call to foo(), on line 3, is not non-criticalsince the call is not made from a non-critical block. This behaviour ispreferable since an auxiliary function could be called both from criti-

5.5 NON-CRITICAL MEMORY IN JAVA 93

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� first GC cycle second GC cycle third GC cycle� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� ��

non−critical memory allocated by high priority process� � ��

critical memory allocated by high priority process

memory allocated by low priority process

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

��

��

� � ��

� � ��

�� !! ""##$$$ %%

&& '' (( )) ** ++ ,, -- .. // 00 11222

333

44 55 66777 8899

:::;;; <<==

>>>>>>>>>>>>

????????????

@ @@ @@ @A AA AA A

B B BB B BC C CC C C

D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D DD D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D

E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E EE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E EE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E EE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E EE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E EE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E EE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E EE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E EE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E EE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E

F FF FF FG GG GG G

Allocatable memory limit

All

oca

table

mem

ory

All

oca

ted m

emory

Time

Time

Heapsize

Den

ied

All

oca

tions

Time

Heapsize

Memory reserved for critical HP allocationsNon−critical allocation limit

live objects and floating garbage

Figure 5.2: An example showing how the amounts of allocated and allocatablememory vary over time. Allocation requests for non-critical memory are deniedwhen the amount of allocatable memory is less than or equal to the non-criticalallocation limit (RHP −ALP ). This happens at the end of the second GC cycle.Note that the first and third GC cycles are shorter than the nominal lengthdue to low priority memory allocations. Also note how the non-critical limit islowered when LP allocations are done so that the amount of memory availablefor non-critical allocations is not changed.


1 void example(){2 Object aCriticalObject = new Object();3 foo(aCriticalObject); // do something important4 try{5 Object aNonCriticalObject = new Object();6 foo(aNonCriticalObject);7 doSomething();8 // do something9 // if the non-critical10 // allocation was successful11 } catch(NoNonCriticalMemoryException e){12 // non-critical allocation failed13 }14 }

Figure 5.3: Small example program. The allocation of aCriticalObjectis always done, but the allocation of aNonCriticalObject may be denied.If the allocation fails, a NoNonCriticalMemoryException is thrown andmay be handled in the catch-clause.

cal and non-critical contexts. In order to make such transitivity possi-ble without having to litter the code with try and catch clauses, theexception class NoNonCriticalMemoryException is an uncheckedexception. An instance of this class can be statically allocated to avoidwasting memory.

An experimental implementation7 has been made using the IVM (In-finitesimal Virtual Machine) [Ive03], a very compact real-time Java vir-tual machine. Currently, non-critical allocations are explicitly turned onand off using a native method IVM.setMemoryPriority(). This is,however, not fundamentally different from our proposed approach sincethose calls could be inserted automatically by the class loader as the ex-ception table is set up (much in the same way as monitorenter andmonitorexit byte codes are inserted for synchronized methods).

5.6 Summary

It was observed that memory priority and CPU time priority need to betreated separately. The logging example shows that a process havinghigh CPU time priority doesn’t necessarily mean that all of its memoryallocations are critical. The idea of applying priorities to memory allo-

7The experiments with priorities for memory allocations are presented in Section 8.5.

5.6 SUMMARY 95

cation was introduced and it was shown shown how this can be usedto enhance the robustness of real-time applications. The advantage thisapproach gives is twofold: Firstly, it provides run-time support for pri-oritizing memory allocations if there is not enough available memoryto safely accommodate for all allocation requests. Secondly, but equallyimportant, it makes it easier to provide hard guarantees since the worstcase memory usage calculations only has to be done for the critical partsof the system as non-critical allocations cannot cause the system to fail.Furthermore, it is suggested that the same mechanisms could be used toincrease performance by limiting the amount of memory allocation and,consequentially, GC work.

The presented approach is based on the notion of non-critical mem-ory allocation requests, which can be used by the programmer to indi-cate that the memory allocations done in a certain part of the programare less important than the rest. Such non-critical allocations may beallowed to fail if the run-time system decides that that memory couldbe of better use elsewhere or that the increased garbage collection workwould degrade system performance.

The incorporation of priorities for memory allocations in an objectoriented language is studied and a way of introducing non-critical mem-ory allocation in a Java system without making any changes to the syn-tax of the Java language is proposed. This has successfully been imple-mented in the IVM experimental Java virtual machine.

Preliminary experiments show that the mechanism is fairly easy toimplement and can improve the robustness and performance of a con-trol application by restricting its operation to the critical tasks if the sys-tem runs low on memory. It allows the programmer to write a systemthat performs better if run on a faster and larger system but whose crit-ical tasks won’t fail if it is run on a system with less than ideal amountof memory. Instead, the non-critical features of the system will automat-ically be turned off if there isn’t enough memory for them to be safelyexecuted.

CHAPTER 6

MEMORY-AWARE

FEEDBACK SCHEDULING

Feedback control is a good way to cope with uncertainties, and has suc-cessfully been used in process schedulers for real-time control systemswith non-deterministic execution times — a technique known as feedbackscheduling. Such scheduling is very suitable for systems which changebetween different operating modes with different resource utilizationpatterns, where using worst case assumptions would yield an unaccept-ably low CPU utilization. A feedback/feed-forward system can adapt tothe changing requirements of the application and tune, for instance, theperiod times of the tasks in order to keep the CPU utilization at a safelevel while optimizing the quality of service delivered by the system.

This chapter investigates how an auto-tuning time-triggered GC canbe incorporated in a feedback scheduling system in order to make thememory management overhead explicit and let the process schedulertake this into account when scheduling the application tasks.

It is also studied how the priorities for memory allocations presentedin Chapter 5 can be used, in a feedback scheduling system, to control theallocation rates of the application threads in order to optimize the trade-off between memory and CPU time consumption.

6.1 Introduction

Thus far, we have studied how to calculate the scheduling parametersfor a time-triggered garbage collector in two different cases. In the firstone, all parameters (Lmax, ai, etc.) were known and constant. In the sec-ond case, the parameters were estimated based on run-time measure-ments. In a feedback scheduling system, the GC scheduling problem

98 6. MEMORY-AWARE FEEDBACK SCHEDULING

comes in a third form. Here, the parameters of the mutator threadsare known at any particular instant, but may change as the schedulerchanges sampling rates in order to maximize the overall performance.

Previous work on feedback scheduling and automatic identificationof (soft) real-time systems [AP00] has showed how self-tuning regula-tors can be used to control resource allocation without a priori know-ledge about the task requirements. However, in existing feedback sched-uling systems the memory management overhead is either ignored ortreated implicitly as a part of the application’s execution.

With traditional incremental garbage collectors, the memory man-agement overhead is inlined in the application code, as a small amountof GC work is performed at each allocation. Therefore the memory man-agement overhead can be treated as part of the mutator’s execution timeand no special consideration is required (although doing so may stillimprove performance).

With a concurrent garbage collector, that is no longer possible. Asthe GC work motivated by the actions of the mutator is performed by aseparate task, the CPU utilization of that task must be handled explicitlyby the scheduler. The problem with GC scheduling is that the GC hasto finish each cycle before the available memory is exhausted or else itwill stop-the-world to complete the cycle, causing unacceptable delaysfor the hard real-time tasks. Therefore, care has to be taken to make surethat the GC is always given the CPU time (or bandwidth) it needs. Thisimplies that we cannot use standard feedback scheduling on the garbagecollection thread, as making the GC cycles longer (to reduce the GC’sCPU utilization) may be fatal. In the proposed approach, the deadlineand CPU utilization calculated by the GC scheduler cannot be changedby the feedback scheduler, but must be taken into account when the pe-riod times of the application threads are calculated. This corresponds tomaking the GC a rigid task in [BLA02].

This chapter studies how to take the memory management costs intoaccount in the period assignment problem of the feedback scheduler.Section 6.2 derives approximative models for estimating and optimiz-ing the GC scheduling parameters together with the period assignment.Section 6.3 briefly discusses how slack in the schedule caused by con-servative estimates of the GC utilization can be utilized when the GCfinishes its work before its deadline. Section 6.4 investigates how themechanisms for different priorities on memory allocation requests fromChapter 5 can be utilized in a feedback scheduling context, where con-trolling the allocation rate of processes gives another degree of freedomwhen optimizing overall performance.

6.2 GC-AWARE PERIOD ASSIGNMENT 99

6.2 GC-aware period assignment

Recall the period assignment problem from Equation (2.3),

minh1...hn

n∑

i=1

Ji(hi)

subject to

n∑

i=1

Ci

hi≤ Usp .

If the cost function, J , is (approximated by) a linear or quadratic func-tion, it has been shown that a closed-form solution to the optimizationproblem (2.3) can be found: With the cost function

Ji(hi) = αi + γihi (6.1)

or, equivalently,

Vi(fi) = αi +γi

fi(6.2)

the optimal frequencies, f?i , are given by

f?i =

(

γi

Ci

)

Usp∑n

j=1(Cjγj)12

(6.3)

and for a quadratic approximation of the cost function, a similar explicitsolution can be found [CEBA02].

In a system with a scheduled garbage collector, the required CPUutilization of the GC, UGC, must be taken into account when assigningtask periods in order to keep utilization below the setpoint. To get thetotal CPU utilization Usp, the reference utilization for the mutator tasks inthe feedback scheduler must therefore be reduced to

Uref = Usp − UGC . (6.4)

The utilization of the garbage collector is UGC = CGC

TGCand thus, the con-

straint of the period assignment problem becomes

n∑

i=1

Ci

hi+

CGC

TGC≤ Usp . (6.5)

Given the previously derived expressions for the GC cycle and ex-ecution time, we get the the general expression for the required CPUutilization for GC,

UGC =CGC(Sh)

TGC(H, L, a1, . . . , an, h1, . . . , hn). (6.6)


However, at run-time, all parameters are typically not known, and there-fore an approximate model must be used. We will now formulate suchmodels for compensating for UGC in FBS period assignment. In the firstone, we will simply use a GC auto-tuner, as described in sections 4.2and 4.3, as a reference generator to the feedback scheduler. In the sec-ond, we will incorporate the GC tuning into the optimization problemof the feedback scheduler, in the case where Lmax is known. In the thirdone, we assume that Lmax is unknown and derive similar expressionsbased on the previously described GC auto-tuning techniques.

As far as the optimization problem is concerned, we will assume thatCGC is constant. This just means that in the formulation of the optimiza-tion problem, we assume that CGC is independent of the period times ofmutator tasks, and that the effects that changes to the schedule has onCGC is captured by the feedback loop. The interaction between the GCcycle parameter estimation and the feedback scheduling is done onlythrough the model for TGC.

6.2.1 Separate GC tuning and feedback scheduler

The most simple way of taking garbage collection work into accountis to use the GC auto-tuner as a reference generator for the feedbackscheduler. Figure 6.1 shows how the adaptive garbage collection sched-uler from Chapter 4 fits into a general feedback scheduling system. TheGC thread is scheduled as a normal application thread, but with the im-portant difference that it is allowed to set its own deadline whereas thefeedback scheduler changes the deadlines of the application threads inorder to optimize CPU utilization.

As mentioned, the special treatment of the GC thread is necessarysince the GC will stop all application threads if the system runs out ofmemory and that must be avoided as it leads to long GC pauses andunacceptable real-time performance. In this case, the GC tuner andthe feedback scheduler are independent of each other, and the feedbackscheduler simply uses Uref as in (6.4), where TGC and CGC are estimatedusing some of the described techniques.

However, in general, the different tasks have different memory re-quirements, and thus any changes to the scheduling will affect the GCworkload. As the GC scheduler is decoupled from the feedback sched-uler, such effects cannot be taken into account in the period assignment,and this is a limitation of the described approach. Instead, any changesto the allocation rate — and, hence, to TGC and UGC — caused by thechanges in period times are compensated for by the feedback to theGC tuner. That may, in turn, cause Uref to change, and therefore, this


Scheduler Tasks Dispatcher

Memory manager and

GC auto-tuner

Usp {Ti} {jobs}

Ci, U

TGC , CGC job

CGC

Figure 6.1: Feedback scheduling of both application tasks and GC. The GCtask issues jobs which are dispatched just as any other jobs. The only differencebetween the GC task and the application tasks is that the GC is allowed to setits own period time while the feedback scheduler changes the application tasks’period times in order to keep U ≤ Usp.

model may show oscillating behaviour. Such oscillations can, however,be avoided by using conservative settings in the GC auto-tuner. Forinstance, if the UGC prediction is filtered using the maximum value anda forgetting factor close to unity, a well damped system can be achieved,at the price of lower average utilization.

Another, and potentially more important, drawback of the separatedapproach is that the measured GC overhead is divided evenly across allmutator tasks. Thus, even if one task is responsible for the majority ofthe memory usage, the sampling rates of all tasks will be affected. Insystems with competing (as opposed to cooperating) tasks, that may bean issue, as far as fairness in the scheduling is concerned.

6.2.2 Integrated GC and feedback scheduling

If the GC estimation and tuning is incorporated in the feedback sched-uler itself, the effects on the GC utilization of changing period times canbe taken into account in the period time optimization. In principle, wewant to be able to express the cost of garbage collection per task andsample, in a way that the constraint in the optimization problem is on aform that allows the existing closed-form solution to be used.

Under the previously stated assumption that CGC is constant, UGC

will be a function of the GC cycle time, which, in turn, depends on the


allocation rate. Thus, we get a utilization constraint with one term forthe CPU requirement and one for the memory requirement of each task,

n∑

i=1

Ci + KGC · ai

hi≤ Usp (6.7)

where KGC can be viewed as the cost, in CPU utilization, of memory al-location in CPU seconds per byte. With this formulation, the utilizationconstraint is of the same form as (2.3), as the extra term is constant (as-suming ai is independent of hi), and thus the existing explicit solutionto the optimization problem can be used. We will now see how the uti-lization constraint can be expressed when the maximum amount of livememory is known and unknown, respectively.

Using worst case live memory information

Given the maximum amount of live memory, Lmax, and the amount ofmemory allocated per period of each task, ai, we can use Theorem 1 tofind the maximum allowed TGC and, hence, the CPU utilization:

UGC = CGC ·

∑ni=1

ai

hi

H−Lmax

2 −∑n

j=1 aj

. (6.8)

Inserting this expression for UGC into (6.5) gives the constraint

n∑

i=1

Ci + CGCH−Lmax

2 −P

nj=1 aj

· ai

hi≤ Usp (6.9)

which, assuming that CGC and {a1 . . . an} are independent of {h1 . . . hn},can be written as (6.7).

In practice, the period time of the GC will be much longer than thatof the mutator tasks, and thus

∑nj=1 aj is typically very small compared

to H − Lmax. Further, if a conservative estimation of UGC is used, andUsp < 1, there will always be some slack in the schedule. For thesereasons, sufficient safety margins can be achieved, making it reasonablysafe to approximate (6.9) with

n∑

i=1

Ci + CGCH−Lmax

2

· ai

hi≤ Usp . (6.10)

I.e.,

KGC =CGC

H−Lmax

2

(6.11)

which is precisely the GC CPU time per allocated byte.


Without a priori analysis

The above discussion assumes Lmax to be known and that it is reason-able to use the worst case live memory. If that is not the case, TGC can beestimated using (4.3), and the constraint (6.5) becomes

n∑

i=1

Ci

hi+

2 CGC

F (t)P

ni=1 ai

+ t − ts≤ Usp (6.12)

which, with ai = ai

hi, gives

n∑

i=1

Ci

hi+

2 CGC

F (t)P

ni=1

aihi

+ t − ts≤ Usp (6.13)

which can be reorganized as

n∑

i=1

Ci + 2 CGC

F (t)+(t−ts)P

ni=1

aihi

ai

hi≤ Usp . (6.14)

I.e.,

KGC =2CGC

F (t) + (t − ts)∑n

i=1ai

hi

(6.15)

Unfortunately, the constraint (6.14) is not linear, meaning that the ex-isting closed-form solution is not directly applicable. Worse yet, in thisform, we get an optimization problem where both the objective functionand the constraint are concave, and that makes it practically useless.

In order to remedy that, an approximation that turns (6.14) back intoa linear constraint is sought. It is observed that, if a is constant, thedenominator in (6.15) is equal to F (ts) = Fs. If that is used to linearizethe constraint, we get

n∑

i=1

Ci + 2 CGC

F (ts) ai

hi≤ Usp (6.16)

and

KGC =2CGC

Fs. (6.17)

The error in the TGC approximation of (6.16) will increase with increas-ing changes in a and the effect will be greater if the change occurs laterin the GC cycle. Figure 6.2 shows how the approximation error depends


on the change in a and the time of change. For instance, if the alloca-tion rate is doubled half-way into the GC cycle, the relative error in theTGC approximation will be 20%. However, as the total GC utilizationtypically is 5–20 %, the overall impact of the error in the approximatedutilization will only be a few percent. For robustness, a safety margin toaccommodate such uncertainties can be added when setting Usp.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Fraction of GC cycle elapsed at switch time

Rel

ativ

e er

ror

in G

C c

ycle

tim

e

Figure 6.2: Relative error in TGC approximation as function of change in a

and time of switch. The lines represent changes in a from a factor of 0.5 to afactor of 2. An increase in a causes underestimation of UGC.

Thus, with suitable approximations, the CPU requirement of the GCtask can be included in the period assignment, while keeping the opti-mization problem on a form that allows the existing closed-form solu-tions to be used.

6.3 Utilizing slack

By making the costs of memory management explicit and taking theminto account in the period time optimization, it is possible to use a con-current garbage collector in a feedback scheduling system. In order toget a system that is robust to variations in execution times, the utilizationsetpoint is typically set below 100%. Also, to get stable estimates of GCscheduling parameters, the estimation needs to be conservative. Thatmeans that, in the average case, there will be some slack in the schedule,allowing the GC to finish before its deadline.

The feedback scheduler reserves a fraction of the CPU time for garbagecollection. However, when the GC is not running, this CPU time could

6.4 CONTROLLING THE ALLOCATION RATE 105

be used for mutator threads. In a system with a time-triggered GC, itis known that when the GC has finished a cycle it will not need to runagain until at its next release time. If the feedback scheduler is aware ofthe state of the GC, this means that when the GC has completed a cy-cle, a higher mutator utilization can be allowed until the next GC releasetime. That is, if the GC finishes at time tf ; ts < tf < te,

Uref (t) =

{

Usp − UGC, ts ≤ t ≤ tfUsp, tf < t < te − δ

(6.18)

where δ is used to take into account the fact that increasing the mutatorutilization may increase the allocation rate and, hence, shorten the timeuntil the next GC release.

The GC cycle time, and consequentially, the start time of the nextGC cycle, was estimated based on worst case assumptions about float-ing garbage, but when a GC cycle has finished, it is known how muchmemory was actually reclaimed. Thus, δ depends on both the amountof free memory and the allocation rate. We know that the amount of freememory at the time the GC has completed the cycle, F (tf ) ≥ Fmin. Therequirement is the same; when the next GC cycle starts, the amount offree memory must be no less than Fmin. Therefore, the adjusted releasetime of the next GC cycle must satisfy

R′

GC(a) ≤ tf +F (tf ) − Fmin

a(6.19)

and, with equality, we get

δ = te − R′

GC(a) = te −

(

tf +F (tf ) − Fmin

a

)

. (6.20)

Thus, a sufficient degree of conservatism can be used to give robust-ness against inaccuracies in the GC scheduling parameter estimates dueto variations and approximations, without the low average CPU utiliza-tion normally associated with such conservative scheduling.

6.4 Controlling the allocation rate

As we have seen, the fraction of CPU time that must be reserved forgarbage collection depends on the allocation rate of the mutator, which,in turn, depends on the period times of the individual threads. There-fore, in a system with garbage collection, the feedback scheduler controls


the CPU usage of a thread directly, through the period assignment, butalso indirectly as the period time affects the allocation rate.

The notion of priorities for memory allocations, introduced in Chap-ter 5, suggests that it may be possible to, to some extent, directly controlthe allocation rates of the individual threads. Having such a mecha-nism may be used to increase the flexibility of a feedback scheduler, bymaking it possible to separate allocation of memory and CPU time. Ashigher memory usage means more GC work, that allows the scheduler,or resource manager, to trade off memory usage for CPU time.

Assuming that each task has a critical and a non-critical part, with

memory requirements of a(c) and a(nc)max, respectively, we extend the cost

function with a term corresponding to the increase in quality from thenon-critical parts

J(h, a(nc), . . .) = . . . ; 0 ≤ a(nc) ≤ a(nc)max (6.21)

which gives the optimization problem

minh1...hn

∑ni=1 Ji(hi, a

(nc)i , . . .)

subject to∑n

i=1

Ci+KGC·

“

a(c)i

+a(nc)i

”

hi≤ Usp (6.22)

The motivation for introducing different priorities for memory al-locations as presented in Chapter 5 was primarily to provide isolationbetween critical and non-critical parts of a system. Now, we focus on op-timizing the performance of the application, and thus it becomes moreimportant to take which allocations that should be preformed into ac-count. The memory manager can, however, only limit the amount ofnon-critical allocations per time unit (typically, per GC cycle); as therun-time system doesn’t have any information about the purpose of theapplication threads it cannot make any detailed decisions about exactlywhich allocations to allow or deny.

In order to maximize the quality of service, it is therefore better toactually communicate how much non-critical memory it is currently al-lowed to use to each thread. In the application, this can then be trans-lated into performing some parts every nth sample, or something sim-ilar. Thus, while the hard limit used to ensure robustness may be en-forced by the run-time system, the programmer can make more fine-grained decisions about how make best use of the memory available toeach thread.

In an actual implementation of these ideas in a feedback scheduledsystem, the memory non-critical limit must be set individually for each


thread and thus the interface between the feedback scheduler and mem-ory manager must contain operations for that. Or — if the threads arecooperating — it can be expressed directly in the code.

In the general formulation, (6.22) might prove hard to solve on-line.In order to test the fundamental principle, we will now investigate asimplified case, where the problem is reduced to either allowing all orno non-critical allocations of a thread in each sample.

Case study: Ball-and-beam

As an example, we take the ball-and-beam process1, controlled by a LQGregulator. It is assumed that the angle can either be measured (whichrequires a measured-value object to be allocated and passed to the con-troller) or estimated by using an observer. I.e., the allocation of the anglemeasurement is non-critical. Depending on the state of the memory sys-tem, KGC — and hence, the total available CPU utilization — will vary.

Using Matlab-based tools, the effects of the scheduling on controlperformance in the described scenario is analysed and simulated. Forcontrol performance analysis, the Jitterbug toolbox is used. Jitterbug isa tool for studying how timing affects the performance of a computer-controlled system [LC02]. The simulation was done using the TrueTimereal-time kernel simulator in Matlab/Simulink, with simulated heap, aseparate GC thread, and one disturbance task. The simulator is pre-sented in Chapter 8.

Theoretical analysis

The analysis is done for two versions of the ball-and-beam controller:with or without angle measurements. The controller with the angle mea-surement will allocate more memory per sample, and therefore, underthe discussed feedback scheduling, it will suffer a bigger penalty fromthe GC overhead. On the other hand, for the same sampling rate, thecontroller using angle measurements will perform better. In order tooptimize quality of control, the cost of memory management must bebalanced against the control performance, to choose which of the twocontrollers to use, given a certain KGC.

Figure 6.3 shows the calculated total cost for a range of samplingrates. Figure 6.4 shows the sampling rate for the two controllers as afunction of KGC. The controller without angle measurements has lowermemory requirement, and is therefore much less sensitive to KGC.

1The experiment setup is described in more detail in Chapter 8.


Putting this together, using linear cost functions, Figure 6.5 shows J

as function of KGC for the two systems. The intersection of the lines isthe value of KGC where the system with observed angle starts outper-forming the one with measured angle as a lower memory usage allowsa higher sampling rate — the optimal Kswitch.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

5

10

15

20

25

30

35

h

with observerwith measured phi

J

Figure 6.3: The calculated costs and the linear approximations.

Simulation

In order to measure the control performance of the LQG regulator, if qn

is the weight of the nth state (i.e., Q = diag(q1 . . . qn)), and x is the statevector, we define the total cost as

Jtot =

∫ t n∑

i=1

qix2i (t) dt . (6.23)

Running the system with different values of the switching point Kswitch

and measuring Jtot gives the plot shown in Figure 6.6, where the mini-mum corresponds to the optimal Kswitch. The cost is the total cost of a160s execution, and it is not normalized. The absolute values of the costare not very interesting, as a direct comparison with the analysis is notpossible as they show different things. The analysis calculated the costfor different, constant, values of KGC. In the simulation, KGC varied


6 7 8 9 10 11 12 13

x 10−5

0.14

0.15

0.16

0.17

0.18

0.19

0.2

K

h


Figure 6.4: Sample rate as function of KGC.

6 7 8 9 10 11 12 13

x 10−5

30

31

32

33

34

35

K

Jtot


Figure 6.5: Cost as function of KGC. For high values of KGC, the system withthe observer will outperform the one with measured angle, due to the large CPUcost of memory allocation.


throughout the execution and at each scheduling instant the controllerwith the lowest cost was used. The GC was scheduled as described inChapter 4, and the feedback scheduler used UGC to adjust the utilizationreference, according to Section 6.2.1.

6 6.5 7 7.5 8 8.5 9 9.5 10

x 10−5

1800

1900

2000

2100

2200

2300

2400

Kswitch

Jtot

Figure 6.6: Total cost as function of Kswitch.

In theory, the minimum in Figure 6.6 should be at the same KGC

value as the intersection of the lines in Figure 6.5. The discrepancy be-tween the theoretical and simulated results can be explained by a combi-nation inaccuracies in both the models and the run-time system. The the-oretical results are based on an optimal feedback scheduler, but at run-time, some approximations are required. Notably, in order to determinethe mutator utilization, both the GC cycle time and the GC executiontime has to be predicted. The GC cycle time is dependent on the allo-cation rate and object distribution, which are both affected by the modechanges. Also, in order to get a high enough KGC to reach Kswitch, thesystem had to be quite stressed, with a UGC around 45 – 55 %. Thus theimpact of the discussed approximations and uncertainties, which wouldbe small in a system with lower UGC, became significant.

While the setup in this simple case study is not entirely realistic, itstill illustrates the fundamental idea that if memory usage can be con-trolled, the total quality of control of a system can be improved by on-line optimization of the trade-off between memory and CPU usage.

6.5 SUMMARY 111

6.5 Summary

In order to use scheduled garbage collection in a feedback schedulingsystem, the required CPU utilization of the GC task must be known,and as the GC utilization depends on the memory behaviour of mutatortasks, it must be determined on-line. Also, as feedback scheduling istypically used in systems where the workload of the mutator tasks (and,hence, their execution pattern) is variable, the GC scheduling cannot bestatic but must be able to react to such changes.

Different approaches to taking the GC into account in the period as-signment of a feedback scheduler were suggested. In the first approach,the GC auto tuner is used as a reference generator to the feedback sched-uler, using feedback to adjust the utilization reference based on mea-sured and estimated GC utilization. In the second approach, the GCscheduling is incorporated into the period assignment of the feedbackscheduler.

Both approaches have similar performance2, and the major differ-ences between them lie in implementation and fairness. The separate ap-proach is easier to implement, as the communication between the feed-back scheduler and the memory manager is kept at a minimum: the GCutilization is accounted for by changing the utilization reference of thefeedback scheduler. The advantage of the integrated approach is that itincreases fairness of the schedule, as the memory usage is accounted foras a part of the execution time of a task.

Feedback scheduling is a technique for on-line resource management.It was suggested that overall performance can be enhanced if also mem-ory usage could be included in the optimization. If a controller can berun in different modes, with different memory requirements, the trade-off between memory usage and CPU usage can be optimized on-line.

This chapter has presented different examples of how communica-tion between the memory manager and feedback scheduler is, to someextent, necessary, and, in other cases, opens new possibilities for opti-mization of the performance of the complete system.

2Experiments are presented in Section 8.6

CHAPTER 7

GC IN AN UNCOOPERATIVE

ENVIRONMENT

Due to external requirements, run-time systems for embedded appli-cations may have to operate in an uncooperative environment; for in-stance, extra-functional requirements or historical reasons may stipulateusing an off-the-shelf C compiler and RTOS or including external, legacyor automatically generated, C code. In such cases, one cannot rely ondetailed assumptions on the behavior of the back-end C compiler or thethread scheduler, which makes implementation of a real-time GC morechallenging. For instance, it means that any synchronization requiredbetween collector and mutator, needs to be done explicitly. It also meansthat the generated C code must be written so that it ensures, in a portableway, that no back-end optimization causes interference with the GC.

In particular, the combination of uncooperative compiler, uncooper-ative scheduler, and tight real-time requirements (low latency) makes ademanding challenge. Without control over the scheduling, some com-piler optimizations cannot be allowed, as threads may be preempted atany time. For instance, if we are using a copying or compacting GC al-gorithm, pointers must always be read from memory, and not kept inregisters, as the collector may move objects at (from the mutator’s pointof view) any time. Furthermore, explicit synchronization with the col-lector is required, which adds to the execution time overhead of memoryoperations.

It is shown, and experimentally verified, how it is possible to imple-ment an accurate, concurrent GC in an uncooperative environment, withmaximum latency times of a few microseconds, and acceptable run-timeoverhead. Potential bottlenecks are identified, and compile-time andrun-time optimizations to mitigate the problems are suggested.

114 7. GC IN AN UNCOOPERATIVE ENVIRONMENT

After the introduction in Section 7.1, Section 7.2 discusses the problemsassociated with concurrent GC in an uncooperative environment andpresents our approach. Section 7.3 briefly describes our GC API. Sec-tion 7.4 investigates some potentially expensive performance bottlenecksand Section 7.5 discusses how they may be mitigated.

7.1 Introduction

In a run-time system for real-time Java, or other safe languages withautomatic memory management, it is essential to have an accurate, orexact, (i.e., non-conservative), concurrent garbage collector with ade-quate real-time performance. Due to the external requirements discussedbelow, the GC must also be able to function in an uncooperative environ-ment, meaning that we must make sure that neither correct behaviour,nor real-time performance, is jeopardized by compiler optimizations,concurrency issues or interference from external code. The challengesencountered when designing and implementing such a run-time systeminclude:

Real-time performance The collector should be fully concurrent in or-der to make it possible to schedule GC in a non-intrusive way[Hen98, RH03]. It should also have very fine-grained incremen-tality to allow latency times of at most a few microseconds as re-quired e.g. in automatic control applications.

Usability and flexibility Just as a system must make efficient use of sys-tem resources, it must not require unreasonable amounts of engi-neering effort in order to meet e.g. timing and space constraints.Therefore, an important requirement on a run-time system that isto be practically usable is that it is easy to use and offers sufficientflexibility. This means, for example, that the interface to the mem-ory manager must be fairly simple, and that it should be flexibleenough to allow migration between platforms, operating systems,and GC algorithms with little, or no, effort.

Uncooperative compiler The major reason for using a non GC awarecompiler is availability; there are C compilers for practically allcomputer platforms and therefore it is desirable to implement newcompilers for high level languages using C as intermediate codeand a standard C compiler as the back-end. This gives access to aportable and highly optimized back-end without having to spendthe effort required to implement one. However, not having control

7.1 INTRODUCTION 115

over machine code generation makes it harder to implement accu-rate garbage collection. Finding roots and identifying referencesare more difficult as we do not have control over activation recordlayout, register allocation, etc.

Uncooperative scheduler As with compilers, there are many reasonsfor wanting to use an off-the-shelf real-time operating system. Also,if there is a need to call external native code it is not possible torely on specific scheduling features like preemption points to en-sure safe behaviour, as such external code does not contain pre-emption points. For instance, a thread in a Java program may callnative functions in legacy libraries or code generated from a tool(like Matlab/Simulink). As it must be possible to preempt a threadduring such native calls, if we want to make guarantees on latency,preemption must be possible at any instant and not just at preemp-tion points.

Furthermore, for the sake of portability (currently, our system runson posix1, Linux/RTAI2, STORK[AB91], and a locally developedkernel for the Atmel AVR series of micro-controllers) the interfaceto — and reliance on certain features in — the underlying OS mustbe kept at a minimum.

In isolation, each of these aspects do not pose a large problem, but,as we will see, the difficulty comes from the combination, which gen-erates conflicting requirements. In particular, synchronization betweenthe mutator and collector can be a problem; full preemption in combi-nation with a fully concurrent GC — especially a compacting one —requires mutual exclusion and, to get short latency times, quite frequentlocking and unlocking, which may be a serious performance bottleneck.

Since we usually cannot expect to have control over every aspect ofthe execution environment of an embedded control system, we need tofind a way to handle the conflicting requirements this places on the de-sign of a concurrent GC. This includes finding out how to make a reason-able trade-off between short latency and overall performance as well asinvestigating what possibilities exist for reducing the impact of the de-sign conflicts by applying a combination of compile-time and run-timeoptimizations.

1Tested on Solaris on UltraSparc and Linux on Intel and PowerPC2The DIAPM Real-Time Application Interface is an addition to Linux, making it possi-

ble to run hard real-time tasks at the kernel level, below Linux. Tested on Intel, PowerPCand Axis ETRAX computers.


7.2 Exact GC in an uncooperative environment

The desire to use standard real-time operating systems and standard Ccompilers means that we will have to construct a GC that not only isindependent of operating system or compiler support, but will work de-spite the way the operating system and compiler works. Designing aconcurrent GC for such an environment is a challenge. The GC mustbe synchronized with the application threads and the operating systemsuch that the heap remains consistent no matter how the operating sys-tem chooses to schedule the system. The compiler must also be pre-vented from certain optimizations that could jeopardize the integrity ofthe heap — especially in combination with a preemptive scheduler. Wemust also provide some type of runtime type information in order tomake it possible for the GC to identify all references.

7.2.1 Uncooperative compiler

We must ensure that all references are traversed by the collector. Thisincludes finding references that reside in local variables as well as avoid-ing that references are missed because of compiler optimizations. Refer-ences may reside on stacks or on the heap (or in registers, but this mustbe avoided). If the back-end compiler doesn’t know about references,the code has to contain explicit instructions to inform the GC about thelocation and scope of each reference.

A common way of implementing this is by pushing the location ofany local reference variable onto a root stack [Hen98]. Another alterna-tive is to group all local variables for each function together into a Cstruct and to link these structs together forming a shadow stack contain-ing all references [Hen02].

Finding references on stacks

An accurate traversing GC must be able to correctly identify all refer-ences outside the GC heap which reference objects on the heap. I.e. itmust find all the root references. Roots can consist of global variablesas well as variables located within method activation records on the Cstack. An efficient strategy for tracking these is required.

Keeping track of root references becomes especially hard when weuse a compiler or a back-end without support for GC, since we havelittle or no control over activation record layout, register allocation, andcode optimization. In order to gain independence from the compiler, weneed to have a known — to the GC — format of references. Just storing

7.2 EXACT GC IN AN UNCOOPERATIVE ENVIRONMENT 117

references as C pointers is not possible, as any kind of optimization maythen be performed on them.

Our approach to tracking root references is to use an auxiliary rootstack, consisting of reference structures, as shown in Figure 7.1. Thereference structs also reside on the C stack and are linked together in asingly linked list. Roots are registered by calling PUSH_ROOT(gc_root)and de-registered by calling POP_ROOT(gc_root)3.

typedef struct gc_root {GC___REF(ObjectHead) ref;struct gc_root *next_root;

} gc_root;

Figure 7.1: The structure used to track references on stacks. TheGC REF(type) macro expands to type* for a non-moving GC, or totype** for a moving GC (to accommodate the indirect table or forwardingpointer of the read barrier).

In a multi-threaded system, each thread has its own stack, and weuse one root stack per thread. This makes things like popping all localvariables of a function and handling exceptions easier, and also reducesthe amount of required synchronization as the thread root stacks are in-dependent. The roots of each thread are kept in a linked list (as above),and the heads of each thread root stack (marked, in our implementa-tion, by next_root==0) are kept in a doubly linked list. The structureis shown in Figure 7.2, and Figure 7.3 gives an example of how the set ofroot stacks are linked together.

Finding references in objects

In order to find references in objects (i.e., on the heap), we need to haveinformation about the object layout. This can be implemented in severalways. One method is to associate a trace function with each object typewhich calls a GC function for each reference in the object. Another alter-native is to insert into each object a reference to another object that con-tains information about the layout of the object. We call this informationGC info. The GC parses this information in order to find the referencesto traverse.

3In our implementation, we use the root structure as a stack. As roots are typicallylocal variables, their lifetimes depend on the scope they are declared in, and thus, exhibita stack-like behaviour. However, as the roots are kept in a list, it is possible to register andde-register roots in an arbitrary order.


typedef struct gc_root {GC___REF(ObjectHead) ref;struct gc_root *next_root;struct gc_root *top_root;struct gc_root *next; // next threadstruct gc_root *prev; // previous thread

} gc_root;

Figure 7.2: The root layout for multi-threaded programs. The top root,next and prev fields are not used for the actual root elements, so this memorydoesn’t need to be allocated for the roots, but having the same struct for both listheads and list nodes simplifies the traversal code.

ref

prev

ref

nextprev

ref

nextprev

ref

nextprev

000

0 0 0

static_roots thread_1 thread_n

0 0 0

...root_head

next root next root next root next root

ref

ref

ref ref

ref

ref

ref

ref

ref

next root

next root

next root

next root

next root

next root

next root

next root

next root

nexttop root top root top root top root

Figure 7.3: Data structure for roots. One root stack per thread and one forstatic (system) objects. For the thread root stack heads, ref == null.

7.2 EXACT GC IN AN UNCOOPERATIVE ENVIRONMENT 119

Each object in our implementation contains a reference to a templateobject. The template object contains various runtime type informationabout the object including the GC info and object size. The layout ofthe GC info is a zero-terminated array [R0, D0, R1, D1, . . . , RN , DN , 0],where R is the number of references and D the number of data bytes,as is shown in Figure 7.4. In addition, we use special escape codes toindicate that the size of a variable size array of either references or datais stored in the object itself. This makes it possible to use the same GCinfo object for arrays that only differ in length.

By using an object layout convention which says that all objects shouldstart with a sequence of all references followed by all data fields of theobject, the total size of the GC info can be reduced to three integers. Thiscan be generalized to that the size of the GC info array is limited to 2N+1where N is the depth of the inheritance hierarchy of that object.

gc_info

size[3,8,1, 0]

gc_fields

ref1ref2ref3long1ref4

template

Figure 7.4: Object layout example: The object consists of three referencesfollowed by eight bytes of data, and finally one more reference.

Ensuring safety

A problem for garbage collectors in uncooperative environments is thatcompiler optimizations may make it harder to find roots. For instance,if a reference variable is allocated to a register and never stored on thestack, it will not be found by the garbage collector. A conservative col-lector that relies on heuristics and assumptions on the stack frame layoutis vulnerable to this type of problems, but this is not the case in the pre-sented approach.

As our GC uses its own auxiliary structures to find roots on the stack,the only requirement is that all reference variables are stored in memorywhen the GC runs. This can be expressed in standard C by taking theaddress of any local reference variable (the &var construction). Then,the compiler must allocate that variable on the stack (as it must have amemory address) and ensure that it is written back to memory before


function calls. As we have explicit instructions for linking local rootsinto our root structure, this requirement is fulfilled. Depending on howthe read barrier is constructed, reference variables may also need to bedeclared volatile to ensure that they are read from memory, as theGC thread may have moved objects.

7.2.2 Uncooperative scheduler

With a fully concurrent GC, and without control over the thread sched-uler, we must ensure that a context switch does not cause a process to beleft in an inconsistent state. For example, we must prevent that a threadis preempted in the middle of the execution of a read- or write barrier.A context switch may occur at any time, which means that the mutatorand collector must have mutually exclusive access to the heap in orderto prevent both that a process is preempted during reference operationsand that a reference or an object is read when the heap is in an inconsis-tent state due to GC.

It should be noted that the need for synchronization between a con-current GC and mutator threads described here is not a result of (ahead-of-time) compilation; the same issues arise in e.g., a JVM using nativethreads. An uncooperative compiler complicates things further as it im-poses restrictions on the implementation, but the fundamental problemis that certain memory accesses and reference operations must be atomicto the mutator and collector even if they are not atomic from scheduler’spoint of view.

Locking the heap can be accomplished in various ways dependingon the current platform and operating system. A method that might beadvantageous on some platforms is to disable interrupts when exclusiveaccess is required in order to prevent context switches. In many situa-tions disabling interrupts is not feasible; for instance the operating sys-tem might prevent programs from doing so, or it might interfere withother parts of the system. In such cases, the synchronization mecha-nisms provided by the operating system must be used, e.g. a semaphoreor a monitor.

7.3 Garbage collector interface

Different GC algorithms require different interaction with the applica-tion. For instance, a compacting or copying collector requires a read-barrier, as objects may move, where a non-moving mark-sweep collec-tor only requires a write barrier. These differences makes it error-prone

7.3 GARBAGE COLLECTOR INTERFACE 121

and troublesome to write code generators supporting more than just onetype of GC algorithm, and it gets even worse considering hand-writtencode, which would need a major rewrite for each supported GC type.

In order to separate, and hide, the GC implementation from the ap-plication, we have specified a garbage collector interface (GCI) that pro-vides heap access primitives to the application [IBE+02]. This includesproviding the necessary synchronization for reference and heap opera-tions. The GCI is used both in the Infinitesimal Virtual Machine and inthe LJRT compiler, and with both concurrent and stop-the world ver-sions of non-moving, compacting and copying collectors. By using theinterface, no changes to the compiler or VM is required, when a new GCis added.

As the efficiency of a garbage collection algorithm is highly depen-dent on the behaviour of the application, the choice of GC algorithm ispart of the configuration and tuning of a system. The separation pro-vided by the GCI means that the intermediate C code generated by theLJRT compiler doesn’t have to be re-generated in order to change GC,only C compilation is required, which makes experimenting with differ-ent GCs quicker and easier.

The varying requirements of different GCs, both on the set of mem-ory access primitives and run-time aspects causes the interface to con-tain quite many operations, which makes it less than ideal for manuallywritten code, but as the main intended use for GCI is generated code(especially from our Java to C translator) or low-level routines in a vir-tual machine, this is no major concern. As always, there is a trade-offbetween keeping the interface small and limiting the power of expres-sion as little as possible. Manually writing code that accesses the heapthrough the GCI is, however, also quite doable.

The interface consists of primitives for initialization, object layoutdeclaration, reference variable declaration, object allocation, referenceaccess, field access, and function declaration and call, which adds up to50 primitives. The GCI is implemented as C macros, and, as an exampleof how the interface looks, we take the field reference operation: readinga reference field from an object is done through the GC_GET_REF macro.Figure 7.5 show how an expression of the type t = a.b.c, on an ob-ject a, must be split up to fit the GCI. Note that a temporary variable(tmp) is used and how it is pushed onto and popped from the root stack.The first GC_GET_REF macro expands to executing the read barrier ona (i.e., finding a pointer to the actual object), assigning a.b to tmp andexecuting the write barrier.


GC_REF(Type, tmp); // Type tmp;GC_PUSH_ROOT(tmp);GC_GET_REF(tmp, a, b); // tmp = a.b;GC_GET_REF(t, tmp, c); // t = tmp.c;GC_POP_ROOT(tmp);

Figure 7.5: Example of field access through the GCI.

7.4 Performance issues

This section discusses the run-time overhead incurred by the presentedapproach and Section 7.5 presents some optimizations that reduce thatoverhead. The problems described here are to a large part due to the factthat we have conflicting requirements on our design. As stated earlier,the design criteria behind our system is that it should

• cause very low latency

• not require compiler (back end) cooperation

• not require scheduler cooperation

• have low execution time overhead

A combination of any three is fairly easy. The problem is achieving allthese properties at the same time. If we add the requirement that theimplementation should have a simple interface and high flexibility, itgets even more difficult.

As we cannot rely on cooperation from the scheduler, the applica-tion4 code must provide the required synchronization. In order to keepthe latency low, tight synchronization with very small critical sectionsis required, and our experiments show that the dominating part of theoverhead is introduced by frequent locking, so the discussion will fo-cus on that. Furthermore, the requirement on simplicity and flexibilitymust also be taken into account, which means that we cannot rely onmanual tweaking in order to meet the real-time requirements. However,even if a very large engineering effort can be put into manual tuning,there will still be an overhead due to the synchronization, compared toa traditional non-real-time GC.

It is stressed that the discussion in this chapter is based on the aboveassumptions, and that the performance problems described are causedby the high degree of synchronization necessary when we require very

4From the operating system’s point of view; application includes the GC.

7.4 PERFORMANCE ISSUES 123

low latency and are constrained by an uncooperative environment (inparticular the scheduler) and/or unknown external native code. If, onthe other hand, the scheduler and application allows it, preemption pointscould be used; if preemption points are placed in a way that ensuresthat preemption only may occur when the heap is in a consistent state,no additional synchronization would be required. This is a commonlyused technique, but as it is not suitable for our applications, it is outsidethe scope of this discussion.

7.4.1 Too frequent locking

As stated, in order to achieve short latency times, we need to make theatomic GC operations as short as possible. As the heap must be lockedduring GC operation (to provide mutual exclusion w. r. t. the mutator)this requires frequent locking and unlocking. Even though each lock-/unlock operation is really cheap, the number of lock/unlock operationsin the straight-forward implementation proved to be a serious perfor-mance bottleneck. For the test application presented in the experiments,the straight-forward implementation performed thousands of lock/un-lock operations per sample. This was the major limiting factor on thepossible sample rate for the controller.

The assignment statement t = a.b.c of the example in Figure 7.5,illustrates the problem with locks: if we expand the GC macros to showthe locking instructions, it looks like in Figure 7.6. In this example, thereare three pairs of gc_unlock(); gc_lock(); instructions, due to thefact that many atomic operations are executed in sequence, and that eachoperation has to contain the proper synchronization. This is obviouslyquite inefficient. If the lock and unlock instructions could be placed ar-bitrarily, the intermediate pairs could be removed in order to to increaseefficiency, resulting in the code as shown in Figure 7.7, reducing the lock-ing overhead by 75%. Also, as no GC work may occur while the heapis locked, the GC_PUSH_ROOT() and GC_POP_ROOT() operations mayalso be removed, leaving us with just the code in Figure 7.8, which ismuch more efficient and has almost as small critical section as each ofthe original primitives. This is, of course, a very simple example. In areal program the lock instructions could be placed at arbitrary intervals,which allows the trade-off between latency and throughput. Nonethe-less, the example illustrates the problem of big synchronization over-head due to small atomic operations.

The optimization described in the above example, however, requiresboth that the locking instructions are accessible in the interface andeither tedious manual placement of locking instructions or that we have


tool support (in the compiler and/or in the run-time system) for au-tomatically inserting lock/unlock instructions at suitable (for a certaindesired latency) intervals.

GC_REF(Type, tmp);gc_lock(); GC__IMPL_PUSH_ROOT(tmp); gc_unlock();gc_lock(); GC__IMPL_GET_REF(tmp, a, b); gc_unlock();gc_lock(); GC__IMPL_GET_REF(t, tmp, c); gc_unlock();gc_lock(); GC__IMPL_POP_ROOT(tmp); gc_unlock();

Figure 7.6: Example showing the expanded macros, revealing the lock instruc-tions enclosing the implementation-layer macros.

GC_REF(Type, tmp);gc_lock();GC__IMPL_PUSH_ROOT(tmp);GC__IMPL_GET_REF(tmp, a, b);GC__IMPL_GET_REF(t, tmp, c);GC__IMPL_POP_ROOT(tmp);gc_unlock();

Figure 7.7: Example with bigger critical section

GC_REF(Type, tmp);gc_lock();GC__IMPL_GET_REF(tmp, a, b);GC__IMPL_GET_REF(t, tmp, c);gc_unlock();

Figure 7.8: Example with root operations removed

The heap synchronization is not trivial when really fine-grained in-crementality is desired, and having explicit locking instructions increasesthe risk of concurrency errors, as the responsibility would be movedfrom the GC to the application code. It would also contradict the designgoal of GCI, that the details of the different GC implementations (in thiscase, the synchronization) should be hidden from the user.

As one of the intentions behind the GCI is to make it possible toswitch GC algorithms without changing the application code, explicitlocking instructions are problematic, as the level of synchronization re-quired depends on GC algorithm and implementation. E.g. a copying

7.4 PERFORMANCE ISSUES 125

or compacting collector needs locking at both read- and write barriers,while a non-moving mark-sweep only has a write barrier. Also, de-pending on how root operations and function calls are implemented,the required synchronization varies. This could be solved by havinga number of different locking instructions for different operations (forinstance, gc_lock_READ(), gc_lock_WRITE(), gc_lock_ROOT(),gc_lock_CALL(), gc_lock_RETURN(), etc.) which would increasethe complexity of the interface, and the risk of programming errors, sig-nificantly.

It would be possible to add a gc-lock optimization pass to the LJRTcompiler by e.g. performing analysis similar to the PMH placement in[ACM+03]. However, explicitly placing lock instructions in the appli-cation code has the drawback that their placement must depend on theintended target platform and desired timing properties. That is, the pro-grammer, or tool, must know that, for instance, a piece of code musthave critical sections that are less than 10 µs on a particular computer.Thus, the analysis may need to be drastically different if we compile fora small 8-bit 8 MHz micro-controller or a 2 GHz machine.

From our point of view, flexibility and portability are very impor-tant, and therefore we believe that the application code should be asgeneric as possible, and that low-level decisions regarding the real-timebehaviour and scheduling should be left to the run-time system. This ismotivated not only by portability but also by the fact that much moreinformation is available at run-time than statically at compile-time.

7.4.2 A read barrier requires locking

Copying and compacting garbage collection algorithms have, amongother things, the advantages that fragmentation is avoided and that al-location is a constant time operation. This, however, comes at the costthat a read barrier is required and this is potentially expensive. In ourimplementations, the extra cost of the read barrier itself compared to anon-moving GC is just an extra pointer dereference for each referenceaccess. In the context of concurrent GC, the main cost of the read barrieris that it requires synchronization to avoid that the collector moves anobject while it is being accessed by the mutator.

The total synchronization overhead incurred by the read barrier hastwo parts; the cost of each lock/unlock operation, and the number ofaccesses. Typically, variables are read much more often than they arewritten, (or at least as often, for most programs) and thus, any overheadassociated with the read barrier has a higher (or at least as high) impacton overall performance than the write barrier overhead. This means


that, for any application, the locking overhead will be at least twice ashigh when using a moving collector compared to a non-moving.

7.4.3 Locking at method calls

Passing references as parameters to a function may need to be protected.For instance, if a reference can exist only as a parameter (e.g., as infoo(new Bar()) ), the GC must not run until the parameter has beenrooted (registered as a root) in the called context.

For functions returning references, it is possible (or likely) that theobject to be returned has been allocated in the called function. Therefore,when the function returns, the only reference to that object is the returnvalue, which may have been referenced only by a local variable. Thismeans that the return value must be protected, both to make sure thatthe object is retained and scanned, and to prevent the GC from movingthe object until a proper reference has been rooted and the write barrierexecuted in the calling context.

7.4.4 Effects on optimization

Another important aspect of synchronization, which is not addressedhere, is the interaction with an optimizing back end. For instance, itcan make a big difference if the lock/unlock instructions can be inlinedby the compiler or if they have to be function calls. Specifically, if thesynchronization instructions break up basic blocks, this would severelylimit an optimizing compiler’s options.

7.5 Reducing the overhead

This section outlines some observations that can be used to drasticallyreduce the overhead associated with heap locking. It is shown how thecost of function calls and root operations can be significantly reducedand how the overhead can be almost completely eliminated for highestpriority threads.

7.5.1 Reducing the need for synchronization

The level of required synchronization is affected both by the choice ofGC algorithm (e.g., if a read barrier is required or not) and by differ-ent implementation decisions in the compiler and run-time system. This

7.5 REDUCING THE OVERHEAD 127

section gives examples of how those issues can be addressed in the com-piler and in the run-time system, respectively.

Root alias analysis In a typical object oriented program, a large partof local variables will be of reference types, and thus there will be manyroot references. A special case is the temporaries used in LJRT programs.As stated, the GCI requires complex constructs, such as foo = a.b.c, tobe split up into simple attribute accesses as shown in Figure 7.5. Thismeans that a lot of roots has to be pushed on and popped from the rootstack, causing a significant execution time overhead, primarily from therequired synchronization.

It can, however, be observed that in order to ensure correct GC be-havior, it is enough that each live object is reachable from one root5.This means that the amount of necessary root operations, and therebythe overhead, can be reduced; if it can be statically determined that avariable will only reference objects that are also referenced by anothervariable with longer lifetime, the “inner” variable does not have to beregistered as a root. We call this root alias analysis, and the compile-timeanalysis is trivial, as we do whole-program compilation. With this opti-mization, the push and pop operations in Figure 7.5 would be removed,which means that there will be no additional overhead of having thetemporary variables explicitly in the code. In a typical Java program,the amount of “root duplication” is, in our experience, very high, as theassociativity between objects tend to be high — between 50% and 70%of roots (including temporaries) were found to be statically redundant inour experiments. A large portion of the local variables that really needto be rooted are temporary references required to keep a newly allocatedobject live before its constructor has completed. This is needed to keeplatency low; as the constructor can be of arbitrary length it cannot betreated as atomic.

As an example of how the root alias analysis works, we take the codefragment in Figure 7.9. There, f and b will (or may) reference objectsthat are allocated in the context of main, so these variables must be reg-istered as roots, as they are the only references to the new objects. On theother hand, in proc, we know that the parameters have been registeredas roots in the calling scope. Local analysis in proc, can statically deter-mine that t1 and t2 only reference objects that are reachable from (the

5This does not hold for moving collectors that use forwarding pointers in the objects,as the roots are used for updating pointers as well as for finding live objects; for this op-timization to work, the read barrier must be implemented using an indirect table outsidethe object.


void main() {Foo f; Bar b;...f = new Foo();b = new Bar();...proc(f,b);

}void proc(Foo foo, Bar bar) {

Test t1, t2; Bar b1;...t1 = foo.test1;t2 = foo.test2;b1 = bar.x();...

}class Foo {

Test test1, test2;...

}class Bar {

Bar b;...public Bar x() { return b; }

}

Figure 7.9: Root alias example

attributes of) the parameters, and therefore it is not necessary to registerthese variables as roots. In contrast, we cannot tell if b1 is an alias forsomething already rooted, or not. By analyzing the method Bar.x() itis seen that x only returns an object reachable from an attribute. There-fore, b1 does not need to be registered as a root.

If we are doing whole-program compilation, all calls to functions re-turning references can be analyzed and will finally boil down to either anattribute access (which doesn’t require rooting) or an allocation (whichdoes). In a separate compilation context, it is not generally possible toperform the whole-program root alias analysis, but the local analysismay still be used to get rid of unnecessary roots caused by temporaryvariables.

The implementation of the root alias analysis is quite simple, and themajority of the code is shown in figure 7.10. This is code written forJastAdd II, an aspect-oriented compiler compiler tool [Ekm04, NIEH04],but it is basically Java code for evaluating the attributes of syntax treenodes. For example, the first method describes the evaluation of the


isNewRoot attribute in VariableDeclaration nodes, which will evaluateto true if there is any statement that may cause the variable to contain aunique root.

In the case of class overloading, the analysis of whether a method callmay return a new root must analyze all overloaded implementations ofthe method which may be executed, which may yield a conservativeresult. For the sake of readability, that code has been left out from thefigure.

boolean VariableDeclaration.isNewRoot() {boolean result = false; Stmt stmt = null;ASTNode scope = getSurroundingScope();foreach stmt in scope {

result |= stmt.isNewRoot(this); }return result;

}boolean ExprStmt.isNewRoot(VariableDeclaration varDecl) {

if (getExpr() instanceof AssignSimpleExpr) {AssignSimpleExpr expr = (AssignSimpleExpr) getExpr();return expr.getDest().isUse(varDecl) &&

expr.getSource().isNewRoot(); }return false;

}boolean MethodAccess.isNewRoot(){return decl().isNewRoot();}boolean VarAccess.isNewRoot(){return decl().isNewRoot();}boolean MethodDecl.isNewRoot(){ return returnsNewRoot();}boolean InstanceExpr.isNewRoot(){return true; }

boolean Block.returnsNewRoot() {boolean result = false;for (int i=0; i<getNumStmt(); i++) {

result |= getStmt(i).returnsNewRoot(); }return result;

}boolean ReturnStmt.returnsNewRoot() {

boolean result = false;if (hasResult()) { result = getResult().isNewRoot(); }return result;

}boolean MethodDecl.returnsNewRoot() {

// Native methods do not have bodies, so let’s be conservativeboolean result = true;if (hasBlock()) { result = getBlock().returnsNewRoot(); }return result;

}

Figure 7.10: Root alias analysis in the front-end


Function calls For function calls, the level of locking required dependson how reference arguments are passed — as references or as actualpointers (i.e., if the read barrier is executed in the caller or in the callee).In our implementation, reference structures are stack allocated and thuswill not be moved by the GC. Therefore, if references are called by ref-erence (i.e., a pointer to the reference structure is passed) no new rootsare pushed in the callee and no heap locking is required. As the callerwill always out-live the callee, if parameters to functions are known tobe rooted in the calling context they don’t have to be rooted again in thecalled context. Similarly, we know that the return value of a functionwill be used in the calling function (or not at all). Therefore, the variablethat will receive the return value must already be rooted so if we pass areference to this variable to the called function, it can be assigned beforethe return which removes the need to protect the return value. If func-tion arguments and return values are handled in this way, no locking isrequired for function calls.

Root stacks in multi-threaded programs Another example of over-head caused by an uncooperative environment is the root stacks. Inmulti-threaded programs, each thread has its own root stack, and there-fore, all root operations (i.e. push and pop) requires a pointer to theroot stack of the current thread. In a system where the thread scheduleris Java-aware, the root stack pointer is part of the execution context ofeach thread and is saved and restored automatically.

In systems which cannot rely on scheduler cooperation, this has to behandled in the application code. As the root operations are part of theapplication code, and the current thread is not known at compile time,this must be looked up at run time. Looking up the root stack at eachroot operation is quite inefficient so this should be done once for eachfunction call and cached. Similarly, if no root operations are done in afunction (like in e.g. a typical math function of the standard library),such lookup is unnecessary. Therefore, lookup of the thread root stackis done lazily at the first root operation of each function and the result iscached. This can be implemented quite efficiently.

Highest priority threads If a thread is known to have the highest pri-ority it will never be preempted by another thread during its execution.Therefore, it is enough to lock the heap (or rather, ensure that no otherthread has locked the heap) each time such a thread starts executing. Fora periodic thread, this could be implemented by placing a gc_lock();at the start and a gc_unlock(); at the end of each sample. This almost


completely removes the locking overhead for the (set of) highest prioritythread(s) without affecting the real-time behaviour of the application6.

Furthermore, it enables much more aggressive optimizations to beapplied to the code of HP threads, as it is known that no GC can occurduring execution and the heap only needs to be in a consistent statewhen the HP thread stops executing. This means that also a part ofthe read and write barrier calls can be removed, reducing the inlinedoverhead and, as another consequence, allows more optimizations inthe back end, at the machine code level.

This assumes independent threads, which is a reasonable assump-tion for the high priority threads in a control system. If a thread containsblocking calls (e.g., semaphore or monitor operations) the heap must beunlocked before each such call, or there will be a risk of deadlock.

7.5.2 Reducing the cost of synchronization

With fine-grained memory operations and heap-intensive applications,such as Java programs, the heap is almost always locked, so wheneverpreemption occurs, the probability that the heap is locked is high. As-sume that a thread (T1) is executing and is in the middle of a memoryoperation. Then, a context switch occurs; the thread that is scheduledto run (T2) will probably try to lock the heap very soon after the con-text switch and be blocked. Then T1, which is holding the heap lock,is scheduled to run again until it releases the heap lock, allowing T2to continue its execution. This means that there will be three contextswitches instead of one, increasing the execution time overhead due tosuch context switch chatter.

Low latency due to locking is a requirement, so just increasing thesize of the critical sections is not a viable solution. Therefore, we need asolution that allows very fine-grained preemption without the overheadof frequent unlocking and re-locking. We also need to make sure thatcontext switches are not performed when the heap is locked.

This section will sketch three possible solutions based on turning offinterrupts, preemption points, and a proposed technique, lazy locking,respectively.

Turning off interrupts The straight forward solution is to simply im-plement gc_lock() by turning off (clock) interrupts and gc_unlock()

6Assuming that no preemption takes place between threads of the same priority, as isthe common case in real-time systems. This is no restriction, as if the system is schedulablethe ordering between threads of the same priority doesn’t matter


by turning them on again. On most architectures, interrupt requests thatarrive when interrupts are masked are latched, so that when the inter-rupts are turned back on, any missed interrupt will be generated and thecorresponding interrupt routine is executed. On such an architecture,this will give the desired semantics that if a time-slice ends, and pre-emption should take place, when the heap is locked, the context switchis delayed until the heap lock is released. Turning off interrupts may,however, not be allowed by the OS, or have negative effects on otherparts of the system, e.g., interrupt-based drivers for peripherals, etc.

Preemption points By using a scheduler which only allow preemptionat certain, pre-determined points, we can avoid frequent locking/un-locking. In fact, if the memory accesses are taken into account whenplacing preemption points so that preemption is only allowed when theheap is in a consistent state, no additional housekeeping or synchroniza-tion is needed in order to ensure correct GC operation.

Preemption points are problematic for two reasons. The first is thatmost standard real-time operating systems don’t support them. The sec-ond one is that calling external native code (that doesn’t have preemp-tion points) may cause priority inversion. An illustrating example isa background thread calling an external routine with a long executiontime. As external code doesn’t have preemption points, high prioritythreads may be delayed indefinitely. One solution is switching to “na-tive” preemption when calling external code and then switching backto preemption-points when executing known code. Drawbacks includea more complex scheduler implementation, and increased latency forexternal code due to the extra housekeeping required. The latter maynot be acceptable if the external code is run in timing-critical parts ofthe application, e.g. if the external code is a controller generated from asimulink diagram or low-level legacy code.

Lazy locking If turning off interrupts or using preemption points isnot possible or desirable, an alternative strategy for reducing the lock-ing overhead is based on the observation that, while the frequent lock-ing and unlocking is required in order to achieve low latency, in thecommon case, the heap is unlocked, and then shortly re-locked by thesame thread. Thus, most of the locking operations are really unnecessaryand most unlock–lock pairs could be removed without changing the be-havior of the program (other than reduced overhead). The problem isjust determining which lock and unlock operations that need to be per-formed. This could be done statically, but the analysis would be difficult


and highly dependent on the low-level scheduling, control flow basedon input data, etc. Therefore, a dynamic, on-line approach is preferable.

For example, take a code sequence like in Figure 7.11. If we are exe-cuting in the marked region, and no clock interrupt has arrived (i.e., thethread will not yet be preempted), it is unnecessary to perform the un-locking and re-locking operations. Thus, if we could dynamically decidewhether to perform the unlock/lock operations (in a way that is muchcheaper than actually performing the locking), the overhead could bereduced. Then, when a clock interrupt occurs, the heap should reallybe unlocked at the next unlock instruction and the context switch per-formed.

gc_lock();...

--> gc_unlock();--> gc_lock();--> ...--> gc_unlock();--> gc_lock();

...gc_unlock();

Figure 7.11: Locking example: Small atomic operations cause frequent locking.

One way of implementing this is by having two versions of the op-erations: the actual lock/unlock operations (which are executed whenthe locking is required) and “NOP” versions that are used when unlock-ing and re-locking isn’t necessary. Then, the run-time system ensuresthat the correct version is run at each time to both guarantee the correctsemantics and achieve the best performance. In principle, an implemen-tation of this scheme looks like in Figure 7.12. This method gives similarbehavior as preemption points with regard to heap accesses, but with-out requiring additional housekeeping in order to allow external nativecode to be run with real-time guarantees.

In the sketched implementation, the reschedule function in thescheduler is modified to include the lazy locking related operations. Ifmodifying the scheduler is not possible, or practically feasible, much ofthe benefit of lazy locking can still be obtained if the OS has a call-backhook for a method to be called at context switches. In fact, this is themethod used in our Linux/RTAI prototype, and it gives the same re-duction of the number of locking operations, but does not address con-text switch chatter. That may, however, be a reasonable trade-off for nothaving to modify the scheduler.


void (*gc_lock)(void);void (*gc_unlock)(void);

void gc_lock_real(void){ lock(heap_mutex);

gc_lock = f_nop;gc_unlock = f_nop;

}void gc_unlock_real(void){ unlock(heap_mutex);

yield();}void f_nop(void) { return; }void reschedule(void){ if(is_locked(heap_mutex)) {

gc_lock = gc_lock_real;gc_unlock = gc_unlock_real;

} else {/* perform actual context switch */

}}

Figure 7.12: Lazy locking implementation sketch

There are, of course, many other small details that must be taken careof when implementing such a scheme; e.g., the system must ensure thatthe heap is always unlocked before a blocking call is made or before athread dies; otherwise there is a risk of deadlock.

7.5.3 Compiler optimization effects

Another problem with locking is that the lock/unlock operations arefunction calls or inline assembler, and that tend to break basic blocksand interfere with compiler optimizations. This is, partly, intentional asmany optimizations are not safe in the general case. For instance, wemust make sure that pointers (gotten through the read barrier) to ob-jects are always read from memory. Otherwise, objects may have beenmoved since the last access, and such race conditions will lead to mem-ory corruption.

However, preventing such optimizations is really only needed whena context switch actually has taken place; as long as the same thread isexecuting, any optimization is legal, as long as the heap and all refer-ences in memory are consistent at the next context switch. Thus, perfor-mance could be improved significantly if it was possible to implement

7.6 SUMMARY 135

lazy locking in a way that the fast case did not break basic blocks. Webelieve that this could be done with self-modifying code, injecting thelock/unlock operations into the code where they are needed and modi-fying the lock/unlock instructions so that they ensure heap consistency.This, of course, requires detailed information about the inner workingsof the optimizing back-end and target architecture and cannot be donein a simple or portable way.

7.6 Summary

An implementation of a framework for accurate, concurrent, real-timegarbage collection aimed at embedded systems was presented. It al-lows very low latency and works for automatically generated C code,a standard C compiler and a standard real-time operating system, andwe have evaluated its performance in a robotics application. The resultsshow that it is possible to use accurate garbage collection in an unco-operative environment for real-time applications which require latencytimes as low as a few microseconds.

However, due to the restrictions imposed by the uncooperative en-vironment — especially scheduler — explicit synchronization betweenmutator and collector is required, and this adds to the execution timeoverhead of memory operations. That means that we get a trade-off be-tween performance (throughput) and predictability (latency) since if werequire low latency the critical sections must be small, and that, in turn,requires more frequent synchronization. The synchronization overheadmust also be taken into account when choosing GC algorithm; e.g., acopying or compacting GC requires a read barrier (which requires syn-chronization) and this may have a big impact on throughput as readsare typically much more common than writes.

Further, optimizations, in both the Java compiler and in the GC im-plementation, aimed at reducing both the cost of and need for synchro-nization was presented. Experiments show that the overhead can besignificantly reduced without affecting the worst-case latency.

It is concluded that concurrent GC is feasible for use in hard real-timesystems, even in an uncooperative environment. The run-time overheadcan be kept at a reasonable level, and that cost may in many cases beacceptable in order to get the safety and predictability of accurate GCalso in hard real-time threads.

CHAPTER 8

EXPERIMENTS

This chapter presents experimental support for the proposed techniques.After a brief presentation of the applications and execution environ-ments, experimental support for the presented techniques is presented.Section 8.2 presents experiments with time-triggered GC and shows howusing different scheduling methods affect the scheduling of GC work.Auto-tuning of the GC cycle time is studied in Section 8.3, and Sec-tion 8.4 presents experiments illustrating the different approaches to GCexecution time estimation. Section 8.5 shows how priorities for memoryallocations can improve robustness and performance. Section 8.6 con-tains experiments with memory-aware feedback scheduling. The per-formance of the LJRT run-time system, illustrating an accurate GC in anuncooperative environment, is examined in Section 8.7.

8.1 Experiment platforms

The experimental verification has been carried out in two control appli-cations, the ball-and-beam, a simple control process, and motion controlof industrial robots. The execution platform has been the IVM virtualmachine [Ive03] and natively compiled Java using the LJRT platform.

The applications were chosen since we need benchmarks that arerepresentative for the kind of systems that benefit from a low-latencyGC, i.e. real-time control systems. Standard benchmark suites such asSPECjvm98 don’t fit very well in this context because of their batch-oriented character. In batch programs, incremental and concurrent GCjust adds overhead without yielding any benefit. Also, batch programsand embedded control systems typically have drastically different mem-

138 8. EXPERIMENTS

ory usage patterns; the former tend to build some data structure, dosome computations on it, and then deallocate it, whereas the latter typi-cally run “forever” in steady state.

Ball and beam process

As a test platform, a simple control system for a lab process which bal-ances a ball on a beam was used. The angular velocity of the beam iscontrolled in order to roll the ball to a given position on the beam. Aphoto of the lab process is shown in Figure 8.1.

Figure 8.1: The ball-and-beam process. The beam can be rotated to roll the ballto the desired position. Sensors measure the position of the ball and the angle ofthe beam.

The control was performed by a Java application consisting of threethreads; a user interface, a reference generator, and a controller. In ad-dition to doing the actual control, the controller thread sends log databack to the user interface thread as illustrated in Figure 8.2. The refer-ence generator and controller are run at a much higher rate than the UIthread.

The garbage collector used is an incremental mark-compact collec-tor. The traces were collected by instrumenting the RT-kernel and theJava virtual machine, respectively, with logging calls at memory opera-

8.1 EXPERIMENT PLATFORMS 139

UI Control

RefGen

Process

setpoint reference

log data

Figure 8.2: The ball-and-beam control application consists of three threads;user interface, reference generator and controller. The data communicated be-tween the threads is indicated by the arrows.

tions and context switches. Logging was done to a dedicated memoryarea and uploaded via a serial line after each experiment. The time-triggered and adaptive GC experiments were performed using compiledJava [NEN02] on a 350 MHz PowerPC and the memory allocation prior-ity experiments were done using the IVM virtual machine [Ive03] on aSTORK [AB91]/Linux platform.

Industrial robots

A recent master’s thesis project [Lin04] made a Java implementationof the low-level servo controller for an ABB IRB-2000 industrial robot(given a desired motor velocity for each of the six joints, suitable torquevalues and the corresponding AC motor currents are calculated). Posi-tion samples and control signals are received and sent to the robot overa real-time network.

Also, a motion controller for an ABB IRB-6 was implemented. Thisis a standard PI controller. On the IRB6, local I/O on the control com-puter is used, making the control code simpler as sampling, calculationand output of the control signal are all performed in the same thread.With the exception of the drivers for the analog and digital I/O in thetarget system, the complete applications were written in Java. The IRB-6controller was developed as a case study on the multi-stage deploymentmethod presented in Section 2.4.3.

140 8. EXPERIMENTS

TrueTime-based memory management simulator

The proposed techniques for adaptive GC scheduling has been tested ina simulated environment. The simulations were carried out using True-Time, a Matlab and Simulink-based system for studying embedded con-trol systems by co-simulating the timing properties of a real-time kerneland the continous time dynamics of the process under control [HCA02].On top of this, a simulator for a concurrent GC was implemented. TheGC simulation is based on a generic heap model and a mark-sweepgarbage collector. The heap model is driven by the mutator’s allocationof objects and pointer assignmnent, and the GC is used to determine thenumber of live and dead objects (the mark routine) or to reclaim mem-ory (sweep).

Based on the numbers and sizes of live and dead objects found by themark routine, the amount of GC work required to complete a GC cycleis computed with a hand-written GC work function. That allows simu-lation of different GC algorithms by simply changing the work function.(currently, there is a mark-sweep, a mark-compact, and a copying col-lector). In each invocation of the GC task, the heap state is measured,and the GC work function is evaluated, and when the execution time ofthe GC task is equal or greater than the total work of that cycle, the cy-

Figure 8.3: Screenshot of the TrueTime-based memory simulator.

8.2 TIME-TRIGGERED GC 141

cle finishes. Thus, the simulation is fairly accurate, as the actions of themutator affects the CG workload of the current cycle. That also meansthat the scheduling will affect the amount of floating garbage — if theGC gets much CPU time early in the cycle, and finishes early, less objectswill have had time to die.

Figure 8.3 shows a screenshot of the memory simulator. The physi-cal process and the control computer are simulink blocks, and both thestates of the process and different signals in the computer, such as theschedule, amount of free memory, GC cycle time and execution time,etc, are available as Simulink signals.

The application used in the simulations consists of two threads, acontroller for the ball-and-beam, and a disturbance thread that generatesgarbage by allocating objects, filling and releasing buffers. It also causestransients by switching between operating modes with different alloca-tion rates (varying size of allocated objects) and live memory amounts(varying buffer sizes).

8.2 Time-triggered GC

This section illustrates the run-time behaviour of allocation-triggeredand time-triggered garbage collection and shows the difference betweentraditionally scheduled incremental GC, where each increment is sched-uled individually and the work is spread evenly across the GC cycle,and EDF-scheduled time-triggered GC. In the plots showing the threadscheduling, the threads are numbered as follows: idle (-2), GC (-1), main(0), controller (1), reference generator (2) and UI (3).

Figure 8.4 shows an execution trace of a run with allocation triggeredincrements, in Figure 8.5 the same program is run with time-triggeredGC with metric-scheduled increments and Figure 8.6 shows the corre-sponding trace with time-triggered, EDF scheduled garbage collection.At the macro level, the executions are almost the same; the memorytraces are nearly identical and the mutator threads get to run when theyshould. The big difference is between the versions where the individ-ual increments are scheduled separately, in order to spread the workevenly across the cycle, and the EDF-scheduled version. Figure 8.7 andFigure 8.8 show a close-up view of the thread graphs. Note that theallocation-driven garbage collector performs a much larger number ofminiscule increments as it spreads the GC work more evenly across theGC cycle even though there is idle time in the schedule. The deadline-scheduled version, on the other hand, finishes as quickly as possible,which is shown by the longer GC invocation without any idle time.

142 8. EXPERIMENTS

If the application has a bursty allocation pattern, the difference be-tween allocation- and time-triggered scheduling gets more discernible.A simple experiment where the low frequency UI thread was modi-fied to allocate a large number of objects at each invocation was per-formed. Memory traces of this execution is shown in Figure 8.9 andFigure 8.10, and close-ups of the thread graph is shown in Figure 8.11and Figure 8.12. In this case, both the memory trace and the schedulingare different.

The difference between allocation-triggered and time-triggered GCwhen it comes to handling bursty allocations is shown in the schedulinggraphs. When the UI thread (number 3) has executed and made thelarge allocation, the following GC increment is much longer than theother increments. Notice that, by necessity, the cycle length of the time-triggered GC has been shortened in order to accommodate the higherallocation rate.

0 5 10 15 20 25 30

−2

−1

0

1

2

3

0 5 10 15 20 25 300

1

2

3

4

5

6

7x 10

4

Time

Th

read

Fre

em

emo

ry

Figure 8.4: Memory trace and schedule for the ball on beam application usingallocation-triggered GC.


0 5 10 15 20 25 30

−2

−1

0

1

2

3

0 5 10 15 20 25 300

1

2

3

4

5

6

7x 10

4

)

Time

Th

rea

dF

ree

mem

ory

Figure 8.5: Time-triggered with individually scheduled increments.

0 5 10 15 20 25 30

−2

−1

0

1

2

3

0 5 10 15 20 25 300

1

2

3

4

5

6

7x 10

4

Time

Th

rea

dF

ree

mem

ory

Figure 8.6: Time-triggered, EDF scheduled.

144 8. EXPERIMENTS

9.98 9.985 9.99−3

−2

−1

0

1

2

3

Time

Th

rea

d

Figure 8.7: Thread scheduling with the allocation-triggered GC. As the al-locations performed during each thread period is small, the corresponding GCincrement is also very short. The schedule of the time-triggered, metric-basedscheduler is quite similar as both schedulers spread the GC work evenly acrossthe cycle and the constant allocation rate of the application makes it possible totune the work metric used in the allocation-triggered GC.

9.98 9.985 9.99 9.995 10 10.005 10.01 10.015 10.02 10.025 10.03

−2

−1

0

1

2

3

Time

Th

rea

d

Figure 8.8: Thread plot with the EDF scheduled GC. When a GC cycle isstarted, the garbage collector uses all idle time in order to perform the workrequired to finish the GC cycle as quickly as possible and then remains idleuntil the start of the next cycle. Each increment is, however, still very short inorder to avoid disturbing the application threads more than necessary. This canbe seen at t = 10 s. Here, the GC thread is released just before the applicationthreads. Thread number 2 preempts the GC, but since the GC has locked theheap, when thread 2 attempts a heap operation it is blocked until the GC finishesits current increment. Thread 2 was blocked for 0.4 milliseconds.


0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7x 10

4

Time

Fre

em

emo

ry

Figure 8.9: Memory trace of an application with bursty allocations andallocation-triggered GC.

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7x 10

4

Time

Fre

em

emo

ry

Figure 8.10: Memory trace of an application with bursty allocations and time-triggered GC.

2.41 2.415 2.42 2.425 2.43

−2

−1

0

1

2

3

Time

Th

rea

d

Figure 8.11: Part of the thread graph corresponding to Figure 8.9. Note howa large allocation in thread 3 causes a long GC increment.

2.41 2.415 2.42 2.425 2.43

−2

−1

0

1

2

3

Time

Th

rea

d

Figure 8.12: Part of the thread graph corresponding to Figure 8.10. As GCwork is not triggered by allocations, the GC work is spread evenly across theGC cycle, and long increments are avoided.

146 8. EXPERIMENTS

8.3 GC cycle time auto-tuning

This section examines the adaptive GC cycle time estimates described inSection 4.2. Two sets of experiments are presented. The first one is anactual application executing on a real computer, and the second one wasdone in a simulator.

The first set of experiments show the ball-and-beam applicationrunning on the PowerPC/STORK platform, using EDF scheduling forthreads. Figure 8.13 shows a memory trace of the system with the auto-tuner enabled. The fast threads run at 100 Hz. Figure 8.14 shows howthe auto-tuner reacts to changes in allocation rate. At t = 10 s, the fre-quency of the high priority threads is increased from 20 to 100 Hz and att = 20 s the frequency is lowered to 20 Hz. The GC is scheduled so thatit will work even if all the dead objects in one cycle would be floatinggarbage. I.e., we reserve a part of the available memory for the next GCcycle as expressed in Equation (4.3). Note the step in the TGC graph neart = 2.5; no memory was freed during the first GC cycle, and thereforeTGC is halved.

As memory allocations typically are bursty, the measurement of theallocation rate is filtered in order to keep the deadline estimates morestable and reduce the update frequency for the scheduling parameters.Care must be taken not to underestimate the allocation rate, as this mightlead to an out-of-memory situation, so we must react quickly to actualchanges in allocation rate while avoiding chatter due to bursty alloca-tions. The rise time in the allocation rate plots are due to such filtering.

The second set of experiments were run in the simulated environ-ment. Figure 8.15 shows how the TGC tuner responds to changing al-location rates. Figure 8.16 shows the same experiment, using with thesteady-state ∆G compensation as of Theorem 3. At the start, one of thethreads run in an allocation-intensive mode with a random allocationpattern. At t = 100, it changes to a steady-state mode with a lowerallocation rate, and at t = 200 it changes back to the random mode.Note that the ∆G compensation reduces the amount of reserved mem-ory, when the mutator is in steady state, and how this reduces the vari-ations in GC cycle times.

8.3 GC CYCLE TIME AUTO-TUNING 147

0 5 10 15 20 25 300

0.5

1

1.5

2x 10

4

Allo

catio

n ra

te

0 5 10 15 20 25 300

5

10

x 104

0 5 10 15 20 25 300

2000

4000

6000

GC

cyc

le ti

me

epla

Time

Fre

em

emo

ry

Figure 8.13: Memory trace of the system with adaptive GC cycle length.The topmost plot shows the amount of available memory (in bytes), the mid-dle plot shows the estimated GC cycle length (in milliseconds) and the bottomplot shows the LP filtered allocation rate measurement (in bytes/second).

0 5 10 15 20 25 300

0.5

1

1.5

2x 10

4

Allo

catio

n ra

te

0 5 10 15 20 25 300

5

10

x 104

0 5 10 15 20 25 300

1

2

3

4x 10

4

GC

cyc

le ti

me

Time

Fre

em

emo

ry

Figure 8.14: How the GC scheduler reacts to changes in allocation rate; Att = 10 s, the frequency of the high priority threads is increased from 20 to100 Hz and at t = 20 s the frequency is lowered to 20 Hz.

148 8. EXPERIMENTS

50 100 150 200 2500

0.5

1

1.5

2x 10

5

time/s

free

mem

ory

50 100 150 200 2500

5

10

time/s

GC

cyc

le ti

me

50 100 150 200 2500.5

1

1.5

2

2.5x 10

4

time/s

allo

catio

n ra

te

re

Figure 8.15: Operation of the adaptive GC cycle length tuner. The topmostplot shows the amount of available memory (in bytes), the middle plot showsthe GC cycle length (in seconds) and the bottom plot shows the allocation ratemeasurement (in bytes per second; the solid line is the actual samples and thedashed line shows the filtered measurement used for the TGC calculation).

50 100 150 200 2500

0.5

1

1.5

2x 10

5

time/s

free

mem

ory

50 100 150 200 2500

5

10

time/s

GC

cyc

le ti

me

50 100 150 200 2500.5

1

1.5

2

2.5x 10

4

time/s

allo

catio

n ra

te

Figure 8.16: Operation of the adaptive GC cycle length tuner with steady state∆ G compensation. Note how TGC is held constant in spite of varying amountsof floating garbage in the steady-state phase.

8.4 GC WORK PREDICTION 149

8.4 GC work prediction

This section examines the performance of the different approaches toCGC estimation. The same simulated setup, as in the cycle time tuningexperiments, was used. In these experiments, the change from the ran-dom allocation mode to the steady-state mode was at t = 75s. Notethat, as a feedback scheduler was used, the period times of the mutatorthreads, and thus the allocation rates, differ. As the CGC estimates arebased on the history of the GC thread, at start-up, the system is run withdefault values for a number of GC cycles. Also, initial allocations per-formed by run-time system and application threads affect the amount oflive memory and are therefore included in the simulation. No effort hasbeen made to handle start-up of the simulated system in a graceful way,and in order to allow the effects of such transients to die out, the plotsstart at t = 50 s.

Figure 8.17 shows the black-box approach, using the maximum valueof CGC of the last four GC cycles as the prediction. As no actual predic-tion is done, this method occasionally under-estimates CGC.

50 100 150 2000

0.5

1

1.5

2

2.5x 10

5

time/s

free

mem

ory

50 100 150 2000

0.5

1

1.5

CP

U ti

me/

s

time/s

Figure 8.17: A trace of the amount of free memory and the CGC estimate usingmax of the last 4 cycles. In the CGC plot, the solid line is the amount of CPUtime spent on the current GC cycle, and the dashed line is the CGC estimate.

150 8. EXPERIMENTS

50 100 150 2000

0.5

1

1.5

2

2.5x 10

5

time/s

free

mem

ory

50 100 150 2000

0.5

1

1.5

2

CP

U ti

me/

s

time/s

re

Figure 8.18: CGC estimate using P (Live) and P (Dead) according to Equa-tion 4.33.

Figure 8.18 shows the clear-box prediction. Note how the CGC es-timate is increased in the cycle after mode change, as the amount ofmemory on the heap is increased due to the ∆G compensation, but theestimates of P (Live) and P (Dead) are still at their old values (in this ex-periment, the forgetting factor was 0.9, meaning that old values decaywith 10% each sample.

Figure 8.19 shows the conservative prediction. The fraction of livememory was about 0.3, and the over-estimation was about a factor oftwo. In the steady-state mode, the actual UGC was about 0.1, meaningthat the over-estimation caused 10% slack. In this experiment, the slackwas not made available to the mutator, which can be seen in the visiblylower allocation rate (compared to the other two experiments), particu-larly during the allocation-intense phase, at the beginning.

As discussed, using a GC algorithm where live objects account forthe greater part of the GC work, combined with a low fraction of liveobjects may cause large over-estimation of CGC. This is illustrated in theexample of Table 8.1, where the same application was run with differ-ent heap sizes. For this experiment, the application threads were sched-uled with fixed period times (i.e., no feedback scheduling) in order tostudy the affect of the heap size on the CGC prediction without havingthe prediction affecting the scheduling. It should be noticed that evenif the conservatism increases as the fraction of live memory decreases,

8.4 GC WORK PREDICTION 151

the required GC utilization still decreases. Therefore, using a smallerheap to get better CGC prediction is, in general, not a good idea. It canalso be noted that the over-estimation in the experiment is less than theworst case conservatism. That can be explained by the fact that floatinggarbage makes the fraction of live objects – and, hence the GC work –larger than the ideal best case. This effect is exaggerated by using thesame, fixed, period times for the application threads: as the GC utiliza-tion decreases, the slack in the schedule increases, allowing the GC tofinish earlier and thereby causing more floating garbage.

50 100 150 2000

0.5

1

1.5

2

2.5x 10

5

time/s

free

mem

ory

50 100 150 2000

0.5

1

1.5

2

2.5

3

CP

U ti

me/

s

time/s

Figure 8.19: Conservative CGC estimate using Equation 4.35. In this experi-ment, P (Live) ≈ 0.32 and α

β ≈ 2.6.

Heapsize 250000 500000 2500000

CGC 1.5 s 2.3 s 10 s

CGC 1 s 1.3 s 3.9 s

TGC 15 s 30 s 160 s

UGC 10% 7.7% 6.3%UGC 6.7% 4.3% 2.4%

Table 8.1: Effects of heap size on the conservative CGC estimation and GCoverhead. While the degree of conservatism of the estimation increases as theheap size increases (i.e., as the fraction of live memory decreases), the total GCoverhead (both estimated and real GC utilization) decreases.

152 8. EXPERIMENTS

8.5 Priorities for memory allocation

It was claimed that introducing priorities for memory allocations andrun-time system support for denying unimportant memory allocationsif memory is scarce can help increasing both the robustness (by avoid-ing out-of-memory situations) and performance (by limiting the amountof garbage collection work) of real-time systems. This section presentsexperimental support for those claims. Experiments were run on thephysical ball-and-beam process.

8.5.1 Avoiding out-of-memory situations

Two scenarios where non-critical memory allocations can help makingsure that a change to a previously working system doesn’t risk breakingit was encountered: increasing the sampling rate of the controller andreducing the amount of memory available to the application.

When the sampling rate is increased, the controller both uses a largerpart of the CPU time and allocates log data at a higher rate until we get toa point where the user interface thread doesn’t get the CPU time neededfor consuming all the log data and the application runs out of memoryand fails. By making the log data allocations non-critical, this cannothappen and the control is not affected.

Reducing the available memory1 will, obviously, at some point causethe application to fail. However, by making the allocation of log datanon-critical, the minimum memory requirement for the application maybe significantly reduced compared to the original version.

The following traces illustrate the first scenario. In these experi-ments, the period of the reference generator and the controller was both20 ms, and a log data object about 60 bytes. Figure 8.20 shows a run ofthe ball-and-beam system without non-critical memory. The high allo-cation rate causes a large GC workload and the UI process is starved,eventually leading to failure.

In the first half of the run the controller(1) and reference generator(2)threads run unimpeded, and the control was OK until t = 90. After thatthe frequent panic stop-the-world GC cycles caused so long delays thatthe controller dropped the ball. The CPU load is almost 100% and theidle thread (0) is not run except in the very beginning. The reason thatthe maximum amount of allocatable memory increases in the middle isthat when the GC cycles get shorter there is less floating garbage.

1This could occur either by actually running the system on a smaller platform or, per-haps more likely, by adding more threads to the system.

8.5 PRIORITIES FOR MEMORY ALLOCATION 153

0 20 40 60 80 100 120

0

1

2

3

0 20 40 60 80 100 1200

2

4

6

8

10

x 104

k)

Thre

ad

Time / seconds

Free

mem

ory

Figure 8.20: A sample run of the ball-and-beam system without using memorypriorities. The UI thread (3) doesn’t get enough CPU time to consume all plotdata that is produced. After t = 75 it is totally starved by the GC. Then, lessand less memory is available and more and more CPU time is spent doing panicGC.

Figure 8.21 shows the same system where the allocation of log datahas been made non-critical, and the log data allocation is kept at a sus-tainable level. In this experiment, more than half of the log data alloca-tion requests were allowed. Figure 8.22 shows a close-up of Figure 8.21where you can see the non-critical behaviour more clearly.

8.5.2 Improving performance

The experiments also indicate that it is possible to achieve better controlperformance by limiting the amount of non-critical memory allocations.The plots in Figure 8.23 show two runs of the ball-and-beam applicationwithout and with non-critical memory allocations enabled, respectively.The position of the ball is in the interval [−10, 10].

In the version without non-critical allocations, the high allocationrate occasionally forces the garbage collector to do a full garbage col-lection cycle in order to reclaim enough memory to satisfy the allocationneeds. This delays the high priority controller process so that it missesits deadline which, in turn, degrades the control performance.

When the allocation of log data is made non-critical, the allocation iskept below the safe limit and the system runs as designed, with moreconsistent control performance.

154 8. EXPERIMENTS

0 20 40 60 80 100 120

0123

0 20 40 60 80 100 1200

2

4

6

8x 10

4

0 20 40 60 80 100 1200

2

4

6

8x 10

4

0 20 40 60 80 100 120

0

1

(k)

Thre

ad

Time / seconds

Free

mem

ory

NC

allo

cLP

allo

ced

Figure 8.21: A run of the ball-and-beam system with log-data allocationsmade non-critical. In the thread plot you see that the UI thread gets CPU timethroughout the run. The third plot shows the amount of memory allocated bylow priority processes during this cycle. The fourth plot shows if non-criticalallocations succeed or not; high level means success and low level is deny.

17.4 17.6 17.8 18 18.2 18.4 18.6 18.8 19 19.2 19.4

0123

17.4 17.6 17.8 18 18.2 18.4 18.6 18.8 19 19.2 19.40

2

4

6

8x 10

4

17.4 17.6 17.8 18 18.2 18.4 18.6 18.8 19 19.2 19.40

2

4

6

8x 10

4

17.4 17.6 17.8 18 18.2 18.4 18.6 18.8 19 19.2 19.4

0

1

Thre

ad

Time / seconds

Free

mem

ory

NC

allo

cLP

allo

ced

Figure 8.22: Close-up to show the non-critical memory behaviour. The dottedline in the free memory plot is the non-critical limit. Note how the GC cyclesare shortened when low priority allocations are made.

8.6 FEEDBACK SCHEDULING 155

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−8

−6

−4

−2

0

2

4

6

8

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−8

−6

−4

−2

0

2

4

6

8

a) log data objects are always allocated

b) allocation of log data is non−critical

Figure 8.23: Plots showing the reference value and the measured positionfor the ball-and-beam process. Plot a shows the system without non-criticalmemory allocations and plot b shows the system where the allocation of plotdata is non-critical. The irregular behaviour in a, around samples 2500, 4000,6000, and 8500, is caused by the controller process being delayed by the garbagecollector due to the program running out of allocatable memory and forcing acomplete garbage collection cycle.

8.6 Feedback scheduling

This section presents simulations illustrating the behaviour of the differ-ent approaches to memory-aware feedback scheduling. The applicationis the TrueTime version of the ball-and-beam controller, and a distur-bance task.

Three simulations are shown, with parameters and resulting controlperformance according to Table 8.2. Figures 8.24, 8.25, and 8.26 show thereference utilization for the mutator, and the sampling period of the con-troller task, under both the separate and the integrated feedback sched-uler. In all plots, the solid line represent the integrated scheduler, andthe dashed line is the version with separate GC tuner. The integratedscheduler used the simplified constraint, (6.17). As the approximationintroduces errors, for a better comparison, the resulting period times ofthe integrated scheduler were scaled to get a mutator utilization of ex-actly Uref .

These experiments show similar performance for both the separateand integrated versions. That supports the claim that the most impor-tant difference is the fairness issue, as in the integrated version, theamount of allocation affects the period assignment. This is apparent inFigure 8.26, where the difference in sampling period for the controller(h1), differs more between the two schedulers, than in the other experi-

156 8. EXPERIMENTS

ments, due to the bigger difference in memory usage. It can also be seenthat as the GC utilization decreases, the variation in Uref also decreases.

Heap size a1 a2 Cost, separate Cost, integrated100000 300 640 474 468200000 300 640 473 426200000 300 64 312 395

Table 8.2: Parameters and control performance (cost, less is better) for thedifferent experiments. a1 is the allocation per sample of the controller, and a2 ofthe disturbance task.

40 45 50 55 60 65 70 75 800.4

0.45

0.5

0.55

time/s

Ure

f

40 45 50 55 60 65 70 75 800.04

0.042

0.044

0.046

0.048

0.05

time/s

sam

plin

g in

terv

al/s

Figure 8.24: Uref and h1, a1 = 300, a2 = 640, H = 100000

8.6 FEEDBACK SCHEDULING 157

40 45 50 55 60 65 70 75 800.6

0.65

0.7

0.75

0.8

time/s

Ure

f

40 45 50 55 60 65 70 75 800.03

0.032

0.034

0.036

0.038

0.04

time/s

sam

plin

g in

terv

al/s


40 45 50 55 60 65 70 75 800.025

0.03

0.035

time/s

sam

plin

g in

terv

al/s

40 45 50 55 60 65 70 75 800.7

0.75

0.8

0.85

time/s

Ure

f


158 8. EXPERIMENTS

8.7 Performance evaluation

In order to study the overhead due to GC in LJRT applications, the IRB-2000 slave controller was used. The program was compiled to C usingthe LJRT compiler, to native code with gcc and executed on a 350 MHzPowerPC G3 with 32 MB RAM running Linux/RTAI, and real-time per-formance and throughput (latency caused by memory operations andmax possible sampling rate) was measured.

It should be noted that no attempt is made to compare the suitabil-ity of different GC algorithms for real-time systems; the implementa-tions of the two collectors presented here are quite different, and shouldbe considered as proof-of-concept prototypes, so a direct comparison isnot meaningful. The aim of the experiments is to verify the feasibilityof very fine-grained incremental garbage collection in an uncooperativeenvironment.

8.7.1 Inlined overhead

This section studies how the choice of GC algorithm and the proposedoptimizations affect the inlined overhead (e.g., read and write barri-ers, heap synchronization, GC housekeeping, etc) and, consequentially,throughput. Here, the inlined overhead is estimated by measuring themaximum possible sample rate.

The maximum sample rate measurements don’t include the actualtime to perform GC work, but only the inlined overhead, as the amountof time spent on GC work depends heavily on both the GC implemen-tations and the allocation pattern of the application. Another reason fornot taking the actual GC work into account in the throughput measure-ments is that our GC scheduling model is based on scheduling GC workso that it doesn’t disturb the real-time tasks. Making sure that there isenough time for the GC to execute and meet its deadlines is a separateissue.

The sample rate measurements show that, in this application, theoverhead of a moving collector is about twice that of a non-movingmark-sweep collector. The additional overhead comes from the addi-tional locking for the read barrier. This is almost the best case as manyof the objects in this application are very short lived (they only live forone sample) and only accessed once. Table 8.3 shows the number oflocks per sample, the time spent in mutex operations and, based on this,the maximum possible sample rate if the actual computation took zerotime.

8.7 PERFORMANCE EVALUATION 159

Algorithm locks/sample time (µs) max rate (Hz)Non-moving 2980 715 1298Moving 6088 1461 684

Table 8.3: Frequency of heap locking and maximum possible sample ratesbased on lock overhead using mutex locking and no optimization.

Figure 8.27 shows how the choice of locking primitive and the rootalias optimization affects total throughput, i.e. the maximum possiblesample rate. As a base line, using the batch-copy collector, without lock-ing allowed 4317 Hz without the root alias optimization and 11038 Hzwith. As the batch-copy collector doesn’t have any read or write bar-riers, this also gives a rough indication of the performance that can beexpected from applying the optimization of eliminating all GC synchro-nization for highest priority threads2.

The big difference between the mark-sweep and the mark-compactcollector is caused by the extra synchronization required for the readbarrier in the mark-compact case. With synchronization turned off3,there is no big difference between a moving and a non-moving collec-tor. In this example, the overhead of the read barrier is compensated bythe cheaper allocation4. For an application with a larger number of readsper write, the impact of the read barrier would be bigger.

8.7.2 Latency and jitter

In order to estimate how the GC synchronization affects thread latency,the gc_lock() and gc_unlock() instructions were instrumented tomeasure the time the heap was locked. The heap locking method used inthis experiment was interrupt masking, which prevents context switchesbetween gc_lock() and gc_unlock(). Thus, handling of any clockinterrupt arriving when the heap is locked is delayed until gc_unlock()is executed. This delay is the added thread latency due to GC synchro-nization. Table 8.4 shows max, mean and median numbers for two GCalgorithms. This gives an estimate of the worst case latency due to GClocking, which would occur if a high priority thread were released justafter the heap was locked.

2This is a conservative estimate, as knowing that a HP thread will run uninterruptedalso enables much more aggressive optimizations to be applied.

3Of course, running without synchronization is not safe and may cause race conditionsand memory corruption, so this is done for reference only and is not practically usable.

4In the mark-compact collector, allocation is done by simply incrementing a pointer,whereas in the mark-sweep case, freelist search and block splitting is done.

160 8. EXPERIMENTS

1 2 3 4 5 6 7 80

1000

2000

3000

4000

5000

6000

7000

configuration

sam

ple

rate

/ H

z

mark−compactmark−sweep

Th

read

Figure 8.27: The effect on throughput of different locking primitives and rootoptimization. The configurations are 1) Mutex locking, 2) Mutex locking withroot alias optimization, 3) Lazy mutex locking, 4) Lazy mutex locking withroot alias optimization, 5) Interrupt masking (cli/sti), 6) Interrupt maskingwith root alias optimization, 7) No locking and 8) No locking with root aliasoptimization

8.7 PERFORMANCE EVALUATION 161

The amount of added latency depends on how large the atomic op-erations is. Lower latency can be achieved by making the atomic opera-tions smaller at the cost of more frequent locking and lower throughput.Keeping the locking time short is most challenging in the implemen-tation of the GC, especially in a moving GC for which it is non-trivialto achieve shorter locking times than the time it takes to move one ob-ject. This is possible, however, either by detecting that moving an objectfailed [Hen98] or by limiting the size of individual heap objects [Sie02].

Algorithm Max Average MedianMark-sweep 12.3 2.7 2.6Mark-compact 14.8 3.6 3.6

Table 8.4: Heap locking times (µs) for the two GCs.

A second experiment measured the actual latency of a thread withhighest priority. In this experiment, the IRB-6 motion controller wasused, as in that application the controller thread is explicitly periodic.Figure 8.28 shows the real-time performance of the controller thread, onthe target system, and you can see that the jitter is quite low, typicallybelow ± 3µs.

0 100 200 300 400 500 600 700 800 900 1000485

490

495

500

505

510

515

sample number

sam

plin

g in

terv

al/u

s

0 10 20 30 40 50 60 70 80 90 100495

496

497

498

499

500

501

502

503

504

505

sample number

sam

plin

g in

terv

al/u

s

Figure 8.28: Measured sampling intervals for 1000 consecutive sampling in-stants; the nominal sampling period was 500 µs, and the jitter was typicallyless than ± 3µs, with a maximum of ± 10 µs. The right plot shows a close-upof the first 100 samples.

162 8. EXPERIMENTS

8.7.3 Lazy locking

This experiment investigates the impact of lazy locking on the numberof lock operations that are actually performed. Figure 8.29 shows thefrequencies of locks in the vanilla version and real and lazy locks in thelazy version. This shows that only a small fraction of the locks actuallyneed to be performed and thus that the locking overhead can be sig-nificantly reduced. For instance, in the receiver thread, which performsmost of the computations, only 0.03% of the lock instructions in the codeactually cause a mutex operation.

Figure 8.29: Comparing the vanilla version (left) to the one with lazy locks(right), showing the frequencies of real and lazy locks for each of the applicationthreads. Please note that the scale is logarithmic. In this experiment, the mark-compact collector was used. The application was run for a fixed amount of time;the vanilla version ran 17191 samples and the lazy version 23672 samples, sothe numbers should not add up.

CHAPTER 9

FUTURE WORK

This chapter outlines open problems and possible directions for futureresearch in areas discussed in, or related to, this thesis.

9.1 Adaptive GC scheduling

The proposed approaches to CGC prediction must be regarded as prelim-inary. Using a more detailed model of the heap state and predicting heapstate by simulating a dynamic system was dismissed as not practicallyfeasible. However, for application with small variations in memory us-age and long periods between mode changes, and GC algorithms wheredifferent size distributions, etc., have large impact on the workload, itmay be interesting to further explore that approach.

In the simplified clear-box workload prediction, the probabilities oflive and dead cells are quite simplistic. Here it would be interesting tostudy how modelling the mutator as a stochastic process could improvethe quality of the prediction.

The experimental evaluation of the presented techniques for auto-tuning GC scheduling has been limited to a small number of applica-tions, and while they are representative for embedded control systems,a bigger set of benchmarks — including applications outside the field ofautomatic control — is desirable. The GC execution time prediction hasto date only been tested in the simulated environment, and will requireevaluation also in an actual run-time system.

An interesting research issue is raised by the difference in how theGC increments are scheduled in the fixed priority and EDF systems de-scribed in this thesis: Is it desirable to spread GC work evenly across

164 9. FUTURE WORK

the cycle even if that means leaving idle time at the start of the cycle?One advantage of that approach is that it may give objects allocated atthe start of the cycle time to die, which decreases the average amount offloating garbage when using an incremental-update collector. The majordrawback is that it leaves less slack in the schedule towards the end ofthe cycle and therefore makes the system more vulnerable to changes inCPU utilization. This may be of particular importance in an adaptivesystem where robustness to variation in resource utilization is one of thekey factors.

Another interesting situation is a system with a few hard real-timethreads which requires a certain CPU percentage and a set of soft real-time threads. Then, after allocating the required CPU time to the hardreal-time threads, the remaining CPU bandwidth should be divided be-tween GC and soft real-time threads. Solving Equation 3.4 or 3.18 for a

instead of TGC would yield a safe allocation rate and hence, period time,for each low priority thread.

9.2 Priorities for memory allocation

Preliminary experiments indicate that having run-time support for di-viding memory allocations into critical or non-critical can increase bothrobustness and performance of real-time software. However, more ex-periments on larger systems and systems with high performance re-quirements (e.g. low latency) will have to be done.

In the presented work, only two levels of priority for memory alloca-tion (critical and non-critical) are used. That has the advantages of beingeasy to handle, both at design time and in the run-time system, wherethe former is the more important. For the programmer, it makes the de-sign decision quite clear: is a certain piece of code critical or not? Hav-ing more levels of priority would increase the power of expression of themodel but, at the same time, make the meaning of a priority level less ob-vious. Nonetheless, an interesting direction of further study is whetherthere are applications where additional advantages may be gained fromhaving an arbitrary number of memory priority levels.

9.2.1 Configurable behaviour

Models for controlling when to fail non-critical allocations should bestudied. In the logging example the optimal behaviour of the systemdepends on what the intended use of the log data is; if it is for systemidentification we want as long consecutive series of data as possible but

9.3 GC SCHEDULING INTERFACE 165

the amount of time between the series is of less importance. Therefore,in such an application, we want every non-critical allocation request tobe granted up to a point where no more non-critical requests are grantedduring that cycle. On the other hand, if the data is to be used for plottingor supervision, we want the samples to be equally spaced, i.e., every nthnon-critical allocation request should be granted. Furthermore, usuallya set of allocations is needed in order to perform a certain task. If the lastallocation of such a set is denied, the whole task has to be abandonedfor that time. That should also be taken into account when decidingwhether to grant or deny an allocation request.

Also, would it be possible to have different profiles to let the pro-grammer choose among to get the one that fits a particular applicationbest? Could such profiles co-exist in one application, i.e., different partsof the application having different non-critical memory policies?

9.2.2 Non-critical memory using aspects

In this work, focus is on embedded real-time systems and the approach,as presented here, relies on the fact that we can modify the memoryallocator. For systems without hard real-time constraints, however, itmay be possible to achieve the same advantages without having to doany modifications to the Java platform. One way of doing this couldbe by using aspect oriented software development[AOS]. The cross-cutting concern in this case is the handling of low-on-memory situa-tions. It should be investigated whether it is suitable to e.g. divide thetasks into critical and non-critical aspects and dynamically weave in thenon-critical parts only if the system has enough memory. We believethat it is possible to use e.g., the property-based cross-cutting of AspectJ[KHH+01] to insert a test whether an allocation should be done beforeeach call to a constructor.

9.3 GC scheduling interface

The experimental platforms were implemented using the garbage col-lection interface (GCI) [IBE+02] developed by our research group. TheGCI is a programmer’s interface consisting of a well-defined set of mem-ory operations and the goal of the GCI is to make it possible to separatethe GC implementation from its usage even in a hard real-time systemand in an uncooperative environment like an optimizing compiler backend that is unaware of garbage collection. The GCI makes it possible

166 9. FUTURE WORK

to change GC algorithms without making any changes to the rest of therun-time system or the code generation.

This scheduling principles presented in this thesis makes it possibleto separate the GC scheduling from the GC implementation. When ablack box approach to on-line GC scheduling is used in the current pro-totype implementation it is possible to change garbage collector withoutmodifying the scheduler. However, if we want to allow a clear box ap-proach, it is necessary to specify a GC scheduling interface that defineshow the communication between the GC algorithm and the GC sched-uler is done and that requires further investigation. Furthermore, thecommunication between the process scheduler and the GC schedulermust be studied and formalized.

9.4 Feedback scheduling and QoS

The results in Chapter 6 have only been tested in a simulated environ-ment, and further evaluation on real implementations is required. Inparticular, the proposed idea of controlling the allocation rate of pro-cesses has to be developed further. In the presented case study, thechange of allocation rate was implemented by switching between dif-ferent controllers, and while suitable for some control systems, it is nogeneral solution. This also motivates further research on how to makethe behaviour of non-critical memory allocations configurable.

9.5 Distributed hard real-time systems

Another area where the presented techniques may have impact are tem-porally predictable distributed systems. In a distributed system, thenodes can be seen as components and the whole system as being con-structed by composition of node components. When designing such sys-tems, one important factor is the ease of composing systems out of com-ponents, composability. The time-triggered architecture [Kop02, KB03]addresses the composability problem and its important features includetime-triggered communication and temporal firewalls — interfaces be-tween the components specifying what data should be available or com-municated at what time. Such interfaces makes it possible to guaranteethat if the individual components conform to their specified interfaces,the resulting system will work as intended. They also solve problems ofsafety critical systems like, for instance, maintaining a global time baseand determining data validity.

9.5 DISTRIBUTED HARD REAL-TIME SYSTEMS 167

In order utilize automatic memory management in such temporallypredictable components, it seems as it would be helpful, if not necessary,to be able to guarantee that also the memory manager is temporally pre-dictable. As time-triggered GC scheduling has the property that it has anexplicit deadline and therefore makes it possible to guarantee that a GCcycle finishes and makes a certain amount of memory available at a cer-tain time, it would be interesting to study the impact of time-triggeredGC in this field of application.

For the same reasons, time-triggered GC scheduling might also beuseful together with the linear control server model [CE03], which usestime-triggered I/O in order to avoid degraded control performance dueto scheduling jitter.

CHAPTER 10

RELATED WORK

This chapter presents work, related to the contributions of this thesis,in the areas of GC scheduling, memory management for real-time Java,and worst case analysis.

10.1 Time-based garbage collection scheduling

The fundamental idea of the presented work is that a deadline is as-signed to the GC, and then the GC is scheduled as any other threadin the system using an arbitrary scheduling policy. I.e., as stated, thepresented time-triggered approach to scheduling differs from other GCscheduling strategies (including the Metronome, the deferrable serverapproach, and semi-concurrent scheduling) in that it does not addressthe scheduling of the individual GC increments but leaves that to thenormal process scheduler. As a consequence, the time-triggered ap-proach contains no explicit rules for the relative priorities of GC andmutator threads.

However, in a typical application of time-triggered GC, the GC willseldom or never preempt mutator threads. If rate-monotonic schedulingis used, the period time of the GC will typically be much longer than thatof mutator threads. If earliest deadline first is used, the deadline of theGC will be a relatively long time into the future most of the time. If thereis some slack in the schedule, the GC will finish its work before being soclose to its deadline that it actually preempts mutator tasks.

170 10. RELATED WORK

Henriksson

Using time as the GC work metric was discussed in [Hen98] as thiswould solve the problem of traditional GC work metrics failing to cap-ture the temporal behaviour of the garbage collector. The approach was,however, dismissed as impractical, since it requires a high resolutionclock. However, most current embedded platforms (even smaller ones,such as the Atmel AVR) have timers with resolution of the same magni-tude as the CPU clock, which is more than adequate for these purposes.Thus, on such platforms, using time as the fundamental GC work metricis practically possible, and offers advantages over ad hoc metrics.

Bacon et al

The problems of allocation-triggered GC scheduling in real-time sys-tems, particularly the uneven GC overhead and consequentially, muta-tor CPU utilization, caused by variances in allocation rate, are addressedby David F. Bacon et al and their Metronome collector[BCR03b, BCR03a].To achieve even and predictable mutator CPU utilization, time-basedscheduling, where the collector and mutator are interleaved using fixedtime quanta, is proposed.

The work of Bacon et al is largely motivated by the same concernsand has much in common with the work presented in this thesis. Onefundamental feature of time-based GC scheduling common to both ap-proaches is that they turn garbage collection into a periodic activity in-stead of a sporadic one as allocation-triggered GC does.

The main difference between the model proposed by Bacon et al andthe time-triggered GC scheduling model presented in this thesis lies inthe level at which GC scheduling is considered; the period time of theirmodel is at the quantum level while the period of the time-triggered GCis the GC cycle. Also, the fixed time quanta of the Metronome explicitlystate how the GC work should be scheduled while the time-triggeredmodel specifies a deadline and leaves the actual scheduling decisions tothe underlying process scheduler.

The behaviour of the approach of Bacon et al is, at a large time scale,similar to that of a semi-concurrent GC or a time-triggered GC in thatthe CPU utilization of the mutator is predictable and consistent and in-dependent of bursty allocation rate of the mutator. 1 However, at a more

1The interleaving of GC and background processes in the semi-concurrent model maybe almost identical; quantization effects due to atomic GC primitives make a GC sched-uled according to Equation (3.20) behave as a time-based GC with small GC and mutatorquanta.

10.1 TIME-BASED GARBAGE COLLECTION SCHEDULING 171

fine-grained level, the garbage collector may still preempt the mutatoras the GC is scheduled to run for one GC quantum after each muta-tor quantum. Also, the Metronome time quanta are in the millisecondrange, whereas the atomic operations of the collectors in the LJRT run-time system are a few microseconds. Here, the design goals behind theircollector differ from the ones driving the work presented in this the-sis; they focus on low overhead and consistent utilization while non-intrusiveness and low GC induced latency and jitter — possibly at thecost of higher inlined overhead — are the key issues behind this thesis.

Qian et al

Time-triggered GC was also proposed in [YSaSC02] as a means to spreadGC work more evenly and minimize the number of GC invocations andheap usage when the application’s allocation pattern is bursty. The focuson that work is on measuring object lifetimes but they note that similarconcerns are relevant in server applications.

Previous object life span studies have used an allocation-triggeredapproach, calling the GC every n KB of allocation. Qian et al supplementthis with a time based approach by periodically performing a GC cycle,e.g., every 100 ms. In their paper, no effort is made to ensure that thecollector keeps up with the mutator since this is not a problem in theirapplication; it is sufficient that the GC cycle time can be manually tunedto suit a particular application.

They also hint that the time-triggered approach can be applicable toembedded systems by using the timing information of the processes torun the GC when the number of live objects is small. The focus is stillon efficiency and minimizing the number of GC invocations and they donot address any real-time issues.

Xian and Xiong

Work on GC scheduling aimed at minimizing the memory requirementwas presented in [XX05]. That approach is based on minimizing theresponse time of the GC, and thereby the memory required to satisfyallocation requests while the GC is running, and the technique used is totreat the GC as an aperiodic task and run it in a deferrable server [SLS95].

Embedded systems are typically quite predictable, including mem-ory usage. Thus the GC really is a periodic task, and by making this ex-plicit in the model, standard thechniques for scheduling periodic taskscan be used. Treating the GC as aperiodic and allowing it to interruptmutator tasks at arbitrary times will lead to increased jitter.


10.2 Adaptive GC scheduling

An approach to adaptive GC scheduling aimed at minimizing the GCoverhead is suggested by Henriksson [Hen96]. The idea is that at thestart of the GC cycle, garbage collection is performed at a rate that willallow the GC to finish on time in the average case. Then, at a certain(a priori calculated) point, if the GC workload in the current cycle wasmore than the average, the GC rate is increased to the maximum ratein order to finish on time. Thus, this adaptive GC rate improves theaverage performance while still guaranteeing that the GC will not stopthe application from meeting its deadlines in the worst case.

That approach is particularly useful if the difference between theworst and average case GC work is large, and the worst case is rare.As the conservative CGC prediction presented here may, under somecircumstances, be very conservative, a similar approach might be veryuseful in the applications discussed in this thesis. By using both a con-servative estimate and a record of average UGC, the GC could be sched-uled according to the average case at the beginning of each cycle. Then,at a certain point in time (determined by the maximum allowed UGC) ifthe GC hasn’t finished its work, UGC is raised to the maximum.

Engelstad and Vandendorpe [EV91] mention using a heuristic forcontrolling the “steal rate” of their garbage collector. A GC incrementis performed every n allocations and GC progress is measured. If for-ward progress is not made, n is decreased and vice versa.

Siebert [Sie02] also use use an adaptive scheme to minimize GC over-head; based on the current memory utilization, a proper value for howmuch GC work to be performed for every allocated byte is determined.The fundamental difference between that work and the adaptive sched-uling presented in this thesis is that Siebert requires an upper boundon the fraction of allocated memory to be known and the adaptivity isan optimization to avoid unnecessarily long GC increments if the actualamount of allocated memory is less than the worst case. The adaptivescheduling presented in this thesis requires no a priori analysis and ispurely based on measuring the state of the memory system. This givesincreased flexibility at the cost of a priori guarantees.

10.3 Memory Management in Real-Time Java

There are two specifications for real-time Java; The Real-Time Specifica-tion for Java (RTSJ) [B+01] and the Real-Time Core Extensions (RTCE)[JC00]. Both try to solve the real-time garbage collection problem by

10.3 MEMORY MANAGEMENT IN REAL-TIME JAVA 173

avoiding it. They assume that garbage collection is not feasible in real-time systems and instead propose region-based approaches to memorymanagement for the real-time threads. The non-real-time threads dotheir memory allocation on a heap with traditional garbage collection.

RTSJ uses scoped memory areas for high priority threads. Objects allo-cated in scoped memory areas are not garbage collected but instead thewhole memory area is reclaimed when the program exits the scope inwhich the memory area was allocated. The access restrictions associatedwith scoped memory (e.g., objects allocated on the heap may not refer-ence objects in scoped memory, and real time threads aren’t allowed toaccess the heap2) make inter-thread communication more difficult. Real-time threads, however, may share scoped memory areas.

In RTCE, real-time objects are allocated in core memory, and may notaccess objects on the garbage collected baseline heap. Objects on the heapmay, with some restrictions, access core objects through special methodcalls. Core objects are allocated in an allocation context. When an alloca-tion context is released, all objects in it may be eligible for reclamationbut, since there might be references from the baseline heap, the actualreclamation is done by the baseline garbage collector when all of theobjects in the allocation context are unreachable. Thus, a non-real-timegarbage collector is used to reclaim the memory used by the real-timeprocesses.

In RTCE, there are no limitations on which allocation contexts objectsmay reference so it is up to the programmer not to release an allocationcontext when it is still referenced. RTCE also specifies stack allocation ofreal-time objects, which are to be automatically reclaimed as the scopeis exited. To allocate stack objects, a set of restrictions apply and thereference must explicitly be declared stackable.

Under both of these specifications, behaviour similar to our non-critical allocations can be achieved by using one memory area (or al-location context) for critical memory and another (or the heap) for thenon-critical objects. The drawbacks of these approaches compared tothe one proposed in this thesis are firstly that a much higher responsi-bility is placed on the programmer by removing the safety that garbagecollection provides, from the most critical parts of the system. Secondly,the access restrictions between the different types of memory make com-munications between low and high priority threads more complicated.

2Since the heap is garbage collected, real-time threads with hard time constraints mustbe of the type NoHeapRealTimeThread in order to avoid interference from the garbage col-lector.


10.4 Soft references

The notion of non-critical allocation is somewhat related to the soft refer-ences found for instance in Java [J2S] in that they both aim to prevent outof memory errors due to too many objects not absolutely needed for thecorrect operation of the program. In analogy with the Java terminology,non-critical allocations could be called “soft allocations”.

The difference lies in when the system decides that it is running lowon memory and starts trying to limit memory usage. With the approachpresented here, the decision is taken at allocation time, preventing a lowon memory situation from arising. When using soft references, on theother hand, all allocations are carried out, and the decision about whento reclaim softly reachable objects are left to the garbage collector. Thereis also a difference in the intended usage; soft references were introducedto facilitate the implementation of e.g. caches, where objects’ lifetimesare nondeterministic (i.e., you never know whether a cached value willbe accessed again in the future or not, but it’s best to keep it as long as thememory permits). Thus, while soft references may be used to achieve alogical behaviour similar to our non-critical allocations, the increase inthe amount of required GC work when the system is already low onmemory makes this use of soft references unsuitable for for real-timeapplications.

10.5 GC in an uncooperative environment

The work presented here focuses on hard real-time systems for whichwe want to use standard C compilers and to run our application un-der a standard real-time operating system. This means that we have tofind a way to deal with an uncooperative environment. The standardway to introduce GC in uncooperative environments is to use conser-vative garbage collectors such as the one devised by Boehm and Weiser[BW88]. Such GC algorithms are, however, not suitable for the types ofreal-time systems we are interested in, since we need the GC to be bothaccurate and predictable.

An algorithm which is accurate and which works well with standardC compilers is the one presented by Henderson [Hen02]. His aim was torefute the “common wisdom” that accurate GC is not possible withoutsupport from the compiler back-end by presenting such a GC for single-threaded applications and stop-the-world GC. The work presented inthis thesis also illustrates that GC is possible without special compilersupport. An important part of the contribution is to show that accurate

10.6 WORST CASE AND SCHEDULABILITY ANALYSIS 175

GC without support from the compiler or scheduler is possible also forconcurrent GC and multi-threaded applications with strict real-time re-quirements.

10.6 Worst case and schedulability analysis

Good worst case estimates for execution time and memory usage arecrucial for making any kind of real-time guarantees. In order to makesuch analysis feasible in industry, tool support is required.

Alan C. Shaw has developed a technique, timing schema [Sha89], forformalizing execution time analysis. A timing tool for a subset of C hasalso been developed [PS91].

In order to give continuous feedback to the developer, an interactiveprogramming environment with worst case analysis functionality is de-sirable. The experimental tool Skanerost developed at our departmentprovides interactive worst case execution time and memory consump-tion analysis based on timing schema and source code annotations for(currently a subset of) the Java language [PH99, Per99, PH00].

The WCET group at Uppsala University has presented research onand tool support for worst case analysis on C code without the require-ment for programmer annotations based on flow analysis and pipelinesimulation [EES01, Eng02].

Another approach to schedulability analysis and automatic verifica-tion of real-time systems based on timed automata has been developedin the UPPAAL project [LPY97, Pet99].

Current approaches to worst case analysis are often highly complexwhen applied to life-size programs. A different approach to temporallypredictable software is proposed by Puschner [Pus02]. That approachis based on trading off performance for predictability by writing (or au-tomatically transforming) programs in a way that they are inherentlypredictable; single path programming. It is not clear how this approachaffects dynamic memory management.

For systems where threads execute with fixed period times, off-lineassignment of GC scheduling parameters can be used. To facilitate this,a framework for performing static (compile-time) analysis of allocationrates was presented by Mann et al. [MDLC05].

CHAPTER 11

CONCLUSIONS

Motivated by the desire for greater flexibility in real-time systems, andthe need to handle non-determinism and variations in resource utiliza-tion, new approaches to memory management have been presented.

A model for scheduling garbage collection work, time-triggered GC sched-uling, that has several benefits compared to previous techniques is pro-posed. The single scheduling requirement that the garbage collectormust finish before its deadline makes it especially suitable for earliestdeadline first (EDF) systems, for which we have not seen any similarsystems.

The handling of non-determinism, and the desire to enable the run-time system to provide real-time performance without requiring worstcase analysis, motivated two approaches to adaptive memory manage-ment. Firstly, techniques to accomplish auto-tuning of a concurrent,real-time, time-triggered garbage collector were examined. AdaptiveGC scheduling contains two problems: to determine the scheduling pa-rameters of the GC process and to keep a task set with varying resourceutilization schedulable. Both problems were addressed. Methods foron-line estimation of both period and execution time of the GC were de-veloped, and an approach to taking the CPU utilisation of the GC intoaccount in a feedback scheduler were suggested.

Another approach to handling non-determinism and enhancing ro-bustness, applying priorities to memory allocation, was presented. It wasobserved that often, systems contain parts that are not critical to the corefunctionality. Thus, if the computer is running low on memory, we wantrun-time system support for selecting the most important memory allo-cations, just as the process scheduler makes sure that the most importantprocesses get precedence over less important ones if CPU time is scarce.

178 11. CONCLUSIONS

11.1 Contributions

Time-triggered garbage collection scheduling

A number of problems related to GC scheduling was addressed:

• Using time rather than allocation as the trigger for GC work solvesthe problem of bursty allocations causing long GC pauses. It alsoallows us to spread the GC work evenly across the GC cycle. Inessence, by turning GC work into a periodic activity rather than asporadic one, the scheduling of GC is simplified.

• The metric used to measure GC work has a big impact on the GCscheduling. The optimal GC work metric is the CPU time requiredto perform the GC work and it is proposed that it is practicallypossible to use time as the GC work metric at run-time.

• Implementing non-intrusive concurrent GC with guaranteed pro-gress in an EDF scheduled system has been problematic. Time-triggered GC scheduling provides an explicit deadline for each GCcycle and therefore fits nicely into an EDF system.

• Time-triggered GC makes it possible to schedule the GC thread asany other thread. The GC work metric is only used for schedula-bility analysis and therefore, the problems of poor real-time per-formance caused by a poor metric are avoided.

These ideas form a novel approach to non-intrusive, concurrent garbagecollection scheduling in real-time systems.

Adaptive garbage collection scheduling

As the time-triggered approach to garbage collection scheduling allowsus to make scheduling decisions at the GC cycle level rather than in-dividual increments, it lends itself well to auto-tuning. Techniques forestimating both the GC cycle time and the amount of GC work requiredto complete a cycle was presented and their applicability was experi-mentally verified.

The proposed techniques can facilitate the implementation of moreflexible real-time systems as they make is possible to use GC in a real-time system without the need for tedious manual tuning.

11.1 CONTRIBUTIONS 179

Priorities for memory allocations

Based on the observation that, often, not all of the code in a hard real-time system is critical, the idea of applying priorities to memory allo-cation was presented. This can be used to enhance the robustness ofreal-time and embedded systems in two ways:

• It provides run-time support for prioritizing memory allocations ifthere is not enough memory for all allocation requests and therebyfacilitates development of robust applications.

• It makes it easier to provide hard guarantees since the worst casememory usage only has to be analyzed for the critical parts of thesystem as non-critical allocations cannot cause the system to fail.

Furthermore, experiments also show that the same mechanisms can beused to increase performance by limiting the amount of memory alloca-tion and, consequentially, garbage collection work.

Memory-aware feedback scheduling

In order to use scheduled garbage collection in a feedback schedulingsystem, the required CPU utilization of the GC task must be known,and as the GC utilization depends on the memory behaviour of mutatortasks, it must be determined on-line. It was investigated how an auto-tuning time-triggered GC can be incorporated in a feedback schedulingsystem in order to make the memory management overhead explicit andlet the process scheduler take this into account when scheduling the ap-plication threads.

It was also suggested that non-critical memory allocations can beused, in a feedback scheduling system, to control the allocation rates ofthe application threads in order to optimize the trade-off between mem-ory and CPU time usage.

Accurate real-time GC in an uncooperative environment

An implementation of a framework for accurate, concurrent, real-timegarbage collection aimed at embedded systems was presented. It allowsvery low latency and works for a system including legacy and auto-matically generated C code, and an off-the-shelf compiler and operatingsystem. Restrictions imposed by the uncooperative environment — es-pecially the scheduler — makes explicit synchronization between mu-tator and collector necessary, adding to the execution time overhead of

180 11. CONCLUSIONS

memory operations. The sources of that overhead were analyzed, andpossible remedies presented.

Experiments show that it is possible to use accurate garbage collec-tion in an uncooperative environment for multi-threaded real-time ap-plications which require latency times as low as a few microseconds, andthat the run-time overhead can be kept at a reasonable level.

11.2 Reflections

In the introduction, it was stated that an important property of a mem-ory manager to be used in a flexible real-time system is that the real-timeperformance at run-time must be independent of a priori schedulabilityanalysis. That is, if the total requested CPU utilization of mutator andcollector is low enough that the system is schedulable, the actual sched-ule produced by the run-time system will allow all tasks to meet theirdeadlines. The inherent robustness of the time-triggered GC schedul-ing model and the property that low-level scheduling decisions are leftto the process scheduler, combined with the presented approaches toadaptive GC scheduling and memory allocation, help resolve the mem-ory management issues of flexible real-time software.

The second goal was to develop a model that makes it possible toschedule garbage collection as any other task while still guaranteeingsufficient progress. This thesis shows that time-triggered GC schedulinghas this property under both fixed priority and EDF scheduling.

The fundamental idea behind time-triggered GC scheduling is toturn garbage collection into a periodic activity that can be scheduled us-ing standard scheduling techniques. It is my belief that adding a special-ized scheduler for the individual GC increments merely adds overhead:if the system (including GC) is schedulable, there is no point in runningthe GC with a higher priority than the mutator — that just increases jit-ter. If the system is not schedulable it is better to explicitly limit the CPUusage of the mutator (using e.g. constant bandwidth servers, feedbackscheduling, or some such approach) rather than running the GC at thehighest priority with a specialized scheduler — the result is the same:the mutator threads are delayed and the GC is given enough CPU timeto ensure sufficient progress.

On-line resource management through adaptive techniques and ap-proaches to handling overload is an important part of the presentedwork. Based on the observation that not all hard real-time systems — ornot all parts of a hard real-time system — are safety critical, techniquesthat transfer some resource-management tasks from the programmer to

11.2 REFLECTIONS 181

the run-time system were proposed and discussed. While adaptive sys-tems cannot give absolute guarantees, the presented mechanisms worktogether in enhancing robustness and providing isolation between dif-ferent parts of a system. Their lack of hard guarantees can also be astrength as it forces the engineer to consider, and provides tools for man-aging, the uncertainty that is unavoidable in increasingly complex em-bedded systems.

Automatic memory management is essential to the use of safe objectoriented languages and the presented contributions are a step towardsmaking real-time garbage collection practically feasible.

BIBLIOGRAPHY

[AB91] Leif Andersson and Anders Blomdell. A real-time pro-gramming environment and a real-time kernel. In Lars As-plund, editor, National Swedish Symposium on Real-Time Sys-tems, Technical Report No 30 1991-06-21. Dept. of ComputerSystems, Uppsala University, Uppsala, Sweden, 1991.

[AB98] Luca Abeni and Giorgio Buttazzo. Integrating multimediaapplications in hard real-time systems. In Proceedings of the1998 IEEE Real-Time Systems Symposium, Madrid, Spain, De-cember 1998.

[ACM+03] Nevine AbouGhazaleh, Bruce Childers, Daniel Mosse, et al.Energy management for real-time embedded applicationswith compiler support. In LCTES’03 [LCT03].

[AEL88] Andrew W. Appel, John R. Ellis, and Kai Li. Real-time con-current collection on stock multiprocessors. In Proceedings ofthe SIGPLAN’88 Conference on Programming Language Designand Implementation, Atlanta, Georgia, June 1988.

[AOS] Aspect-oriented software development web site;http://www.aosd.net.

[AP00] Luca Abeni and Luigi Palpoli. On adaptive control tech-niques in real-time resource allocation. In Proceedings of theIEEE Euromicro Conference on Real-Time Systems, Stockholm,Sweden, June 2000.

[AW89] Karl Johan Astrom and Bjorn Wittenmark. Adaptive Control.Addison-Wesley, 1989.

184 BIBLIOGRAPHY

[AW97] Karl Johan Astrom and Bjorn Wittenmark. Computer-controlled systems, Theory and design. Prentice Hall, 1997.

[B+01] Greg Bollella et al. The Real-Time Specification for Java.Addison-Wesley, 2001.

[Bak78] Henry G. Baker. List processing in real time on a serial com-puter. Communications of the ACM, 21(4):280–294, April 1978.

[BCR03a] David F. Bacon, Perry Cheng, and V. T. Rajan. Controllingfragmentation and space consumption in the metronome, areal-time garbage collector for Java. In LCTES’03 [LCT03].

[BCR03b] David F. Bacon, Perry Cheng, and V. T. Rajan. A real-timegarbage collector with low overhead and consistent utiliza-tion. In Proceedings of POPL’03, New Orleans, Louisiana,USA, January 2003.

[BH04] Stephen M Blackburn and Anthony L Hosking. Barri-ers: Friend or foe? In Proceedings of the 2004 InternationalSymposium on Memory Management (ISMM’04), Vancouver,Canada, October 2004. ACM Press.

[BLA02] Giorgio Buttazzo, Giuseppe Lipari, and Luca Abeni. Elasticscheduling for flexible workload management. IEEE Trans-actions on Computers, 51(3), March 2002.

[Bob68] D. G. Bobrow. Managing re-entrant structures using refer-ence counts. ACM Transactions on Programming Languagesand Systems, 11(3), July 1968.

[BW88] Hans-J. Boehm and Mark Weiser. Garbage collection in anuncooperative environment. Software – Practice and Experi-ence, 18(9), September 1988.

[CE03] Anton Cervin and Johan Eker. The Control Server: A com-putational model for real-time control tasks. In Proceedingsof the 15th Euromicro Conference on Real-Time Systems, Porto,Portugal, July 2003.

[CEBA02] Anton Cervin, Johan Eker, Bo Bernhardsson, and Karl-ErikArzen. Feedback-feedforward scheduling of control tasks.Real-Time Systems, 23(1), July 2002.

185

[Cer03] Anton Cervin. Integrated Control and Real-Time Scheduling.PhD thesis, Department of Automatic Control, Lund Insti-tute of Technology, Sweden, April 2003.

[CLE+04] Anton Cervin, Bo Lincoln, Johan Eker, Karl-Erik Arzen, andGiorgio Buttazzo. The jitter margin and its application in thedesign of real-time control systems. In Proceedings of the 10thInternational Conference on Real-Time and Embedded ComputingSystems and Applications, Goteborg, Sweden, August 2004.

[Col60] G. E. Collins. A method for overlapping and erasure of lists.Communications of the ACM, 3(12), December 1960.

[DLM+78] E. W. Dijkstra, L. Lamport, A. J. Martin, C. S. Scholten, andE. F. M. Steffens. On-the-fly garbage collection: An exercisein cooperation. Communications of the ACM, 21(11), Novem-ber 1978.

[DN76] Ole Johan Dahl and Kristen Nygaard. SIMULA – A lan-guage for Programming and Description of Discrete Event Sys-tems. Norwegian Computing Center, Oslo, Norway, 5th edi-tion, September 1976.

[EES01] Jakob Engblom, Andreas Ermedahl, and Friedhelm Stap-pert. A worst-case execution-time analysis tool prototypefor embedded real-time systems. In Proceedings of the Work-shop on Real-Time Tools (RT-TOOLS 2001), August 2001.

[Ekm04] Torbjorn Ekman. Rewritable Reference Attributed Grammars— design, implementation, and applications. Licenciate thesis,Department of Computer Science, Lund University, 2004.

[Eng02] Jakob Engblom. Processor Pipelines and Static Worst-Case Exe-cution Time Analysis. PhD thesis, Department of InformationTechnology, Uppsala University, 2002.

[EV91] Steven L. Engelstad and James E. Vandendorpe. Automaticstorage management for systems with real time constraints.In OOPSLA ’91 GC Workshop, 1991.

[FY69] R. Fenichel and J. Yochelson. A lisp garbage collector forvirtual memory computer systems. Communications of theACM, 12(11), November 1969.

186 BIBLIOGRAPHY

[HC05] Dan Henriksson and Anton Cervin. Optimal on-line sam-pling period assignment for real-time control tasks based onplant state information. In Proceedings of the Joint 44th IEEEConference on Decision and Control and European Control Con-ference, Seville, Spain, 2005.

[HCA02] Dan Henriksson, Anton Cervin, and Karl-Erik Arzen. True-Time: Simulation of control loops under shared computerresources. In Proceedings of the 15th IFAC World Congress onAutomatic Control, Barcelona, Spain, July 2002.

[Hen96] Roger Henriksson. Adaptive scheduling of incrementalcopying garbage collection for interactive applications. InProceedings of the 1996 Nordic Workshop on Programming Envi-ronment Research (NWPER’96), Aalborg, Denmark, 1996.

[Hen98] Roger Henriksson. Scheduling Garbage Collection in Embed-ded Systems. PhD thesis, Department of Computer Science,Lund Institute of Technology, Lund University, 1998.

[Hen02] Fergus Henderson. Accurate garbage collection in an unco-operative environment. In ISMM’02 [ISM02].

[HN99] Mathias Haage and Klas Nilsson. On the scalability of vi-sualization in manufacturing. In In Proceedings of ETFA ’99,1999.

[IBE+02] Anders Ive, Anders Blomdell, Torbjorn Ekman, Roger Hen-riksson, Anders Nilsson, Klas Nilsson, and Sven GestegardRobertz. Garbage collector interface. In Proceedings of NW-PER’02, Copenhagen, Denmark, August 2002.

[ISM02] Proceedings of the 2002 International Symposium on MemoryManagement (ISMM’02), Berlin, Germany, June 2002. ACMPress.

[Ive03] Anders Ive. Towards an embedded real-time Java virtual ma-chine. Lic. eng. thesis, Department of Computer Science,Lund Institute of Technology, Lund University, 2003.

[J2S] Java 2 platform, standard edition, API specification. Sun Mi-crosystems. http://java.sun.com.

[JC00] J-Consortium. Real-time core extensions for the java plat-form. International J Consortium Specification, 2000.

187

[JL96] Richard Jones and Raphael Lins. Garbage Collection. Algo-rithms for Automatic Dynamic Memory Management. John Wi-ley & Sons, 1996.

[JP86] M Joseph and P Pandya. Finding response times in a real-time system. The Computer Journal, 29(5), 1986.

[KB03] Hermann Kopetz and Gunther Bauer. The time-triggeredarchitecture. Proceedings of the IEEE, 91(1):112 – 126, January2003.

[KHH+01] Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten,Jeffrey Palm, and William G. Griswold. An overview ofAspectJ. In Jørgen Lindskov Knudsen, editor, Proceedingsof the Europeean Conference on Object Oriented Programming(ECOOP). Springer-Verlag, 2001.

[Knu73] Donald E. Knuth. The Art of Computer Programming. Funda-mental Algorithms. Addison-Wesley, 1973.

[Kop02] Hermann Kopetz. Time-triggered real-time computing.IFAC World Congress, Barcelona, July 2002, IFAC Press, July2002.

[LC02] Bo Lincoln and Anton Cervin. Jitterbug: A tool for analysisof real-time control performance. In Proceedings of the 41stIEEE Conference on Decision and Control, Las Vegas, NV, De-cember 2002.

[LCT03] Proceedings of the 2003 ACM SIGPLAN Conference on Lan-guages, Compilers, and Tools for Embedded Systems (LCTES’03),San Diego, California, USA, June 2003. ACM Press.

[Lin04] Daniel Linden. Estimating the Overhead for Automatic Mem-ory Management in Real-Time Systems. Master’s thesis, De-partment of Computer Science, Lund Institute of Technol-ogy, Lund University, 2004.

[LL73] C. L. Liu and James W. Layland. Scheduling algorithms formultiprogramming in a hard real-time environment. Journalof the ACM, 20(1), 1973.

[LPY97] Kim G. Larsen, Paul Pettersson, and Wang Yi. UPPAAL in aNutshell. Int. Journal on Software Tools for Technology Transfer,1(1–2):134–152, October 1997.

188 BIBLIOGRAPHY

[Mac04] The Real-Time Java Platform, white paper.http://research.sun.com/projects/mackinac/mackinac whitepaper.pdf, June 2004.

[McC60] J. McCarthy. Recursive functions of symbolic expressionsand their computation by machine. Communications of theACM, 3(4), April 1960.

[MDLC05] Tobias Mann, Morgan Deters, Rob LeGrand, and Ron K.Cytron. Static determination of allocation rates to supportreal-time garbage collection. In Proceedings of the 2005 Confer-ence on Languages, Compilers, and Tools for Embedded Systems(LCTES’05). ACM, 2005.

[Min63] M. L. Minsky. A lisp garbage collector algorithm using se-rial secondary storage. Memo 58 (rev.) Project Mac, M.I.T.,Cambridge, Mass., December 1963.

[NEN02] Anders Nilsson, Torbjorn Ekman, and Klas Nilsson. RealJava for real time – gain and pain. In Proceedings of CASES-2002, pages 304–311. ACM Press, October 2002.

[NIEH04] Anders Nilsson, Anders Ive, Torbjorn Ekman, and GorelHedin. Implementing java compilers using ReRAGs. NordicJournal of Computing, 11(3):213–234, 2004.

[Nil04] Anders Nilsson. Compiling Java for Real-Time Systems. Lic.eng. thesis, Department of Computer Science, Lund Instituteof Technology, Lund University, 2004.

[NR05] Anders Nilsson and Sven Gestegard Robertz. On real-timeperformance of ahead-of-time compiled Java. In Proceed-ings of the 8th IEEE International Symposium on Object-orientedReal-time distributed Computing (ISORC’05), Seattle, Wash-ington, May 2005.

[Per99] Patrik Persson. Live memory analysis for garbage collectionin embedded systems. In Proceedings of the ACM SIGPLAN1999 Workshop on Languages, Compilers, and Tools for EmbeddedSystems (LCTES’99), Atlanta, Georgia, May 1999.

[Pet99] Paul Pettersson. Modelling and Verification of Real-Time Sys-tems Using Timed Automata: Theory and Practice. PhD thesis,Uppsala University, 1999.

189

[PH99] Patrik Persson and Gorel Hedin. Interactive execution timepredictions using reference attributed grammars. In Pro-ceedings of WAGA’99: Second Workshop on Attribute Grammarsand their Applications, Amsterdam, The Netherlands, March1999.

[PH00] Patrik Persson and Gorel Hedin. An interactive environ-ment for real-time software development. In Proceedingsof the 33rd International Conference on Technology of Object-Oriented Languages (TOOLS Europe 2000), St. Malo, France,June 2000.

[PS91] Chang Yun Park and Alan C. Shaw. Experiments with a pro-gram timing tool based on source-level timing schema. Com-puter, 24(5), May 1991.

[Pus02] Peter Puschner. Is worst-case execution-time analysis a non-problem? — Towards new software and hardware archi-tectures. In Proceedings of the 2nd International Workshop onWorst-Case Execution Time Analysis (WCET 2002), Vienna,Austria, June 2002.

[RH03] Sven Gestegard Robertz and Roger Henriksson. Time-triggered garbage collection — robust and adaptive real-time GC scheduling for embedded systems. In LCTES’03[LCT03].

[Rit03] Tobias Ritzau. Memory Efficient Hard Real-Time Garbage Col-lection. PhD thesis, Department of Computer and Informa-tion Science,Linkoping University, 2003.

[RNNH06] Sven Gestegard Robertz, Anders Nilsson, Klas Nilsson, andMathias Haage. Multi-stage deplyoment of robot controlsoftware. In Proceedings of the 8th International IFAC Sympo-sium on Robot Control, SYROCO, September 2006. to appear.

[Rob02] Sven Gestegard Robertz. Applying priorities to memory al-location. In ISMM’02 [ISM02].

[Sha89] Alan C. Shaw. Reasoning about time in higher-level lan-guage software. IEEE Transactions on Software Engineering,15(7), 1989.

[Sie02] Fridtjof Siebert. Hard Realtime Garbage Collection in ModernObject Oriented Programming Languages. PhD thesis, Fakultatfur Informatik, Universitat Karlsruhe, 2002.

190 BIBLIOGRAPHY

[SLS95] Jay K. Strosnider, John P. Lehoczky, and Lui Sha. The de-ferrable server algorithm for enhanced aperiodic respon-siveness in hard real-time environments. IEEE Transactionson Computers, 44(1), January 1995.

[SRL94] Lui Sha, Ragunathan Rajkumar, and John. P. Lehoczky. Gen-eralized rate-monotonic scheduling theory. Proceedings of theIEEE, 82(1), 1994.

[Ste75] G. R. Steele, Jr. Multiprocessing compactifying garbage col-lection. Communications of the ACM, 18(9), September 1975.

[Wad76] P. L. Wadler. Analysis of an algorithm for real time garbagecollection. Communications of the ACM, 19(9), September1976.

[Wel04] Andy Wellings. Concurrent and Real-Time Programming inJava. Wiley, September 2004. ISBN 0-470-84437-X.

[Wil92] Paul R. Wilson. Uniprocessor garbage collection techniques.In Yves Bekkers and Jacques Cohen, editors, InternationalWorkshop on Memory Management, number 637 in LectureNotes in Computer Science, pages 1–42, St. Malo, France,September 1992. Springer-Verlag.

[WJNB95] Paul R. Wilson, Mark S. Johnstone, Michal Neely, and DavidBoles. Dynamic storage allocation: A survey and critical re-view. In Proc. 1995 International Workshop on Memory Man-agement, Kinross, Scotland, September 1995.

[XX05] Yuqiang Xian and Guangze Xiong. Minimizing memory re-quirement of real-time systems with concurrent garbage col-lector. ACM SIGPLAN Notices, 40(3), 2005.

[YSaSC02] Qian Yang, Witawas Srisa-an, Therapon Skotiniotis, andJ. Morris Chang. Java virtual machine timing probes – astudy of object life span and GC. In Proceedings of 21thIEEE International Performance, Computing and Communica-tions Conference (IPCCC), Phoenix, Arizona, April 2002.

automatic memory management for ﬂexible real-time systems · 2015-07-29 · trik persson for...

Documents