debugging distributed ada programs - university of york€¦ ·  · 2009-01-15debugging...

48
Debugging Distributed Ada Programs J S Briggs S D Jamieson G W Randall I C Wand Real-time and Distributed Systems Research Group Department of Computer Science University of York June 1994 ABSTRACT This is the final report on the PAPA (Distributed Ada Debugging) project (MoD Contract number: NUW72D/1090). The project’s objective was to determine the requirements of a tool to support the testing and debugging of a distributed system implemented in Ada and pro- duce a specification for such a tool. The work was to support the SMCS pro- ject being conducted by BAe Sema on behalf of the Ministry of Defence. In this report we describe the work we have done during the project, and show how a tool can be constructed which collects traces produced by an Ada task- ing run-time system and displays them to the user in a range of formats. A prototype tool, RELATE-2, has been developed to demonstrate some of our findings.

Upload: phungnga

Post on 06-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Debugging Distributed Ada Programs

J S Briggs

S D Jamieson

G W Randall

I C Wand

Real-time and Distributed Systems Research GroupDepartment of Computer Science

University of York

June 1994

ABSTRACT

This is the final report on the PAPA (Distributed Ada Debugging) project(MoD Contract number: NUW72D/1090).

The project’s objective was to determine the requirements of a tool to supportthe testing and debugging of a distributed system implemented in Ada and pro-duce a specification for such a tool. The work was to support the SMCS pro-ject being conducted by BAe Sema on behalf of the Ministry of Defence.

In this report we describe the work we have done during the project, and showhow a tool can be constructed which collects traces produced by an Ada task-ing run-time system and displays them to the user in a range of formats. Aprototype tool, RELATE-2, has been developed to demonstrate some of ourfindings.

Table of Contents

1. INTRODUCTION ...............................................................................................................31.1 Debugging....................................................................................................................31.2 Debugging distributed systems ....................................................................................31.3 Previous work at York .................................................................................................41.4 The SMCS project........................................................................................................5

1.4.1 Debugging SMCS .................................................................................................51.4.2 Problems encountered by BAe Sema ....................................................................6

1.5 Objectives of this project .............................................................................................61.6 Format of this report ....................................................................................................7

2. THE DEBUGGING PROCESS...........................................................................................82.1 An overview of Cognitive Psychology.........................................................................82.2 Cognitive models of the debugging process.................................................................9

2.2.1 Hypothesis refinement.........................................................................................102.2.2 Hypothesis refinement techniques.......................................................................11

2.3 A new taxonomy of debugging refinement strategies ................................................132.4 Summary....................................................................................................................14

3. REQUIREMENTS OF AN ADA DEBUGGING TOOL..................................................153.1 RELATE-1.................................................................................................................15

3.1.1 The level 1 view..................................................................................................163.1.2 The level 2 view..................................................................................................173.1.3 Features for manipulating the views....................................................................183.1.4 Problems with RELATE-1..................................................................................19

3.2 RELATE-2 – further views of the program trace.......................................................193.2.1 Application-level view........................................................................................193.2.2 Time-line view....................................................................................................223.2.3 Features for manipulating the views....................................................................24

3.3 Summary....................................................................................................................27

4. AN INSTRUMENTATION MECHANISM FOR TRACING ADA TASKS...................294.1 RELATE instrumentation requirements.....................................................................29

4.1.1 Creation and Termination....................................................................................314.1.2 Activation............................................................................................................324.1.3 Rendezvous .........................................................................................................334.1.4 Remote procedure call.........................................................................................344.1.5 Abort ...................................................................................................................364.1.6 Exception handling and delay .............................................................................36

4.2 General monitoring requirements...............................................................................374.2.1 Instrumentation techniques..................................................................................374.2.2 Organising execution observations......................................................................384.2.3 Implementing replay ...........................................................................................39

4.3 Summary....................................................................................................................40

5. CONCLUSIONS ...............................................................................................................415.1 Summary....................................................................................................................415.2 Application to SMCS.................................................................................................425.3 Further work ..............................................................................................................42

- 2 -

5.3.1 Need to evaluate debugging method ...................................................................425.3.2 Integration of event-based debugging with breakpoint debugging ......................435.3.3 Debugging at the design level .............................................................................435.3.4 Generalising the trace display .............................................................................44

5.4 Conclusion .................................................................................................................44

Appendix A. SMCS system description.................................................................................45

REFERENCES .........................................................................................................................46

- 3 -

Chapter 1

INTRODUCTION

This report describes the work of the Distributed Ada Debugging project (codenamed ‘‘Papa’’)at the University of York. The project was funded by the UK Ministry of Defence, specificallythe Director General Submarines (DGSM), to provide technical advice which could be used tothe benefit of the SMCS (Submarine Control System) project being conducted on behalf ofDGSM by BAe Sema Ltd.

This introduction will first of all define what we mean by debugging, and explain why it is hardto do in the context of a distributed system. We will then make some observations about theSMCS project and the problems it has encountered. Finally the outline of the remainder of thisreport will be described.

1.1. Debugging

The ANSI/IEEE glossary defines debugging as ‘‘the process of locating, analysing and correct-ing suspected faults’’, where fault is defined as an accidental condition that causes a programto fail to perform its required function. It is an important part of the development cycle ofsoftware in particular and systems in general. The ‘‘traditional’’ approach to debugging usu-ally involves executing the program, stopping it at some point to examine the values in vari-ables or to see which procedures are active, then either continuing to a further breakpoint orre-executing the program from the beginning in order to stop it at an earlier point.

Debugging is largely a hypothesis testing process. The programmer has a hypothesis as towhat he thinks may be wrong with the program and he attempts to find evidence to support it.The process may be long and repetitive – firstly because it may take several executions of theprogram in order to test one hypothesis, and secondly because many early hypotheses may bewrong and new ones will be created as evidence mounts. Evidence, both in the form of infor-mation obtained from the program and the conclusions drawn from that information, may bemassive and so pose problems of organisation and storage. Traditional debugging therefore isa time consuming process and is intellectually difficult because of the need to continuallydevelop hypotheses about the program’s behaviour and to abstract from large amounts of data.

1.2. Debugging distributed systems

Debugging distributed systems is even more difficult than debugging single processor systems.According to an extensive review of work in the area of debugging concurrent programs21

there are four main problems.

1. The problem of scale. Distributed systems can be very large, so increasing the complex-ity of the program and hence the complexity of debugging it. This is not a problem

- 4 -

exclusive to distributed systems, and techniques of abstraction drawn from other areascan be applied.

2. The ‘‘probe effect’’ (sometimes referred to as intrusion). It is frequently the case thatthe attempt to find out more information about the program may itself contribute to orobscure the erroneous behaviour. For example, the time taken to execute debuggingoperations may affect the time to perform the program itself. The mere action of switch-ing on the debugging mode of a program may cause a bug to disappear, or reveal a bugthat was not previously apparent. For this reason, systems may have to be implementedin such a way that the debugging mechanism is always live (i.e. is present in the produc-tion version of the system), since the behaviour of the system without the debuggingmechanism is, by definition, not ‘‘debugged’’.

3. Non-repeatability. Because of race conditions within the program, the bug may notnecessarily recur if the program is re-executed. It is often necessary to duplicate exactlythe conditions under which the program was originally run, but this is not always possibleespecially where the program relies on complex combinations of external stimuli.

4. The lack of a synchronised global clock. It is often necessary to relate chronologicallyevents which occur on different processors. Although techniques for the implementationof a global clock exist, by their very nature they often intrude into the working of the sys-tem or affect the repeatability of the bug. However, many systems provide adequateinformation for debugging without a global clock.

A further problem of debugging distributed systems is that the above may be compoundedwhen different parts of the program are written in different programming languages and/or arerun on heterogeneous processors. This suggests a requirement for a debugging system that isboth language and machine independent. Machine independence suggests run-time systemindependence as a corollary.

It is possible to debug distributed programs in a ‘‘traditional’’ way. By considering each pro-cess in isolation, conventional debugging techniques may be applied to discover algorithmicerrors within those processes. However concurrent programs, whether they are run on a singleprocessor or on a distributed system, are affected by a class of errors which are not applicableto single-process programs. These errors concern the ways in which processes interact witheach other and with external interfaces. In a distributed system these interactions are oftentiming dependent, so introducing further potential for error. Distributed debugging is thereforenot simply determining the logical correctness of the program, but also its timing correctness.

1.3. Previous work at York

The Papa project was set up in the context of much previous work at York involved with theAda programming language and distributed and real-time systems. Three projects in particulardeserve mention because of their relevance to this work.

From 1979 to 1986 a major project was undertaken to develop the York Ada Workbench Com-piler.13, 27 This produced the first validated Ada compiler in the UK, and established the repu-tation of the Department as a centre of expertise in the language. The York Ada compiler isused extensively both at York and elsewhere, and forms the foundation for much later work.

Secondly work was undertaken which developed a model of distributed Ada programs. TheAda 83 language definition does not explicitly address issues of distribution, but there are waysin which distribution can be imposed on top of the existing facilities. One of these is the so

- 5 -

called ‘‘virtual node’’ approach.16 In this the program is seen as a collection of virtual nodes,each corresponding to a set of library unit packages. Each virtual node is located on one pro-cessor, though the same processor may house more than one virtual node. Each virtual nodecan have its own tasks, but communication between nodes is achieved by means of a remoteprocedure call mechanism rather than a task rendezvous mechanism.

Finally, an MEng project student, Andrew Cobbett, produced a system called RELATE12

(Replay of Extremely Large Ada Tasking Environments) which can be used to display infor-mation about the tasking activity taking place in an Ada program. It displays a trace of inter-task events in a graphical form showing the status and both static and dynamic inter-relationship between tasks in the program. At runtime, the executing program creates a file ofmessages: one message for each inter-task event. RELATE can then be used to replay portionsof the program, illustrating the status of each relevant task. Changes of task state are encodedin a manner that makes the execution of the trace reversible, so allowing the user of RELATEto go backwards through the execution of the program as well as forwards. More detail onCobbett’s RELATE will be found in chapter 3.

1.4. The SMCS project

The SMCS project is being conducted by BAe Sema Ltd to design and implement a commandand control system for use in Royal Navy submarines. It has encountered numerous problemsin its development, some of which are related to the problems of debugging a large distributedsystem.

Some salient details of the SMCS system can be found in appendix A to this report.

1.4.1. Debugging SMCS

Debugging SMCS is mainly carried out in traditional ways using a breakpoint debugger (theTeleSoft one). When a bug is detected, the normal thing to do would be to re-execute the sys-tem, setting breakpoints in the suspect task. At the breakpoint, the debugger can be used toexamine the values of variables and to set further breakpoints. The debugger connects to thecards through an Intel proprietary interface that relates to the object module format.

The major problem in debugging is the length of time it takes to do anything. The critical pathactivity is perceived to be the length of time needed to rebuild the system, with the perfor-mance of the debugger second. Rebuild times are of the order of hours, and debugger response(at least for the first request to read the value of a variable) of the order of minutes.

Aside from the ability to set breakpoints within processes, there is some tracing capabilitybased on information that the TeleSoft Exec outputs.

Software Analysis Workstations (SAWs) are available. These provide non-intrusive probingof the bus traffic between a processor and the rest of its card. However there are only a smallnumber of them available, so a bug has to be narrowed down to the appropriate card before onecan be used.

When the system goes operational, built in to it will be facilities for dumping various informa-tion whenever the system goes down. This may include data areas, task state information andtrace buffers. This can be achieved by doing a warm reboot whereby the system can be res-tarted without re-initialising the bulk of data. However there is a conflict between dumpingmore information and getting the system operational again quickly. While the developer mightwant as much information as possible recorded to investigate the bug, the submarine captain

- 6 -

might have urgent motives to have the system working again as soon as possible! The idea ofthe submarine having a black box recorder was one idea under consideration.

Looking to the future, the SSCS (Surface Ship Command System) is already under develop-ment (also at BAe Sema). After that the next generation of systems will require more process-ing power since they will include such things as knowledge-based systems. More off-the-shelfsoftware (e.g. database systems) will be used, therefore there will be an increasing need toaccept the software provider’s interfaces as they come. This means that we will no longer beable to add say a level of monitoring output beyond that supplied by the provider.

1.4.2. Problems encountered by BAe Sema

The problems being encountered by BAe Sema in their development of SMCS are typical ofthe general problems of debugging large distributed systems. From what we have been toldabout the project, and on our observations of their work based on a brief visit, we surmise thatthe major problems they are encountering are:

a) when a bug is detected, the size and complexity of the system militate against the easylocating of it;

b) the debugging tools that are available work at the source language level whereas much ofthe software of the system is produced automatically from JSD designs – the level there-fore at which most programmers comprehend the system;

c) it is only possible to debug one node of the system at a time – therefore it is difficult toobtain the ‘‘big picture’’ of how the system is operating;

d) the existing debugging tools are relatively slow;

e) the existing debugging tools require the system to be re-executed from the beginning inorder to examine state at an earlier point.

The bottom line with all these problems is the amount of time that programmers have to sitaround unproductively waiting either for the system to do something, or for the debuggingtools to provide them with some information. Either we must improve the efficiency of thesetools, or we must raise the tools to a higher level at which they fit in more effectively with thedebugging process that goes on in their user’s minds.

1.5. Objectives of this project

The initial objectives of this project were:

(i) to implement a distributed debugging monitoring mechanism;

(ii) to investigate the integration of ‘‘traditional’’ and ‘‘event-based’’ debugging analysistools and their application in a distributed system;

(iii) to investigate the problems of presenting large amounts of information relating to debug-ging distributed systems.

These were to be carried out in the context of providing advice and ideas that might of use andbenefit to BAe Sema and the SMCS project. We will show that the objectives have broadlybeen met.

- 7 -

1.6. Format of this report

In chapter 2 we discuss the debugging process from the viewpoint of cognitive psychology anddescribe the various strategies that debugging users (both novice and expert) may adopt todevelop and refine hypotheses of program behaviour.

Chapters 3 and 4 describe the prototype system we have built. Chapter 3 looks at how thedebugging process can be viewed from the user’s perspective, and how the facilities we pro-vide fit in to the model of debugging process described in chapter 2. Chapter 4 looks at theAda-related features of the system and shows how a tracing mechanism can be implemented inan Ada run-time system.

Chapter 5 draws some conclusions from the above and discusses the directions in which webelieve debugging research in the future should go.

In this report, we use the phrase ‘‘debugging user’’ to signify the person (or persons) who isattempting to debug a program. In this way we distinguish that role from that of ‘‘debugger’’which is the name often given to the software tool that the debugging user uses, and from thatof ‘‘user’’ which could be construed to mean the end user of the program that is beingdebugged.

- 8 -

Chapter 2

THE DEBUGGING PROCESS

Little has been published on what constitutes the process of debugging. Thus it is perhaps theone element of the software engineering life-cycle that has no theoretic basis, or methodologi-cal approach. The authors adopt the pragmatic view that by encouraging and formalising thoseprocesses normally associated with ‘‘expert’’ behaviour, it should be possible to improvedebugging performance.

This chapter attempts to develop a coherent model of human behaviour in debugging situa-tions, based on the results of psychological analyses reported in the literature, with a view todistilling the representational and information requirements thereof. It first of all gives a verybrief overview of the subject (from a psychology perspective), then describes the different stra-tegies which have been identified in the literature as being applicable to the debugging process.Finally, it presents a new, two-dimensional taxonomy of these strategies.

2.1. An overview of Cognitive Psychology

Cognitive psychology is the study of the mechanisms by which mental processes are carriedout, and the kinds of knowledge required for each process. Essentially it is an attempt tounderstand the nature of human intelligence and how people think.

Central to this branch of psychology is the information processing approach .4 In this schemeone attempts to describe behaviour as a sequence of mental operations that transform a per-ceived stimulus (usually visual, auditory or tactile) or a desired goal into a series of motoractions that control the movements of arms, fingers, eyes and the vocal system.

Figure 2.1 – The information processing approach

Sensory

Input

Motor & Speech

Output

Information

Processor

The human information processing system makes demands on the person’s storehouse of facts,procedures and experiences in order to solve the problem at hand. It is still unclear how thehuman cognitive system is organised and controlled. It is attractive to think of the brain’sfunction as being analogous to a digital computer, however this compartmentalisation is notlikely to be useful given current computer architectures. An aspect of cognitive processing thathas been given considerable attention is that of human memory. In essence there are threememory modules:

- 9 -

(i) perceptual memory – a fleeting store of visual, auditory and other sensory information,the persistence of which is not consciously controllable;

(ii) short-term memory – limited in the number of chunks of information† and in persistence;it is used in the execution of conscious mental tasks;

(iii) long-term memory – finite in capacity but capable of storing vast amounts of informationof different types for long periods of time.

However, the process of organisation itself requires a finite amount of resources. Research inthis field suggests that when the information we are required to assimilate exceeds our capacityto organise and store it (e.g. there is too much of it or it arrives too fast), the task at hand willbecome error prone or even impossible.

Figure 2.2 – The cognitive triad

Real World

Representation Human Agent

By introducing an appropriate representation into the cognitive environment (i.e. one in whichthe information is already arranged in meaningful units), we alleviate the burden of organisa-tion and in effect improve our cognitive capacity. However if the information is not organisedin a similar format to that required for the immediate problem, the process of translation mayexceed our cognitive capacity and thereby render the process error prone.

Similarly, if we can introduce a representation whose organisation is appropriate for a modelassociated with a more effective process (the ‘‘right abstraction’’) we may encourage the useof that process.

2.2. Cognitive models of the debugging process

In order to establish the most appropriate way to structure information for the purposes ofdebugging we must first understand how people debug.

� ���������������������������

† A ‘‘chunk’’ of information is where several related pieces of information are grouped together andthen considered as a single piece. For example the 10 digits of a telephone number might be‘‘chunked’’ into four digits of STD code and two three-digit or three two-digit chunks for the linenumber.

- 10 -

2.2.1. Hypothesis refinement

Modern debugging philosophies5 generally accept that the debugging process is bestrepresented by a hypothesis refinement model – see Figure 2.3.

Figure 2.3 – Hypothesis refinement

Hypothesis Set

Modification

Hypothesis

Selection

Hypothesis

Verification

Bug

Located ?

Yes

No

Initial

Hypothesis Set

(Error Report)

According to this, developers generate an initial hypothesis set based on the error report, theirpresent knowledge of the program and more general programming heuristics. This hypothesisset describes the discrepancy, as well as the bug† which is supposedly responsible for causingthe defect in question.

The hypothesis set is then iteratively modified through a continual process of selection andrefinement until the current hypothesis is either verified or refuted.

Several studies have concluded that this is indeed a faithful model of the debugging process,but it is insufficient for our purposes because it does not explain why an expert finds bugs fas-ter than a novice. Firstly it does not account for the finding that debugging ability is closelyrelated to chunking ability.15 (This is no surprise when we consider the volume of data debug-ging users are required to utilise in conjunction with the STM information bottleneck.) Nordoes it account for the finding that debugging ability correlates strongly with comprehensionability. Secondly it does not form a sufficient basis on which to evaluate the adequacy of dif-ferent representations of debugging information.

� ���������������������������

† Carver and Klahr20 define discrepancy�������������������

as being the difference between the program plan and the pro-gram output (behavioral), and bug

�����

as the cause of the discrepancy (software).

- 11 -

2.2.2. Hypothesis refinement techniques

To develop a more detailed understanding of debugging, we must consider the techniques (orcognitive processes) and knowledge (or cognitive structures) employed in generating and veri-fying debugging hypotheses.

Gugerty and Olson15 identify three classes of mechanism employed (often together) during therefinement process.

(i) Comprehension based strategies. In this mechanism the mismatch between the pro-gram plan (what the program is supposed to do) and the program output (what the pro-gram actually does) is used to identify candidate bug locations. The debugging userattempts to construct a mental model of how the program does what it is supposed to bedoing.

(ii) Topographic strategies. Clues in the program output are used to locate the bug. Thedebugging user tries to narrow down the set of places in the program where the bug couldoccur.

(iii) Symptomatic strategies. Prior knowledge (i.e. experience) is used to identify bugs withsimilar properties and thereby locate the bug. Unfortunately little research has been con-ducted in this area.

These strategies are generally applicable to fault diagnosis scenarios and we can illustrate thedistinctions between them by reference to car maintenance.

� Comprehension: The mechanic might build up a hierarchical mental model of a car bydecomposing it into its constituent subsystems and parts. If the real car doesn’t fit withthe model in some respect, then that aspect of the car is a candidate for the location of thefault.

� Topographic: The mechanic would attempt to localise the problem by observing thecar’s behaviour, e.g. an indicator light is not working and subsequent testing of the electr-ical subsystem traces the fault to a loose bulb connector.

� Symptomatic: In this approach the mechanic might identify a form of defectivebehaviour with a particular fault, e.g. if he were to hear screeching he might associate thiswith fan-belt slippage, clutch-slippage, or brake-pad wear depending on the conditionsassociated with the sound.

Program comprehension is generally accepted to be a fundamental skill in many softwaredevelopment activities, and not limited to debugging. Because of its importance many theoret-ical and practical studies of programmer behaviour have focussed on this strategy and themodels constructed therein.

Wiedenbeck30 identifies three classes of comprehension mechanism:� Bottom-up comprehension. In studies such as those of Shneiderman and Mayer,25 and

more recently Basili and Mills,6 it was found that by studying and interpreting small sec-tions of code, and by subsequently coalescing these into continually higher and higherunits of abstraction, developers were able to construct layered representations of programknowledge in a bottom-up fashion.

� Top-down comprehension. In Jeffries’ experiments on programmer behaviour,17 it wasreported that by reading the program in execution sequence (rather than line by line)debugging users developed hierarchical models based on program structure. Conse-quently it was concluded that well-structured programs facilitate improved comprehen-sion.

- 12 -

� Hypothesis-driven comprehension. In Brooks’ investigation of the comprehension pro-cess,8, 9 a second type of top-down process was described. Hypotheses were generatedand iteratively refined by skimming through the documentation for ‘‘beacons’’ (stereo-typical sections of code recognisable from previous experience) whose presence orabsence served as strong indicators of program function, subject to the domain knowledgeof the developer (i.e. this process becomes more effective with increasing experience).This model is particularly interesting for two reasons:

(i) it bears a strong similarity to the debugging process itself; and

(ii) its use of static stereotypical features gives us some idea of the processes that mustbe taking place in symptomatic (behavioural) refinement strategies.

Since topographic strategies are peculiar to activities such as program maintenance and debug-ging, far less research has been carried out in this area. However typical strategies include:

� Program modification. This involves instrumenting the program itself to produce pro-gram output that will help to localise the cause of defective behaviour.

� Slicing.28, 29 In this method a ‘‘slice’’ is derived by identifying those components of theprogram which could interact with the component which is producing the defectivebehaviour. Occasionally several slices are combined through intersection to form a‘‘dice’’ and thereby further localise the bug.

� Backtracking.1, 2, 22 This technique produces what is effectively a subset of a slice by‘‘back-stepping’’ through the execution history and examining the interaction of a pro-gram component with that which produced the defective behaviour.

In essence topographic searches are characterised by activity which attempts to isolate thesphere-of-influence of the bug by relating behaviour to the static software product, therebyreducing the cognitive complexity of the analysis. Considerable interest has been shown indeveloping highly focussed debugging environments based on dynamic slicing ,3 a techniquewhich combines backtracking and slicing.

While all the processes described above tend to generate hierarchical representations of pro-gram knowledge consisting of units of ever higher abstraction (as we might expect consideringthe advantages of chunking described earlier), the application of such models has been foundto vary significantly according to the experience of the user.

� Novice models. Studies have shown that novice debugging users tend to deal with com-plexity by employing an ‘‘as-needed’’ approach. In smaller programs this leads to theconstruction of a weak mental structural model, grouped according to similar languageconstructs. However as complexity increases, this mode breaks down, and novices thenbecome entirely reliant on topographic and symptomatic isolation strategies such as pro-gram modification.

� Expert models. Experts, on the other hand, tend to use a more systematic comprehen-sion process, developing a network of procedural or ‘‘goal-plan’’ descriptions of the tar-get program capturing both structural and causal aspects of program composition. Hav-ing identified candidate faults on this basis, hypothesis refinement through comprehensionis augmented by topographic and symptomatic activity. This tends to reduce the need forinvasive strategies and frequently allows experts to remove several defects simultane-ously.

- 13 -

2.3. A new taxonomy of debugging refinement strategies

If we compare Wiedenbeck’s comprehension classification and Gugerty and Olson’sclassification, we observe that symptomatic and beacon-based comprehension have in commonthat they both exploit prior knowledge to short-circuit the fault location process. Consequentlywe identify two dimensions among hypothesis refinement strategies (Figure 2.4):

� dynamism: whether we operate on program behaviour (dynamic), or on some structuralaspect of the software product (static); and i.e. code/design/requirements

� familiarity: whether experience of a situation allows us to make intuitive leaps betweenlevels of abstraction (pre-conceptualised), rather than having to build our own abstrac-tions (post-conceptualised).

Figure 2.4 – Classification of debugging strategies

POST

-

PRE

-

CO

NC

EPT

UA

LIS

ED

CO

NC

EPT

UA

LIS

ED

SYMPTOM

ATIC

BEACONS

STATIC

DYNAMIC

COMPREHENSIO

N

TOPOGRAPHIC

This model has been developed from the results of empirical studies of debugging behaviourin, by-and-large, uni-processor environments. However, we may surmise that this is likely toremain a faithful model in the context of distributed systems since:

a) If anything the cognitive bottleneck associated with short term memory is likely to wor-sen as the volume and complexity of information we are required to manipulate increases.Accordingly, we are likely to see the importance and usefulness of layered causal modelsof program structure increase.

b) Similarly, multiple threads of control will complicate topographical strategies, but byadopting them we are likely to be able to eliminate large portions of defect-irrelevantcode.

c) The volume of data may obscure the existence of stereotypical features, so it will beincreasingly necessary to employ pre-conceptualised strategies to reduce the complexity.

- 14 -

2.4. Summary

In order to improve debugging performance by encouraging those strategies associated withexpert behaviour, we primarily want to provide support for a systematic approach to debuggingin which we allow the user to:

� develop hierarchical, causal frameworks through conventional comprehension (primarilytop-down models) and beacon-based comprehension (identifying stereotypical sections ofcode); and

� isolate defective constructs through topographical activities (i.e. we need to support slic-ing, backtracking and dynamic slicing) and symptomatic activities.

In general we find that these processes are characterised by a capacity to reduce informationcomplexity and thereby make debugging easier to do, rather than by actually recognising thebugs for us.

In the next chapter we look at how these processes can be supported by the facilities of adebugging tool.

- 15 -

Chapter 3

REQUIREMENTS OF AN ADA DEBUGGING TOOL

In Chapter 2 we discussed the debugging process from a cognitive psychology point of view.In this chapter we look at how some of the features of a debugging tool contribute to that pro-cess, and describe the prototype tool that we have developed.

In the literature, the key point that is made is that the debugging system should be transparent .Goldszmidt et al14 suggest that a debugging facility may be deemed transparent when ‘‘thedebugger itself does not affect the computation of the program being debugged’’. This isbehavioural transparency – the debugging user does not have to modify their source code inorder to debug the program, nor does use of the debugging tool alter the behaviour of the pro-gram (the ‘‘probe effect’’), nor does the program behaviour differently when its execution isrepeated. We should also desire transparency of use: the wish that the debugging tool does notmake intense demands of its user that detract from his/her ability to apply their mentalprocesses to the problem to be solved.

3.1. RELATE-1

The work of the Papa project is based upon that of a previous project carried out by AndrewCobbett.11, 12 Its aim was to provide a graphical representation of (potentially large numbersof) Ada tasks at runtime. Called RELATE (Replay of Extremely Large Ada Tasking Environ-ments),† Cobbett’s system consisted of two parts: a modified runtime system and a post-runtime replay tool (see Figure 3.1 overleaf). By augmenting the runtime system of the YorkAda compiler with a set of probes, Cobbett was able to record a trace of application activity interms of a series of task state transitions or events which was consistent with a (simplified)version of the standard Ada task state model. The system was applied only to programs run-ning on a single processor.

� ���������������������������

† In this document we will refer to Cobbett’s work as RELATE-1 to distinguish it from our later workwhich we will call RELATE-2.

- 16 -

Figure 3.1 – RELATE architecture

ModifiedFile

Run-Time System

Replay ToolTrace

Ada

Application

Event-stream

Event-stream

To display the information contained in the event trace file, Cobbett devised two views whichhe termed level 1 and level 2.

3.1.1. The level 1 viewThe level 1 view is based on the Ada language rules that state that every task has a parent task(the task that caused it to be created) and that also every task has a task upon which it is depen-dent (the task in whose scope it is declared). (In many cases the parent of a task and the oneon which it depends are the same.) This gives us two parallel hierarchies of task structure thateach lends itself to being depicted as a tree diagram.

Figure 3.2 – RELATE-1 screen-shot showing the level 1 view and an information box

Figure 3.2 shows a parental tree of six tasks (one is the parent of the other five). In this view,each task is represented by a symbol which denotes its state at the ‘‘current’’ point in

- 17 -

execution. An exception has just been raised in one of the tasks while in rendezvous. Theinformation box to the bottom right shows some details about that task.

A replay facility can be used to step through the event trace either forwards or backwards.This is controlled by the buttons and time bar at the bottom of the screen. At each step of thetrace, the symbol for a task changes if the corresponding event changed its state.

This view gives the debugging user a picture of the tasking state of the target system at anypoint in its execution. It is therefore possible to examine the state of the system, say at thepoint where the symptoms of a bug were detected, and then work backwards looking for a statethat appears anomalous.

3.1.2. The level 2 view

Cobbett felt that the level 1 view alone did not provide a sufficiently detailed characterisationof tasking behaviour. For example, during rendezvous there is no direct link on the displaybetween the entering task and the accepting task. In an earlier project at York, a system calledGRAP24 had been developed to animate the interaction of small numbers of tasks using anamended version of the Booch notation.7 An example of this is illustrated in Figure 3.3.

Figure 3.3 – The level 2 view

09a876d

a0de1

12e04

876ae

In - Rendezvos

Blocked-completion

Blocked on accept

1

2

3

1

2

1

0

0

1

1

0

1

Interface_Manager

Calculator

Input_task

Layout_task

Blocked_on_select_with_terminate

Cobbett proposed to use such a mechanism to represent in more detail the activity of tasksselected from the level 1 display. He called this the level 2 view but was not able to implementit in the time available to him. Obviously, the level of detail contained within each objectwould seriously limit the number of them that could be shown on the screen at any one time.

- 18 -

3.1.3. Features for manipulating the views

RELATE-1 provides a number of facilities to support the level 1 view and to allow users tomanipulate it.

Navigation

Potentially, RELATE-1 could be used to examine a program consisting of a very large numberof tasks. To fit all these onto the screen at the same time would require the size of the statesymbols to be reduced beyond the point at which they could be recognised. Cobbett’s solutionwas to keep the size of the symbols fixed, and to generate a tree that might be bigger than theon-screen window. The user can control which part of this to display by moving a rectanglerepresenting the window around a small skeletal map of the entire tree structure. This actionscrolls the display in the main window.

The compressed view only shows each task as a small tick mark, but it is useful both as ameans to allow the user to specify which part of the tree he/she wants to look at, and of findingout whereabouts in the whole structure their present view is situated. Obviously in pathologi-cal cases the application may be too large even for the map facility, and for this reasonRELATE allows the map itself to be scrolled in the same way as the main window.

Time bar

The user controls progress through the event trace by means of a trace bar. The display can beanimated by clicking one end of the bar to move forward one event, or the other end to movebackward one event. Reasoning that we may be primarily interested in unusual behaviour suchas exception propagation, buttons are provided to allow the user to skip forward to the nextexception or backward to the previous one. The position of the time bar is mirrored by a clockwhich shows the actual (chronometric) time at which the event occurred relative to the start ofthe program.

Information boxes

The level 1 view actually shows very little information about a task. All that can be seen in themain window is the symbol denoting the state of the task at the ‘‘current’’ time controlled bythe time bar. To find out the name of the task (if it has one), its unique identifier, and furtherdetails about it such as the number of tasks queued on each entry, the user can call up a textualinformation box. This applies to a selected task at the ‘‘current’’ time, but gets updated if thetime bar moves.

Search facilities

It is sometimes necessary to search for a particular task in the task tree. This might be byname, unique identifier or by type. RELATE-1 provides a facility for this.

Multiple windows

It is sometimes convenient to have more than one view of the program on the screen simul-taneously. RELATE-1 therefore provides a mechanism to create new windows and to moveand resize them on the screen. In RELATE-1 the only use for multiple windows is to haveseparate views of the tree in the two hierarchical formats (parental and dependency) or to lookat two distinct parts of a large tree structure. However if each window had its own time bar,then this mechanism could be used to compare the state of the system at different points intime.

- 19 -

3.1.4. Problems with RELATE-1

If we consider RELATE-1 in the context of the model of cognition presented in the previouschapter, we find that the original RELATE-1 concept (i.e. both levels 1 and 2) indeed tends toenforce the systematic approach to debugging that we wish to encourage. However we havetwo reservations about the tool.

Firstly, it does not support comprehension strategies very well. The structure that the tooldisplays to the user shows only the hierarchical parent/dependent relationships between tasks.Tasks that are related in other ways may be large distances apart on the tree diagram and thereis no mechanism to allow the user to specify any other relationship. Further, the tool providesno explicit support for beacon-oriented strategies, though a user may find that he/she comes torecognise ‘‘beacons’’ in tree structures.

Secondly, it is difficult for a user to establish any behavioural context for the events that areshown. Although the time bar allows the user to animate the execution of the program, it is noteasy to identify the consequences of some particular event or, working backwards, the eventsthat contributed to the occurrence of an event. Without such behavioural context we are unableto employ symptomatic strategies effectively.

It is with these thoughts in mind that we move on to describe the work carried out during thePapa project to extend RELATE-1. In the next section we will discuss the main features ofthis new version, which we call RELATE-2.

3.2. RELATE-2 – further views of the program trace

Cobbett’s RELATE-1 provided only one view of the event trace. Our work has added twofurther views which provide enhanced support for the strategies described in Chapter 2. Thethree views together provide a hierarchy of functionally distinct levels that enable the user tolook at the program at different levels of abstraction.

The first, and highest, level is what we term the application-level view. This is a view of theprogram as a whole and is based on the model of distribution that we have adopted. The viewshows each of the virtual nodes as a single entity, and allows us to show some details of theinteraction between them. This view is described in more detail in section 3.2.1.

The second level is what we term the node-level view, and is simply Cobbett’s level 1 viewconfined to the tasks running on a particular virtual node. Because it has been described above(and in Cobbett’s publications) we will not dwell further on it here.

The third level is what we term the timeline view. In this we display task activity in achronologically-ordered form that allows inter-task activity to be highlighted, and the timingrelationships between events to be shown. This view is described in section 3.2.2.

3.2.1. Application-level view

In trying to debug a distributed system, it is obviously beneficial to have a view of the programthat corresponds to the distribution of the program. At this level we would expect to be able toexamine the interactions between the different components of the system. This can be used toidentify bugs such as bottlenecks in communication, missing communication, or unwantedcommunication.

- 20 -

An example application-level view

The application-level view in RELATE-2 displays the application program according to itspartition into virtual nodes. Each node is shown as a circle (the choice of symbol is arbitrary).The representation is opaque since no details of the node are displayed.† Initially the nodes aresimply drawn around the perimeter of a circle, but the user is able to move them using themouse and can arrange them in any way, including perhaps to reflect some physical arrange-ment of them. The user can also specify that a number of nodes together constitute a groupand assign the group an arbitrary name. The group can then be moved around the display as awhole.

Figure 3.4 – Opaque node representation

Group One Group Two

Figure 3.4 shows several nodes, some of which have been arranged into groups. The illustra-tion also shows arrows drawn between nodes representing communication. The exact meaningof the arrows is flexible. Among other things they could be used to represent:� communication that is ‘‘currently’’ active (i.e. a message sent but no reply yet received);� communication that has been completed;� communication that has been completed recently (e.g. within some arbitrary period of

time);� potential communication (i.e. nodes which might communicate even if they have not done

so).

This representation gives some support for topographic strategies since it enables the user todisplay the nodes in some meaningful arrangement and to visualise inter-node activity. Furthersupport might show such detail as:� communication load (the number of messages sent/completed);� communication state (whether communication is complete or incomplete);� ‘‘order’’ of communication (which messages were sent or received first).� ���������������������������

† In our prototype absolutely no details are shown. However it is probably desirable to show at leastthe name or some other identifying mark to aid user navigation.

- 21 -

An alternative representation

Trying to include too much detail in a diagrammatic form such as that proposed above maycause the representation to become cluttered and difficult to interpret. Consequently we lookedat variants of the view in which the relative positioning of the nodes is used to convey someinformation. One particular variant that we implemented was the clock face proximitydiagram (see Figure 3.5).

Figure 3.5 – Clock face proximity diagram

Increasing RPC

Load

Increasingly Recent

Communication

Reference Node

Datum Line

Maximum Load

Zero RPC Load

In this depiction, a node N is chosen by the user and automatically placed at the centre of thediagram. The other nodes are then positioned automatically according to the characteristics oftheir communication to/from N. The polar co-ordinates of radius and angle can be used todenote two independent variables associated with each node.

In the example above, radius is used to denote relative communication load, while angledenotes the chronological order in which messages were most recently received from eachnode. The node (or group of nodes) that has communicated with N most recently is placed at 3o’clock, with other nodes (or groups) located in order of most recent communication in aclockwise fashion. The radius that each node is from the centre of the diagram indicates thecommunication load between it and N. Highest load is indicated by smallest radius, with nocommunication denoted by an arbitrary maximum value.

This form of representation might be useful in detecting bugs associated with order of com-munication, or loading problems such as bottlenecks or starvation. Our choice of load andorder as the variables to be displayed was arbitrary, as was the choice to display them as polarco-ordinates. A diagram with rectangular co-ordinates could show the same thing.

Grouping nodes

Our implementation of the application-level view allows the user to cluster nodes together intogroups. Providing a facility to do this manually is a useful means of assisting user-definedabstraction, for example identifying related groups of nodes. Initially the layout of ourrepresentation conveys no information about the relationships between nodes at all. Movingthe nodes around and grouping them together allows the user to model one.

- 22 -

However an automatic means of grouping is desirable, especially in cases where there are avast number of nodes. Performing the grouping based on physical relatedness (e.g. in our dis-tributed system two virtual nodes could be said to be physically related if they were running onthe same processor) is relatively easy and can sometimes be carried out based on the results ofstatic analysis. This is an example of a comprehension strategy.

Divining some relationship between nodes based on the event trace is more interesting fromour point of view, and is an example of a topographic strategy. We implemented an automaticclustering facility to group together nodes in a way that optimised some measure of ‘‘related-ness’’. Since it is generally accepted that in principle good design is characterised by moduleswhich exhibit high cohesion and low coupling ,† we measure the relatedness of a given groupof nodes by subtracting a metric for their coupling from a metric for their cohesion‡, and useda simulated annealing algorithm to optimise the assignment of nodes to clusters.

Automatic clustering, as well as being useful in a large system where manual grouping may betoo time-consuming and potentially error-prone, can also show up unintended relationshipsbetween nodes in smaller systems, and also the absence of a relation where one is expected.

3.2.2. Time-line view

The application of topographic strategies needs some depiction of behavioural context, other-wise it is difficult to identify the dynamic relationships between events that occur in the pro-gram. In a distributed system, our concern is with the representation of behavioural context(i.e. the sequence of events leading up to a state, and the sequence of events that state precipi-tates) and how this helps the user to develop a true sense of the parallelism of the system. Wewant to do this without the user’s understanding being impaired by factors such as communica-tion overheads, relative processor speeds, or the particular total ordering of events imposed byreference to a global clock.

Cai et al10 suggest that there are essentially two techniques by which we may graphicallyrepresent parallel behaviour.

(i) Animation techniques (i.e. displaying a series of snapshots of program state one after theother) have been used in several systems to convey parallel program behaviour. Typi-cally these show a number of separate nodes whose interaction is represented by display-ing appropriate connections as the program runs. Indeed this is precisely the type of viewwe have dealt with in the previous section. However animation alone is not a particularlyeffective technique for representing behavioural context.

(ii) Time-process diagrams are a two dimensional display with (logical) time on one axis anda series of processes on the other (see Figure 3.6). Examples of such techniques areLamport’s space-time diagrams,19 and Stone’s concurrency maps.26 While theserepresentations offer detailed and fairly compact representations of behavioural context,they suffer from high information density requirements, which limits their capacity toconvey both a complete picture of application state and long event histories.

� ���������������������������

† Cohesion is defined to be the degree of functional relatedness of the constituent parts of a module,and evaluates the degree to which these elements perform a single function. Coupling is defined to bethe degree of interconnection with other units, and essentially is a measure of our ability to reasonabout any unit in isolation. These measures are clearly related to our ability to comprehend a program.‡ The metrics we used are due to Patel23 and Kunz18 and are based on the number of messages sentfrom each node to each other.

- 23 -

Figure 3.6 – General time-line format

Time

Task A

Task B

McDowell and Helmbold21 have observed that these classes of view are complementary andsuggest that the ideal debugging toolkit should integrate both forms of representation.

In RELATE-2 we provide a timeline view of tasking behaviour. Event traces for individualtasks are represented on separate horizontal lines. To maintain consistency and to aid ease oflearning, we use the same symbology as in the node-level view (Cobbett’s original symbolsplus a few new ones necessary in a distributed system). A symbol denotes the start of a newtask state. The duration of the state is then represented by the distance between symbols.Periods during which a task is executable are denoted by solid lines, and all blocked periodsare signified by dashed lines. Interaction between tasks is illustrated through the use of diago-nal lines to represent causal interactions, and vertical lines to represent synchronous interactionsuch as rendezvous. Purely synchronous interaction is denoted by dotted lines, howeverinteractions which involve communication (i.e. data exchange) are represented as a solid line.Figure 3.7 shows a screen shot from our tool employing the above conventions. Figure 3.8explains the meaning of the RELATE-2 extended symbol set.

The timeline view is useful in that it enables the user to identify missing or unexpected interac-tions between tasks. It also gives a good sense of the parallelism of the system.

Figure 3.7 – RELATE-2 timeline view

TASK

A.main

A.al locator

A.rpc_stub

A.rpc_stub

A.rpc_stub

B.main

B.screen_server

B.rpc_stub

B.rpc_stub

B.rpc_stub

C.main

The above diagram shows a portion of the timelines of a program from the point of its start. In the diagram thenames of the tasks are to the left and are in the form nodename.task_name. The task name ‘‘rpc_stub’’ is used

- 24 -

to denote a server task created for the purposes of remote procedure call. It shows tasks being created(A.allocator and B.screen_saver for example) and their parents (A.main and B.main) resuming once activation iscomplete. A.main makes a remote procedure call which is served by the first B.rpc_stub. When the latter ter-minates, A.main resumes. During the RPC call, B.rpc_stub rendezvous with B.screen_saver. Three server tasks(all shown as A.rpc_stub) are created to deal with calls from a task that is not shown on the screen. They allattempt to enter A.allocator and are serviced in turn (the first one terminating just before the right hand edge ofthe window).

Figure 3.8 – Key to RELATE-2 symbols (see also the key in Figure 3.2)

Exception

Exception

Abnormal

Exception Raised

Executing abort

Terminated

Completed

Suspended on dependant termination

Executing

Activating

S

T

C

Exception Caught

Blocked onrendezvous complete

C

Delayed

Blocked on accept

Blocked on select

Blocked on timed select

Suspended on terminate with select

Conditional select

Conditional entry

Blocked on entry

Blocked on timed entry

Suspended on child activation

Server task created and suspendedon library package elaboration

Created

Client RPC Server task creation

York distributed Ada remote procedure call symbols

Standard Ada tasking symbols

3.2.3. Features for manipulating the views

In the previous sections we have described different views of a program’s event trace. As withRELATE-1, the usefulness of these views depends alot on the features that we provide in thedebugging tool that enable them to be manipulated. In this section we will address some of thefacilities which we believe a tool should provide. Our prototype tool implements some but notall of these.

Time bar

The time bar in the context of a distributed system doesn’t mean exactly the same as in asingle-processor system since there is normally no concept of a ‘‘global’’ time. However thetime bar is still a useful means of navigation.

In RELATE-2 instead of having a single time bar, each view has its own. For the node-levelview this is essential, since each virtual node has its own total ordering of events. It may leadto the user drawing a false conclusion if the events of the system as a whole are totally ordered.

Where nodes interact in a manner where their local clocks can be synchronised, their time barscould be tied together. For example, if node A receives a message from node B, then theaction of incrementing the time bar for node A could be made to increment B’s time bar to thepoint in time when the message was sent. If the user had an active display of node B (at thenode level) it would be updated correspondingly.

- 25 -

Animation

We provide a facility whereby a view or views can be animated at a speed determined by theuser. This enables users to play (either forwards or backwards) through sections of the eventtrace as a sequence of frames (like a cartoon). Of course in the timeline view the animationsimply appears as scrolling as the timelines move left or right across the screen.

Information boxes

The RELATE-1 information boxes for individual tasks have been joined by an information boxfacility for each virtual node. These can be called up from the application-level view. Anexample is shown in Figure 3.9.

Figure 3.9 – RELATE-2 screen-shot showing application-level view and information box

f i r s t _ g r o u

s e c o n d _ g r o

H

E

D

C

A

(1) : Application : A on sun702 no R

(2) : Parental : A on sun702

E

f i r s t_group

K

I

B

Node ID B

Machine sun707

Interface List

Program /usr/stephan/pa

I-face

Sub Program List

RELATE

f indopen show_key help q u i tshow mapshow lines

Navigation

To assist the user to know where in the overall picture he/she is, we provide a skeletal map inthe timeline view. We also provide scroll bars as an alternative means of moving around.

The horizontal scroll bar of the timeline view could simply be the time bar for that view.However it might be desirable to keep them separate – the time bar used to manipulate the‘‘current time’’ (and to drive any other views that may be active and synchronised with the

- 26 -

timeline view) whereas the scroll bar would merely control what portion of the view the userlooked at.

Manipulation of layout

It is important to provide the user with the means to alter their view to match their model of theprogram. We talked in the description of the application-level view about allowing the user tomove the symbols representing nodes around the display so that he/she can model any inter-node relationship. Similarly in the timeline-view we would want the user to be able to adjustthe spacing between task state symbols (to increase or decrease the number of symbols visi-ble), and to vary the order in which the tasks are arranged vertically (to place related tasks neareach other). This is in addition to the provision of any automatic facilities for helping toensure that representations are more easily interpretable. One heuristic for this might be togenerate a layout arrangement that minimises the number of overlapping lines.

It is just as important to allow the user to alter the view so that they do not get the impressionthat the default view they have been presented with is the only one. In the timeline-level viewthe display of task states appears to impose a total order on the events, whereas in fact thedisplay represents one of a number of total orderings which are consistent with the partial ord-ering defined by task interactions. If we allow the user to slide events along a task timeline, orto stretch or compress a timeline, then we allow them to visualise alternative orderings. Ofcourse their changes must be consistent with the partial ordering, and attempts to move anevent outside the valid range ought to be resisted, or result in a compensating movement ofevents on other timelines.

Filtering

With the vast amount of information that may be available from a system with perhaps tens ofprocessors, hundreds of tasks, and events occuring over a long period of time, we clearly needto provide mechanisms to reduce complexity. This is particularly important when carrying outtopographic strategies – these are the ones that are likely to involve most information.

Filtering facilities allow the user to discard irrelevant detail. In the timeline-level view weenvisage filters to remove a particular class of event (e.g. anything to do with delay-statements), or the entire timeline for a task. With both of these we must ensure that causalrelationships are not concealed, and that where interaction takes place through hidden events,the relationship between the tasks involved is not itself hidden.

A key form of filtering is to highlight those events which could have a bearing on a particularevent. Those events that causally precede some event are said to form its dynamic slice. Thisis relatively straightforward to calculate from the event traces and can be convenientlydisplayed (see Figure 3.10) either by highlighting the events that are in the slice, or pruning outthose that are not.

- 27 -

Figure 3.10 – Filtering by dynamic slicing

Dynamic Slice

Emergent Event

Abstractions

Event abstraction reduces information complexity through the recognition of common patternsof application behaviour, and the combination of them into higher-level discrete units. Thiscan be carried out either automatically or manually. The program can then be reasoned aboutin those higher level terms. As examples we may recognise the pattern of events that consti-tutes a nested rendezvous, or those that occur when a child task is spawned. Both of these arecomposed of a number of primitive level events, but are sufficiently well-defined to enablethem to be abstracted into higher level units.

While such services have not been implemented in RELATE-2, event abstraction has beenexplored in the Dalek project,22 where the combination of primitive events into ever higherlevel events was found to be a useful debugging facility even in sequential systems. We canobserve that not only would this provide an extremely flexible complexity reduction technique,but that ultimately pattern recognition services such as this could become the basis of asymptomatic-debugging toolkit.

3.3. Summary

In Chapter 2, we categorised debugging strategies into four classes. In summary let us con-sider those four classes in turn and show which of the facilities described above provides sup-port for each.

Topographic� Three of the features we have discussed above provide support for topographic debugging

strategies:

(i) filtering, and in particular dynamic slicing, allows the user of the tool to narrowdown very quickly those parts of the program which may affect the part where thebug has been detected;

(ii) the application-level view allows the user to find out about the behaviour of the pro-gram at the level of inter-node communication;

- 28 -

(iii) automatic grouping of nodes based on the event trace allows the user to divineerroneous behaviour from the relationships drawn between nodes.

Comprehension-based� RELATE-2 contains some support for comprehension-based strategies, in that it allows

the user to construct mental models of how the program works, in a small number offorms. Grouping nodes according to some physical or structural relationship may assistthe user to understand the structure of the program, and thereby to detect flaws in it.

� Facilities such as abstraction and filtering also assist in doing this in that they allow theuser to create higher level models of the program and its structure.

Symptomatic� Symptomatic strategies are also supported by abstraction facilities. Being able to con-

sider what a program does at a higher level of abstraction helps the user to associate par-ticular defective behaviour with particular program features.

� Navigation and layout manipulation facilities also provide support in that they make iteasier for the user to identify particular patterns of program behaviour.

Beacon-based� It is difficult to judge what support RELATE-2 provides for beacon-based strategies,

since we do not know what sort of beacons an experienced user of the tool might identify.We can say that the support for both comprehension-based and symptomatic strategiesprovides the user with mechanisms that can be used to establish and recognise patternsthat reveal what the program is doing.

In this chapter we have shown how information about a program can be displayed. In the nextchapter we go on to consider how that information may be collected.

- 29 -

Chapter 4

AN INSTRUMENTATION MECHANISM FOR TRACING ADA TASKS

A post-mortem debugging tool like RELATE depends on the program being debugged produc-ing a trace of its execution. In this chapter we discuss the implementation issues that underpinthis, including those concerned with collecting information from a distributed system. Wepresent our requirements at a more abstract level, leaving the instrumentation techniques to beexamined in the context of the requirements of a general system. In order that the chapter canbe followed by someone unfamiliar with Ada tasking, we briefly summarise the semantics ofthe relevant Ada constructs as we go along.

A York Distributed Ada program is pre-partitioned into a set of virtual nodes for distribution.Tasks are the unit of parallelism – they execute independently when placed on different pro-cessors (within the restrictions of programmed synchronisation), and concurrently under someintra-node scheduling algorithm when placed on the same processor. Inter-node communica-tion is by remote procedure call; synchronisation is achieved by task rendezvous.

We characterise program execution as a set of events, where an event is some action ofinterest. The events which occur in a single task are totally ordered, as a consequence ofsequential execution, while events in different tasks are only ordered if task synchronisation orcommunication takes place. B.

4.1. RELATE instrumentation requirements

RELATE’s primary purpose is to portray the tasking behaviour observed during a program’sexecution. This is characterised as a series of tasking events for which a program can beinstrumented. Our monitoring mechanism generates a set of tasking event traces, and a set ofscheduling event traces, one per processor. This information may then be used for theidentification of synchronisation errors (livelock, deadlock, race conditions), for detectingresource related problems such as starvation, and for helping users to understand taskingsemantics and their consequences.

Our tasking event set is derived from the state model of Figure 4.1. An event identifies a tran-sition or, more accurately, captures sufficient information to identify a new state and the attri-butes specific to it, given the set of events which precede it and the initial program state. Forexample, an entry call event specifies the called task and the called entry and the type of call(timed, unconditional, etc.).

- 30 -

Figure 4.1 – Ada tasking state model

Non-Existent

Executing

Blockedon accept

Activating

Created

Blockedon select

Completed

Blockedon RPC

Blockedon entry

Blocked onchild activation

Blocked on endrendezvous

Delayed

Terminated

Abnormal

Blocked ondependenttermination

Events are described by tuples of the form:

{event type, originating task, attributes, timestamp}

Timestamps are logical rather than chronological and can be used to order all events whichoccurred on a single processor or in a single task. Attributes record the additional informationnecessary to locate causally related events in other task traces and consequently to permit tracemerging.

In our implementation, the trace is produced by instrumenting the Ada runtime system to pro-duce an event record each time the runtime system is called upon to perform some tasking-related operation. The event records are output to a file (one per virtual node) using normaloperating system functions.

In the remainder of this section, we describe our primitive event set. With slight concessionsto fault tolerance, the defined events are reasonably minimal in size and content – that is theycontain just enough information to reconstruct behaviour. This is desirable because creating,manipulating and storing large records may cause probe effect problems, and even if not, exe-cution delays can make debugging frustrating and inefficient.

- 31 -

The elements of our tasking event set characterise task activation, task termination, task ren-dezvous, remote procedure call and abort. We describe events in terms of the specified taskingsemantics.

4.1.1. Creation and Termination

A task creates a child task or tasks either (i) through the elaboration of the declarative region ofa program unit such as a subprogram, task, package or block statement; or (ii) by dynamicinstantiation via new. If an exception is raised during elaboration then the newly created tasksbecome terminated.

Task objects created by the elaboration of a declarative region depend on the instance of exe-cution of the program unit in which the declaration is located. If dependent tasks have not ter-minated on completion of execution of the unit then execution of it suspends until they do.This controlled parallelism ensures that objects instantiated during elaboration are only des-troyed when there are no active tasks which may refer to them, and that the tasks themselvesare visible through the duration of their existence.

The runtime creation of tasks is provided for by dynamic instantiation via access types. Anaccess type can be declared to refer to objects of a task type. Task objects referenced by anaccess object are created using the new command. Such objects may be ‘‘assigned’’ to accessobjects and passed as parameters. They can become completely anonymous (invisible to othertasks in the program) if access values are lost. A task created through an access objectdepends on the instance of execution of the program unit which defined the access type. Asbefore, this aids reasoning about a program and preserves data dependencies.

A unit instance with dependants is called a master and may be either a task, a currently execut-ing block statement or subprogram or library package. Dependency is transitive: a task thatdepends on a subprogram or block statement executed by another master is dependent on thisother master. When a master is a task and execution suspends for direct dependant termina-tion, that task is said to be completed.

The table in Figure 4.2 lists our creation and termination events.

Figure 4.2 – Task creation and termination events

���������������������������������������������������������������������������������������������������������������������������������������������������������Event type Attributes Comment���������������������������������������������������������������������������������������������������������������������������������������������������������Create Task identity, name, priori-

ty, number of entries, mas-ter identity

Task created

SusDepTerm Master identity Suspend on dependant ter-mination

Complete None Suspend on termination oftask body dependants

Terminate NoneUnblock [exception identity] Dependants have terminat-

ed������������������������������������������������������������������������������������������������������������������������������������������������������������������������

���������������

The attributes of a Create event are:� the task’s identity;

- 32 -

� its name (if it has one);� its priority;� the number of entries it has; and� its master’s identity.†

Master information must always be available at task creation as it may be used at any time inthe execution of an abort command.

The terminate events are Complete , SusDepTerm and Terminate . SusDepTerm signifiessuspension of task execution on completion of a master and Completed distinguishes the casewhen the master is the task itself. The SusDepTerm event has a single attribute, that of themaster’s identity. An Unblock event signifies resumption of execution by a task once depen-dants have terminated.

We associate an instance of a SusDepTerm event with a set of dependent tasks’ Terminateevents using the information captured in their creation events. It might seem more appropriatethat master information be explicitly captured with the termination events, but we record it attask creation in case of trace failure and to reduce the number of times it is actually reportedshould termination be implemented using ‘‘select with terminate’’ (see section on rendezvousbelow).

4.1.2. Activation

The process of elaboration of the declarative region in the code body associated with a newlycreated task is described as task activation. On completion of its activation a task commencesexecution. Should activation fail due to an exception being raised or the task being aborted,the task becomes completed – an exception raised during activation cannot be caught, as anexception handler may potentially refer to uninstantiated or corrupted objects. So that somecontrol can be provided in the event of failure, activation is synchronised with parent execu-tion. The parent task is suspended on commencement of child activation and only unblockedwhen activation is completed. A TASKING_ERROR exception is raised in the parent shouldactivation of one or more child tasks fail.

The table in Figure 4.3 lists our activation events.

The ActStart and ActEnd events mark the start and end of the activation of a child task, whileSusChildAct marks suspension of the parent on its children’s activation. If an activation isaborted then, for the purposes of trace merging, we assume a Complete event marks the end ofactivation.

The suspension of the parent and the commencement of the child’s activation are synchronisedand a parent only unblocks when child activation completes. We use the identity of a� ���������������������������

† For the purpose of trace merging, the master’s identity need only be a unique logical value, that is itneed not be related to its source code name. When a master is a task instance, its identity is simply thatof the instance. For a master which is not a task instance, the identity must be more complex. It shouldtake the form of a tuple

{master-task identity, unique value}

where the master-task identity is that of the task responsible for the master’s execution (i.e. transitivelyfollowing dependency relations, the next master which is a task), and the unique value is the portion re-quired for unique identification. A master task identity is used to locate the trace containing the execu-tion events of the master task.

- 33 -

Figure 4.3 – Task activation events

�������������������������������������������������������������������������������������������������������������������������������������������Event type Attributes Comment�������������������������������������������������������������������������������������������������������������������������������������������SusChildAct None Suspend on child activationUnblock [TASKING_ERROR ex-

ception]Child activation complete

ActStart Start activationSusChildAct event identityActEnd None Activation complete�������������������������������������������������������������������������������������������������������������������������������������������

��������

��������

SusChildAct event instance as an attribute of ActStart event instances so that the associatedevents can be located. An event instance identity is composed of its source task identity andtimestamp and is unique. We assume since we have synchronisation between events that sucha mechanism can be implemented.

4.1.3. Rendezvous

In Ada, task synchronisation and communication is supported through the rendezvous con-struct. A task may provide a number of entries to which it is willing to accept calls from othertasks. An entry may specify some statements to be executed when a call is accepted. Themechanism uses asymmetric naming: an accepting task does not have access to a calling task’sidentity.

A calling task may perform unconditional, conditional and timed entry calls. In a conditionalentry call, if the accepting task is not in a state to receive it, the call returns and the calling taskcontinues execution. If a called task has completed or been aborted, the exceptionTASKING_ERROR is raised. In an unconditional entry call, the entering task blocksindefinitely until the call is accepted or it is aborted. A timed call specifies a maximum waittime.

An accepting task may accept entry calls by either an accept-statement or a select-statement.An accept-statement is used to perform an unconditional wait on a single specified entry. Aselect-statement supports unconditional, conditional, timed or terminate waits on a number ofspecified entries. Should calls be queued on more than one of the specified entries, the choiceof which call is accepted is not defined by the language.

A select-with-terminate option is a useful mechanism for synchronising task terminationwithout explicit programming. It is used to specify tasks which terminate when no longerneeded. Should a task be suspended on a select-with-terminate, the condition of termination issimply that no task will ever access it again. Slightly more formally, the condition is that thetask is dependent on some master, the execution of which is completed, and all tasks dependenton that master (directly or transitively) are either terminated or waiting on a select-with-terminate. On satisfaction of this condition, termination is carried out in two steps. Firstly, alltasks suspended on select-with-terminate become terminated. Secondly, should the master bea task, the task terminates, else the task which executed the master is unblocked and continues.

When an entry call is accepted, rendezvous commences. The entering task is suspended untilthe rendezvous is complete, and the accepting task resumes execution. A rendezvous is com-pleted when the accepting task either completes execution of the statements associated with theentry, or is aborted and itself becomes completed. If an exception is raised in the accepting

- 34 -

task and not caught before rendezvous completion it will be propagated to the entering task.Should the accepting task be aborted, it becomes abnormal, and may not make the transition toterminated before end of rendezvous.

The table in Figure 4.4 lists the rendezvous events.

Figure 4.4 – Task rendezvous events

� �����������������������������������������������������������������������������������������������������������������������������������������������������Event type Attributes Comment� �����������������������������������������������������������������������������������������������������������������������������������������������������SusAccept Entry identity Suspend on acceptSusSelect Open entry identities, select type [,

delay value]Suspend on select

EnterAccept Accepted task identity Enter accept code bodyEndAccept None Accept code completedSusEntry Called task identity, called entry

identity, call typeSuspend on entry call

SusEndEntry None Entry call acceptedUnblock [exception identity] Entered task resumes execution on

rendezvous completion� ��������������������������������������������������������������������������������������������������������������������������������������������������������������������

���������������

SusAccept and SusSelect correspond to the execution of accept and select statements respec-tively. The execution of an entry call is signified by a SusEntry event.

Start of rendezvous is signified by the EnterAccept and SusEndEntry event pair. The identityof the called entry is not an attribute of the EnterAccept event as it will be either an attribute ofa prior SusAccept event or the entering task’s corresponding SusEntry event.

EndAccept signifies normal completion of execution (by the accepting task) of the statementsassociated with an entry, and hence end of rendezvous. End of rendezvous in the entering taskis signified by an Unblock event with a possible exception attribute. For the purpose of merg-ing traces, in the case of an abort we consider the end of the accept to be signified by a Com-pleted event (i.e. it is this event which unblocks the entered task). This also applies if theaccepting task is aborted.

Since task event traces are totally ordered, this information is all that we need to match up ren-dezvous events in both calling and accepting tasks.

4.1.4. Remote procedure call

Remote procedure call is the mechanism used for communicating between virtual nodes in aYork Distributed Ada program. A task in one node may call a subprogram implemented inanother. However, for a subprogram to be called remotely, it must be declared in a library unitpackage that defines the interface to the virtual node.

Subprogram calls are executed on behalf of a remote client (calling) task by a server task.While a call is being carried out on its behalf, the client task remains suspended, resumingwhen the server task terminates and any results have been returned. There exists the possibil-ity that a subprogram might not yet be elaborated at the point at which a server is created tocall it. In such a situation the server will be blocked, and only set running when elaborationfinally occurs.

- 35 -

A server task is merely an implementational convenience and not a ‘‘proper’’ Ada task. It isnot visible to the client program and does not behave in the same way as a normal task.Specifically, it does not activate and never has direct dependants which would cause it to blockin a completed state (Figure 4.5).

Figure 4.5 – RPC server state diagram

Created

Blocked on RPCelaboration

Suspended Executing

Terminated

Non-existent

completerequest

subprogramelaborated

subprogramnot elaborated

subprogramelaborated

new request

In a distributed Ada program there is the possibility of communication failure. Failure may bepartial – a node or nodes may fail – or the communications infrastructure itself may fail, leav-ing tasks unable to perform remote procedure calls or receive results from ones in progress.

In York Distributed Ada such occurrences are dealt with using exceptions. In the event of anode failing, client tasks are unblocked and the exception NODE_FAILURE is raised in each.Further calls in the future to the same node will elicit the same outcome. Should a failure inthe communications subsystem occur, any remote procedure call activity results in aCOMMS_FAILURE exception.

The table in Figure 4.6 lists our remote procedure call events.

SusRPC marks the suspension of the client task on a remote call and SCreate marks the crea-tion of the server task. The identity of the subprogram called, and which node it resides in, areassociated with each RPC call. Under normal circumstances, we should need only to recordthese with one of the two events. However, in anticipation of a failure resulting in the loss of atrace, we record them as attributes of both. As with rendezvous events, we use trace orderingto match call events in the client to corresponding creation events in the server.†

� ���������������������������

† Event timestamps are capable of ordering all events which occurred in tasks executed on the sameprocessor and hence can order instances of server task creation.

- 36 -

Figure 4.6 – Remote procedure call events

� �����������������������������������������������������������������������������������������������������������������������������������������������������Event type Attributes Comment� �����������������������������������������������������������������������������������������������������������������������������������������������������SusRPC Called node identity, called in-

terface and subprogram identi-ties

Client task suspends on remoteprocedure call

Unblock [exception identity] Client unblocked by rpc com-pletion

SCreate Calling task identity, called in-terface and subprogram identi-ties, [blocked for elaborationflag]

Server task creation

Terminate None Server task terminated� ��������������������������������������������������������������������������������������������������������������������������������������������������������������������

���������������

4.1.5. Abort

In Ada an abort statement may be used as an emergency means of terminating one or morespecified tasks. On execution of an abort, each of the named tasks becomes abnormal unless itis already terminated. Similarly any task which depends on one of the named tasks becomesabnormal unless already terminated. If an abnormal task is blocked, i.e. execution issuspended on an accept statement, a select statement, a delay statement, or on an entry callwhich has not been accepted, then the task becomes completed. A task which has not begunactivation becomes completed and hence terminated as it can have no dependants. This com-pletes the execution of the abort command.

Any abnormal tasks not blocked as above become completed not later than the next synchroni-sation point (such as an entry call or end of activation). A calling task suspended in rendez-vous does not terminate until that rendezvous completes, and an RPC client task does not ter-minate until the remote procedure call returns or fails.

The table in Figure 4.7 lists our abort events.

Figure 4.7 – Task abort events

� �����������������������������������������������������������������������������������������������������������������������������������������������������Event type Attributes Comment� �����������������������������������������������������������������������������������������������������������������������������������������������������Abort Aborted task identities Commence execution of AbortAbnormal None Task becomes abortedComplete None Abnormal task becomes com-

pleted� ������������������������������������������������������������������������������������������������������������������������������������������������������������

�������

4.1.6. Exception handling and delay

The table in Figure 4.8 lists the last of the primitive events, those for handling delays andexceptions. Though not exclusively part of tasking, we record exception propagation as this isextremely useful for tracking down errors.

We have already listed the Unblock event in the various other contexts in which it is used. Anexception identity is an optional attribute for those cases (such as the unblocking of a parent

- 37 -

Figure 4.8 – Exception and delay events

���������������������������������������������������������������������������������������������������������������������������������������������������Event type Attributes Comment���������������������������������������������������������������������������������������������������������������������������������������������������SusDelay Delay Value Suspend on completion of delayUnblock None Delay completedRaise Exception identity Exception raised during executionHandle None Exception caught by handler���������������������������������������������������������������������������������������������������������������������������������������������������

�������

�������

suspended on child activation) where such an occurrence is a possibility. We could haveinstead used a separate Raise event, requiring it to precede the Unblock so that the associationmight be recognised. The processing needed to ensure the ordering would seem comparable tothat required to combine the two events into one, but we chose the latter as it has the advantageof a lower monitoring overhead i.e. less data to capture.

4.2. General monitoring requirements

When debugging a program, one monitors the program to record how its state changes as exe-cution proceeds. In the previous section we have described how we monitor a program fromthe point of view of Ada tasking semantics. In this section we look at three issues: how toimplement instrumentation in the program, how to ensure that traces from different virtualnodes can be merged, and how the trace can be used to implement deterministic replay.

4.2.1. Instrumentation techniques

There are three main ways in which it is possible to implement instrumentation of a program:source code modification, runtime system modification, and breakpointing.

Source code modification is an attractive technique for a variety of reasons:

(i) it does not require access to the innards of a proprietary language system, therefore ithelps makes a debugging tool portable;

(ii) it is straightforward to do manually (though this might be error prone);

(iii) it is possible to do it automatically;

(iv) if we were not solely interested in monitoring tasking, the instrumentation is not limitedto one particular aspect of the program’s behaviour.

Source code modification does however have its disadvantages:

(i) there is likely to be a significant compilation overhead;

(ii) it cannot be applied to programs (or libraries) for which the source code is not available;

(iii) it provides no facility to access private information about a task (for example from theruntime system’s task control block) so identifying tasks uniquely may be difficult;

(iv) it provides no mechanism for affecting task execution control; and

(v) event reporting is not atomic as we have no control over scheduling.

Implementing the instrumentation by runtime system modification (as we have done in theYork Distributed Ada implementation) solves all the downside problems of source codemodification but unfortunately destroys most of the advantages that source code modificationcan provide. Its further disadvantage is that it can only monitor events that involve calls to the

- 38 -

runtime system. It would be possible to have a hybrid system in which the runtime system didsome of the monitoring and source code inserts the rest. By this means a basic set of eventsmonitored by the runtime system could be supplemented by a further set produced by calls inthe source.

The two techniques above share one disadvantage in that the monitoring is not immutable. Ifprogram code and monitoring system share the same address space for code, data or both thereexists the possibility that a fault in the program might corrupt the monitoring system.

The third technique for monitoring, breakpointing, avoids this last problem by monitoring theprogram from the outside. If the monitoring is performed by another process (in a separateaddress space) then normally it cannot be corrupted. However this requires operating systemand/or processor support to achieve and is typically very time intensive since often the targetprogram cannot run without extensive intervention. The monitoring process needs to knowwhere to set the breakpoints in the target program. To do this requires a knowledge of the pro-gram and its translation system, but this is usually available since it forms the basis of mostsource code debuggers.

4.2.2. Organising execution observations

In our distributed system, instrumentation traces are produced by each virtual node. In order toconsider the behaviour of the Ada program as a whole, we need to be able to resolve refer-ences between nodes and between tasks, in other words to merge the separate traces to form asingle trace, at least conceptually. These references between nodes and tasks do after allrepresent the synchronisation events that we are most interested in.

In our implementation all events are timestamped so that we can order all those from the sametask. As described in section 4.1, we record where necessary the identities of those tasks withwhich there is interaction. We then take this information and apply our knowledge of Adatasking semantics (for example, we know a child task completes activation before its parentunblocks) to construct the complete ordering.

Effectively, we capture sufficient tasking information to identify direct relationships betweenevents which occur in different tasks. To identify relations between arbitrary events we musttransitively search through our trace set, or process the set to generate time stamps that give usa partial ordering.

Partially-ordering timestamps are generated using a logical clock algorithm. They can be usedto assert for two events e 1 and e 2, that e 1 happened before e 2, or vice versa, or that e 1 and e 2

are concurrent (i.e. they could have happened simultaneously or either order of them is appli-cable). Such a clock algorithm may be either applied as a post-processing device or embeddedin the system under observation. The former is normally better in that it interferes less withthe executing program (see the paragraph below on overheads). However the latter is anattractive proposition as it frees us from the monitoring of events which may not be of immedi-ate interest. Thus we could potentially choose not to record and store some events but as longas we keep the clocks up to date we will still maintain a sense of parallelism and causalitywhen considering those events we do choose to store.

With respect to Ada, the principles of a logical clock are that each task maintains its ownnotion of local time based on locally observable events, and maintains a notion of times inother tasks based on information received through interactions. The whole forms a ‘‘globaltime’’ system. Usually in implementation, a local time is represented by a counter which is

- 39 -

incremented for each event occurrence, and global time is represented by sets of countervalues. It is these value sets which are propagated between tasks; simple rules are used toupdate a task’s global time interpretation.

A clock implementation generates computational, communication and storage overheads whichare proportional to clock data structure sizes (i.e. the size of counter value sets). At any instantin execution, the sizes depend on static task structure and the dynamic tasking and communica-tion activity that has occurred prior to that point. The size of the set in task T is equal to thenumber of tasks from which T has received inter-task communication (applied transitively toinclude those that have sent messages to those that have sent messages to T, and so on). For itto be viable to use embedded logical clocks to capture causality, we would want the overheadsto be within the bounds of the available resources. We cannot avoid the probe effect but wewould wish to prevent more extreme effects such as the triggering of timeouts which wouldperhaps cause the system under observation to enter some failure mode of operation. Theother concern must be that overheads are sufficiently low to make the monitoring systemusable in terms of response time for the user.

4.2.3. Implementing replay

Though not totally relevant to our work on the display of Ada tasking related information, thework described above on program tracing has led us to consider some of the issues concernedwith replaying the execution of a program.

The execution of a distributed program is inherently non-deterministic due to arbitrary factorssuch as the time taken for messages to be transmitted and the vagaries of loading on each pro-cessor. This makes it difficult to reproduce exactly the behaviour of a program in the frequentcase where the debugging user wants to go back and look at the execution again. It is thereforehighly desirable to provide a deterministic replay mechanism that makes it possible to observethe same execution sequence as many times as desired.

The basic technique is to monitor execution with sufficient detail to permit the direction of sub-sequent executions along the same path. Since in our system we record traces at the nodelevel, we are able, under certain constraints, to provide a mechanism that permits the executionof a single virtual node to be repeated without having to re-execute the rest of the program.

In general to perform replay we need to monitor three types of information. From these, theprecise path of program execution can be reproduced.

Input from external sources

It is necessary to record the content of each input and the logical time at which it arrived.However if a system can be constructed in which all input is received through one or a smallnumber of nodes which is/are perhaps simple enough not to require replay-driven debugging,then the replay of other nodes to which the input is disseminated degenerates into considerationonly of the other two types of information, below. This is the approach that we adopted, henceour prototype can only be used to replay nodes without input from external sources.

- 40 -

Inter-node message reception

Message reception is non-deterministic in the order and in the times at which messages arrive.Since the effect of receiving a message is to place a task on a scheduling queue then for repea-tability we need to ensure the order of queuing is duplicated on subsequent executions.

This means synchronising the placement of a task on a queue with the order of scheduling. Wedon’t need to ensure queuing is at exactly the same point in the scheduling sequence, merelythat it occurs before the scheduler queues the task which follows. Should the point arrive atwhich that following task is to be queued, then we can suspend execution until the appropriatemessage arrives and order can be preserved. When messages arrive ahead of time, we caneither buffer them until reception is permissible, or process them and hold the associated tasksuntil they can be queued.

Task scheduling

The scheduling algorithm used to concurrently execute tasks on the same processor determinesthe order in which their events are interleaved.

In general, scheduling may be non-deterministic for the reasons that when a new task is due tobe scheduled there may be an arbitrary choice between ones of the same priority, or that if aclock mechanism is used to control the length of time a task executes, it may not be sufficientlyaccurate to ensure that given the same start state, it will halt execution at the same point.Assuming a worst case, we need to record the sequence of context switches made. For eachwe need the identity of the task chosen for execution and the address at which it was halted.During replay we can halt the task at the same point using conventional breakpointing tech-niques.

The Ada language rules allow for non-determinism in one particular case: a select-statement isallowed to make an arbitrary choice of which of two or more entry calls to accept. We need torecord which entry is chosen so that we can direct the decisions on subsequent executions.

In the York Ada environment the replay of task scheduling is very simple. The run-to-completion strategy that is adopted means that all the information required is stored in theevent trace, and the total ordering defined by the original execution of the tasks running on asingle processor drives the replay.

4.3. Summary

In this chapter we have described how a tracing mechanism can be linked to the events thatoccur in an Ada tasking system, and how this can be extended to include inter-node communi-cation events. We have discussed some of the advantages and disadvantages of implementinginstrumentation in the runtime system as opposed to at the source code level or by tracing theprogram from outside. In addition we have discussed briefly the way in which traces from dif-ferent nodes can be merged together, and how the tracing mechanism forms a foundation forthe implementation of a replay mechanism.

- 41 -

Chapter 5

CONCLUSIONS

5.1. Summary

In this report we have looked at the problem of using tracing information to debug distributedAda programs from three points of view:

� from the viewpoint of the debugging process – how do users debug, and what makesgood debugging users better than poor ones?

� from the viewpoint of the display tool, and what facilities it should provide to presentinformation (derived from the trace) to the debugging user in the most effective way orways;

� from the viewpoint of the program being debugged and what information about it shouldbe captured in the trace.

In the first of these viewpoints we make no claim to have advanced the discipline of cognitivepsychology. Instead we have attempted to consolidate some general principles into a modelthat identifies four distinct strategies that a user may adopt when debugging a program. Thesefour strategies – comprehension based, topographic, symptomatic and beacon based –correspond to the four points in a two-dimensional split between, on the one hand, techniquesthat rely on the user’s previous experience and those that rely on what he/she can find outabout the program in question, and, on the other hand, techniques that involve looking at theprogram as an object and those that involve looking at a particular execution of it.

This classification enables us to indicate in what general way the facilities provided by ourRELATE-2 tool help the debugging process. Therefore in the second viewpoint we concen-trate on facilities without presenting anything but a basic justification as to how they mightprove effective in practice. The proper test of these awaits future work (see section 5.3.1below). The key facilities would seem to be to provide more than one way of looking at atrace, and to provide functionality to support abstraction of, and navigation around, the infor-mation available.

In the third viewpoint we present a definition of what information we record about Ada taskingand inter-process communication activity. We present this as a reasonably useful approach totake, rather than attempting to justify it as being optimal. Nor do we claim that instrumentingthe runtime system is the only or best technique for collecting the information: instead we havediscussed the options available to an implementor.

- 42 -

5.2. Application to SMCS

How does this work apply to SMCS?

From talking to the personnel involved at BAe Sema it seems that debugging at the Ada task-ing level is no longer a requirement, thereby reducing the relevance of the first part of Chapter4, and that there is more interest in the more general application of tracing techniques. Webelieve therefore that the most important contribution of this work to the SMCS and similarprojects will be found in consideration of the tool ideas of Chapter 3, and perhaps in relatingthose to the processes of Chapter 2. It is there that we have presented our ideas about the wayin which information useful to the debugging user can be displayed and manipulated by adebugging tool.

In the Introduction (§1.4.2), we listed five major problems that we understood that SMCS wasencountering. We have addressed three of these:

a) RELATE-2 illustrates how a tool can provide navigation, automation and abstraction aidsto fight the complexity of the system and make locating bugs easier.

c) The application level view provided in RELATE-2 is an example of how the system canbe comprehended at a global level to obtain the ‘‘big picture’’ that is, perhaps, lacking atpresent.

e) RELATE-2 allows reverse execution of a program provided that one wishes only to con-sider the program as a series of tasking and communication events. At that level it makesit unnecessary to re-execute the program from the beginning in order to examine state atan earlier point. If it is necessary to re-execute from the beginning, the section on replay(§4.2.3) provides the foundation of a mechanism to replay one or a small number ofnodes in isolation.

We have not addressed the further problems of relating bugs to the design level at which mostprogrammers comprehend the system, nor of the performance of debugging tools.

5.3. Further work

The above suggests to us four areas in which further work could be done.

5.3.1. Need to evaluate debugging method

We have quite deliberately not done any evaluation work of the RELATE-2 tool in this project.To do so would have required us to spend a significant amount of time developing a distributedapplication, and this would have detracted from our development of the tool. Now that wehave a prototype tool, evaluation of it is an obvious next step.

The fundamental question that needs to be answered is whether a tool such as RELATE canprove useful in the development of a large Ada distributed system. This ‘‘usefulness’’ neednot only be its effectiveness at tracking down bugs – its use to comprehend the structure of aprogram is equally valid. The evaluation could be carried out at York (in which case it couldbe done using one or more exemplar systems drawn from previous projects in the departmentor suggested by MoD) or in an industrial-scale project. The advantage of the latter is that itwill have a greater degree of realism; the major disadvantage is that RELATE-2 may have tobe ported to another environment.

It is important that the evaluation encompasses more than one application system. Distributedsystems come in many shapes and forms and it is not clear that the sort of bugs that might befound in one are necessarily found in another.

- 43 -

5.3.2. Integration of event-based debugging with breakpoint debugging

So far we have developed RELATE purely as an event-based debugger. This is useful inallowing the debugging user to track bugs down to the chosen level of events, in our case task-ing interactions.

Using the application-level view of RELATE-2 it is possible to track the cause of a bug downto a particular node in the distributed system. Using the node-level view it is possible to refinethat down to a particular task or group of tasks. Using the timeline view it is possible tofurther refine that down to an individual event or group of events. What RELATE-2 cannotcurrently do is take that a step further and allow the user to identify a particular line of sourcecode as being the cause. This is normally the function of a ‘‘traditional’’ debugger – onewhich allows the user to set breakpoints at particular points and examine the values of vari-ables, etc.

The integration of event-based and breakpoint debuggers is a most interesting potential area forinvestigation with regard to the ‘‘state of the art’’ in debugging. At a panel session held as partof the 1991 Workshop on Parallel and Distributed Debugging, Tom LeBlanc expressed theview that integration of the many existing debugging ideas was a key research issue, but that itwas not as well advanced as other issues. It is our view that this is still the case, citing as evi-dence of this that no papers directly addressing integration were presented at the 1993Workshop.

The ultimate goal of a tool developer should be to provide an integrated set of tools whichallows the user to seamlessly move from one view of the program to another. By this means,one tool could be used to navigate to a point in the program, and then, without having to repeatthe navigation, another tool used to provide different information relating to that point. If thetoolset is also extensible then that would be a bonus.

5.3.3. Debugging at the design level

The earliest and crudest debugging tools provided their user with access to the binary codesrepresenting a program’s data and instructions. Later on symbolic debuggers were developedthat translated the codes into machine instruction formats and more readable representations ofdata. Later still means were developed of storing compiler symbol table information in an exe-cutable file, so enabling the debugging tool to relate entities in the program to their high-levellanguage source code.

This is essentially still the state of the art in traditional debugging tools. The user can refer to apoint in the program by name – either by reference to the name of the program unit, or byreference to its line number. Variables can also be referred to by name, and their valuesdisplayed and altered in a form according to their type. Gradually the level at which debug-ging takes place has moved to higher and higher levels of virtual machine.

The next step is to move up to the design level. The SMCS project is an example of a systemin which most code is written in a high-level design notation (in this case JSD) and thentranslated automatically into source code (in this case Ada). From the developers’ point ofview, the system is defined in the design notation, and they want to be able to relate bugs backto that level. Instead they are forced to debug at the source code level – a level at which, as aresult of the translation process, there is a large volume of difficult to identify code that posesproblems of complexity and lack of abstraction.

It is desirable for a debugging user to be able to examine the execution of the program in terms

- 44 -

of the design level. To do this it would be necessary to include in the source code produced bythe translation process, sufficient additional information that would allow a design-level debug-ging tool to relate the execution of the program back to the components of the design.

As well as JSD, it would be interesting to apply the investigation to HOOD because of its usein relation to Ada.

5.3.4. Generalising the trace display

In the work we have done to date, we have concentrated exclusively on how to gather anddisplay information about Ada tasking events. However, much of the work described inChapter 3 also applies to other sorts of event.

A further avenue for study would be to develop a general model, in which different types ofevents could be accommodated. Applications could then be instrumented to produce traces ofthese events, and a generalised display tool could be used to show them in a variety of formats.

As discussed in section 4.2.1, the instrumentation mechanism might be built in to the applica-tion system by the developer, either at the source code level or runtime system level, or appliedexternally. It does not have to be the case that the developer instruments the application – ifthe operating system produces a trace, or there is a facility in a high-level language implemen-tation that allows a runtime trace to be enabled, then this information may be able to be incor-porated. The flexibility it provides in choosing the level at which tracing takes place is a majoradvantage of having a generalised mechanism.

A generalised mechanism should also facilitate the integration of different types of trace pro-duced by the same application. For example, separate traces could be produced by the runtimesystem level and by source code inserts and, by suitable definition of the relationship betweenthe two, provide two levels of abstraction when looking at the application.

A generalised display tool might provide facilities for looking at traces in a set of standardways (e.g. snapshot of the state of the system at a particular time, timeline view showing statetransitions) and also user-extensible facilities to display information in application-orientedways. By such means, the user of the tool can tailor it to specific requirements both of theapplication and the form which the tracing information takes.

5.4. Conclusion

In this report we have shown how a tracing mechanism can be implemented to record informa-tion about Ada tasking and inter-node communication activity. We have constructed a proto-type tool which displays this information in a variety of formats that allow the user of it toinvestigate the structure and execution of the program. We have shown how it provides sup-port for a number of different cognitive techniques. Finally we have identified four possibledirections for further research.

- 45 -

Appendix A. SMCS system description

NOTE. This section is based on notes taken at a July 1992 meeting in Yorkbetween project members and representatives of MoD and BAe Sema. It may nottherefore reflect the current state of affairs precisely.

The SMCS system hardware consists of a number of logical nodes connected together by afibre-optic ring. The nodes are a number of multi-function consoles (MFCs) which are themain operator display devices. There are also a number of common service nodes (CSNs)which hold the system database and also provide number-crunching and other support, andsome I/O nodes which are the connection to the sensor and control systems on board. There isalso a system console and a small number of remote terminals. All hardware components areduplicated.

Each node typically consists of one or more ‘‘cards’’, each containing one processor. The pro-cessors are a variety of 386, 486 and 68020 CPUs with some Transputers in the CSNs for tar-get plot calculations. The cards within a node are connected by a Multibus II link, and onecard within each node acts as the fibre-optic ring gateway.

Each node has a serial port and a Ethernet port. The Ethernet connections are only used in thedevelopment version of the system and are there for down-loading and debugging purposes.

The basic operation of the system is that data received by the input nodes, say the location of atarget, is passed to the CSN which stores it in the database and updates relevant ‘‘pictures’’.The updated parts of the picture would be broadcast around the ring, and any node interested init (e.g. a MFC which was displaying that picture) would pick it up and update its local copy.

The system has no dynamic task creation, and the allocation of tasks to cards is done at buildtime. There is some problem of ‘‘heap-creep’’ and data is paged between the MFCs and theCSNs.

BAeSema’s ‘‘house culture’’ for the development of software is based around JSD and theyhave a number of software tools to help them develop systems from a JSD design. JSD objectsare mapped on to Ada constructs. This mapping is performed automatically.

The system’s software is conceptually in two parts: an infrastructure layer and an applicationslayer. The infrastructure layer is itself built on top of the TeleSoft Ada executive – there aresome direct calls from top layer to the Exec.

Each of the infrastructure and applications layers consists of a number of Ada tasks. Typicallya card might have 10 or so tasks running on it, of which 2 of these might be applications tasksand the rest part of the infrastructure.

- 46 -

REFERENCES

1. H. Agrawal and E.H. Spafford, ‘‘A bibliography on debugging and backtracking’’, ACMSIGSOFT Software Engineering Notes, pp. 49-51 (April 1989).

2. H. Agrawal, R.A. DeMillo and E.H. Spafford, ‘‘An execution backtracking approach todebugging’’, IEEE Software, pp. 21-26 (May 1991).

3. H. Agrawal, R.A. DeMillo and E.H. Spafford, ‘‘Debugging with dynamic slicing andbacktracking’’, Software – Practice and Experience 23(6), pp. 589-616 (June 1993).

4. J.R. Anderson, Cognitive psychology (and its implications), W.H. Freeman (1985).

5. K. Araki, Z. Furukawa and J. Cheng, ‘‘A general framework for debugging’’, IEEESoftware, pp. 14-20 (May 1991).

6. V.R. Basili and H.D. Mills, ‘‘Understanding and documenting programs’’, IEEE TransSoftware Engineering 8, pp. 270-283 (1982).

7. G. Booch, Software engineering with Ada, Benjamin/Cummings (1983).

8. R. Brooks, ‘‘Towards a theory of the comprehension of computer programs’’, Interna-tional Journal of Man-Machine Studies 18, pp. 543-554 (1983).

9. R. Brooks, ‘‘Toward a theory of the cognitive processes in computer programming’’, inTutorial: human factors in software development, ed. B. Curtis, B.D. Carroll, J. Cotton,E. Nahouraii, F.E. Petry and C. Wu, IEEE Computer Society Press (1985).

10. W. Cai, W.J. Milne and S.J. Turner, ‘‘Graphical views of the behaviour of parallel pro-grams’’, Journal of Parallel and Distributed Computing 18, pp. 223-230 (1993).

11. A. Cobbett, RELATE: Replay of Extremely Large Ada Tasking Environments, Universityof York 4th year Project (21 Jul 89).

12. A.P. Cobbett and I.C. Wand, ‘‘The debugging of large multi-task Ada programs’’, AdaUser 10(Supplement), pp. 122-131 (1989). Proceedings of the Ada UK 8th InternationalConference.

13. J.R. Firth, C.H. Forsyth, L. Tsao, K.S. Walker and I.C. Wand, ‘‘York Ada WorkbenchCompiler Release 2 User Guide’’, YCS.87, Department of Computer Science, Univer-sity of York (March 1987).

14. G.S. Goldszmidt, S. Yemini and S. Katz, ‘‘High-level language for debugging con-current programs’’, ACM Transactions on Computing Systems 8(4), pp. 311-336(November 1990).

15. L. Gugerty and G.M. Olson, ‘‘Comprehension differences in debugging by skilled andnovice programmers’’, in Empirical studies of programmers, ed. E. Soloway and S.Iyengar, Ablex Publishing (1986).

16. A.D. Hutcheon and A.J. Wellings, ‘‘The Virtual Node Approach to Programming Distri-buted Embedded Systems in Ada’’, YCS.115, Department of Computer Science,University of York (February 1989).

17. R.A. Jeffries, Comparison of debugging behaviour of novice and expert programmers,Department of Psychology, Carnegie Mellon University (1982).

18. T. Kunz, Programming paradigms and clustering rules, Technische HochschuleDarmstadt (February 1993).

- 47 -

19. L. Lamport, ‘‘Time, clocks, and the ordering of events in a distributed system’’, Com-munications of the ACM 21(7), pp. 558-564 (July 1978).

20. S. McCoy-Carver and S. Clarke-Risinger, ‘‘Improving children’s debugging skills’’, inEmpirical studies of programmers: second workshop, ed. G.M. Olson, S. Sheppard andE. Soloway, Ablex Publishing (1987).

21. C.E. McDowell and D.P. Helmbold, ‘‘Debugging concurrent programs’’, ComputingSurveys 21(4), pp. 593-622 (December 1989).

22. R.A. Olsson, R.H. Crawford and W.W. Ho, ‘‘A dataflow approach to event baseddebugging’’, Software Practice & Experience 21(2), pp. 209-229 (February 1991).

23. S. Patel, W. Chu and R. Baxter, ‘‘A measure for composite module cohesion’’, pp.38-48 in Proceedings of the 14th International Conference on Software Engineering(May 1992). Melbourne, Australia

24. D.J. Rowland, Graphical replay of a multi-task Ada program, University of York MScProject (September 1986).

25. B. Shneiderman and R. Mayer, ‘‘Syntactic/semantic interactions in programmerbehaviour: a model and experimental results’’, in Tutorial: human factors in softwaredevelopment, ed. B. Curtis, B.D. Carroll, J. Cotton, E. Nahouraii, F.E. Petry and C. Wu,IEEE Computer Society Press (1985).

26. J.M. Stone, ‘‘A graphical representation of concurrent processes’’, SIGPLAN Notices24(1), pp. 226-235 (January 1989).

27. I.C. Wand, J.R. Firth, C.H. Forsyth, L. Tsao and K.S. Walker, ‘‘Facts and figures aboutthe York Ada compiler’’, Ada Letters VII(4) (July/August 1987).

28. M. Weiser, ‘‘Program slicing’’, in Tutorial: human factors in software development, ed.B. Curtis, B.D. Carroll, J. Cotton, E. Nahouraii, F.E. Petry and C. Wu, IEEE ComputerSociety Press (1985).

29. M. Weiser and J. Lyle, ‘‘Experiments on slicing-based debugging aids’’, in Empiricalstudies of programmers, ed. E. Soloway and S. Iyengar, Ablex Publishing (1986).

30. S. Wiedenbeck, ‘‘Beacons in computer program comprehension’’, International Journalof Man-Machine Studies 25, pp. 697-709 (1986).

- 48 -