system architecture recovery for open source software ... · analyse textuelle, lexicale et...

System architecture recovery foropen source software integrationA prototype to support system comprehension and component substitution

P. Charland D. Ouellet M. Salois DRDC Valcartier

Defence R&D Canada – ValcartierTechnical Report

DRDC Valcartier TR 2008-464May 2009

System architecture recovery for open source software integration A prototype to support system comprehension and component substitution

P. Charland D. Ouellet M. Salois DRDC Valcartier

Defence R&D Canada – Valcartier

Technical Report

DRDC Valcartier TR 2008-464

May 2009

Principal Author

Original signed by Philippe Charland

Philippe Charland

Defence Scientist

Approved by

Original signed by Guy Turcotte

Guy Turcotte

Section Head, System of Systems

Approved for release by

Original signed by Christian Carrier

Christian Carrier

Chief Scientist

© Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2009

© Sa Majesté la Reine (en droit du Canada), telle que représentée par le ministre de la Défense nationale, 2009

DRDC Valcartier TR 2008-464 i

Abstract ……..

Software component substitution is a common maintenance activity. It can informally be defined as the process of replacing an existing component in a system by a candidate component meeting new functional or non-functional requirements. A component is usually considered as a black box which provides and requires services through its interfaces. Before a component can be substituted, the existing software system must be understood to identify the scope of the component to replace, the services it provides, as well as to comprehend how it depends on other components or how its replacement could affect the overall system behavior. Although these tasks are already challenging for large and complex systems, they are further complicated by the fact that for most of them, the source code is the only complete and up-to-date documentation available. As a result, before a component can be substituted, the architecture of the existing system must be understood. This technical report describes a suite of tools to assist with this task. It was implemented in Eclipse, an extensible integrated development environment (IDE). A case study is presented to demonstrate the applicability of this suite of tools on a realistic software component substitution example.

Résumé ….....

La substitution de composants logiciels est une activité de maintenance courante. Elle peut être définie de façon informelle comme étant le processus pour remplacer un composant existant dans un système par un composant candidat répondant à de nouvelles exigences fonctionnelles ou non fonctionnelles. Un composant est généralement considéré comme une boîte noire qui fournit et requiert des services par l‟entremise de ses interfaces. Avant qu‟un composant puisse être substitué, le système existant doit être compris afin d‟identifier la délimitation du composant à

remplacer, les services qu‟il fournit et comprendre comment il dépend d‟autres composants ou comment son remplacement pourrait affecter le fonctionnement de l‟ensemble du système. Bien

que ces tâches soient déjà difficiles pour des systèmes complexes et de grande taille, elles sont davantage compliquées par le fait, que pour la plupart d‟entre eux, le code source est la seule documentation disponible qui soit complète et à jour. Par conséquent, avant qu‟un composant puisse être substitué, l‟architecture du système existant doit être comprise. Ce rapport technique décrit une suite d‟outils pour aider dans cette tâche. Celle-ci a été implantée dans Eclipse, un environnement de développement intégré (EDI) extensible. Une étude de cas est présentée pour démontrer l‟applicabilité de cette suite d‟outils sur un exemple concret de substitution d‟un

composant logiciel.

ii DRDC Valcartier TR 2008-464

This page intentionally left blank.

DRDC Valcartier TR 2008-464 iii

Executive summary

System architecture recovery for open source software integration: A prototype to support system comprehension and component substitution

Philippe Charland; David Ouellet; Martin Salois; DRDC Valcartier TR 2008-464; Defence R&D Canada – Valcartier; May 2009.

Introduction: During the last two decades, the software market has been dominated by Commercial Off-the-Shelf (COTS) products. During this period, the inherent limitations associated with these products have emerged. This has paved the way to a movement towards Free and Open Source Software (FOSS). FOSS refers to programs whose source code is made freely available for use or modification to anyone interested in using or working with it. For the past ten years, the FOSS movement has been constantly growing in importance.

Results: In order to facilitate the use of FOSS in the Canadian Forces‟ (CF) systems, the members of the Opening up Architectures of Software-Intensive Systems (OASIS) research project have developed a prototype to assist with the process of component substitution. The objective is to support, in the first case, the replacement of a COTS component in an existing Java software system by its FOSS equivalent and in the second case, the replacement of a FOSS component by a modified and enhanced version of the same component. The prototype was implemented in Java as a collection of Eclipse plug-ins.

This project, a successor of 15ak, was carried out for the Joint (15av) and Air Force (13jf) client groups from April 2005 to March 2008, mainly by the System Engineering and Architecture group of the System of Systems section at DRDC Valcartier. Using about 3.5 full-time equivalent (FTE) people and a contracting budget of approximately $1M, it also leveraged expertise and tools from other projects of the late Information and Knowledge Management section.

The techniques which have been implemented as part of the prototype allow understanding the architecture of existing software systems. This fundamental step has to be performed first in the case of a component substitution to identify the scope of the component to replace, the services it provides, as well as to comprehend how it depends on other components or how its replacement could affect the overall system behavior. The implemented techniques fall within the following four broad categories.

Textual, lexical, and syntactical analysis - Common to these techniques is their focus on the source code and its representations. The vast majority of them are based on a static analysis of the software system in which the source code is first parsed, stored in an intermediate representation, and then represented at various abstraction levels. The stored information can typically be retrieved using techniques associated with the intermediate format.

Visualization - Software visualization aims at providing high-level views of the system under study, by hiding irrelevant low-level details, and focussing on the abstract views of the relevant information. Visualization techniques are usually applied in combination with

iv DRDC Valcartier TR 2008-464

lexical and syntactical analysis techniques to provide different views and interpretations of the underlying information.

Execution and testing - Typically dynamic in nature, these techniques are based on the observation of the system behavior, as well as on the profiling and monitoring of its execution.

Domain knowledge-based analysis - These approaches focus on recovering the domain semantics of a software system, i.e., to understand the functionality of the source code in terms of the system‟s application domain. They combine domain knowledge representation

and source code analysis. The objective is to establish traceability links between the source code and the application domain.

Significance: The work carried out as part of this research project is important for any organization developing or maintaining software systems since for most of them, the source code is the only documentation that is both complete and up-to-date. Therefore, the intended audience for this report are software developers who have to understand existing source code as well as managers and decision makers of organizations acquiring software systems that will have to be maintained.

Being able to understand the architecture of existing systems is critical for the CF if they want to ensure that their software systems are conformed to the architectures that were originally envisioned and that they evolve in a disciplined manner. Furthermore, increasing code diversity using FOSS instead of relying exclusively on COTS could reduce the number and speed at which cyber attacks proliferate. It could also decrease the CF‟s dependence on commercial software systems which can become too expensive to purchase or be discontinued.

Future plans: Depending on the client‟s needs, future work might consist in continuing the development of the prototype to be able to comprehend the architecture of C/C++ systems using the Eclipse C/C++ Development Tooling (CDT) project. Examples of such systems that could be of interest are the large acquisition projects from the Directorate of Technical Airworthiness and Engineering Support (DTAES). Another possibility could be to adapt the techniques and tools to support assembly for national security purposes. A different option could also be to continue improving the prototype for Java, as it seems that is has become the language of choice for Command and Control (C2) systems.

DRDC Valcartier TR 2008-464 v

Sommaire .....

System architecture recovery for open source software integration: A prototype to support system comprehension and component substitution

Philippe Charland; David Ouellet; Martin Salois; DRDC Valcartier TR 2008-464; R & D pour la défense Canada – Valcartier; mai 2009.

Introduction : Au cours des deux dernières décennies, le marché du logiciel a été dominé par les produits commerciaux sur étagère. Durant cette période, les limitations intrinsèques rattachées à ces produits sont apparues. Cela a ouvert la voie à un mouvement vers des logiciels à code source libre et ouvert. Ce mouvement réfère à des programmes dont le code source est disponible sans aucune contrainte à quiconque est intéressé à l‟utiliser ou à le modifier. Depuis les dix dernières années, les logiciels à code source libre et ouvert ont pris une importance croissante.

Résultats : Afin de faciliter l‟usage de logiciels à code source libre et ouvert dans les systèmes

des Forces canadiennes (FC), les membres du projet de recherche Ouverture d‟Architectures de Systèmes Informatisés Significativement (OASIS) ont développé un prototype pour aider au processus de substitution de composants. L‟objectif est de supporter, dans un premier temps, le

remplacement d‟un composant commercial sur étagère dans un système logiciel Java existant par un composant à code source libre et ouvert équivalent. Dans un deuxième temps, l‟objectif est de

pouvoir remplacer un composant à code source libre et ouvert par une version modifiée et améliorée de ce même composant. Le prototype a été implanté en Java dans une collection de plugiciels Eclipse.

Ce projet, un successeur de 15ak, a été réalisé pour les groupes clients des forces interarmées (15av) et aérienne (13jf), à partir du mois d‟avril 2005 jusqu‟au mois de mars 2008, principalement par le groupe Ingénierie et architecture de systèmes de la section Système de systèmes de RDDC Valcartier. Utilisant l‟équivalent temps plein (ETP) de 3,5 personnes et un budget d‟environ 1M $ pour la passation de contrats, le projet a aussi misé sur les compétences et les outils venant d‟autres projets de la défunte section Gestion de l'information et de la connaissance.

Les techniques qui ont été implantées dans le cadre du prototype permettent de comprendre l‟architecture de systèmes logiciels existants. Cette étape fondamentale doit être accomplie en premier dans le cas d‟une substitution de composants afin d‟identifier la délimitation du

composant à remplacer, les services qu‟il fournit et comprendre comment il dépend d‟autres composants ou comment son remplacement pourrait affecter le fonctionnement de l‟ensemble du

système. Les techniques implantées se situent dans les quatre grandes catégories suivantes.

Analyse textuelle, lexicale et syntaxique - Commun à ces techniques est qu‟elles s‟articulent autour du code source et de ses représentations. La vaste majorité d‟entre elles sont basées

sur une analyse statique du système logiciel dans laquelle le code source est tout d‟abord

analysé grammaticalement, mémorisé dans une représentation intermédiaire et puis représenté à différents niveaux d‟abstraction. L‟information mémorisée peut généralement

être récupérée en utilisant des techniques associées au format intermédiaire.

vi DRDC Valcartier TR 2008-464

Visualisation - La visualisation logicielle vise à fournir des vues de haut niveau du système logiciel à l‟étude, en masquant les détails de bas niveau non pertinents et en se concentrant sur les vues abstraites des informations pertinentes. Les techniques de visualisation sont généralement appliquées en combinaison avec des techniques d‟analyse lexicale et syntaxique pour fournir des vues et interprétations différentes de l‟information sous-jacente.

Exécutions et tests - Généralement de nature dynamique, ces techniques sont basées sur l‟observation du fonctionnement d‟un système, ainsi que sur le profilage et la surveillance de son exécution.

L’analyse basée sur la connaissance du domaine - Ces approches se concentrent sur la récupération de la sémantique du domaine d‟un système logiciel, c‟est-à-dire comprendre la fonctionnalité du code source en fonction du domaine d‟application du système. L‟objectif est d‟établir des liens de traçabilité entre le code source et le domaine d‟application.

Importance : Le travail qui a été accompli dans le cadre de ce projet de recherche est important pour toutes les organisations développant ou faisant la maintenance de systèmes logiciels, puisque pour la majorité de ceux-ci, le code source est la seule documentation qui est à la fois complète et à jour. Pour cette raison, l‟audience cible de ce rapport sont les développeurs de logiciels qui ont à comprendre du code source existant ainsi que les gestionnaires et décideurs d‟organisations

acquérant des systèmes logiciels qui devront être maintenus.

Être capable de comprendre les architectures de systèmes existants est critique pour les FC si elles veulent s‟assurer que leurs systèmes logiciels sont conformes aux architectures qui avaient été imaginées à l‟origine et qu‟elles évoluent de façon méthodique. De plus, augmenter la diversité du code en utilisant des logiciels à code source libre et ouvert plutôt que de dépendre exclusivement de produits commerciaux sur étagère pourrait réduire le nombre ainsi que la vitesse à laquelle les cyberattaques prolifèrent. Cela pourrait aussi réduire la dépendance des FC à l‟égard des systèmes logiciels commerciaux qui peuvent devenir trop onéreux ou cesser d‟être produits.

Perspectives : Selon les besoins du client, les travaux futurs pourraient consister à continuer le développement du prototype pour être capable de comprendre les architectures de systèmes développés en C/C++, en utilisant le projet Eclipse C/C++ Development Tooling (CDT). Des exemples de tels systèmes qui pourraient être d‟intérêt sont les projets de grande acquisition du Directeur - navigabilité aérienne et soutien technique (DNAST). Une autre possibilité serait d‟adapter les techniques et outils à l‟assembleur pour des besoins de sécurité nationale. Une autre

option serait de continuer à perfectionner le prototype pour le Java, étant donné qu‟il semble être devenu le langage de choix pour les systèmes de commandement et contrôle (C2).

DRDC Valcartier TR 2008-464 vii

Table of contents

Abstract …….. ................................................................................................................................. i Résumé …..... ................................................................................................................................... i Executive summary ........................................................................................................................ iii Sommaire ..... ................................................................................................................................... v Table of contents ........................................................................................................................... vii List of figures .................................................................................................................................. x List of tables .................................................................................................................................. xii 1 Introduction ............................................................................................................................... 1 2 Background ............................................................................................................................... 3

2.1 Software component ...................................................................................................... 4 2.2 COTS component .......................................................................................................... 4 2.3 FOSS component ........................................................................................................... 5 2.4 Component granularity levels ........................................................................................ 5 2.5 OASIS v2 prototype ...................................................................................................... 6

3 OASIS v2 functionalities .......................................................................................................... 9 3.1 Infrastructure ................................................................................................................. 9

3.1.1 Repositories ..................................................................................................... 9 3.1.1.1 Source code .................................................................................. 9 3.1.1.2 Facts .............................................................................................. 9 3.1.1.3 Exchange models .......................................................................... 9

3.1.2 Data access .................................................................................................... 10 3.1.2.1 Data object definition ................................................................. 10 3.1.2.2 Marshalling ................................................................................. 10 3.1.2.3 Unmarshalling ............................................................................ 10

3.1.3 Information management services ................................................................ 10 3.1.3.1 Exchange model definition ......................................................... 10 3.1.3.2 Browsing ..................................................................................... 11 3.1.3.3 Querying ..................................................................................... 12 3.1.3.4 Publishing ................................................................................... 13

3.2 Fact extraction ............................................................................................................. 13 3.2.1 Static fact extraction ...................................................................................... 13

3.2.1.1 Parsing ........................................................................................ 13 3.2.1.2 Decompilation ............................................................................ 15 3.2.1.3 Build file parsing ........................................................................ 16

3.2.2 Dynamic fact extraction ................................................................................ 17 3.2.2.1 Instrumentation ........................................................................... 17 3.2.2.2 Profiling ...................................................................................... 19

viii DRDC Valcartier TR 2008-464

3.3 Analysis ....................................................................................................................... 21 3.3.1 Software metrics ........................................................................................... 21 3.3.2 Domain knowledge definition and exploitation ............................................ 23

3.4 UML sequence diagrams ............................................................................................. 26 3.4.1 OASIS sequence explorer views ................................................................... 29 3.4.2 Grouping ....................................................................................................... 31 3.4.3 Loop recognition ........................................................................................... 33 3.4.4 Searching, filtering and annotating ............................................................... 35 3.4.5 Sequence diagram comparison ...................................................................... 36

4 OASIS v2 implementation ...................................................................................................... 40 4.1 Eclipse ......................................................................................................................... 40

4.1.1 The Eclipse project ........................................................................................ 40 4.1.2 The Tools project .......................................................................................... 40 4.1.3 The Technology project ................................................................................ 41 4.1.4 Plug-in architecture ....................................................................................... 41 4.1.5 User Interface framework ............................................................................. 42

4.1.5.1 SWT ............................................................................................ 42 4.1.5.2 JFace ........................................................................................... 42 4.1.5.3 Workbench ................................................................................. 42

4.2 Infrastructure ............................................................................................................... 44 4.2.1 Source code repository .................................................................................. 44 4.2.2 Facts and exchange models repositories ....................................................... 44 4.2.3 EMF .............................................................................................................. 44

4.2.3.1 Ecore model ................................................................................ 46 4.2.3.2 Ecore model creation and edition ............................................... 47 4.2.3.3 XMI serialization ........................................................................ 48

4.2.4 Data access .................................................................................................... 49 4.2.4.1 Data object definition ................................................................. 50 4.2.4.2 Marshalling and unmarshalling .................................................. 51

4.2.5 Querying and publishing ............................................................................... 51 4.3 Tools and techniques ................................................................................................... 51

4.3.1 Parsing ........................................................................................................... 51 4.3.2 Decompilation ............................................................................................... 52 4.3.3 Build file parsing ........................................................................................... 52 4.3.4 Instrumenting ................................................................................................ 53

4.3.4.1 Execution trace logging .............................................................. 56 4.3.5 Profiling ........................................................................................................ 57 4.3.6 Software metrics ........................................................................................... 58 4.3.7 Domain knowledge definition and exploitation ............................................ 58

4.3.7.1 TagSEA ...................................................................................... 59 4.4 Sequence diagrams ...................................................................................................... 61

DRDC Valcartier TR 2008-464 ix

4.4.1 Execution trace based sequence diagrams .................................................... 62 4.4.2 SWT, JFace and Zest..................................................................................... 62 4.4.3 Sequence diagram comparison ...................................................................... 63

5 Case study ............................................................................................................................... 64 5.1 Application domain model .......................................................................................... 65 5.2 Instrumentation ............................................................................................................ 65 5.3 Sequence diagram comparison .................................................................................... 65 5.4 Source code browsing and tagging .............................................................................. 67 5.5 Summary and discussion ............................................................................................. 70

6 Conclusions and future work .................................................................................................. 72 References ..... ............................................................................................................................... 73 List of symbols/abbreviations/scronyms/initialisms ..................................................................... 81 Distribution list .............................................................................................................................. 85

x DRDC Valcartier TR 2008-464

List of figures

Figure 1: OASIS functional architecture ......................................................................................... 3

Figure 2: Subset of the functional architecture implemented for OASIS v2 ................................... 7

Figure 3: Exchange model for a call graph .................................................................................... 11

Figure 4: Generic Object Browser view ........................................................................................ 12

Figure 5: Abstract syntax tree view ............................................................................................... 14

Figure 6: Java source code ............................................................................................................ 15

Figure 7 : Java source code [54] .................................................................................................... 18

Figure 8: Java bytecode of the fac() method [54] ..................................................................... 18

Figure 9: Instrumented Java bytecode of the fac() method [54] ............................................... 19

Figure 10: Memory Statistics view ................................................................................................ 20

Figure 11: Execution Statistics view ............................................................................................. 21

Figure 12: Software Metrics view ................................................................................................. 22

Figure 13: Domain model of the DocSearcher application ........................................................... 24

Figure 14: Location of interest tagged directly in source code ..................................................... 25

Figure 15: Domain Model and Waypoint views ........................................................................... 26

Figure 16 : UML sequence diagram [67] ...................................................................................... 27

Figure 17: OASIS Sequence Explorer views ................................................................................ 30

Figure 18: Sequence Diagram view without (top) and with (bottom) the clone pane ................... 31

Figure 19: Child activation grouped (top) and ungrouped (bottom) ............................................. 32

Figure 20: java.lang package ungrouped (top) and grouped (bottom) ................................... 33

Figure 21: Loop recognition: One iteration of the loop displayed (i = 0) (top left); All iterations displayed (top right); Two iterations grouped (bottom). ............................. 34

Figure 22: Conditional statements and exception handling blocks ............................................... 35

Figure 23: Package filtering .......................................................................................................... 36

Figure 24: Sequence diagram comparison..................................................................................... 38

Figure 25: Sequence diagram comparison without filters (top) and with filters (bottom) ............ 39

Figure 26: Eclipse workbench ....................................................................................................... 43

Figure 27: EMF unifies Java, XML, and UML [78] ..................................................................... 44

Figure 28: Annotated Java interfaces [78] ..................................................................................... 45

Figure 29: UML diagram of interfaces [78] .................................................................................. 46

DRDC Valcartier TR 2008-464 xi

Figure 30: XML schema [78] ........................................................................................................ 46

Figure 31: Simplified subset of the Ecore model [78] ................................................................... 47

Figure 32: Ecore instances of the purchase order system [78] ...................................................... 47

Figure 33: EclipseUML graphical editor ....................................................................................... 48

Figure 34: Purchase order model serialized in XMI [78] .............................................................. 49

Figure 35: Teneo mapping process overview [83] ........................................................................ 50

Figure 36: OASIS v2 infrastructure design [83] ........................................................................... 51

Figure 37: yWorks Ant Explorer view .......................................................................................... 53

Figure 38: Targets page of a probe ................................................................................................ 54

Figure 39: Fragment page of a probe............................................................................................. 55

Figure 40: Structure of an XRat file .............................................................................................. 56

Figure 41: Thread Analysis view .................................................................................................. 58

Figure 42: GMF overview [95] ..................................................................................................... 59

Figure 43: TagSEA view ............................................................................................................... 61

Figure 44: Generation of a sequence diagram based on an execution trace .................................. 62

Figure 45: Two sequences of numbers .......................................................................................... 63

Figure 46: Differences between two sequences of numbers ......................................................... 63

Figure 47: DocSearcher screenshot ............................................................................................... 64

Figure 48: Application domain model of DocSearcher ................................................................. 65

Figure 49: Sequence diagram comparison without filtering ......................................................... 66

Figure 50: Sequence diagram comparison with filtering ............................................................... 66

Figure 51: Sub-sequence diagram ................................................................................................. 67

Figure 52: DocSearcher addDocToIndex() method .............................................................. 68

Figure 53: DocSearcher addDocToIndex() method (continued) ............................................ 69

Figure 54: DocSearcher getFileType() method .................................................................... 70

xii DRDC Valcartier TR 2008-464

List of tables

Table 1: Component granularity levels ........................................................................................... 6

Table 2: XML and relational schemas ........................................................................................... 49

Table 3: In memory and RDBMS storage representation of a Core model ................................... 50

Table 4: XRat tag attributes ........................................................................................... 57

DRDC Valcartier TR 2008-464 1

1 Introduction

During the last two decades, the software market has been dominated by Commercial Off-the-Shelf (COTS) products [1] such as the Windows operating system [2], the Office productivity suite [3], as well as the Oracle relational database management system (RDBMS) [4]. Although such software can offer a great number of functionalities, COTS products have limitations. The most important ones are the fact that their source code is not public, they are affected by security issues, and in order to have their most current, robust, and secure version, users have to spend money on expensive upgrades. Furthermore, since the cost of switching from one COTS product to another one is high, users often stick with an inferior product, a phenomenon known as the lock-in effect.

The intrinsic limitations of COTS products have paved the way to a movement based on Free and Open Source Software (FOSS). FOSS refers to programs whose source code is made freely available for use or modification to anyone interested in using or working with it. It is developed by either groups of volunteers or by large organizations which want to experiment with a different business model based on collaborative development. For the past ten years, the FOSS movement has been constantly growing in importance. According to Spinellis and Szyperski [5], more than 115,000 open source projects were registered at the main four open source forums (i.e., www.freshmeat.net, www.sourceforge.net, www.cpan.com, and FreeBSD). In May 2004, it was estimated that 115 FOSS applications have achieved a maturity level comparable or superior to their COTS equivalents [5]. The most often FOSS cited are the Linux operating system and the Apache Web server [6, 7].

One advantage that FOSS has over proprietary software concerns security. The fact that the source code is available encourages peer reviews, testing, and quality audits by a larger community of users and developers compared to their COTS equivalents. This results in an increased confidence level [8, 9]. COTS products are also often bloated with features. Since FOSS applications are smaller, they limit the opportunities for security exploits. Furthermore, as their source code is public, it can be enriched with assertions and other complementary safety checks. Moreover, not relying only on COTS software and increasing the code diversity using FOSS can help reduce the number as well as the speed at which cyber attacks proliferate [1].

In order to facilitate the use of FOSS in the Canadian Forces‟ (CF) systems, Defence Research and Development Canada (DRDC) Valcartier started a research project in 2005 called Command and Control Information Systems (C2IS) Architecture Understanding for Open Source Software Integration (first 15av, then 13jf). This project is the continuation of the Opening up Architectures of Software-Intensive Systems (OASIS) research project [10] (15ak), which started in 2002 and ended in 2005. The objectives of the C2IS Architecture Understanding for Open Source Software Integration project are to develop tools and techniques to understand the architecture of complex C2IS as well as assess and adapt FOSS components prior to their integration in these systems.

The integration of FOSS components will take the form of substitutions. The goal is to develop technical solutions to support, in the first case, the replacement of a COTS component in a C2IS by its FOSS equivalent and in the second case, the substitution of a FOSS component by a modified and enhanced version of the same component. This technical report presents the prototype which was developed as part of the project to assist in performing such tasks. It is built

2 DRDC Valcartier TR 2008-464

on the results obtained as part of the OASIS project [11, 12, 13]. The prototype was developed in Java as a collection of Eclipse [14] plug-ins. Eclipse is an extensible open source integrated development environment (IDE).

The remainder of this technical report is organized as follows: Chapter 2 provides background information about the research project. Chapters 3 and 4 respectively describe the prototype‟s functionalities and its implementation. A case study of a component substitution performed using the prototype is then presented in Chapter 5. Finally, Chapter 6 provides conclusions and future work.


2 Background

As mentioned in the introduction, the C2IS Architecture Understanding for Open Source Software Integration project is the continuation of the OASIS research project. The objective of the latter was to develop technical solutions to reduce the time needed to comprehend the architecture of already existing software systems. As part of this project, a functional architecture of the ideal tool for software system understanding was conceptualized (Figure 1) [12] and a prototype implementing a subset of it was developed in Java as a collection of Eclipse plug-ins [13].

Figure 1: OASIS functional architecture

Before substituting a component in a software system with another one, the existing system must be understood to identify the scope of the component to replace, the services it provides, as well as to comprehend how it depends on other components or how its replacement could affect the overall system behavior [15]. Although these tasks are already challenging for large-scale and complex systems, they are further complicated by the fact that for most of them, the specification and design artefacts are unavailable or of poor quality [16]. In such cases, the source code is the only documentation available [17]. As a result, before a component can be substituted, the mental reconstruction of the existing system has to take place. This requires the recovery of models that


incorporate architecturally significant concepts such as architectural level components, connectors, design patterns, component distribution information, and architectural decisions [15]. Typically, these types of information cannot be directly extracted from source code without domain knowledge. Also, they must be reconstructed and expressed during the recovery process by producing models at different abstraction levels. Furthermore, the mapping between the abstraction levels must be retained and available during the component comprehension and substitution process [15].

Because the OASIS project aimed at recovering such architectural models from existing software systems, the functional architecture designed as part of it, as well as the first version of the prototype (OASIS v1), served as the building blocks for the different Eclipse plug-ins developed to support component substitution. But before giving an overview of the functionalities of this collection of plug-ins called OASIS v2, the following terms need to be defined in the context of the present technical report: software component, COTS component, FOSS component, and component granularity levels.

2.1 Software component

The term and what constitutes a component are both very ambiguous and a large number of definitions and interpretations can be found in the literature [15]. One simple and compact definition of a component is the following: “binary units of independent production, acquisition

and deployment” [18]. But other looser definitions can also be found, such as “a physical, replaceable part of a system that packages implementation and provides the realization of a set of interfaces. A component represents a physical piece of implementation of a system, including software code (source, binary or executable) or equivalents such as scripts or command files” [19]. Yet, another definition of a component is “a unit of design (at any level), for which a structure is defined, a name identifying the component is associated, and for which design guidelines, in the form of design documentation, are provided in order to support the reuse of the component and to illustrate the context where it can be reused” [19].

Within this document, a component is defined as a “set of related entities that together have either functional or abstract cohesion. They therefore form a cluster that corresponds to a set of subprograms (functions and procedures), variables, constants, and user-defined types” [15]. Based on this definition, “a component therefore corresponds to a group of related elements with a unifying common goal or concept relevant at the architectural level” [15]. It can be implemented in a variety of forms, such as executable, source code/file, object code, object library, or operating system (OS) built-in facility [20].

2.2 COTS component

What constitutes a COTS is ambiguous, as most articles published in the literature adopt their own definition. The definition of a COTS component used in this document is based on the one from the Center for Empirically Based Software Engineering (CeBASE) for COTS, i.e., a software product [21]:

developed by a third party (which controls its ongoing support and evolution),


bought, licensed, or acquired for the purpose of integration into a larger system as an integral part, i.e., that will be delivered as part of the system to the customer of that system,

which might or might not allow modification at the source code level,

but may include mechanisms for customization,

and is bought and used by a significant number of systems developers.

For the purpose of this document, the following property is added to the previous definition [15]:

the source code of a COTS component is not available for analysis or modification.

2.3 FOSS component

As mentioned in the introduction, FOSS refers to software which “gives users the right to run, copy, distribute, study, change, and improve it as they see fit, without having to ask permission from or make additional payments to any external group or person” [22]. In this context, the word free does not refer to financial cost, but to the autonomy rights granted by FOSS to its users [22]. The phrase open source software refers to FOSS that uses any of the licenses [23] approved by the Open Source Initiative (OSI) [24]. This list of licenses is based on the open source definition [25] derived from the set of software user rights [26] formulated in the late 1980s by Richard Stallman of the Free Software Foundation [27]. The OSI licenses include additional criteria aimed at ensuring the fairness of the licenses. Despite these differences, both result in the selection of almost identical sets of licenses [22]. Free software usually qualifies as open source and vice versa, as they both derive from Stallman‟s list of software user rights.

In the context of this document, a FOSS component is characterized by the fact that it is [15]:

developed by a community of developers who control its ongoing support and evolution,

available at no cost,

licensed with rights for users to study, change, and improve its design through the availability of its source code, licensed or acquired,

used by a significant number of systems developers.

2.4 Component granularity levels

A typical modeling aspect of a software component is the granularity level at which it is defined [28]. For example, in Enterprise Resource Planning (ERP) systems which were first introduced in the 1990s, functionally complete subsystems are considered as basic components. An information system is therefore developed by assembling and customizing these components [29]. An ERP suite provides a single, homogeneous solution for a significant number of back-office functions in an organization, such as integrated finance, human resources, and manufacturing/supply-chain processes, by defining common semantics and models for the organization, as well as a single architecture [30]. More recently, the trend is towards finer-grained components and application frameworks [28, 31], such as the various e-commerce components suites and frameworks offered


by different vendors, e.g., [32], where components are used as building blocks for assembling new e-commerce applications and portals [29].

In the context of the present research project, the granularity levels of the components to substitute in C2IS will greatly vary. A component could be as small as a few lines of source code, or as large as an entire self-contained software system. Table 1 lists the different component granularity levels considered, as well as their equivalent software artefacts.

Table 1: Component granularity levels

Granularity level Corresponding software artefacts

Small Lines of code (e.g., 2-3 LOC)

Medium Method(s)

Function(s)

Class

Large Library (e.g., dll, jar)

Package

Application domain functionality (e.g., mission planning)

Very large Subsystem

System (single or distributed)

Although each of the above granularity levels requires different techniques and tools to first locate a software component in a system and then substitute it, they complement each other. The techniques that operate at a low level of abstraction, e.g., the statement or method level, often build the basis for techniques used to substitute components at a higher granularity level, i.e., a subsystem or complete system.

2.5 OASIS v2 prototype

To assist the CF in using FOSS in their software systems, a prototype was developed to support the process of component substitution. As previously mentioned, there are different sizes of components one can consider in a C2IS. The prototype presented in this document focuses on the large and very large granularity levels (e.g., library, package, application domain functionality, and subsystem). By focusing on such granularity levels, i.e., the architectural level, the prototype can reuse some of the technical solutions which were implemented as part of the OASIS research project, hence the name OASIS v2.

The OASIS v2 prototype supports the component substitution process for object-oriented software systems written in Java, although the software metrics it provides also support the C/C++ programming language. This was done to partially address one limitation of the existing tools for architecture comprehension previously identified in [33], i.e., multi-language support.


It has to be noted that there is no single technique that can support a general component substitution process. A combination of different techniques is required to provide the insights and information needed to guide maintainers during the process, as it is a multidimensional problem domain that requires the incorporation of multiple information artefacts, at different abstraction levels. To complicate matters, the source code implementing the component can be located in non-contiguous parts, even though they are conceptually related.

To address these issues, the OASIS v2 prototype implements a selected subset of the functional architecture presented at the beginning of this section. The implemented techniques, as well as the infrastructure supporting them, are highlighted in yellow in Figure 2.

Figure 2: Subset of the functional architecture implemented for OASIS v2

The techniques and infrastructure implemented in the OASIS v2 prototype will be discussed in greater detail in Chapters 3 and 4, but the techniques applied during a component substitution process typically fall within one of the following four categories [15]:

Textual, lexical, and syntactical analysis - Common to these techniques is their focus on the source code and its representations. The vast majority of them are based on a static analysis of the software system in which the source code is first parsed, stored in an intermediate representation such as an abstract syntax tree (AST), and then represented at various abstraction levels. The stored information can typically be retrieved using techniques


associated with the intermediate format. Additional filtering can be performed with graph-based analysis techniques (e.g., control and data flow).

Visualization - Software visualization aims at providing high-level views of the system under study, by hiding irrelevant low-level details, and focussing on the abstract views of the relevant information. Visualization techniques are usually applied in combination with lexical and syntactical analysis techniques to provide different views and interpretations of the underlying information.

Execution and testing - The techniques within this category are typically dynamic in nature, as they are based on the observation of the system behavior, as well as on the profiling and monitoring of its execution.

Domain knowledge-based analysis - These approaches focus on recovering the domain semantics of a software system, i.e., to understand the functionality of the source code in terms of the system‟s application domain. They combine domain knowledge representation and source code analysis. The objective is to establish traceability links between the source code and the application domain.


3 OASIS v2 functionalities

This section describes the functionalities OASIS v2 offers to assist in understanding the architecture of software-intensive systems. Their implementation is discussed in Chapter 4.

3.1 Infrastructure

The infrastructure provides the foundations of OASIS v2 on which the different techniques to analyze and visualize the architecture of a software system under study are built. It consists of the Repositories, Data Access, and Information Management subsystems of Figure 2.

3.1.1 Repositories

In order to understand an existing software system at the architectural level, one needs to have access to different types of information. In the course of the comprehension process, additional information will also be generated. This information needs to be stored persistently in repositories. OASIS v2 contains three such repositories, which store respectively the source code, facts, and models associated with the software system under study. These repositories are logical ones and are not necessarily implemented as databases.

3.1.1.1 Source code

Component substitution, like most high-level reverse engineering analysis and architecture recovery activities, is based on the software system source code. It is therefore one type of information which has to be stored in the repositories. OASIS v2 supports two mechanisms to manage source code and its associated files. They are described in Subsection 4.2.1.

3.1.1.2 Facts

This repository contains the basic facts about a subject system, at a low level of abstraction. These facts are usually extracted using lexical or parser-based tools, in the case of static information, or profiling tools for dynamic information. As each tool usually has its own specific data schema, the interoperability between tools is limited and often restricted to the use of a standard exchange format, such as the Graph eXchange Language (GXL) [34], to describe the schema in the case of graph-based tools.

3.1.1.3 Exchange models

To address the limitation mentioned in the previous paragraph and facilitate the interoperability among tools to be integrated within OASIS v2, the infrastructure provides exchange models. These models allow independently developed tools to be integrated within OASIS v2 and be able to define and share common data. The Models repository persistently stores the instances of these exchange models.


3.1.2 Data access

The Data Access subsystem handles the mapping of low level data elements to higher level constructs. This supports the goal previously mentioned, i.e., to allow different tools to use their own data schema to persistently store information, while still being able to share data between them. The service groups of this subsystem provide functionalities to define and transform data elements to conform to the different exchange models.

3.1.2.1 Data object definition

To allow different tools to interoperate, their data elements must be compatible. As the tools integrated within OASIS v2 all use exchange models, data object definition consists of mapping them to the persistence technology used, in this case, a PostgreSQL database [35]. In OASIS v2, Hibernate [36] is used to generate the database schema corresponding to an exchange model. PostgreSQL and Hibernate are covered in more detail in Section 4.2.

3.1.2.2 Marshalling

Marshalling consists in pulling, from the repositories, information elements, packaging them into a data object, and then sending it to the information consumer that requested it. In OASIS v2, the implementation of this functionality is provided by the application programming interface (API) of Hibernate, which can load objects stored in a database.

3.1.2.3 Unmarshalling

Unmarshalling is the opposite of marshalling. It consists in separating a data object into its constituent information elements and storing them in their corresponding repository. As for marshalling, the implementation of this functionality is also provided by Hibernate.

3.1.3 Information management services

Information Management is the top level subsystem of the infrastructure. It provides services to integrate tools within OASIS v2 so that they can use the Repositories and Data Access subsystems.

3.1.3.1 Exchange model definition

Exchange models are the foundation for the other services provided by the infrastructure, as they support data integration. Using exchange models, independently developed tools can define and share common data. As a result, before a tool can be integrated within OASIS v2, either by developing it or by reusing an existing one, its exchange model must be specified. For example, if a tool generating call graphs had to be incorporated, an exchange model similar to the one shown in Figure 3 would first need to be defined.


NamedElement

+name : String

Package MethodType Parameter0..*0..*0..*

returnType calls

Class

type

Interface Datatype

Figure 3: Exchange model for a call graph

Once defined, the Data Object Definition service group, through Hibernate, could then generate its corresponding relational model and store instances of this exchange model into the Models repository. As a result, it could be used by other subsystems, e.g., Visualization, which needs to operate on call graphs. Furthermore, other information management services could retrieve and store instances of this exchange model without having to deal with the intricacies of the persistence mechanism used.

In OASIS v2, the definition of exchange models is done using Omondo‟s freely available

EclipseUML graphical editor [37]. EclipseUML is covered in more detail in Subsection 4.2.3.2.

3.1.3.2 Browsing

Browsing consists in exploring a body of information, based on the organization of the collections or scanning lists, rather than by direct searching [38]. In OASIS v2, browsing is done using the Generic Object Browser (GOB). Figure 4 displays the GOB view in Eclipse.


Figure 4: Generic Object Browser view

Examples of information that can be browsed using the GOB are the different exchange models and their instances. As illustrated in Figure 4, the GOB displays in a tree-like composition structure the model‟s classes, their attributes and types, subclasses, superclasses, as well as their

association and composition relationships. When integrating an existing tool in OASIS v2, this view can reveal if there already exists an exchange model for the data required and/or produced by the tool. For example, the exchange model displayed in Figure 3 could be used by another tool that produces call graphs for a different object-oriented programming language.

The GOB provides two operation modes: class and instance. While in class mode, as in Figure 4, the user browses through an exchange model. When in instance mode, the user browses through the actual objects of an exchange model instance. In this mode, the names of the objects and the values of their attributes are displayed.

3.1.3.3 Querying

Querying is the process by which data matching a set of criteria are retrieved from a repository. Examples of data that can be retrieved in OASIS v2 are the data contained in the instances of an exchange model.


3.1.3.4 Publishing

Publishing is the main activity performed by the tools producing data in OASIS v2, with respect to repositories. It consists in persistently storing the data contained in an instance of an exchange model. Upon publication, it becomes available to all the tools using the information contained in the repositories.

3.2 Fact extraction

Fact extraction consists in finding pieces of information about a software system. It is a fundamental step of reverse engineering and architecture recovery techniques and as a result, has often to be performed first [39]. This means that before any high-level reverse engineering analyses or architecture recovery activities can be performed, available information about a system has to be extracted and aggregated in a fact base. Such a fact base forms the foundation for further analysis tasks that are conducted next, either manually or (semi)-automatically using tools [39].

Fact extraction can either be static or dynamic. OASIS v2 supports both, as explained in the next two subsections.

3.2.1 Static fact extraction

Static fact extraction provides information which is obtained by observing only the artefacts of a system [40]. A common technique for extracting static facts from source code is parsing.

3.2.1.1 Parsing

Informally, a parser is a program which receives input in the form of source code instructions and breaks them into parts such as objects, methods, and attributes [39]. This collected data, as well as the dependencies among the extracted entities, e.g., inheritance and association relationships, are then added to a fact base.

More formally, parsing transforms source code into a data structure, usually a tree, which is suitable for later processing and captures the implied hierarchy of the source code. A parser generally operates in a two-stage process. First, it identifies the tokens in the source code and then builds a parse tree using them.

A token is a categorized block of text, usually consisting of indivisible characters known as lexemes. Examples of tokens include literals, operators, and identifiers. A lexical analyzer initially reads the lexemes and categorizes them according to function, giving them meaning. This assignment of meaning is known as tokenization. A parse tree, or concrete syntax tree, is then generated from these tokens. A parse tree represents the syntactic structure of the source code according to a grammar.

In OASIS v2, an abstract syntax tree (AST) is used instead of a parse tree. In a parser, an AST is an intermediate between a parse tree and a data structure. The latter is often used as a compiler or


interpreter‟s internal representation of a computer program, while it is being optimized and from which code is generated.

An AST captures the essential structure of the source code in a tree form, while omitting unnecessary syntactic details. It differs from a parse tree by excluding nodes representing punctuation marks, such as the semi-colons terminating statements or the commas separating method arguments. It also omits tree nodes representing unary productions in the grammar. These omissions are represented by the structure of the AST [41].

Figure 5 shows a section of the AST for the Java source code partially displayed in Figure 6. In OASIS v2, parsing is used, among other things, to compute the metrics covered in Subsection 3.3.1.

Figure 5: Abstract syntax tree view


Figure 6: Java source code

3.2.1.2 Decompilation

In the case of a COTS substitution, the source code of the component to replace is usually not available. Therefore, parsing, as well as other analysis techniques, cannot be performed to identify the services the component provides. In this circumstance, the decompilation of the binary code or bytecode is the only option.

A decompiler, or reverse compiler, is a program which attempts to perform the inverse process of the compiler. Given an executable program compiled using a high level programming language, the objective is to generate a high level language program which performs the same function as the executable program [42].

Decompiling executable programs is not a trivial task, as one faces several difficulties. Some of the problems are the separation of data and code, the reconstruction of control structures, and the


recovery of high-level data types [43]. Also, any meaningful names given by programmers to variables and methods to facilitate their identification are not usually stored in an executable file. Therefore, they cannot be recovered by the decompiler. Another problem is the great number of subroutines introduced by the compiler [42] to set up its environment and for runtime support. These are usually written in assembly and most of the time, cannot be translated into a high level language. In addition, library routines, written either in the compiler language or in assembly, are also included by the linker. As an example, a “hello world” program compiled in C generates 23

different procedures [42]. To improve the decompilation process, decompilers make use of knowledge about certain compilers and libraries used in the compilation of the file to be decompiled [43].

One case for which the decompilation is somewhat easier is Java, the programming language supported by OASIS v2. The reason is that Java bytecode is relatively high-level and is guaranteed to be well-formed and well-typed due to verification constraints [44]. Therefore, it provides an ideal basis for the decompilation back to Java source code. Another reason why decompiling bytecode is easier is that the most common way of producing class files is to use Sun‟s javac compiler, which has specific compilation patterns [45]. However, decompiling Java bytecode has been complicated by the fact that there is an increasing number of compilers that can generate bytecode for other languages (e.g., AspectJ and C), as well as by the use of bytecode optimizers and obfuscators. These produce faster and/or smaller class files in the first case and classes which are harder to decompile and understand in the latter. Although the bytecode generated by these tools is both correct and verifiable, it is much more complex than the one produced by javac [44].

3.2.1.3 Build file parsing

After a software system has been designed and implemented, it has to be configured, compiled, and linked for the particular environment in which it will be deployed [46]. For small systems developed for a unique platform, the Make utility [47] and a single Makefile, for example, are usually sufficient for system building. However, in the case of large and complex systems running on multiple platforms and supporting several functional configurations, the build process is more complicated.

Since the target systems of the current research project are large scale military applications, their configuration and built-time properties should be extracted from build management artefacts, such as build and configuration files [46]. Having the compilation dependencies between the compilation units of a system, the time-sequence configuration of the compilation procedure, as well as knowing which portions of source code are automatically generated at build time would provide valuable insights to understand an existing system before a component could be substituted.

The software systems to be analyzed by OASIS v2 will be military applications developed in Java. Therefore, Ant [48] is the build format supported.


3.2.2 Dynamic fact extraction

Dynamic fact extraction provides information which is obtained by observing the system during execution [40]. With the heterogeneity and dynamism of today‟s software systems, it is difficult to comprehend them outside the actual time and context in which they execute [49]. Therefore, most of the time, architecture comprehension cannot rely only on static information. It must be complemented by dynamic analysis, such as the exchange of control and data between the various components at run time. This information increases the level of precision provided by static analysis and as a result, improves understanding. In general, when collecting dynamic information about a set of executions, one is interested in collecting information for some specific entities in the code (e.g., method calls and paths) and in a subset of the program (e.g., in a specific module or set of modules) [50].

3.2.2.1 Instrumentation

One technique commonly used to collect information about a system behavior is instrumentation. As opposed to general-purpose program transformations, instrumentation only aims at gathering additional information about a system, rather than modifying its original structure and behavior. Therefore, it only allows minor side effects, such as increases in execution time or changes to the log file [51]. As an example, Java bytecode instrumentation uses structural and semantic information provided by the language and platform specifications to both identify instrumentation points as well as avoid affecting the original program structure and behavior [51]. Such instrumentation does not remove program elements (e.g., classes, fields, and methods). Variables defined by the original program may be read but not written. Instrumentation may add its own variables, even to existing program elements (e.g., new fields or local variables), and those variables may be read or written by it. Instrumentation may also insert new code into original program methods, and invoke other methods from this code, provided that original variables are not modified as a result of these invocations. Finally, instrumentation may outline code, i.e., move all or part of the method code into a new method and replace it in the original method with the invocation of the new one [51].

Once executed, an instrumented program generates an execution trace, which can be defined as a record of the sequence of instructions executed that often takes the form of a list of code labels encountered [52].

There are two different kinds of instrumentation: source and binary code (or bytecode). In the first case, trace statements are added into the source code of an application. In the second one, trace statements are inserted into the binary code or bytecode, which includes applications as well as dynamic and shared libraries. Instrumenting source code is easier than binary code, as one can work in a high-level language. However, the disadvantage is that after it has been instrumented, the modified source code has to be recompiled in order to be able to execute the tracing statements and therefore, generate an execution trace.

As the objective of the OASIS v2 prototype is to assist with the process of component substitution for object-oriented software systems written in Java, it supports bytecode instrumentation. It allows users to specify (1) the types of entities to instrument, (2) the parts of the code in which those entities must be instrumented, and (3) the kind of information to collect


from the different entity types [53]. Bytecode instrumentation is used to create execution traces from which sequence diagrams will be generated.

Figure 7 shows a sample Java program which prompts for a number and prints its factorial. Figure 8 displays the resulting bytecode of the compiled fac() method and Figure 9, its bytecode after it has been instrumented using the Byte Code Engineering Library (BCEL) [54]. Import java.io.*;

public class Factorial {

private static BufferedReader in = new BufferedReader(new

InputStreamReader(System.in));

public static final int fac(int n) {

return (n == 0)? 1 : n * fac(n – 1);

}

public static final int readInt() {

int n = 4711;

try {

System.out.print(“Please enter a number> “);

n = Integer.parseInt(in.readLine());

} catch(IOException e1)

{ System.err.println(e1); }

catch(NumberFormatException e2)

{ System.err.println(e2); }

return n;

}

public static void main(String[] argv) {

int n = readInt();

System.out.println(“Factorial of “+ n + “ is “ +

fac(n));

}

}

Figure 7 : Java source code [54]

0: iload_0

1: ifne #8

4: iconst_1

5: goto #16

8: iload_0

9: iload_0

10: iconst_1

11: isub

12: invokestatic Factorial.fac (I)I (12)

15: imul

16: ireturn

Figure 8: Java bytecode of the fac() method [54]


0: iload_0

1: ifne #8

4: iconst_1

5: goto #16

8: iload_0

9: iload_0

10: iconst_1

11: isub

12: invokestatic Factorial.fac (I)I (12)

15: imul

16: ireturn

0: sipush 4711

3: istore_0

4: getstatic java.lang.System.out Ljava/io/PrintStream;

7: ldc “Please enter a number> “

9: invokevirtual java.io.PrintStream.print (Ljava/lang/String;)V

12: getstatic Factorial.in Ljava/io/BufferedReader;

15: invokevirtual java.io.BufferedReader.readLine ()Ljava/lang/String;

18: invokestatic java.lang.Integer.parseInt (Ljava/lang/String;)I

21: istore_0

22: goto #44

25: astore_1

26: getstatic java.lang.System.err Ljava/io/PrintStream;

29: aload_1

30: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V

33: goto #44

36: astore_1

37: getstatic java.lang.System.err Ljava/io/PrintStream;

40: aload_1

41: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V

44: iload_0

45: ireturn

Figure 9: Instrumented Java bytecode of the fac() method [54]

One limitation of instrumentation is that the behavior of the instrumented system may be different from the expected one (e.g., deadlines), as a consequence of the delays introduced by the execution of the added code [55]. This issue is unavoidable, as observing a system changes the system [56]. However, this should not be a problem in the present case, as the systems targeted by OASIS v2 are not, at the moment, hard real-time systems with deadlines. As a result, the delays introduced by the instrumentation should not change the intended behavior of the system. Also, in order to limit the impact of instrumentation, only the constructs required to obtain the necessary information are instrumented.

3.2.2.2 Profiling

Profiling injects instrumentation statements into the binary code or bytecode of a software system to analyze the performance and resource utilization of its execution. It is useful for comprehension, as it allows identifying the portions of its source code which dominate execution time. It is also useful for component substitution to get an understanding of the complex interactions between the source code, third-party libraries, operating system, hardware, networks, and other processes.

Figures 10 and 11 show two profiling views provided by OASIS v2. Figure 10 is a snapshot of the Memory Statistics view, which displays statistics about an application heap. It provides, for a package or class, the following information: The total number of instances which have been created (Total Instances), the number of instances where no garbage collection has taken place


(Live Instances), the summed size of all live instances (Active Size), the total size of all created instances, including what has been removed via garbage collecting (Total Size), and the average number of garbage collecting the instances survived (Avg. Age).

Figure 10: Memory Statistics view

Figure 11 is a snapshot of the Execution Statistics view, which displays statistics about an application execution time and method calls. It provides, for a package, class, or method, the following information: The time to execute an invocation, excluding the time spent in other methods called during the invocation (Base Time), the Average Base Time (i.e., the Base Time divided by the number of calls), the time to execute all methods called from an invocation (Cumulative Time), and the number of calls (Calls).


Figure 11: Execution Statistics view

3.3 Analysis

The Analysis subsystem provides techniques to [57]:

Separate a system into its constituent parts, in order to identify or classify the elements of communication.

Make explicit the relationships among those elements to determine their connections and interactions.

Recognize the organizational principles of the arrangement and structure that hold these elements together.

In OASIS v2, the Analysis subsystem is composed of two service groups: Software Metrics and Domain Knowledge Definition and Exploitation.

3.3.1 Software metrics

A metric measures a property of a piece of software or its specifications. OASIS v2 provides an extensive set of metrics, as it has been shown that they can provide guidance in analyzing the quality of the design and source code of a system, as well as its potential maintainability and comprehension [58]. Therefore, they can be used to predict the effort involved in performing a component substitution.


Although OASIS v2 provides size, complexity, and object-oriented class metrics, the ones that are more relevant for the granularity levels of interest are the object-oriented package metrics. In [59], they have been proved to be particularly useful for architecture recovery and comprehension. Figure 12 shows a screenshot of the software metrics view provided by OASIS v2. Numbers highlighted in red indicate metrics which are out of the optimal range.

Figure 12: Software Metrics view

As mentioned in Section 2.5, the provided metrics support the C/C++ and Java programming languages. For Java, the notion of a package is well defined. In the case of C++, it is defined as the set of classes in the modules of a single directory [60].

The following suite of object-oriented package metrics is based on the work of Martin [61].

Afferent Coupling (Ca) - Counts the number of other packages which depend on classes within the analyzed package. Ca is an indicator of the level of responsibility of a package.

Efferent Coupling (Ce) - Counts the number of other packages that the classes within the analyzed package depend upon. Ce is an indicator of the package‟s independence.

Abstractness (A) - A is the ratio of the number of abstract classes within a package relative to the total number of classes it contains. The range of this metric is from 0 to 1. An abstractness value of zero (A = 0) indicates a completely concrete package, while a value of one (A = 1) indicates a completely abstract package.

Instability (I) - Instability is defined as the ratio between efferent and total coupling (Ca + Ce). This metric is an indicator of the package‟s resilience to change, i.e., the effort to change a package without impacting other packages within the application. The range of this metric goes from 0 to 1. An I of 0 reveals a completely stable package, while an I of 1 indicates that the package is unstable.

Distance from the Main Sequence (DMS) - Calculates the perpendicular distance of a package from the idealized line given by A + I = 1. It indicates the package‟s balance between abstractness and stability. A package squarely on the main sequence is perfectly


balanced with respect to abstractness and stability. Ideally, packages should either be completely abstract and stable (x = 0, y = 1), or completely concrete and unstable (x = 1, y = 0). The range for this metric goes from 0 to 1. A DMS of 0 indicates that a package is coincident with the main sequence, while a DMS of 1 reveals that the package is as far as possible from the main sequence.

3.3.2 Domain knowledge definition and exploitation

Before a component in a software system can be substituted by another one, the comprehension of the existing system has to take place. This involves the recovery of models incorporating architecturally significant concepts. Usually, these models cannot be directly extracted from source code without domain knowledge, as reverse engineering remains a human-intensive activity that requires knowledge from software engineers and domain experts, in addition to automated tool guidance [62]. A domain is a problem area characterized by its vocabulary, common assumptions, architectural solution approaches, and literature [63]. Domain knowledge is needed to relate the operations of a software system to the application domain in order to understand their semantics. The objective is to establish traceability links between the source code and the application domain.

Recovering domain knowledge from existing software systems is an inherently difficult task. The challenge lies in both the extraction and inference of knowledge from available sources such as domain experts, source code, and databases. There exist two major approaches for domain knowledge recovery. The first one consists in extracting knowledge from domain experts and storing this information for further processing. The second approach involves inferring domain knowledge by analyzing and reasoning about the source code using, for example, natural language processing techniques.

The method provided in OASIS v2 for domain knowledge recovery lies more towards the first approach. As illustrated in Figure 13, an editor allows software engineers and domain experts to model the application domain of the software system under study, in this case, DocSearcher, a search tool for indexing and searching files on a computer. The document formats it currently supports include Word, Excel, PDF, Open / Star Office, RTF, text, and HTML.

In the Domain Model editor, a node represents a domain concept. In the example displayed in Figure 13, the concepts are Index, Document, and Document_Type. A relationship between two nodes is represented by an explicit named link (e.g., IndexToDocument, DocumentToType, and IndexToType). The domain model diagram results in a vocabulary of terms representing entities of the domain and their relationships, which together imply certain semantic information. At the current stage, the metamodel used by the editor does not impose much constraints for the concepts and relationships between the entities. This could later be refined to set additional restrictions if its usage reveals this to be a necessity.


Figure 13: Domain model of the DocSearcher application

After the concepts and relationships of the application domain have been specified by a domain expert, they can be tagged once located in the source code. A tag is an annotation to which a waypoint is associated [64]. Like for geographical positioning systems (GPS), a waypoint in the source code of an application corresponds to a location of interest. In this case, it could be the location of a source code element such as a package, file, class, or method. In the Domain Model editor, tags are indicated below the names of concepts and relationships. In Figure 13, there are three tags for the domain concepts, i.e., ds_index, ds_document, and ds_document_type, as well as one tag for each of the relationships between them, i.e., ds_IndexToDocument, ds_DocumentToType, and ds_IndexToType.

Following the localization in the source code of a concept from the domain model, e.g., the class defining the index, a person trying to understand this application could insert its corresponding tag in the Index.java file, as illustrated in Figure 14. This will automatically associate the waypoint org.jab.docsearch.Index to the ds_Index tag.


Figure 14: Location of interest tagged directly in source code

Once the tags and waypoints are created, they can be used to navigate and understand the source code of an unfamiliar system using the Domain Model view, as illustrated in Figure 15. In this view, when the user clicks on a concept or relationship to which tags are associated, their corresponding waypoints and related properties get displayed in the Waypoints view. For example, in Figure 15, the Index concept is selected, resulting in the org.jab.docsearch.Index waypoint and its description being displayed. Clicking on this entry opens the Index.java file in the Java editor positioned at the appropriate location and highlights the waypointed source code element.


Figure 15: Domain Model and Waypoint views

The Domain Model editor and view support the comprehension of unfamiliar source code using the top-down approach proposed by Brooks [65]. Using this strategy, the knowledge about the application domain is first reconstructed and then mapped on the source code [65]. This approach is required to reconstruct and understand a software system at the architectural level, as it allows mapping source code elements to their corresponding operational concepts.

3.4 UML sequence diagrams

The Unified Modeling Language (UML) is “a language for visualizing, specifying, constructing, and documenting the artefacts of a software-intensive system” [66]. In UML, one way to model the dynamic aspects of a system, which involves modeling concrete or prototypical instances of classes, interfaces, components, and nodes, in addition to the messages which are dispatched between them, is to use sequence diagrams [66].


Informally, a sequence diagram shows a set of roles and the messages sent and received by the instances playing these roles. More formally, a sequence diagram shows a set of objects and their relationships, including the messages that may be dispatched among them, and emphasizes on the time ordering of messages [66].

Figure 16 : UML sequence diagram [67]

As illustrated in Figure 16, in a sequence diagram, the objects involved in the interaction are placed at the top, across the horizontal axis. The object initiating the interaction is typically placed at the left and increasingly more subordinate objects, to the right. Messages sent and received by these objects are then arranged along the vertical axis, in order of increasing time, from top to bottom [66].

Two features characterize sequence diagrams: lifeline and focus of control. An object lifeline, depicted as a vertical dashed line, represents the existence of an object over a period of time. The objects that will be in existence for the duration of the interaction are aligned at the top, with their lifelines drawn from the top to the bottom of the diagram. Objects can also be created and destroyed during an interaction. In the former case, their lifelines start with the receipt of the create message. In the latter case, their lifelines end with the receipt of the destroy message, and are given the visual cue of a large X, marking the end of their lives [66].

The second feature characterizing sequence diagrams is the focus of control.

system architecture recovery for open source software ... · analyse textuelle, lexicale et...

Documents