dependencies in geographically distributed...

DEPENDENCIES IN GEOGRAPHICALLY DISTRIBUTED

SOFTWARE DEVELOPMENT:

OVERCOMING THE LIMITS OF MODULARITY1

Marcelo Cataldo

CMU-ISRI-07-120

December 2007

School of Computer Science Institute for Software Research

Carnegie Mellon University Pittsburgh, PA

Thesis Committee

Kathleen M. Carley, Co-Chair

James D. Herbsleb, Co-Chair

Len J. Bass

David Redmiles

Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy

Copyright © 2007 Marcelo Cataldo

1 This dissertation was supported by the National Science Foundation under Grant No. IIS-0414698, Grant No. IIS-0534656 and Grant No. IGERT 9972762, by the U.S. Army Research Laboratory under Collaborative Technology Alliance Program, Cooperative Agreement DAAD19-01-2-0011, by the Office of Naval Research (ONR N00014-06-1-0921) and by the Air Force Research Lab with Charles River Analytics SC060701.

ii

Keywords: geographically distributed software development, collaborative software

development, coordination, software dependencies.

iii

Dedicada a Pei-Chi y a mis padres, Antonio Y Mirta

iv

ACKNOWLEDGEMENTS

I have been very fortunate to work with an outstanding dissertation committee in

Kathleen Carley, Jim Herbsleb, Len Bass and David Redmiles. I am particularly indebted

to Kathleen and Jim for being the best advisors a student could hope for. I also would like

to thank my family for their patience and encouragement, specially, my wife Pei-Chi

without whom my life as a doctoral student would have been a lot less enjoyable.

Through out this process, many others helped shape my views and research.

Special thanks go to Matthew Bass, Audris Mockus, Jeffrey Reminga, Jeffrey Roberts

and Patrick Wagstrom.

v

ABSTRACT

Geographically distributed software development (GDSD) is becoming pervasive.

Hence, the constraints in communication and its negative impact of developers’ ability to

coordinate effectively is a growing problem that consistently results in sub-par

performance of GDSD teams. Past research argues that geographically distributed teams

do better when their work is almost independent from each other. In software

engineering, modularization is the traditional technique intended to reduce the

interdependencies among modules that constitutes a system. The modular design

argument suggests that by reducing the technical dependencies, the work dependencies

between teams developing interdependent modules are also reduced. Consequently, a

modular product structure leads to an equivalent modular task structure. This dissertation

argues that modularization is not a sufficient representation of work dependencies in the

context of software development and it proposes a method for measuring socio-technical

congruence, defined as the relationship between the structure of work dependencies and

the coordination patterns of the organization doing the technical work. Two empirical

studies assessed the impact of socio-technical congruence on development productivity

and product quality. In addition, a third empirical study explores how developers in a

geographically distributed software development organization evolve their coordination

patterns to overcome the limitations of the modular design approach.

Collectively, this dissertation has important contributions to software engineering,

CSCW and organizational literatures. First, the empirical evaluation of the congruence

framework showed the importance of understanding the dynamic nature of software

development. Identifying the “right” set of product dependencies that determine the

vi

relevant work dependencies and coordinating accordingly has significant impact on

reducing the resolution time of modification requests. The analyses showed traditional

software dependencies, such as syntactic relationships, tend to capture a relatively stable

view of product dependencies that is not representative of the dynamism in product

dependencies that emerges as software systems are implemented. On the other hand,

logical dependencies provide a more accurate representation of the most relevant product

dependencies in software development projects. Secondly, this dissertation moves

forward our understanding of the relationship between product and work dependencies

and software quality. Logical dependencies among software modules and work

dependencies were found to be two very significant factors affecting the failure proneness

of software modules. Finally, the longitudinal analysis of coordination activities in a

GDSD project showed that developers centrally positioned in the social system of

information exchanges and coordination activities performed a critical bridging function

across formal teams and geographical locations. Moreover, those same individuals

contributed an average of 57% of development effort in terms of implementing the

software system in each release covered by the data.

vii

TABLE OF CONTENTS

ACKNOWLEDGEMENTS....................................................................................................................... IV

ABSTRACT ..................................................................................................................................................V

TABLE OF CONTENTS..........................................................................................................................VII

LIST OF TABLES ..................................................................................................................................... XI

LIST OF FIGURES ................................................................................................................................ XIII

CHAPTER 1: INTRODUCTION ................................................................................................................2

THE NATURE OF SOFTWARE DEVELOPMENT AND MODULAR DESIGN.........................................................3

THE NATURE OF SOFTWARE DEVELOPMENT AND INTERDEPENDENCY THEORIES .......................................7

RESEARCH QUESTIONS..............................................................................................................................11

CHAPTER 2: A FRAMEWORK FOR IDENTIFICATION OF WORK DEPENDENCIES..............13

THE CONCEPT OF SOCIO-TECHNICAL CONGRUENCE .................................................................................15

IDENTIFICATION OF COORDINATION REQUIREMENTS................................................................................17

MEASURING SOCIO-TECHNICAL CONGRUENCE.........................................................................................19

CHAPTER 3: TERMINOLOGY AND DESCRIPTION OF THE DATASETS...................................21

TERMINOLOGY ..........................................................................................................................................21

DATASETS .................................................................................................................................................22

Project A ..............................................................................................................................................22

Project B ..............................................................................................................................................23

Project C..............................................................................................................................................24

Project D..............................................................................................................................................24

CHAPTER 4: METHODS FOR IDENTIFYING WORK DEPENDENCIES IN SOFTWARE

DEVELOPMENT PROJECTS..................................................................................................................25

TWO APPROACHES TO DETERMINE PRODUCT DEPENDENCIES IN SOFTWARE SYSTEMS ............................25

viii

General Properties and Evolution of the FCT Task Dependency Matrix ............................................27

General Properties and Evolution of the CGRAPH Task Dependency Matrix....................................31

COMPARATIVE ANALYSIS OF THE TASK DEPENDENCY MATRICES............................................................35

COMPARATIVE ANALYSIS OF THE COORDINATION REQUIREMENT MATRICES ..........................................37

CHAPTER 5: DEPENDENCIES, CONGRUENCE AND THEIR IMPACT ON DEVELOPMENT

PRODUCTIVITY........................................................................................................................................40

STUDY I: CONGRUENCE AND DEVELOPMENT PRODUCTIVITY ...................................................................40

Research Questions..............................................................................................................................41

Method .................................................................................................................................................42

Description of the Measures............................................................................................................................ 43

Description of the Model and Preliminary Analysis ....................................................................................... 48

Results..................................................................................................................................................53

The Evolution of Coordination Requirements................................................................................................. 53

The Impact of Congruence on Resolution Time of MRs................................................................................. 57

The Evolution of Congruence over Time ........................................................................................................ 62

Discussion............................................................................................................................................70

CHAPTER 6: DEPENDENCIES, CONGRUENCE AND THEIR IMPACT ON SOFTWARE

QUALITY ....................................................................................................................................................72

STUDY II: THE STRUCTURE OF DEPENDENCIES, CONGRUENCE AND PRODUCT QUALITY ..........................73

Research Questions..............................................................................................................................75

Method .................................................................................................................................................80

Description of the Data and Measures............................................................................................................. 80

Results..................................................................................................................................................94

The Impact of Dependencies........................................................................................................................... 95

Stability Analysis .......................................................................................................................................... 103

Checks for Random Temporal Effects .......................................................................................................... 107

Discussion..........................................................................................................................................108

CHAPTER 7: THE EVOLUTION OF COORDINATION BEHAVIOR ............................................111

ix

STUDY III: THE EVOLUTION OF COORDINATION BEHAVIOR....................................................................112

Research Questions............................................................................................................................113

Method ...............................................................................................................................................115

Description of the Data ................................................................................................................................. 116

Description of Measures................................................................................................................................ 118

Results................................................................................................................................................125

General Patterns of Coordination Behavior................................................................................................... 125

On the Relationship between Network Position and Productivity................................................................. 130

Stability of Coordination Patterns ................................................................................................................. 144

Drivers of Coordination Patterns................................................................................................................... 146

Discussion..........................................................................................................................................148

CHAPTER 8: APPLICATIONS..............................................................................................................151

APPLICATIONS FOR SOFTWARE DEVELOPERS...........................................................................................153

Enhancing coordination needs awareness.........................................................................................153

Enhancing awareness of product dependencies.................................................................................155

Other applications of the congruence framework..............................................................................156

MANAGERIAL APPLICATIONS ..................................................................................................................157

Project-wide view of coordination patterns.......................................................................................158

Identifying critical software and organizational agents and units.....................................................160

CHAPTER 9: CONCLUSIONS...............................................................................................................162

CONTRIBUTIONS......................................................................................................................................163

LIMITATIONS ...........................................................................................................................................165

FUTURE WORK........................................................................................................................................167

Identification of coordination requirements in early stages of software projects..............................167

The impact of formal roles in development organizations .................................................................168

Communication beyond team and location boundaries and individual-level performance...............169

Applying the congruence framework in other types of tasks..............................................................170

x

REFERENCES..........................................................................................................................................173

APPENDIX A: SURVEY FOR PROJECT D.........................................................................................187

xi

LIST OF TABLES

TABLE 1: DESCRIPTIVE STATISTICS FOR DEPENDENT AND CONTROL VARIABLES .........................................49

TABLE 2: DESCRIPTIVE STATISTICS FOR CONGRUENCE MEASURES (FCT METHOD) ......................................50

TABLE 3: DESCRIPTIVE STATISTICS FOR CONGRUENCE MEASURES (CGRAPH METHOD) .............................50

TABLE 4: PAIR-WISE CORRELATIONS .............................................................................................................51

TABLE 5: RESULTS FROM OLS REGRESSION OF EFFECTS ON RESOLUTION TIME (FCT METHOD)..................59

TABLE 6: RESULTS FROM OLS REGRESSION OF EFFECTS ON RESOLUTION TIME (CGRAPH METHOD) .........60

TABLE 7: EFFECT OF TIME ON CONGRUENCE. ................................................................................................65

TABLE 8: DIFFERENCES BETWEEN DEVELOPERS’ POPULATION.......................................................................69

TABLE 9: DESCRIPTIVE STATISTICS FOR LAST RELEASE OF PROJECT A.........................................................89

TABLE 10: DESCRIPTIVE STATISTICS FOR LAST RELEASE OF PROJECT C .......................................................90

TABLE 11: PAIR-WISE CORRELATIONS FOR LAST RELEASE OF PROJECT A (* P < 0.01)..................................91

TABLE 12: PAIR-WISE CORRELATIONS FOR LAST RELEASE OF PROJECT C (* P < 0.01) ..................................93

TABLE 13: BASELINE MODEL FOR FAILURE PRONENESS................................................................................96

TABLE 14: IMPACT OF SYNTACTIC DEPENDENCIES ON FAILURE PRONENESS.................................................97

TABLE 15: IMPACT OF LOGICAL DEPENDENCIES ON FAILURE PRONENESS.....................................................99

TABLE 16: IMPACT OF WORKFLOW DEPENDENCIES ON FAILURE PRONENESS..............................................101

TABLE 17: IMPACT OF COORDINATION REQUIREMENTS ON FAILURE PRONENESS .......................................102

TABLE 18: IMPACT OF CONGRUENCE ON FAILURE PRONENESS....................................................................103

TABLE 19: IMPACT OF TECHNICAL DEPENDENCIES, WORK DEPENDENCIES AND CONGRUENCE ACROSS

RELEASES IN PROJECT A.....................................................................................................................105

TABLE 20: IMPACT OF TECHNICAL DEPENDENCIES, WORK DEPENDENCIES AND CONGRUENCE ACROSS

RELEASES IN PROJECT C.....................................................................................................................106

TABLE 21: RANDOM-EFFECTS MODEL OF FAILURE PRONENESS ..................................................................108

TABLE 22: DESCRIPTIVE STATISTICS FOR IRC DATASET ..............................................................................136

TABLE 23: RESULTS OF THE MULTI-LEVEL REGRESSION MODEL USING THE IRC DATA ..............................139

TABLE 24: RESULTS OF THE MULTI-LEVEL REGRESSION MODEL USING THE MR DATA ..............................142

TABLE 25: RESULTS FROM MULTI-LEVEL REGRESSION MODEL USING PROJECT D DATA............................143

xii

TABLE 26: STABILITY OF THE COORDINATION NETWORKS ..........................................................................146

TABLE 27: PREDICTING COORDINATION ACTIVITIES ...................................................................................147

xiii

LIST OF FIGURES

FIGURE 1: THE CONCEPT OF CONGRUENCE....................................................................................................17

FIGURE 2: EVOLUTION OF THE DENSITY AND CLUSTERING LEVEL OF THE TD MATRICES (FCT METHOD) ......28

FIGURE 3: EVOLUTION OF THE CHANGE IN THE INFORMATION CONTAINED IN THE TD MATRICES (FCT

METHOD) ..............................................................................................................................................30

FIGURE 4: AVERAGE CUMULATIVE DENSITY OF THE TD MATRIX (FCT METHOD)..........................................31

FIGURE 5: EVOLUTION OF THE DENSITY LEVEL OF THE TD MATRICES (CGRAPH METHOD) ..........................34

FIGURE 6: EVOLUTION OF THE CHANGE IN THE INFORMATION CONTAINED IN THE TD MATRICES (CGRAPH

METHOD) ..............................................................................................................................................35

FIGURE 7: COMPARISON BETWEEN TD MATRICES GENERATED BY THE FCT AND CGRAPH METHODS ..........37

FIGURE 8: EVOLUTION OF DENSITY AND CLUSTERING LEVEL OF THE CR MATRICES (FCT METHOD) .............38

FIGURE 9: EVOLUTION OF DENSITY AND CLUSTERING LEVEL OF THE CR MATRICES (CGRAPH METHOD) ....39

FIGURE 10: THE EVOLUTION OF COORDINATION REQUIREMENTS ON A MONTHLY BASIS .............................54

FIGURE 11: THE EVOLUTION OF COORDINATION REQUIREMENTS IN OPEN SOURCE PROJECTS .....................55

FIGURE 12: EVOLUTION OF THE CONGRUENCE MEASURES ACROSS RELEASES..............................................64

FIGURE 13: PROPORTION OF CHANGES PER DEVELOPER PER RELEASE ..........................................................66

FIGURE 14: CONGRUENCE MEASURES ACROSS RELEASES BASED ON TOP CONTRIBUTORS INTERACTIONS....68

FIGURE 15: CONGRUENCE MEASURES ACROSS RELEASES FOR THE REST OF THE DEVELOPERS.....................68

FIGURE 16: OVER TIME COORDINATION PATTERNS FROM THE MR SYSTEM DATA ......................................126

FIGURE 17: OVER TIME COORDINATION PATTERNS FROM THE IRC DATA ...................................................126

FIGURE 18: COORDINATION PATTERNS ACROSS FORMAL TEAMS AND GEOGRAPHICAL LOCATIONS ...........127

FIGURE 19: LOCATION X NETWORK POSITION INTERACTION EFFECT..........................................................129

FIGURE 20: COORDINATION PATTERNS AND PRODUCTIVITY........................................................................131

FIGURE 21: THE SIZE OF THE CORE GROUP OVER TIME AND TOP PERFORMERS MEMBERSHIP ....................132

FIGURE 22: COMPOSITION OF THE CORE GROUP OVER TIME BY PRODUCTIVITY LEVELS.............................133

FIGURE 23: AMOUNT OF CHANGE IN DYADS CONNECTIONS ........................................................................145

2

CHAPTER 1: INTRODUCTION

Over the past couple of decades, geographically distributed work has become

pervasive and software development organizations are no exception. Factors such as

access to talent, acquisitions and the need to reduce the time-to-market of new products

are the driving forces for the increasing number of geographically distributed software

development (GDSD) projects (Herbsleb & Moitra, 2001; Karolak, 1998). Unfortunately,

this new trend has its costs. Distance leads to numerous problems in communication and

coordination, and ultimately, impacts the performance of software development teams

(Herbsleb et al, 2000; Herbsleb & Mockus, 2003). The failure to identify work

dependencies among developers or development teams results in coordination problems.

A growing body of work on coordination in software development suggests that the

identification and the management of dependencies is a fundamental challenge in

software development organizations, particularly in those that are geographically

distributed (some examples are: Cataldo et al, 2007; de Sourza, 2005; Grinter et al, 1999;

Herbsleb et al, 2000; Herbsleb & Mockus, 2003). The modular product design literature

has developed an important body of research on interdependency, for instance, the work

on design structure matrices to find alternative structures that reduce dependencies

among the various components of the system (Eppinger et al, 1994; Sullivan et al, 2001).

Interdependency is central to organizations and it has also been a perennial research topic

in organizational theory (DeSanctis et al, 1999; Staudenmeyer, 1997). Those research

streams could inform the design of software development organizations so they are better

able to identify and manage work dependencies. However, we first need to understand

3

the assumptions of the different theoretical views and how those assumptions relate to the

characteristics of software development tasks.

The Nature of Software Development and Modular Design

The idea of dividing a complex task into smaller manageable units is consistent

with the reductionist view (Simon, 1962; von Hippel, 1990) which is well developed in

the product development literature (Eppinger et al, 1994). Projects, typically, have a

general description of the system’s components and their relationships or a more detailed

report such as architectural or high-level design document. Managers use the information

in those documents to divide the development effort into work items that are assigned to

specific development teams minimizing the interdependencies among those teams

(Conway, 1968; Eppinger et al, 1994; Sullivan et al, 2001). In the system design

literature, it has long been speculated that the structure of a product inevitably resembles

the structure of the organization that designs it (Conway, 1968). In Conway’s original

formulation, he reasoned that coordinating product design decisions requires

communication among the engineers making those decisions. If everyone needs to talk to

everyone, the communication overhead does not scale well for projects of any size.

Therefore, products must be split into components, with limited technical dependencies2

among them, and each component assigned to a single team. Conway (1968) proposed

that the component structure and organizational structure stand in a homomorphic

relation, in that more than one component can be assigned to a team, but a component

must be assigned to a single team.

2 The terms “technical dependency” and “product dependency” are used interchangeably through this dissertation.

4

A similar argument has been proposed in the strategic management literature.

Baldwin and Clark (2000, page 90) argued that modularization makes complexity

manageable, enables parallel work and tolerates uncertainty. The design decisions are

hidden within the modules which communicate through standard interfaces, then,

modularization adds value by allowing independent experimentation of modules and

substitution (Baldwin & Clark, 2000). Moreover, Baldwin and Clark (2000, page 89)

argued that a modular design structure leads to an equivalent modular task structure.

Then, their view aligns with Conway’s idea that one or more modules can be assigned to

one organizational unit and work can be conducted almost independently of others. In the

context of software engineering, a similar approach was first articulated by Parnas (1972)

as modular software design. Parnas (1972) argued that modules ought to be considered

work items instead of just a collection of subprograms. Then, development work can

continue independently and in parallel across different modules. Parnas’ views also

coincide with the theoretical arguments from product design and strategic management

literatures.

All three theoretical views rely on two interrelated assumptions. The authors

assume a simple and obvious relationship between product modularization and task

modularization. Hence, reducing the technical interdependencies among modules, the

modularization theories argue, task interdependencies are reduced, which consequently,

reduces the need for communication among work groups. Unfortunately, there are several

problems with these assumptions. First, existing software modularization approaches

only use a subset of the technical dependencies, typically syntactic relationships, of a

software system (Garcia et al, 2007). Then, potentially relevant work dependencies might

5

be ignored. Secondly, recent empirical evidence indicates that the relationship between

product structure and task structure is not as simple as previously assumed. Moreover, the

theorized similarity between product and task structures diminishes over time (Cataldo et

al, 2006).

Thirdly, promoting minimal communication between teams responsible for

interdependent modules is problematic. The computer-mediated communication literature

suggests that loose-coupling tasks is the appropriate approach when teams are

geographically distributed (Olson & Olson, 2000). However, recent studies suggest that

minimal communication between teams, collocated or distributed, is detrimental to the

success of projects. The product development literature argues that information hiding,

which leads to minimal communication between teams, is an inevitable antecedent of

variability in the evolution of projects resulting, typically, in integration problems

(Yassine et al, 2003). In context of software development, de Souza and colleagues

(2004) found that information hiding led development teams to be unaware of others

teams’ work resulting in coordination problems. Grinter and colleagues (1999) reported

similar findings for geographically distributed software development projects. The

authors highlighted that the main consequence of reducing the teams’ need to

communicate was to increase costs because problems were discovered too late in the

development process. Those findings do not suggest that modularization is not useful.

They highlight the need to supplement it with coordination mechanisms to allow

developers to deal correctly with the assumptions that are not captured in the

specification of the dependencies.

6

Another problem associated with the assumptions of modular design is the nature

and stability of the interfaces between software modules. Although, the program

dependency literature defines technical dependencies as a syntactic or semantic

relationship between statements (Podgurski & Clarke, 1990), the same ideas are applied

at the level of modules. Then, relationships among modules could also range from

syntactic, for instance a function call from module A to module B, to more complex

semantic dependencies where, for example, the computations done in one module affects

the behavior of another module. Some authors refer to those types of semantic

dependencies as dynamic (Bass et al, 2003) or logical (Gall et al, 1998). Even in the

simple case of a function call between two modules, the complexity and the degree of

dependency varies, for instance, if we consider the number of parameters of a function

call or we compare parameters passed by value versus parameters passed by reference.

Cataldo et al (2007) presented case studies where even simple interfaces between

modules developed by remote teams create coordination breakdown and integration

problems. The authors reported that semantic dependencies were even more problematic

and they argued that the developers’ ability to identify and manage dependencies was

hindered by several inter-related factors such as development processes, organizational

attributes (e.g. structure, management style) and uncertainty of the interfaces. In a field

study of a large software project, de Souza (2005) encountered that interfaces tended to

change often and their design details tended to be incomplete, leading to serious

integration problems. These findings argue that the interfaces between software modules

might differ in complexity and, often, it is not possible to specify those interfaces at the

7

necessary level of detail, increasing the likelihood of future changes to them. This lack of

stability represents a constant challenge for software development organizations.

In sum, the modularization approach is a very useful tool for dividing the

development of a complex software system into manageable units. However,

modularization is not a sufficient representation of work dependencies in software

development activities. The relationship between the task dependency structure and the

product structure is not as simple as theorized. Appropriate mechanisms are then required

to identify relevant work dependencies and, consequently, maintain suitable levels of

communication and coordination among teams developing interdependent modules,

particularly, in the case of geographically distributed software development.

The Nature of Software Development and Interdependency Theories

Coordination is a central concept in organizations, the idea of division of labor

into interdependent units is a well developed and mechanisms for coping with the varying

degree of interdependency have been proposed in the traditional organizational literature

(for instance, March & Simon, 1958; Thompson, 1967; Galbraith, 1973; Staudenmayer,

1997). More recent work, particularly in organizational design, has focused on

computational and mathematical approaches to examine how organizational designs, that

use different models of communication and coordination, are affected by factors such as

stress, task decomposition, quality of information exchanged, and ability to adapt (for

instance, Carley and Lin, 1995, 1997; Handley & Levis, 2001; Perdu & Levis, 1998).

Then both streams of work, traditional organizational theory and computational and

8

mathematical organizational theory (CMOT), are relevant to the problem of coordination

in software development projects.

In the traditional organizational theory, March and Simon (1958) argued that

coordination encompasses more than just a traditional division of labor and assignment of

tasks. The authors proposed numerous mechanisms such the division of the task into

nearly independent parts and they also argued that schedules and feedback mechanisms

are required when interdependence is unavoidable. Thompson (1967) extended March

and Simon’s work by matching three mechanisms: standardization, plan, and mutual

adjustment, to stylized categorizations of dependencies such as pooled, sequential, and

reciprocal. Galbraith (1973) argued that low levels of interdependency can be managed

by traditional mechanisms such as rules and programs. However, as the level of

interdependency increases additional mechanisms are required such as slack resources

and lateral communication (Galbraith, 1973). Mintzberg (1979) took an organizational-

level perspective and argued that specific coordination mechanisms are properties of

particular kinds of organizations and environments. Crowston (1991) developed a

typology of coordination problems to catalog coordination mechanisms that address

specific types of interdependencies. Staudenmayer (1997) grouped the contributions of

March and Simon, Thompson, and others into the information processing theories of

interdependency which, she argued, rely on the assumptions of determinism and stability.

In other words, those theoretical views focus on predictable and static tasks

(Staudenmayer, 1997). This limitation of the information processing argument is not

problematic if software development tasks can be identified a priori and the set of

interdependencies that arise from the division of labor are managed with the appropriate

9

set of mechanisms. If we think in terms of project management activities, coarse-grain

development activities such as “develop component A” or “implement feature X” can

typically be identify at relatively early stages of the projects. Some dependencies among

those development tasks are typically easy to identify. For instance, particular work items

need to be finished before other work items can start. Work items that can only be

assigned to specific teams because of the skill set required would represent another

example. Then, specific organizational forms can be used to manage the dependencies

among those coarse-grain development tasks (Malone & Crowston, 1991), even in the

case of geographically distributed development organizations (Grinter et al, 1999).

Unfortunately, there are several characteristics of software development activities

that limit the applicability of traditional organizational theories as well as the more recent

CMOT work. First, it is widely accepted among software engineering researchers and

practitioners that the requirements of the system become known over time or those

requirements change as time progresses (Leffingwell & Widrig, 2003). In some cases the

changes in the requirements result in minor alterations of specific development tasks. In

other cases, new features have to be added or features under development are eliminated.

These events introduce a certain level of dynamism in software development that

challenges the determinism and stability assumptions of the information processing views

of interdependency.

Secondly, the dynamic nature of finer-grain dependencies that arise as part of the

development of a piece of code is not well suited for traditional organizational theories of

coordination. The act of developing a software system consists of a collection of design

decisions, either at the architectural level or at the implementation level. Those design

10

decisions introduce constraints that might establish new dependencies among the various

parts of the system, modify existing ones or even eliminate dependencies. The changes in

dependencies can generate new coordination requirements that are quite difficult to

identify a priori, particularly when they are not obvious, or as a project matures over time

(Henderson & Clark, 1990; Sosa et al, 2004). Failure to discover the changes in

coordination needs might have a profound impact on the quality of the product (Curtis et

al, 1988), on productivity (Herbsleb & Mockus, 2003) and even on the projects’ overall

design (Bass et al, 2006). In addition, little is known about the specific impact of the

various types of dependencies that arise among parts of a software system such as explicit

versus implicit dependencies or syntactic versus logical dependencies. Then, the use of

the computational and mathematical organizational theory approaches is limited because

of the lack of theoretical framework that guides the modeling of the relationships

between the organizational tasks, their dependencies and the need to communicate and

coordination.

In sum, software development tasks are embedded in an evolving network of

coordination requirements that need to be satisfied. The coarse-grain and idealized

approaches suggested by the organization theory literature are not appropriate to identify

and manage such a dynamic web of interdependencies. A finer-grain view of

coordination would provide a better framework in dynamic knowledge-intensive tasks

such as software development.

11

Research Questions

In the previous sections, I highlighted the limitations of the current mechanisms

for identifying and managing dependencies in geographically distributed software

development organizations. Product modularization does not necessarily yield an

equivalent task modularization structure and additional mechanisms are required to

maintain appropriate levels of coordination among workgroups. The nature of software

development such as the attributes and stability of interfaces among modules and the

dynamics of technical dependencies, limit the applicability of established task

decomposability and coordination approaches. Moreover, these characteristics are a

constant challenge for software development organizations, particularly, for those

geographically distributed. This dissertation addresses the problem of work dependencies

in software development by examining how to use technical dependencies to determine

work dependencies and by investigating the impact of those work dependencies in the

development process. Specifically, I address the following general research questions:

RQ 1: How relevant task dependencies can be identified from technical

dependencies?

RQ 2: What is the impact of those task dependencies on traditional outcome

variables such as productivity and quality?

The rest of this document is organized as follows. Chapter 2 presents a framework

for identifying and managing dependencies. Chapter 3 introduces terminology used in

this dissertation and describes the various datasets used in the empirical studies. In

12

chapter 4, I examine different methods of identifying work dependencies from technical

dependencies. Chapter 5 presents the first empirical study that examines the impact on

development productivity of the mismatches between coordination requirements and

coordination behavior. In chapter 6, I study the impact of the structure of technical and

work dependencies on software quality. The last empirical study which explores the

usage of the proposed framework for examining the relationship between coordination

behavior and developer-level performance is described in chapter 7. Chapter 8 describes

developer and managerial applications of the results reported in this dissertation. Finally,

chapter 9 describes the contributions of this research endeavor, its limitations as well as

future research directions.

13

CHAPTER 2: A FRAMEWORK FOR IDENTIFICATION OF WORK

DEPENDENCIES

It has long been observed that organizations carry out complex tasks by dividing

them into smaller interdependent work units assigned to groups and coordination arises as

a response to those interdependent activities (March & Simon, 1958). Communication

channels emerge in the formal and informal organizations. Over time, those information

conduits develop around the interactions that are most critical to the organization’s main

task (Galbraith, 1973). This is particularly important in product development

organizations which organize themselves around their products’ architectures because the

main components of their products define the organization’s key subtasks (von Hippel,

1990). Organizations also develop filters that identify the most relevant information

pertinent to the task at hand (Daft & Weick, 1990). Changes in task dependencies,

however, jeopardize the appropriateness of the information flows and filters and can

disrupt the organization’s ability to coordinate effectively. For example, Henderson &

Clark (1990) found that minor changes in product architecture can generate substantial

changes in task dependencies, and can have drastic consequences for the organizations’

ability to coordinate work. If effective ways of identifying detailed work dependencies

and tracking their changes over time exist, we would be in a much better position to

design mechanisms that could help to align information flow with work dependencies.

Identifying work dependencies and determining the appropriate coordination

mechanism to address the dependencies is not a trivial problem. Coordination is a

recurrent topic in the organizational theory literature and many stylized types of task

14

dependencies and coordination mechanisms have been proposed over the past several

decades (Crowston, 1991; Galbraith, 1973; Malone & Crowston, 1994; March & Simon,

1958; Mitzberg, 1979; Thompson, 1968). However, numerous types of work, in

particular non-routine knowledge-intensive activities, are potentially full of fine-grain

dependencies that might change on a daily or hourly basis. Conventional coordination

mechanisms like standard operating procedures or routines would have very limited

applicability in these dynamic contexts. Therefore, designing mechanisms to handle

rapidly shifting coordination needs requires a more fine-grained level of analysis than

what the traditional views of coordination provide.

In the context of software development, a technical dependency in the software

system represents a coordination need that relevant software developers might need to

address. The result of ignoring coordination requirement could lead to increased number

of defects, problems in integration and longer development time (Curtis et al, 1988;

Espinosa et al, 2002; Kraut et al, 1995; Herbsleb & Mockus, 2003). When members of a

team are physically collocated and coordination requirements involve individuals from

the same team, there are numerous ways for team members to identify the needs to

coordinate and act on them such as group and status meetings and managerial

intervention. The problem of identifying the need to coordinate is further complicated

when coordination requirements change rapidly (Cataldo et al, 2006). In this chapter, I

present a framework to determine the coordination requirements among developers. The

objective of the framework is two-fold. First, provide a fine-grain level of analysis of

coordination. The second objective is to allow for identification of work dependencies

from alternative representations of technical dependencies of the system. I also propose a

15

measure of “fit” between work dependencies and the coordination activities performed by

the software developers.

The Concept of Socio-Technical Congruence

Product development endeavors involve two fundamental elements: a technical

and a social component. The technical properties of the product to develop, the processes,

the tasks, and the technology employed in the development effort constitute the technical

component. The second element is composed by the organizational individuals involved

in the development process, their attitudes and behaviors. In other words, a product

development project can be thought of a socio-technical system where the two

components, the technical and the social elements, need to be aligned in order to have a

successful project. Then, a key issue is to understand how we can examine the

relationship between those two, the technical and the social, dimensions. Two lines of

work are particularly relevant in this context. First, the concept of “fit” from

organizational literature refers to the match between a particular organizational design

and the organization’s ability to carry out a task (Burton & Obel, 1998). The work in this

line of research has, traditionally, focused on two factors: the temporal dependencies

among tasks that are assigned to organizational groups and the formal organizational

structure as a means of communication and coordination (Carley & Ren, 2001; Levchuck

et al, 2004). Secondly, the research on dynamic analysis of social networks provides an

innovative approach, called the meta-matrix, to examine the dynamic co-evolution of

relationships among multiple types of entities such as resources, tasks, and individuals

(Carley, 2002; Krackhardt & Carley, 1998). The concept of socio-technical congruence

16

presented in this chapter builds on the idea of “fit” from the organizational theory

literature and from a mathematical stand point builds on the meta-matrix model from the

dynamic network analysis literature. Combining those two lines of research allows for

two important contributions to the literature. First, the socio-technical congruence

framework presented here provides a fine-grain level of analysis. Secondly, the measure

facilitates assessing the role of coordination activities in multiple and complementary

ways as well as examining the impact of several types of dependencies.

Figure 1 presents an intuitive representation of the measure of congruence

formally defined later in this chapter. A group of workers have a set of work

dependencies which defines a set of coordination requirements. When the coordination

activities carried out by those workers define a pattern of coordination similar to those

defined by the coordination requirement (case A in Figure 1), we have high levels of

congruence or “good fit”. If the patterns of coordination requirements and coordination

activities do not match, we have low levels of congruence or a “poor fit” (case B in

Figure 1).

Formally, socio-technical congruence is defined as the match between the

coordination requirements established by the dependencies among tasks and the actual

coordination activities carried out by the workers. In other words, the concept of

congruence has two components, coordination needs and coordination activities, and the

following sections discuss the mathematical framework to measure them.

17

Figure 1: The Concept of Congruence

Identification of Coordination Requirements

In order to identify which set of individuals should be coordinating their

activities, we need to represent two sets of relationships. One set is represented by which

individuals are working on which tasks. The relationships or dependencies among tasks

represent the second element. Past research has used a matrix formalization to capture

and relate those two pieces of information. For instance, Carley and Ren (2001) proposed

a metric, called resource congruence, to measure the relationship between the resources

required to perform a task and workers’ access to those resources. The same metric was

further examined by Carley and colleagues (2003) in the context of covert networks.

In the framework proposed in this chapter, assignments of individuals to

particular work items is be represented by a people by task matrix where a one in cell ij

18

indicates that worker i is assigned to task j. I will refer to this matrix as Task Assignments

(TA). Following the same approach, the set of dependencies among tasks can be

represented as a square matrix where a cell ij (or cell ji) indicates that task i and task j are

interdependent. I will refer to this matrix as Task Dependencies (TD). Now, if the Task

Assignment and Task Dependencies matrices are multiplied, a people by task matrix is

obtained that represents the set of tasks a particular worker should be aware of, given the

work items the person is responsible for and the dependencies of those work items with

other tasks. Finally, a representation of the coordination requirements among the

different workers is obtained by multiplying the product of the Task Assignment and Task

Dependencies matrices by the transpose of the Task Assignment matrix. This product

results in a people by people matrix where a cell ij (or cell ji) indicates the extent to

which person i works on tasks that share dependencies with the tasks worked on by

person j. In other words, the resulting matrix represents the Coordination Requirements

or the extent to which each pair of people needs to coordinate their work. Formally, the

Coordination Requirements matrix is determined by the following product:

CR = TA * TD * TAT (Equation 1)

where, TA is the Task Assignments matrix, TD is the Task Dependencies matrix and TAT

is the transpose of the Task Assignments matrix.

This framework provides alternatives ways of thinking about coordination

requirements among workers depending on what type of data is used to populate the Task

Dependencies matrix. Past work had focused on temporal relationships between tasks, for

19

instance, task A needs to be done before task B (e.g. Levchuk et al, 2003). In the context

of software development, such way of thinking about task dependencies is quite common.

Alternative views could be based on high level roles in the development organizations

(e.g. integration and testing depends on development) or task dependencies based on

product dependencies in the actual software code (e.g. function calls between modules).

The focus on this dissertation is on the work dependencies structure-product dependency

structure relationship because, as argued in chapter 1, the difficulty of identifying and

managing certain types of product dependencies is a critical factor in coordination

success and ultimately in productivity and quality.

Measuring Socio-Technical Congruence

Given a particular Coordination Requirements matrix constructed from relating

product dependencies to work dependencies, we can compare it to an Actual

Coordination (CA) matrix that represents the interactions workers engaged in through

different means of coordination. I refer to the match between those to matrices as socio-

technical congruence. Then, given a particular set of dependencies among tasks,

congruence is the proportion of coordination activities that actually occurred (given by

the Actual Coordination matrix) relative to the total number of coordination activities that

should have taken place (given by the Coordination Requirements matrix). For example,

if the Coordination Requirements matrix shows that 10 pairs should coordinate, and of

these, 5 show Actual Coordination interactions, then the congruence is 0.5. Formally, we

define congruence as follows:

20

Diff (CR, CA) = card { diffij | crij > 0 & caij > 0 }

|CR| = card { crij > 0 }

We have,

Congruence (CR, CA) = Diff (CR, CA) / |CR| (Equation 2)

In sum, the value of congruence belongs to the [0,1] interval that represents the

proportion of coordination requirements that were satisfied through some type of

coordination activity or mechanism. The measure of socio-technical congruence proposed

here provides a new way of thinking about coordination, particularly, by providing a fine-

grain level of analysis of different types of product dependencies and allowing us to

examine how coordination needs are impacted by them.

21

CHAPTER 3: TERMINOLOGY AND DESCRIPTION OF THE

DATASETS

Terminology

In this section, I define several terms are used through out the empirical studies as

well as the description of the datasets:

Source code file: A source code file represents a collection of functions, methods, and

data type declarations and definitions that implement part of or an entire functionality of

a software system. In this dissertation, I will use the terms source code file and module

interchangeably. This definition does not refer or imply any specific way of partitioning a

system into implementation modules.

Commit: A commit represents an actual modification to one or more source code files in

the version control system. A particular commit contain at least the following attributes: a

date of submission, an author or developer responsible, a list of one or more files and the

modifications to those files. The terms submission and changelist are used as synonyms

of a commit through out this document.

Modification request (MR): A modification request represents a work item that refers to a

conceptual change to the software that involves modifications to a set of source code files

(Mockus & Weiss, 2000). The changes could represent the development of new

functionality or the resolution of a defect encountered by a developer, the quality

22

assurance organization or reported by a customer. A modification request consists of one

or more commits from a version control system.

Lines of code (LOC): In various parts of the dissertation, we refer to lines of code as a

measure of size of a system or a module. The measure refers to non-blank non-comment

lines of code.

Datasets

In order to address the research questions outlined in chapter 1, data from several

geographically distributed software development projects was collected. The

characteristics of those projects and the data are described in the rest of this chapter.

Project A

I collected data from a software development project of a large distributed system

produced by a company that operates in the data storage industry. The data covered a

period of 39 months of development activity and the first four releases of the product.

The company had one hundred and fourteen developers grouped into eight development

teams distributed across three development locations. All the developers worked full time

on the project during the time period covered by the data. The system was composed of

about 5 million lines of code distributed in 7737 source code files mostly in C language

and a small portion (117 files and less than 96000 lines of code) in C++ language. The

data corresponding to a total of 8,257 resolved modification requests were identified.

Those MRs involved 67,652 commits to the version control system.

23

Software developers communicated and coordinated using various means.

Opportunities for interaction exist when working in the same formal team or when

working in the same location. Developers also use tools such as Internet Relay Chat

(IRC) and a MR tracking system to interact and coordinate their work. For instance, the

MR tracking system keeps track of the progress of the task, comments and observations

made by developers as well as additional material used in the development process. I

collected communication and coordination information from these two systems. Finally, I

also collected demographic data about the developers such as their programming and

domain experience and level of formal education.

Project A represents the main source of data for the various empirical studies

presented in this dissertation. In order to address potential external validity concerns, data

from additional projects was used in each empirical study. Those projects are described in

the following paragraphs.

Project B

Version control data from three open source projects from the Apache Software

Foundation was collected. I focused on changes to the software that were associated with

a modification request that were resolved between February of 2001 and January of 2003.

There were a total of 1068 modification requests resolved in that timeframe involving

1972 commits in the version control system. Those modification requests were related to

three different projects, Ants, Tomcat and Structs, where a total of seventy five engineers

participated in the development effort.

24

Project C

The project involved the development of an embedded software system for a

communications device developed by a major telecommunications company. Forty

engineers participated in the project. The data covered a period of five years and the last

six releases of the product. All the developers but one worked in the same development

facility located in the United States. The remote developer worked in Australia. The

system was composed of approximately 1.2 million lines of C and C++ code distributed

in 1224 modules with 427 modules written using in C++ language. Data associated with

about 7000 modification requests constituted the dataset.

Project D

This project was a large medical device system where the development

organization had eighty three engineers grouped into 10 teams distributed across for

development locations, one in India, one in Eastern Europe and two in the United States.

Architects, some of the technical leads and managers were also in the development

facilities located in the United States. All the developers worked full time on the project

during the time period covered by the data. Engineers had formal roles such as architect,

team lead, tester or developer. The project was organized into iterations which constitute

fixed periods of time, about 8 weeks, focused on the development of a set of

requirements defined at the beginning of the iteration. The data covered the 7th iteration

of the project. A survey instrument based on a roster approach was used to collect

coordination activity twice during the development iteration.

25

CHAPTER 4: METHODS FOR IDENTIFYING WORK

DEPENDENCIES IN SOFTWARE DEVELOPMENT PROJECTS

In this chapter, I explore different methods of determining work dependencies

from product dependencies (e.g. relationships among the source code files of a software

system). Then, those work dependencies will allow us to identify coordination

requirements among software developers as proposed in the congruence framework

introduced in chapter 2.

Two Approaches to Determine Product Dependencies in Software Systems

The traditional view of software dependency has its origins in compiler

optimizations and they focus on control and dataflow relationships (Horwitz et al, 1990).

This approach extracts relational information between specific units of analysis such as

statements, functions or methods, as well as modules, typically, from the source code of a

system or from an intermediate representation of the software code such as bytecodes or

abstract syntax trees. These relationships can represent either a data-related dependency

(e.g. a particular data structure modified by a function and used in another function) or a

functional dependency (e.g. method A calls method B). This type of dependency analysis

techniques has been widely used in a research context to examine the relationship

between coupling and quality of a software system (e.g. Hutchins & Basili, 1985; Selby

& Basili, 1991). Syntactic dependency analysis are also used by software developers to

improve their understanding of programs and the linkages among the various parts of

those programs (Murphy et al, 1998).

26

One characteristic of these relational structures such as a call-graph, and for that

matter other graphs such as inheritance and data dependencies graphs, is that they provide

a particular view of the system-wide structure. Moreover, the accuracy of the information

represented in these graphs depends on the ability of the tool used to identify all the

appropriate types of syntactic relationships allowed by the underlying programming

language (Murphy et al, 1998).

An alternative mechanism of identifying dependencies consists of examining the

set of source code files that are modified together as part of a modification request. This

approach is equivalent to the approach proposed by Gall and colleagues (1998) in the

software evolution literature to identify logical dependencies between modules. A source

code file can be viewed as representing a “bundle” of technical decisions. If a

modification request can be implemented by changing only one file, it provides no

evidence of any dependencies among files. However, when a modification request

requires changes to more than one file, it can be assumed that decisions about the change

to one file in a modification request depend in some way on the decisions made about

changes to the other files involved in implementing the modification request.

Dependencies could range from syntactic, for instance a function call between files, to

more complex semantic dependencies where the computations done in one files affects

the behavior of another files. This approach would represent a better estimate for

semantic dependencies relative to call graphs or data graphs because it does not rely on

language constructs to establish the dependency relationship between source code files.

The remainder of this dissertation refers to this approach to identify dependencies as the

“Files Changed Together” (FCT) method. I will refer to the method to identify

27

dependencies based on syntactic functional and data relationship described earlier as the

CGRAPH method.

The Task Dependency (TD) matrices produced by the techniques described in the

previous paragraphs could change over time as new product dependencies are created or

existing ones are removed. Moreover, the information captured by the TD matrix

constructed with the FCT method might differ from the TD matrix constructed with the

CGRAPG method. Those changes or differences could potentially impact the measures of

coordination requirements (equation 1) and congruence (equation 2). Then,

understanding the general properties of the task dependency matrices, how they evolve

over time and how the differ from each other is critical to assess the impact of socio-

technical congruence on outcome variables such as development productivity and

software quality. The following sections address these issues using the data from Project

A.

General Properties and Evolution of the FCT Task Dependency Matrix

Using the FCT method, I constructed monthly TD matrices which captured all the

changes to the code associated with the set of modifications resolved on each month.

Since a graph and a matrix are equivalent representations of a set of relational data, I can

use widely accepted graph measure to examine the general properties of the TD matrices3.

One basic measure is the density of the graph which provides a general idea of the level

of interconnectivity among the nodes of the graph. In this research context, density

translates to the overall degree of interdependence amongst the source code files in the

3 I use the terms graph and network interchangeably throughout the dissertation

28

system. A second useful network measure is the clustering coefficient (Watts, 1999) and

indicates the extent to which there are clusters of interdependent source code files that are

also interdependent amongst themselves. Those two measures, density and clustering

coefficient, provide a general view of the structural properties of the TD matrices.

Figure 2 shows the evolution of the density and clustering coefficient measures

over the time covered by the data. The density of the monthly TD matrices is relatively

low, with a few exceptions where the levels of density exceed 0.01 (avg=0.0033,

min=0.0004, max=0.0204). The clustering coefficient measure shows modest levels

(avg=0.0925, min=0.0023, max=0.1774) suggesting a small degree of interdependent

clusters of files in the TD matrices. In sum, the results indicate that, on a monthly basis, a

small set of dependencies are identified, and those dependencies tend to be modestly

clustered.

0.000

0.020

0.040

0.060

0.080

0.100

0.120

0.140

0.160

0.180

0.200

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Month in the Dataset

Mea

sure

Lev

el

Density Clustering Coefficient

Figure 2: Evolution of the Density and Clustering level of the TD matrices (FCT

method)

29

An instance of a set of source code files changing together as part of a

modification request represents a piece of evidence indicating the existence of a product

dependency, potentially logical or implicit in nature. In order to capture the representative

set of product dependencies, an understanding of the degree of change in the information

contained in the TD matrices is required. If the matrices are relatively stable that suggests

that considering a short time slice could suffice to capture all relevant product

dependencies. On the other hand, if the information contained in the monthly TD matrices

changes significantly from time t to time t+1, it is necessary to identify the appropriate

time window size that would yield an accurate representation of the product

dependencies. Figure 3 shows the percentage of change in the information contained in a

TD matrix from time t relative to the TD matrix from time t-1. The set of technical

dependencies captured differ significantly from month to month with an average change

of 37% (min=5.11%, max=49.94%). These results suggest that the changes to the source

code are affecting different sets of source code files over time. Hence, it is necessary to

explore how many months of information would constitute an accurate and representative

set of technical dependencies that could be used to compute the Coordination

Requirement matrices.

30

0%

20%

40%

60%

80%

100%

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38


Perc

enta

ge o

f Cha

nge

Figure 3: Evolution of the Change in the Information Contained in the TD matrices

(FCT method)

The following procedure was used to explore the time window size necessary to

capture the relevant product dependencies. First, the union of all the k-tuples of

consecutive TD matrices is computed, where k represents the number of months of data

used to compute the new TD matrices and it ranges from 2 to 39 months. For instance, in

the case of k=2, this computation outputs TD matrices that contain all the dependencies

based on the changes made to the software between months 1 and 2, month 2 and 3,

months 3 and 4, and so forth. The second step is to average the network density value of

all the matrices associated with a particular value of k. Finally, I plotted that average

value of network density for each value of k. Figure 4 depicts the results of this

procedure. As the number of months of data considered to compute the TD matrix

31

increases, the density level of that TD matrix increases monotonically until month 19

where a density value of 0.0109 is reached. The remaining 20 months of data increase the

density of the TD matrix from 0.0109 up to 0.01151. In other words, any additional month

of data beyond 19 month does not yield a significant increase in the value of the density

of the TD matrix, indicating that any additional month of data does not contribute any

additional information value in terms of technical dependencies. In view of this result, I

used a time period of 19 months to compute the TD matrix used in the calculations of the

coordination requirements.

0.000

0.002

0.004

0.006

0.008

0.010

0.012

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38

Amount of Months of Data

Ave

rage

Cum

ulat

ive

Den

sity

Lev

Figure 4: Average Cumulative Density of the TD matrix (FCT method)

General Properties and Evolution of the CGRAPH Task Dependency Matrix

In this case of the CGRAPH, the dependencies between source code files are

determined based on data and functional references. Data references are represented by

relationships were a source code file, A, references a data object in a second source code

file B. Functional references are represented by relationships where a source code file, A,

32

invokes a function or a method declared in a second source code file B. Unlike the

relationships in the FCT methods, data and functional references are directional, that is,

the pair of source code files (A,B) is considered different from the pair (B,A).

I collected quarterly data for this type of dependency information, mapping each

quarter to the corresponding 3 months of the data discussed in the previous paragraphs. I

used the C-REX tool (Hassan and Holt, 2004) to identify programming language tokens

and references in each entity of each source code file. This analysis was performed over

the entire source code of the system4 at the end of the 3rd month of each quarter. Using

the resulting data, I computed dependencies between source code files by identifying

data, function and method references that cross the boundary of each source code file. In

other words, each cell ij of the TD matrix computed with the CGRAPH method represents

the number of data/function/method references that exist from file i to file j.

Figure 5 shows the evolution of the network density measure over each quarter.

The TD matrices have higher levels of density (avg=0.0311, min=0.0261, max=0.0322)

relative to those obtained using the FCT method5. In terms of the evolution of the

clustering coefficient measure, we see that the level are also very stable over time, and

higher (avg=0.1862, min=0.1738, max=0.1909) than those reported for the TD matrices

created with the FCT method. The density of the TD matrices produced by the CGRAPH

is significantly higher than the density of the matrices produced by the FCT method. This

difference could stem primarily from two characteristics of the source code of a system.

First, the CGRAPH method identifies numerous technical dependencies that involve files

4 The set of files used in the analysis also included the automatically generated source code files from functionality such as remote procedure calls. 5 The maximum level of density of a TD matrix produced by the FCT is 0.01151 if all 39 months of development activity are considered.

33

that once developed, are rarely modified. Cross-cutting concerns such as logging, tracing

and security are good examples. Commonly used low level functionality such memory

and thread management and basic storage types such as lists and queues are another

example. A second factor that might contribute to higher levels of density of the TD

matrices is the technical dependencies that exist with and between automatically

generated source code files. One such example is the source code for remote procedure

calls (RPCs). The FCT method would capture dependencies between caller and callee of

an RPC if there changes to the RPC specification or functionality. On the other hand, the

CGRAPH method would capture the complete path of dependencies from the caller

through the RPC stubs, marshalling and communication code all the way to the callee.

Given the potential bias that these two factors could have in the computations of

dependencies, I removed them from the quarterly call graphs and recomputed the density

measures for each quarterly TD matrices. The results showed a reduction in the density

(avg=0.0289, min=0.0241, max=0.0299). However, the density levels remained

significantly higher than those for TD matrices created with the FCT method when

considering the 19 month window for development activity.

34

0.000

0.050

0.100

0.150

0.200

0.250

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

Quarter in the Dataset

Mea

sure

Lev

el


Figure 5: Evolution of the Density level of the TD matrices (CGRAPH method)

We also examined the percentage of change in the information contained in a TD

matrix from quarter t relative to the TD matrix from quarter t-1. Figure 6 shows that rate

of change is relatively low (avg=0.24%, min=0.1%, max=0.9%). Those rates of change

indicate whether the relationship between files exists or not. If we extend the idea of

change to also consider a modification in the weight of the relationship (e.g. number of

calls between files), the rate of change increases (avg=1.1%, min=0.4%, max=3%),

however, they remain relatively stable over time. This result it is not particularly

surprising since significant changes in the overall syntactic dependency structure of a

system would imply major code refactoring efforts or architectural changes, events that

do not occur often. A similar pattern of stability was found in the TD matrices produced

by the FCT method when I accumulated the commit information from 19 consecutive

months. Then, we could think of the volatility that the monthly TD matrices produced by

35

the FCT method showed as an indication of how the development work evolves over time

rather than just focusing how the overall structure of the technical dependencies changes

over time. In sum, the CGRAPH method produces TD matrices that contain significantly

more product dependency information relative to those produced by the FCT method.

Moreover, a fraction of the product dependencies identified by both methods identified

differed significantly.

0%

1%

2%

3%

4%

5%

Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

Quarter in the Dataset

Perc

enta

ge o

f Cha

nge

Change in Number of Edges Change in Edge Weights

Figure 6: Evolution of the Change in the Information Contained in the TD matrices

(CGRAPH method)

Comparative Analysis of the Task Dependency Matrices

Although the analyses described above provides valuable information about the

various TD matrices, they do not tell us anything regarding the similarity in the sets of

technical dependencies identified by both, FCT and CGRAPH, methods. One of the

advantages of the FCT method is the potential to identify technical dependencies that

36

might not necessarily be captured by a simple syntactic dependency among modules of a

software system such as semantic dependencies (Gall et al, 1998). This argument

suggests that a comparison between the TD matrices generated by the two methods, FCT

and CGRAPH, might show differences, possibly significant. The first step of this analysis

was to compute the following two operations: TD(FCT) - TD(CGRAPH) and TD(CGRAPH) -

TD(FCT). These operations, which are equivalent to the set difference operation, allow us to

determine which dependencies that are identified by the FCT methods are not identified

by the CGRAPH method and vice versa. The focus is to identify whether a relationship

between two modules exists on one matrix, the other or in both. Hence, I do not consider

the differences in the weight on the linkages. I compared quarterly TD(CGRAPH) matrices

against the TD(FCT) computed for a period of time of the 19 months prior to the end of the

quarter. For the first two quarters, I did not have 19 month worth of past data to compute

the TD(FCT) matrices. Therefore, I used 13 months to construct the TD(FCT) that compared

to the TD(CGRAPH) matrix from the first quarter, and 16 months in the case of the second

quarter comparison.

Figure 7 shows the comparison between the TD matrices. The TD matrix computed

using the FCT method has an average of 14.6% of the dependencies that were not

identified by the CGRAPH methods (min=12.4%, max=17.1%). As discussed earlier, the

TD matrices computed using the CGRAPH method are denser and that situation is clearly

reflected in this comparison. On average, the TD matrix computed using the CGRAPH

had 74.3% of product dependencies that were not identified by the FCT method

(min=70.6%, max=79.2%).

37

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

Quarter in Dataset

Perc

enta

ge o

f Non

-iden

tifie

dD

epen

denc

ies

FCT-CGR CGR-FCT

Figure 7: Comparison between TD matrices generated by the FCT and CGRAPH

methods

Comparative Analysis of the Coordination Requirement Matrices

As described in chapter 2, the Coordination Requirements matrix (CR) is a

function of two elements: the TA matrix and the TD matrix. Using the different methods

for identifying technical dependencies to construct TD matrices will result in different CR

matrices. Hence, we also need to examine the general properties of the both types of CR

matrices. Using the data from the modification requests resolved in each month to

compute the TA matrix. In terms of computing the TD matrix, we use a 19 month moving

windows in the case of the FCT method or the corresponding quarterly TD matrix in the

case of the CGRAPH method. Figure 8 shows the evolution of the density and clustering

coefficient measures for the CR matrices constructed based on the FCT method. We

observe that the density of the monthly CR matrices is low (avg=0.0655, min=0.0005,

38

max=0.1429) while the clustering coefficient measure shows relatively high levels

(avg=0.3179, min=0.0308, max=0.4331) suggesting an important degree of

interdependent clusters of files in the CR matrices.

0.000

0.100

0.200

0.300

0.400

0.500

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39


Mea

sure

Lev

el


Figure 8: Evolution of Density and Clustering level of the CR matrices (FCT

method)

Figure 9 shows evolution of the density and clustering coefficient measures for

the CR matrices constructed based on the CGRAPH method. Although, the clustering

coefficient values (avg=0.3979, min=0.0312, max=0.5402) are relatively similar to those

shown in Figure 8. On the other hand, the CR matrices created using the CGRAPH

methods are significantly more dense (avg=0.1509, min=0.0009, max=0.2408) than those

created using the FCT method. In other words, CR matrices constructed with the

CGRAPH method would suggest significantly levels of coordination requirements for the

39

developers. Then, it is important to understand if the additional coordination needs are

indeed necessary. The question is addressed in chapter 5.

0.000

0.100

0.200

0.300

0.400

0.500

0.600

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39


Mea

sure

Lev

elDensity Clustering Coefficient

Figure 9: Evolution of Density and Clustering level of the CR matrices (CGRAPH

method)

Chapters 5 and 6 present two empirical studies that use the dependency

identification techniques discussed in the previous paragraphs (FCT and CGRAPH) to

examine the mismatch between coordination needs and coordination activities and their

impact of two traditional outcome variables: development productivity and product

quality.

40

CHAPTER 5: DEPENDENCIES, CONGRUENCE AND THEIR

IMPACT ON DEVELOPMENT PRODUCTIVITY

Identifying work dependencies and determining the appropriate coordination

mechanisms to address the dependencies is not a trivial problem. Coordination is a

recurrent topic in the organizational theory literature and, as discussed in chapters 1 and

2, many stylized types of task dependencies and coordination mechanisms have been

proposed over the past several decades. These perspectives are useful in the context of

enduring structures. However, numerous types of work, for instance non-routine

knowledge-intensive activities such as software development, are potentially full of fine-

grain dependencies that might change on a daily or hourly basis. Conventional

coordination mechanisms like standard operating procedures or routines would have very

limited applicability in these dynamic contexts. Failure to identify the new needs for

coordination and information exchange might hinder the organization’s ability to adapt to

changes in their competitive environment (Henderson & Clark, 1990). The study reported

in this chapter represents the first step in the examination of how the gaps between

coordination needs and actual coordination activity impact outcome variable, such as

development productivity, in the context of software development activities.

Study I: Congruence and Development Productivity

Software development is populated with rapidly changing dependencies and this

attribute of software development tasks is a potential source of coordination problems

which impacts productivity. The analysis presented in this study focuses, first, in

41

exploring the dynamism in the coordination requirements and, secondly, examining the

impact that coordination activity congruent with coordination needs has on development

performance.

Research Questions

When members of a team are physically collocated and coordination requirements

within the team change, there are numero

dependencies in geographically distributed...

Documents