dependencies in geographically distributed...
TRANSCRIPT
-
DEPENDENCIES IN GEOGRAPHICALLY DISTRIBUTED
SOFTWARE DEVELOPMENT:
OVERCOMING THE LIMITS OF MODULARITY1
Marcelo Cataldo
CMU-ISRI-07-120
December 2007
School of Computer Science Institute for Software Research
Carnegie Mellon University Pittsburgh, PA
Thesis Committee
Kathleen M. Carley, Co-Chair
James D. Herbsleb, Co-Chair
Len J. Bass
David Redmiles
Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy
Copyright © 2007 Marcelo Cataldo
1 This dissertation was supported by the National Science Foundation under Grant No. IIS-0414698, Grant No. IIS-0534656 and Grant No. IGERT 9972762, by the U.S. Army Research Laboratory under Collaborative Technology Alliance Program, Cooperative Agreement DAAD19-01-2-0011, by the Office of Naval Research (ONR N00014-06-1-0921) and by the Air Force Research Lab with Charles River Analytics SC060701.
-
ii
Keywords: geographically distributed software development, collaborative software
development, coordination, software dependencies.
-
iii
Dedicada a Pei-Chi y a mis padres, Antonio Y Mirta
-
iv
ACKNOWLEDGEMENTS
I have been very fortunate to work with an outstanding dissertation committee in
Kathleen Carley, Jim Herbsleb, Len Bass and David Redmiles. I am particularly indebted
to Kathleen and Jim for being the best advisors a student could hope for. I also would like
to thank my family for their patience and encouragement, specially, my wife Pei-Chi
without whom my life as a doctoral student would have been a lot less enjoyable.
Through out this process, many others helped shape my views and research.
Special thanks go to Matthew Bass, Audris Mockus, Jeffrey Reminga, Jeffrey Roberts
and Patrick Wagstrom.
-
v
ABSTRACT
Geographically distributed software development (GDSD) is becoming pervasive.
Hence, the constraints in communication and its negative impact of developers’ ability to
coordinate effectively is a growing problem that consistently results in sub-par
performance of GDSD teams. Past research argues that geographically distributed teams
do better when their work is almost independent from each other. In software
engineering, modularization is the traditional technique intended to reduce the
interdependencies among modules that constitutes a system. The modular design
argument suggests that by reducing the technical dependencies, the work dependencies
between teams developing interdependent modules are also reduced. Consequently, a
modular product structure leads to an equivalent modular task structure. This dissertation
argues that modularization is not a sufficient representation of work dependencies in the
context of software development and it proposes a method for measuring socio-technical
congruence, defined as the relationship between the structure of work dependencies and
the coordination patterns of the organization doing the technical work. Two empirical
studies assessed the impact of socio-technical congruence on development productivity
and product quality. In addition, a third empirical study explores how developers in a
geographically distributed software development organization evolve their coordination
patterns to overcome the limitations of the modular design approach.
Collectively, this dissertation has important contributions to software engineering,
CSCW and organizational literatures. First, the empirical evaluation of the congruence
framework showed the importance of understanding the dynamic nature of software
development. Identifying the “right” set of product dependencies that determine the
-
vi
relevant work dependencies and coordinating accordingly has significant impact on
reducing the resolution time of modification requests. The analyses showed traditional
software dependencies, such as syntactic relationships, tend to capture a relatively stable
view of product dependencies that is not representative of the dynamism in product
dependencies that emerges as software systems are implemented. On the other hand,
logical dependencies provide a more accurate representation of the most relevant product
dependencies in software development projects. Secondly, this dissertation moves
forward our understanding of the relationship between product and work dependencies
and software quality. Logical dependencies among software modules and work
dependencies were found to be two very significant factors affecting the failure proneness
of software modules. Finally, the longitudinal analysis of coordination activities in a
GDSD project showed that developers centrally positioned in the social system of
information exchanges and coordination activities performed a critical bridging function
across formal teams and geographical locations. Moreover, those same individuals
contributed an average of 57% of development effort in terms of implementing the
software system in each release covered by the data.
-
vii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS....................................................................................................................... IV
ABSTRACT ..................................................................................................................................................V
TABLE OF CONTENTS..........................................................................................................................VII
LIST OF TABLES ..................................................................................................................................... XI
LIST OF FIGURES ................................................................................................................................ XIII
CHAPTER 1: INTRODUCTION ................................................................................................................2
THE NATURE OF SOFTWARE DEVELOPMENT AND MODULAR DESIGN.........................................................3
THE NATURE OF SOFTWARE DEVELOPMENT AND INTERDEPENDENCY THEORIES .......................................7
RESEARCH QUESTIONS..............................................................................................................................11
CHAPTER 2: A FRAMEWORK FOR IDENTIFICATION OF WORK DEPENDENCIES..............13
THE CONCEPT OF SOCIO-TECHNICAL CONGRUENCE .................................................................................15
IDENTIFICATION OF COORDINATION REQUIREMENTS................................................................................17
MEASURING SOCIO-TECHNICAL CONGRUENCE.........................................................................................19
CHAPTER 3: TERMINOLOGY AND DESCRIPTION OF THE DATASETS...................................21
TERMINOLOGY ..........................................................................................................................................21
DATASETS .................................................................................................................................................22
Project A ..............................................................................................................................................22
Project B ..............................................................................................................................................23
Project C..............................................................................................................................................24
Project D..............................................................................................................................................24
CHAPTER 4: METHODS FOR IDENTIFYING WORK DEPENDENCIES IN SOFTWARE
DEVELOPMENT PROJECTS..................................................................................................................25
TWO APPROACHES TO DETERMINE PRODUCT DEPENDENCIES IN SOFTWARE SYSTEMS ............................25
-
viii
General Properties and Evolution of the FCT Task Dependency Matrix ............................................27
General Properties and Evolution of the CGRAPH Task Dependency Matrix....................................31
COMPARATIVE ANALYSIS OF THE TASK DEPENDENCY MATRICES............................................................35
COMPARATIVE ANALYSIS OF THE COORDINATION REQUIREMENT MATRICES ..........................................37
CHAPTER 5: DEPENDENCIES, CONGRUENCE AND THEIR IMPACT ON DEVELOPMENT
PRODUCTIVITY........................................................................................................................................40
STUDY I: CONGRUENCE AND DEVELOPMENT PRODUCTIVITY ...................................................................40
Research Questions..............................................................................................................................41
Method .................................................................................................................................................42
Description of the Measures............................................................................................................................ 43
Description of the Model and Preliminary Analysis ....................................................................................... 48
Results..................................................................................................................................................53
The Evolution of Coordination Requirements................................................................................................. 53
The Impact of Congruence on Resolution Time of MRs................................................................................. 57
The Evolution of Congruence over Time ........................................................................................................ 62
Discussion............................................................................................................................................70
CHAPTER 6: DEPENDENCIES, CONGRUENCE AND THEIR IMPACT ON SOFTWARE
QUALITY ....................................................................................................................................................72
STUDY II: THE STRUCTURE OF DEPENDENCIES, CONGRUENCE AND PRODUCT QUALITY ..........................73
Research Questions..............................................................................................................................75
Method .................................................................................................................................................80
Description of the Data and Measures............................................................................................................. 80
Results..................................................................................................................................................94
The Impact of Dependencies........................................................................................................................... 95
Stability Analysis .......................................................................................................................................... 103
Checks for Random Temporal Effects .......................................................................................................... 107
Discussion..........................................................................................................................................108
CHAPTER 7: THE EVOLUTION OF COORDINATION BEHAVIOR ............................................111
-
ix
STUDY III: THE EVOLUTION OF COORDINATION BEHAVIOR....................................................................112
Research Questions............................................................................................................................113
Method ...............................................................................................................................................115
Description of the Data ................................................................................................................................. 116
Description of Measures................................................................................................................................ 118
Results................................................................................................................................................125
General Patterns of Coordination Behavior................................................................................................... 125
On the Relationship between Network Position and Productivity................................................................. 130
Stability of Coordination Patterns ................................................................................................................. 144
Drivers of Coordination Patterns................................................................................................................... 146
Discussion..........................................................................................................................................148
CHAPTER 8: APPLICATIONS..............................................................................................................151
APPLICATIONS FOR SOFTWARE DEVELOPERS...........................................................................................153
Enhancing coordination needs awareness.........................................................................................153
Enhancing awareness of product dependencies.................................................................................155
Other applications of the congruence framework..............................................................................156
MANAGERIAL APPLICATIONS ..................................................................................................................157
Project-wide view of coordination patterns.......................................................................................158
Identifying critical software and organizational agents and units.....................................................160
CHAPTER 9: CONCLUSIONS...............................................................................................................162
CONTRIBUTIONS......................................................................................................................................163
LIMITATIONS ...........................................................................................................................................165
FUTURE WORK........................................................................................................................................167
Identification of coordination requirements in early stages of software projects..............................167
The impact of formal roles in development organizations .................................................................168
Communication beyond team and location boundaries and individual-level performance...............169
Applying the congruence framework in other types of tasks..............................................................170
-
x
REFERENCES..........................................................................................................................................173
APPENDIX A: SURVEY FOR PROJECT D.........................................................................................187
-
xi
LIST OF TABLES
TABLE 1: DESCRIPTIVE STATISTICS FOR DEPENDENT AND CONTROL VARIABLES .........................................49
TABLE 2: DESCRIPTIVE STATISTICS FOR CONGRUENCE MEASURES (FCT METHOD) ......................................50
TABLE 3: DESCRIPTIVE STATISTICS FOR CONGRUENCE MEASURES (CGRAPH METHOD) .............................50
TABLE 4: PAIR-WISE CORRELATIONS .............................................................................................................51
TABLE 5: RESULTS FROM OLS REGRESSION OF EFFECTS ON RESOLUTION TIME (FCT METHOD)..................59
TABLE 6: RESULTS FROM OLS REGRESSION OF EFFECTS ON RESOLUTION TIME (CGRAPH METHOD) .........60
TABLE 7: EFFECT OF TIME ON CONGRUENCE. ................................................................................................65
TABLE 8: DIFFERENCES BETWEEN DEVELOPERS’ POPULATION.......................................................................69
TABLE 9: DESCRIPTIVE STATISTICS FOR LAST RELEASE OF PROJECT A.........................................................89
TABLE 10: DESCRIPTIVE STATISTICS FOR LAST RELEASE OF PROJECT C .......................................................90
TABLE 11: PAIR-WISE CORRELATIONS FOR LAST RELEASE OF PROJECT A (* P < 0.01)..................................91
TABLE 12: PAIR-WISE CORRELATIONS FOR LAST RELEASE OF PROJECT C (* P < 0.01) ..................................93
TABLE 13: BASELINE MODEL FOR FAILURE PRONENESS................................................................................96
TABLE 14: IMPACT OF SYNTACTIC DEPENDENCIES ON FAILURE PRONENESS.................................................97
TABLE 15: IMPACT OF LOGICAL DEPENDENCIES ON FAILURE PRONENESS.....................................................99
TABLE 16: IMPACT OF WORKFLOW DEPENDENCIES ON FAILURE PRONENESS..............................................101
TABLE 17: IMPACT OF COORDINATION REQUIREMENTS ON FAILURE PRONENESS .......................................102
TABLE 18: IMPACT OF CONGRUENCE ON FAILURE PRONENESS....................................................................103
TABLE 19: IMPACT OF TECHNICAL DEPENDENCIES, WORK DEPENDENCIES AND CONGRUENCE ACROSS
RELEASES IN PROJECT A.....................................................................................................................105
TABLE 20: IMPACT OF TECHNICAL DEPENDENCIES, WORK DEPENDENCIES AND CONGRUENCE ACROSS
RELEASES IN PROJECT C.....................................................................................................................106
TABLE 21: RANDOM-EFFECTS MODEL OF FAILURE PRONENESS ..................................................................108
TABLE 22: DESCRIPTIVE STATISTICS FOR IRC DATASET ..............................................................................136
TABLE 23: RESULTS OF THE MULTI-LEVEL REGRESSION MODEL USING THE IRC DATA ..............................139
TABLE 24: RESULTS OF THE MULTI-LEVEL REGRESSION MODEL USING THE MR DATA ..............................142
TABLE 25: RESULTS FROM MULTI-LEVEL REGRESSION MODEL USING PROJECT D DATA............................143
-
xii
TABLE 26: STABILITY OF THE COORDINATION NETWORKS ..........................................................................146
TABLE 27: PREDICTING COORDINATION ACTIVITIES ...................................................................................147
-
xiii
LIST OF FIGURES
FIGURE 1: THE CONCEPT OF CONGRUENCE....................................................................................................17
FIGURE 2: EVOLUTION OF THE DENSITY AND CLUSTERING LEVEL OF THE TD MATRICES (FCT METHOD) ......28
FIGURE 3: EVOLUTION OF THE CHANGE IN THE INFORMATION CONTAINED IN THE TD MATRICES (FCT
METHOD) ..............................................................................................................................................30
FIGURE 4: AVERAGE CUMULATIVE DENSITY OF THE TD MATRIX (FCT METHOD)..........................................31
FIGURE 5: EVOLUTION OF THE DENSITY LEVEL OF THE TD MATRICES (CGRAPH METHOD) ..........................34
FIGURE 6: EVOLUTION OF THE CHANGE IN THE INFORMATION CONTAINED IN THE TD MATRICES (CGRAPH
METHOD) ..............................................................................................................................................35
FIGURE 7: COMPARISON BETWEEN TD MATRICES GENERATED BY THE FCT AND CGRAPH METHODS ..........37
FIGURE 8: EVOLUTION OF DENSITY AND CLUSTERING LEVEL OF THE CR MATRICES (FCT METHOD) .............38
FIGURE 9: EVOLUTION OF DENSITY AND CLUSTERING LEVEL OF THE CR MATRICES (CGRAPH METHOD) ....39
FIGURE 10: THE EVOLUTION OF COORDINATION REQUIREMENTS ON A MONTHLY BASIS .............................54
FIGURE 11: THE EVOLUTION OF COORDINATION REQUIREMENTS IN OPEN SOURCE PROJECTS .....................55
FIGURE 12: EVOLUTION OF THE CONGRUENCE MEASURES ACROSS RELEASES..............................................64
FIGURE 13: PROPORTION OF CHANGES PER DEVELOPER PER RELEASE ..........................................................66
FIGURE 14: CONGRUENCE MEASURES ACROSS RELEASES BASED ON TOP CONTRIBUTORS INTERACTIONS....68
FIGURE 15: CONGRUENCE MEASURES ACROSS RELEASES FOR THE REST OF THE DEVELOPERS.....................68
FIGURE 16: OVER TIME COORDINATION PATTERNS FROM THE MR SYSTEM DATA ......................................126
FIGURE 17: OVER TIME COORDINATION PATTERNS FROM THE IRC DATA ...................................................126
FIGURE 18: COORDINATION PATTERNS ACROSS FORMAL TEAMS AND GEOGRAPHICAL LOCATIONS ...........127
FIGURE 19: LOCATION X NETWORK POSITION INTERACTION EFFECT..........................................................129
FIGURE 20: COORDINATION PATTERNS AND PRODUCTIVITY........................................................................131
FIGURE 21: THE SIZE OF THE CORE GROUP OVER TIME AND TOP PERFORMERS MEMBERSHIP ....................132
FIGURE 22: COMPOSITION OF THE CORE GROUP OVER TIME BY PRODUCTIVITY LEVELS.............................133
FIGURE 23: AMOUNT OF CHANGE IN DYADS CONNECTIONS ........................................................................145
-
2
CHAPTER 1: INTRODUCTION
Over the past couple of decades, geographically distributed work has become
pervasive and software development organizations are no exception. Factors such as
access to talent, acquisitions and the need to reduce the time-to-market of new products
are the driving forces for the increasing number of geographically distributed software
development (GDSD) projects (Herbsleb & Moitra, 2001; Karolak, 1998). Unfortunately,
this new trend has its costs. Distance leads to numerous problems in communication and
coordination, and ultimately, impacts the performance of software development teams
(Herbsleb et al, 2000; Herbsleb & Mockus, 2003). The failure to identify work
dependencies among developers or development teams results in coordination problems.
A growing body of work on coordination in software development suggests that the
identification and the management of dependencies is a fundamental challenge in
software development organizations, particularly in those that are geographically
distributed (some examples are: Cataldo et al, 2007; de Sourza, 2005; Grinter et al, 1999;
Herbsleb et al, 2000; Herbsleb & Mockus, 2003). The modular product design literature
has developed an important body of research on interdependency, for instance, the work
on design structure matrices to find alternative structures that reduce dependencies
among the various components of the system (Eppinger et al, 1994; Sullivan et al, 2001).
Interdependency is central to organizations and it has also been a perennial research topic
in organizational theory (DeSanctis et al, 1999; Staudenmeyer, 1997). Those research
streams could inform the design of software development organizations so they are better
able to identify and manage work dependencies. However, we first need to understand
-
3
the assumptions of the different theoretical views and how those assumptions relate to the
characteristics of software development tasks.
The Nature of Software Development and Modular Design
The idea of dividing a complex task into smaller manageable units is consistent
with the reductionist view (Simon, 1962; von Hippel, 1990) which is well developed in
the product development literature (Eppinger et al, 1994). Projects, typically, have a
general description of the system’s components and their relationships or a more detailed
report such as architectural or high-level design document. Managers use the information
in those documents to divide the development effort into work items that are assigned to
specific development teams minimizing the interdependencies among those teams
(Conway, 1968; Eppinger et al, 1994; Sullivan et al, 2001). In the system design
literature, it has long been speculated that the structure of a product inevitably resembles
the structure of the organization that designs it (Conway, 1968). In Conway’s original
formulation, he reasoned that coordinating product design decisions requires
communication among the engineers making those decisions. If everyone needs to talk to
everyone, the communication overhead does not scale well for projects of any size.
Therefore, products must be split into components, with limited technical dependencies2
among them, and each component assigned to a single team. Conway (1968) proposed
that the component structure and organizational structure stand in a homomorphic
relation, in that more than one component can be assigned to a team, but a component
must be assigned to a single team.
2 The terms “technical dependency” and “product dependency” are used interchangeably through this dissertation.
-
4
A similar argument has been proposed in the strategic management literature.
Baldwin and Clark (2000, page 90) argued that modularization makes complexity
manageable, enables parallel work and tolerates uncertainty. The design decisions are
hidden within the modules which communicate through standard interfaces, then,
modularization adds value by allowing independent experimentation of modules and
substitution (Baldwin & Clark, 2000). Moreover, Baldwin and Clark (2000, page 89)
argued that a modular design structure leads to an equivalent modular task structure.
Then, their view aligns with Conway’s idea that one or more modules can be assigned to
one organizational unit and work can be conducted almost independently of others. In the
context of software engineering, a similar approach was first articulated by Parnas (1972)
as modular software design. Parnas (1972) argued that modules ought to be considered
work items instead of just a collection of subprograms. Then, development work can
continue independently and in parallel across different modules. Parnas’ views also
coincide with the theoretical arguments from product design and strategic management
literatures.
All three theoretical views rely on two interrelated assumptions. The authors
assume a simple and obvious relationship between product modularization and task
modularization. Hence, reducing the technical interdependencies among modules, the
modularization theories argue, task interdependencies are reduced, which consequently,
reduces the need for communication among work groups. Unfortunately, there are several
problems with these assumptions. First, existing software modularization approaches
only use a subset of the technical dependencies, typically syntactic relationships, of a
software system (Garcia et al, 2007). Then, potentially relevant work dependencies might
-
5
be ignored. Secondly, recent empirical evidence indicates that the relationship between
product structure and task structure is not as simple as previously assumed. Moreover, the
theorized similarity between product and task structures diminishes over time (Cataldo et
al, 2006).
Thirdly, promoting minimal communication between teams responsible for
interdependent modules is problematic. The computer-mediated communication literature
suggests that loose-coupling tasks is the appropriate approach when teams are
geographically distributed (Olson & Olson, 2000). However, recent studies suggest that
minimal communication between teams, collocated or distributed, is detrimental to the
success of projects. The product development literature argues that information hiding,
which leads to minimal communication between teams, is an inevitable antecedent of
variability in the evolution of projects resulting, typically, in integration problems
(Yassine et al, 2003). In context of software development, de Souza and colleagues
(2004) found that information hiding led development teams to be unaware of others
teams’ work resulting in coordination problems. Grinter and colleagues (1999) reported
similar findings for geographically distributed software development projects. The
authors highlighted that the main consequence of reducing the teams’ need to
communicate was to increase costs because problems were discovered too late in the
development process. Those findings do not suggest that modularization is not useful.
They highlight the need to supplement it with coordination mechanisms to allow
developers to deal correctly with the assumptions that are not captured in the
specification of the dependencies.
-
6
Another problem associated with the assumptions of modular design is the nature
and stability of the interfaces between software modules. Although, the program
dependency literature defines technical dependencies as a syntactic or semantic
relationship between statements (Podgurski & Clarke, 1990), the same ideas are applied
at the level of modules. Then, relationships among modules could also range from
syntactic, for instance a function call from module A to module B, to more complex
semantic dependencies where, for example, the computations done in one module affects
the behavior of another module. Some authors refer to those types of semantic
dependencies as dynamic (Bass et al, 2003) or logical (Gall et al, 1998). Even in the
simple case of a function call between two modules, the complexity and the degree of
dependency varies, for instance, if we consider the number of parameters of a function
call or we compare parameters passed by value versus parameters passed by reference.
Cataldo et al (2007) presented case studies where even simple interfaces between
modules developed by remote teams create coordination breakdown and integration
problems. The authors reported that semantic dependencies were even more problematic
and they argued that the developers’ ability to identify and manage dependencies was
hindered by several inter-related factors such as development processes, organizational
attributes (e.g. structure, management style) and uncertainty of the interfaces. In a field
study of a large software project, de Souza (2005) encountered that interfaces tended to
change often and their design details tended to be incomplete, leading to serious
integration problems. These findings argue that the interfaces between software modules
might differ in complexity and, often, it is not possible to specify those interfaces at the
-
7
necessary level of detail, increasing the likelihood of future changes to them. This lack of
stability represents a constant challenge for software development organizations.
In sum, the modularization approach is a very useful tool for dividing the
development of a complex software system into manageable units. However,
modularization is not a sufficient representation of work dependencies in software
development activities. The relationship between the task dependency structure and the
product structure is not as simple as theorized. Appropriate mechanisms are then required
to identify relevant work dependencies and, consequently, maintain suitable levels of
communication and coordination among teams developing interdependent modules,
particularly, in the case of geographically distributed software development.
The Nature of Software Development and Interdependency Theories
Coordination is a central concept in organizations, the idea of division of labor
into interdependent units is a well developed and mechanisms for coping with the varying
degree of interdependency have been proposed in the traditional organizational literature
(for instance, March & Simon, 1958; Thompson, 1967; Galbraith, 1973; Staudenmayer,
1997). More recent work, particularly in organizational design, has focused on
computational and mathematical approaches to examine how organizational designs, that
use different models of communication and coordination, are affected by factors such as
stress, task decomposition, quality of information exchanged, and ability to adapt (for
instance, Carley and Lin, 1995, 1997; Handley & Levis, 2001; Perdu & Levis, 1998).
Then both streams of work, traditional organizational theory and computational and
-
8
mathematical organizational theory (CMOT), are relevant to the problem of coordination
in software development projects.
In the traditional organizational theory, March and Simon (1958) argued that
coordination encompasses more than just a traditional division of labor and assignment of
tasks. The authors proposed numerous mechanisms such the division of the task into
nearly independent parts and they also argued that schedules and feedback mechanisms
are required when interdependence is unavoidable. Thompson (1967) extended March
and Simon’s work by matching three mechanisms: standardization, plan, and mutual
adjustment, to stylized categorizations of dependencies such as pooled, sequential, and
reciprocal. Galbraith (1973) argued that low levels of interdependency can be managed
by traditional mechanisms such as rules and programs. However, as the level of
interdependency increases additional mechanisms are required such as slack resources
and lateral communication (Galbraith, 1973). Mintzberg (1979) took an organizational-
level perspective and argued that specific coordination mechanisms are properties of
particular kinds of organizations and environments. Crowston (1991) developed a
typology of coordination problems to catalog coordination mechanisms that address
specific types of interdependencies. Staudenmayer (1997) grouped the contributions of
March and Simon, Thompson, and others into the information processing theories of
interdependency which, she argued, rely on the assumptions of determinism and stability.
In other words, those theoretical views focus on predictable and static tasks
(Staudenmayer, 1997). This limitation of the information processing argument is not
problematic if software development tasks can be identified a priori and the set of
interdependencies that arise from the division of labor are managed with the appropriate
-
9
set of mechanisms. If we think in terms of project management activities, coarse-grain
development activities such as “develop component A” or “implement feature X” can
typically be identify at relatively early stages of the projects. Some dependencies among
those development tasks are typically easy to identify. For instance, particular work items
need to be finished before other work items can start. Work items that can only be
assigned to specific teams because of the skill set required would represent another
example. Then, specific organizational forms can be used to manage the dependencies
among those coarse-grain development tasks (Malone & Crowston, 1991), even in the
case of geographically distributed development organizations (Grinter et al, 1999).
Unfortunately, there are several characteristics of software development activities
that limit the applicability of traditional organizational theories as well as the more recent
CMOT work. First, it is widely accepted among software engineering researchers and
practitioners that the requirements of the system become known over time or those
requirements change as time progresses (Leffingwell & Widrig, 2003). In some cases the
changes in the requirements result in minor alterations of specific development tasks. In
other cases, new features have to be added or features under development are eliminated.
These events introduce a certain level of dynamism in software development that
challenges the determinism and stability assumptions of the information processing views
of interdependency.
Secondly, the dynamic nature of finer-grain dependencies that arise as part of the
development of a piece of code is not well suited for traditional organizational theories of
coordination. The act of developing a software system consists of a collection of design
decisions, either at the architectural level or at the implementation level. Those design
-
10
decisions introduce constraints that might establish new dependencies among the various
parts of the system, modify existing ones or even eliminate dependencies. The changes in
dependencies can generate new coordination requirements that are quite difficult to
identify a priori, particularly when they are not obvious, or as a project matures over time
(Henderson & Clark, 1990; Sosa et al, 2004). Failure to discover the changes in
coordination needs might have a profound impact on the quality of the product (Curtis et
al, 1988), on productivity (Herbsleb & Mockus, 2003) and even on the projects’ overall
design (Bass et al, 2006). In addition, little is known about the specific impact of the
various types of dependencies that arise among parts of a software system such as explicit
versus implicit dependencies or syntactic versus logical dependencies. Then, the use of
the computational and mathematical organizational theory approaches is limited because
of the lack of theoretical framework that guides the modeling of the relationships
between the organizational tasks, their dependencies and the need to communicate and
coordination.
In sum, software development tasks are embedded in an evolving network of
coordination requirements that need to be satisfied. The coarse-grain and idealized
approaches suggested by the organization theory literature are not appropriate to identify
and manage such a dynamic web of interdependencies. A finer-grain view of
coordination would provide a better framework in dynamic knowledge-intensive tasks
such as software development.
-
11
Research Questions
In the previous sections, I highlighted the limitations of the current mechanisms
for identifying and managing dependencies in geographically distributed software
development organizations. Product modularization does not necessarily yield an
equivalent task modularization structure and additional mechanisms are required to
maintain appropriate levels of coordination among workgroups. The nature of software
development such as the attributes and stability of interfaces among modules and the
dynamics of technical dependencies, limit the applicability of established task
decomposability and coordination approaches. Moreover, these characteristics are a
constant challenge for software development organizations, particularly, for those
geographically distributed. This dissertation addresses the problem of work dependencies
in software development by examining how to use technical dependencies to determine
work dependencies and by investigating the impact of those work dependencies in the
development process. Specifically, I address the following general research questions:
RQ 1: How relevant task dependencies can be identified from technical
dependencies?
RQ 2: What is the impact of those task dependencies on traditional outcome
variables such as productivity and quality?
The rest of this document is organized as follows. Chapter 2 presents a framework
for identifying and managing dependencies. Chapter 3 introduces terminology used in
this dissertation and describes the various datasets used in the empirical studies. In
-
12
chapter 4, I examine different methods of identifying work dependencies from technical
dependencies. Chapter 5 presents the first empirical study that examines the impact on
development productivity of the mismatches between coordination requirements and
coordination behavior. In chapter 6, I study the impact of the structure of technical and
work dependencies on software quality. The last empirical study which explores the
usage of the proposed framework for examining the relationship between coordination
behavior and developer-level performance is described in chapter 7. Chapter 8 describes
developer and managerial applications of the results reported in this dissertation. Finally,
chapter 9 describes the contributions of this research endeavor, its limitations as well as
future research directions.
-
13
CHAPTER 2: A FRAMEWORK FOR IDENTIFICATION OF WORK
DEPENDENCIES
It has long been observed that organizations carry out complex tasks by dividing
them into smaller interdependent work units assigned to groups and coordination arises as
a response to those interdependent activities (March & Simon, 1958). Communication
channels emerge in the formal and informal organizations. Over time, those information
conduits develop around the interactions that are most critical to the organization’s main
task (Galbraith, 1973). This is particularly important in product development
organizations which organize themselves around their products’ architectures because the
main components of their products define the organization’s key subtasks (von Hippel,
1990). Organizations also develop filters that identify the most relevant information
pertinent to the task at hand (Daft & Weick, 1990). Changes in task dependencies,
however, jeopardize the appropriateness of the information flows and filters and can
disrupt the organization’s ability to coordinate effectively. For example, Henderson &
Clark (1990) found that minor changes in product architecture can generate substantial
changes in task dependencies, and can have drastic consequences for the organizations’
ability to coordinate work. If effective ways of identifying detailed work dependencies
and tracking their changes over time exist, we would be in a much better position to
design mechanisms that could help to align information flow with work dependencies.
Identifying work dependencies and determining the appropriate coordination
mechanism to address the dependencies is not a trivial problem. Coordination is a
recurrent topic in the organizational theory literature and many stylized types of task
-
14
dependencies and coordination mechanisms have been proposed over the past several
decades (Crowston, 1991; Galbraith, 1973; Malone & Crowston, 1994; March & Simon,
1958; Mitzberg, 1979; Thompson, 1968). However, numerous types of work, in
particular non-routine knowledge-intensive activities, are potentially full of fine-grain
dependencies that might change on a daily or hourly basis. Conventional coordination
mechanisms like standard operating procedures or routines would have very limited
applicability in these dynamic contexts. Therefore, designing mechanisms to handle
rapidly shifting coordination needs requires a more fine-grained level of analysis than
what the traditional views of coordination provide.
In the context of software development, a technical dependency in the software
system represents a coordination need that relevant software developers might need to
address. The result of ignoring coordination requirement could lead to increased number
of defects, problems in integration and longer development time (Curtis et al, 1988;
Espinosa et al, 2002; Kraut et al, 1995; Herbsleb & Mockus, 2003). When members of a
team are physically collocated and coordination requirements involve individuals from
the same team, there are numerous ways for team members to identify the needs to
coordinate and act on them such as group and status meetings and managerial
intervention. The problem of identifying the need to coordinate is further complicated
when coordination requirements change rapidly (Cataldo et al, 2006). In this chapter, I
present a framework to determine the coordination requirements among developers. The
objective of the framework is two-fold. First, provide a fine-grain level of analysis of
coordination. The second objective is to allow for identification of work dependencies
from alternative representations of technical dependencies of the system. I also propose a
-
15
measure of “fit” between work dependencies and the coordination activities performed by
the software developers.
The Concept of Socio-Technical Congruence
Product development endeavors involve two fundamental elements: a technical
and a social component. The technical properties of the product to develop, the processes,
the tasks, and the technology employed in the development effort constitute the technical
component. The second element is composed by the organizational individuals involved
in the development process, their attitudes and behaviors. In other words, a product
development project can be thought of a socio-technical system where the two
components, the technical and the social elements, need to be aligned in order to have a
successful project. Then, a key issue is to understand how we can examine the
relationship between those two, the technical and the social, dimensions. Two lines of
work are particularly relevant in this context. First, the concept of “fit” from
organizational literature refers to the match between a particular organizational design
and the organization’s ability to carry out a task (Burton & Obel, 1998). The work in this
line of research has, traditionally, focused on two factors: the temporal dependencies
among tasks that are assigned to organizational groups and the formal organizational
structure as a means of communication and coordination (Carley & Ren, 2001; Levchuck
et al, 2004). Secondly, the research on dynamic analysis of social networks provides an
innovative approach, called the meta-matrix, to examine the dynamic co-evolution of
relationships among multiple types of entities such as resources, tasks, and individuals
(Carley, 2002; Krackhardt & Carley, 1998). The concept of socio-technical congruence
-
16
presented in this chapter builds on the idea of “fit” from the organizational theory
literature and from a mathematical stand point builds on the meta-matrix model from the
dynamic network analysis literature. Combining those two lines of research allows for
two important contributions to the literature. First, the socio-technical congruence
framework presented here provides a fine-grain level of analysis. Secondly, the measure
facilitates assessing the role of coordination activities in multiple and complementary
ways as well as examining the impact of several types of dependencies.
Figure 1 presents an intuitive representation of the measure of congruence
formally defined later in this chapter. A group of workers have a set of work
dependencies which defines a set of coordination requirements. When the coordination
activities carried out by those workers define a pattern of coordination similar to those
defined by the coordination requirement (case A in Figure 1), we have high levels of
congruence or “good fit”. If the patterns of coordination requirements and coordination
activities do not match, we have low levels of congruence or a “poor fit” (case B in
Figure 1).
Formally, socio-technical congruence is defined as the match between the
coordination requirements established by the dependencies among tasks and the actual
coordination activities carried out by the workers. In other words, the concept of
congruence has two components, coordination needs and coordination activities, and the
following sections discuss the mathematical framework to measure them.
-
17
Figure 1: The Concept of Congruence
Identification of Coordination Requirements
In order to identify which set of individuals should be coordinating their
activities, we need to represent two sets of relationships. One set is represented by which
individuals are working on which tasks. The relationships or dependencies among tasks
represent the second element. Past research has used a matrix formalization to capture
and relate those two pieces of information. For instance, Carley and Ren (2001) proposed
a metric, called resource congruence, to measure the relationship between the resources
required to perform a task and workers’ access to those resources. The same metric was
further examined by Carley and colleagues (2003) in the context of covert networks.
In the framework proposed in this chapter, assignments of individuals to
particular work items is be represented by a people by task matrix where a one in cell ij
-
18
indicates that worker i is assigned to task j. I will refer to this matrix as Task Assignments
(TA). Following the same approach, the set of dependencies among tasks can be
represented as a square matrix where a cell ij (or cell ji) indicates that task i and task j are
interdependent. I will refer to this matrix as Task Dependencies (TD). Now, if the Task
Assignment and Task Dependencies matrices are multiplied, a people by task matrix is
obtained that represents the set of tasks a particular worker should be aware of, given the
work items the person is responsible for and the dependencies of those work items with
other tasks. Finally, a representation of the coordination requirements among the
different workers is obtained by multiplying the product of the Task Assignment and Task
Dependencies matrices by the transpose of the Task Assignment matrix. This product
results in a people by people matrix where a cell ij (or cell ji) indicates the extent to
which person i works on tasks that share dependencies with the tasks worked on by
person j. In other words, the resulting matrix represents the Coordination Requirements
or the extent to which each pair of people needs to coordinate their work. Formally, the
Coordination Requirements matrix is determined by the following product:
CR = TA * TD * TAT (Equation 1)
where, TA is the Task Assignments matrix, TD is the Task Dependencies matrix and TAT
is the transpose of the Task Assignments matrix.
This framework provides alternatives ways of thinking about coordination
requirements among workers depending on what type of data is used to populate the Task
Dependencies matrix. Past work had focused on temporal relationships between tasks, for
-
19
instance, task A needs to be done before task B (e.g. Levchuk et al, 2003). In the context
of software development, such way of thinking about task dependencies is quite common.
Alternative views could be based on high level roles in the development organizations
(e.g. integration and testing depends on development) or task dependencies based on
product dependencies in the actual software code (e.g. function calls between modules).
The focus on this dissertation is on the work dependencies structure-product dependency
structure relationship because, as argued in chapter 1, the difficulty of identifying and
managing certain types of product dependencies is a critical factor in coordination
success and ultimately in productivity and quality.
Measuring Socio-Technical Congruence
Given a particular Coordination Requirements matrix constructed from relating
product dependencies to work dependencies, we can compare it to an Actual
Coordination (CA) matrix that represents the interactions workers engaged in through
different means of coordination. I refer to the match between those to matrices as socio-
technical congruence. Then, given a particular set of dependencies among tasks,
congruence is the proportion of coordination activities that actually occurred (given by
the Actual Coordination matrix) relative to the total number of coordination activities that
should have taken place (given by the Coordination Requirements matrix). For example,
if the Coordination Requirements matrix shows that 10 pairs should coordinate, and of
these, 5 show Actual Coordination interactions, then the congruence is 0.5. Formally, we
define congruence as follows:
-
20
Diff (CR, CA) = card { diffij | crij > 0 & caij > 0 }
|CR| = card { crij > 0 }
We have,
Congruence (CR, CA) = Diff (CR, CA) / |CR| (Equation 2)
In sum, the value of congruence belongs to the [0,1] interval that represents the
proportion of coordination requirements that were satisfied through some type of
coordination activity or mechanism. The measure of socio-technical congruence proposed
here provides a new way of thinking about coordination, particularly, by providing a fine-
grain level of analysis of different types of product dependencies and allowing us to
examine how coordination needs are impacted by them.
-
21
CHAPTER 3: TERMINOLOGY AND DESCRIPTION OF THE
DATASETS
Terminology
In this section, I define several terms are used through out the empirical studies as
well as the description of the datasets:
Source code file: A source code file represents a collection of functions, methods, and
data type declarations and definitions that implement part of or an entire functionality of
a software system. In this dissertation, I will use the terms source code file and module
interchangeably. This definition does not refer or imply any specific way of partitioning a
system into implementation modules.
Commit: A commit represents an actual modification to one or more source code files in
the version control system. A particular commit contain at least the following attributes: a
date of submission, an author or developer responsible, a list of one or more files and the
modifications to those files. The terms submission and changelist are used as synonyms
of a commit through out this document.
Modification request (MR): A modification request represents a work item that refers to a
conceptual change to the software that involves modifications to a set of source code files
(Mockus & Weiss, 2000). The changes could represent the development of new
functionality or the resolution of a defect encountered by a developer, the quality
-
22
assurance organization or reported by a customer. A modification request consists of one
or more commits from a version control system.
Lines of code (LOC): In various parts of the dissertation, we refer to lines of code as a
measure of size of a system or a module. The measure refers to non-blank non-comment
lines of code.
Datasets
In order to address the research questions outlined in chapter 1, data from several
geographically distributed software development projects was collected. The
characteristics of those projects and the data are described in the rest of this chapter.
Project A
I collected data from a software development project of a large distributed system
produced by a company that operates in the data storage industry. The data covered a
period of 39 months of development activity and the first four releases of the product.
The company had one hundred and fourteen developers grouped into eight development
teams distributed across three development locations. All the developers worked full time
on the project during the time period covered by the data. The system was composed of
about 5 million lines of code distributed in 7737 source code files mostly in C language
and a small portion (117 files and less than 96000 lines of code) in C++ language. The
data corresponding to a total of 8,257 resolved modification requests were identified.
Those MRs involved 67,652 commits to the version control system.
-
23
Software developers communicated and coordinated using various means.
Opportunities for interaction exist when working in the same formal team or when
working in the same location. Developers also use tools such as Internet Relay Chat
(IRC) and a MR tracking system to interact and coordinate their work. For instance, the
MR tracking system keeps track of the progress of the task, comments and observations
made by developers as well as additional material used in the development process. I
collected communication and coordination information from these two systems. Finally, I
also collected demographic data about the developers such as their programming and
domain experience and level of formal education.
Project A represents the main source of data for the various empirical studies
presented in this dissertation. In order to address potential external validity concerns, data
from additional projects was used in each empirical study. Those projects are described in
the following paragraphs.
Project B
Version control data from three open source projects from the Apache Software
Foundation was collected. I focused on changes to the software that were associated with
a modification request that were resolved between February of 2001 and January of 2003.
There were a total of 1068 modification requests resolved in that timeframe involving
1972 commits in the version control system. Those modification requests were related to
three different projects, Ants, Tomcat and Structs, where a total of seventy five engineers
participated in the development effort.
-
24
Project C
The project involved the development of an embedded software system for a
communications device developed by a major telecommunications company. Forty
engineers participated in the project. The data covered a period of five years and the last
six releases of the product. All the developers but one worked in the same development
facility located in the United States. The remote developer worked in Australia. The
system was composed of approximately 1.2 million lines of C and C++ code distributed
in 1224 modules with 427 modules written using in C++ language. Data associated with
about 7000 modification requests constituted the dataset.
Project D
This project was a large medical device system where the development
organization had eighty three engineers grouped into 10 teams distributed across for
development locations, one in India, one in Eastern Europe and two in the United States.
Architects, some of the technical leads and managers were also in the development
facilities located in the United States. All the developers worked full time on the project
during the time period covered by the data. Engineers had formal roles such as architect,
team lead, tester or developer. The project was organized into iterations which constitute
fixed periods of time, about 8 weeks, focused on the development of a set of
requirements defined at the beginning of the iteration. The data covered the 7th iteration
of the project. A survey instrument based on a roster approach was used to collect
coordination activity twice during the development iteration.
-
25
CHAPTER 4: METHODS FOR IDENTIFYING WORK
DEPENDENCIES IN SOFTWARE DEVELOPMENT PROJECTS
In this chapter, I explore different methods of determining work dependencies
from product dependencies (e.g. relationships among the source code files of a software
system). Then, those work dependencies will allow us to identify coordination
requirements among software developers as proposed in the congruence framework
introduced in chapter 2.
Two Approaches to Determine Product Dependencies in Software Systems
The traditional view of software dependency has its origins in compiler
optimizations and they focus on control and dataflow relationships (Horwitz et al, 1990).
This approach extracts relational information between specific units of analysis such as
statements, functions or methods, as well as modules, typically, from the source code of a
system or from an intermediate representation of the software code such as bytecodes or
abstract syntax trees. These relationships can represent either a data-related dependency
(e.g. a particular data structure modified by a function and used in another function) or a
functional dependency (e.g. method A calls method B). This type of dependency analysis
techniques has been widely used in a research context to examine the relationship
between coupling and quality of a software system (e.g. Hutchins & Basili, 1985; Selby
& Basili, 1991). Syntactic dependency analysis are also used by software developers to
improve their understanding of programs and the linkages among the various parts of
those programs (Murphy et al, 1998).
-
26
One characteristic of these relational structures such as a call-graph, and for that
matter other graphs such as inheritance and data dependencies graphs, is that they provide
a particular view of the system-wide structure. Moreover, the accuracy of the information
represented in these graphs depends on the ability of the tool used to identify all the
appropriate types of syntactic relationships allowed by the underlying programming
language (Murphy et al, 1998).
An alternative mechanism of identifying dependencies consists of examining the
set of source code files that are modified together as part of a modification request. This
approach is equivalent to the approach proposed by Gall and colleagues (1998) in the
software evolution literature to identify logical dependencies between modules. A source
code file can be viewed as representing a “bundle” of technical decisions. If a
modification request can be implemented by changing only one file, it provides no
evidence of any dependencies among files. However, when a modification request
requires changes to more than one file, it can be assumed that decisions about the change
to one file in a modification request depend in some way on the decisions made about
changes to the other files involved in implementing the modification request.
Dependencies could range from syntactic, for instance a function call between files, to
more complex semantic dependencies where the computations done in one files affects
the behavior of another files. This approach would represent a better estimate for
semantic dependencies relative to call graphs or data graphs because it does not rely on
language constructs to establish the dependency relationship between source code files.
The remainder of this dissertation refers to this approach to identify dependencies as the
“Files Changed Together” (FCT) method. I will refer to the method to identify
-
27
dependencies based on syntactic functional and data relationship described earlier as the
CGRAPH method.
The Task Dependency (TD) matrices produced by the techniques described in the
previous paragraphs could change over time as new product dependencies are created or
existing ones are removed. Moreover, the information captured by the TD matrix
constructed with the FCT method might differ from the TD matrix constructed with the
CGRAPG method. Those changes or differences could potentially impact the measures of
coordination requirements (equation 1) and congruence (equation 2). Then,
understanding the general properties of the task dependency matrices, how they evolve
over time and how the differ from each other is critical to assess the impact of socio-
technical congruence on outcome variables such as development productivity and
software quality. The following sections address these issues using the data from Project
A.
General Properties and Evolution of the FCT Task Dependency Matrix
Using the FCT method, I constructed monthly TD matrices which captured all the
changes to the code associated with the set of modifications resolved on each month.
Since a graph and a matrix are equivalent representations of a set of relational data, I can
use widely accepted graph measure to examine the general properties of the TD matrices3.
One basic measure is the density of the graph which provides a general idea of the level
of interconnectivity among the nodes of the graph. In this research context, density
translates to the overall degree of interdependence amongst the source code files in the
3 I use the terms graph and network interchangeably throughout the dissertation
-
28
system. A second useful network measure is the clustering coefficient (Watts, 1999) and
indicates the extent to which there are clusters of interdependent source code files that are
also interdependent amongst themselves. Those two measures, density and clustering
coefficient, provide a general view of the structural properties of the TD matrices.
Figure 2 shows the evolution of the density and clustering coefficient measures
over the time covered by the data. The density of the monthly TD matrices is relatively
low, with a few exceptions where the levels of density exceed 0.01 (avg=0.0033,
min=0.0004, max=0.0204). The clustering coefficient measure shows modest levels
(avg=0.0925, min=0.0023, max=0.1774) suggesting a small degree of interdependent
clusters of files in the TD matrices. In sum, the results indicate that, on a monthly basis, a
small set of dependencies are identified, and those dependencies tend to be modestly
clustered.
0.000
0.020
0.040
0.060
0.080
0.100
0.120
0.140
0.160
0.180
0.200
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Month in the Dataset
Mea
sure
Lev
el
Density Clustering Coefficient
Figure 2: Evolution of the Density and Clustering level of the TD matrices (FCT
method)
-
29
An instance of a set of source code files changing together as part of a
modification request represents a piece of evidence indicating the existence of a product
dependency, potentially logical or implicit in nature. In order to capture the representative
set of product dependencies, an understanding of the degree of change in the information
contained in the TD matrices is required. If the matrices are relatively stable that suggests
that considering a short time slice could suffice to capture all relevant product
dependencies. On the other hand, if the information contained in the monthly TD matrices
changes significantly from time t to time t+1, it is necessary to identify the appropriate
time window size that would yield an accurate representation of the product
dependencies. Figure 3 shows the percentage of change in the information contained in a
TD matrix from time t relative to the TD matrix from time t-1. The set of technical
dependencies captured differ significantly from month to month with an average change
of 37% (min=5.11%, max=49.94%). These results suggest that the changes to the source
code are affecting different sets of source code files over time. Hence, it is necessary to
explore how many months of information would constitute an accurate and representative
set of technical dependencies that could be used to compute the Coordination
Requirement matrices.
-
30
0%
20%
40%
60%
80%
100%
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
Month in the Dataset
Perc
enta
ge o
f Cha
nge
Figure 3: Evolution of the Change in the Information Contained in the TD matrices
(FCT method)
The following procedure was used to explore the time window size necessary to
capture the relevant product dependencies. First, the union of all the k-tuples of
consecutive TD matrices is computed, where k represents the number of months of data
used to compute the new TD matrices and it ranges from 2 to 39 months. For instance, in
the case of k=2, this computation outputs TD matrices that contain all the dependencies
based on the changes made to the software between months 1 and 2, month 2 and 3,
months 3 and 4, and so forth. The second step is to average the network density value of
all the matrices associated with a particular value of k. Finally, I plotted that average
value of network density for each value of k. Figure 4 depicts the results of this
procedure. As the number of months of data considered to compute the TD matrix
-
31
increases, the density level of that TD matrix increases monotonically until month 19
where a density value of 0.0109 is reached. The remaining 20 months of data increase the
density of the TD matrix from 0.0109 up to 0.01151. In other words, any additional month
of data beyond 19 month does not yield a significant increase in the value of the density
of the TD matrix, indicating that any additional month of data does not contribute any
additional information value in terms of technical dependencies. In view of this result, I
used a time period of 19 months to compute the TD matrix used in the calculations of the
coordination requirements.
0.000
0.002
0.004
0.006
0.008
0.010
0.012
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
Amount of Months of Data
Ave
rage
Cum
ulat
ive
Den
sity
Lev
Figure 4: Average Cumulative Density of the TD matrix (FCT method)
General Properties and Evolution of the CGRAPH Task Dependency Matrix
In this case of the CGRAPH, the dependencies between source code files are
determined based on data and functional references. Data references are represented by
relationships were a source code file, A, references a data object in a second source code
file B. Functional references are represented by relationships where a source code file, A,
-
32
invokes a function or a method declared in a second source code file B. Unlike the
relationships in the FCT methods, data and functional references are directional, that is,
the pair of source code files (A,B) is considered different from the pair (B,A).
I collected quarterly data for this type of dependency information, mapping each
quarter to the corresponding 3 months of the data discussed in the previous paragraphs. I
used the C-REX tool (Hassan and Holt, 2004) to identify programming language tokens
and references in each entity of each source code file. This analysis was performed over
the entire source code of the system4 at the end of the 3rd month of each quarter. Using
the resulting data, I computed dependencies between source code files by identifying
data, function and method references that cross the boundary of each source code file. In
other words, each cell ij of the TD matrix computed with the CGRAPH method represents
the number of data/function/method references that exist from file i to file j.
Figure 5 shows the evolution of the network density measure over each quarter.
The TD matrices have higher levels of density (avg=0.0311, min=0.0261, max=0.0322)
relative to those obtained using the FCT method5. In terms of the evolution of the
clustering coefficient measure, we see that the level are also very stable over time, and
higher (avg=0.1862, min=0.1738, max=0.1909) than those reported for the TD matrices
created with the FCT method. The density of the TD matrices produced by the CGRAPH
is significantly higher than the density of the matrices produced by the FCT method. This
difference could stem primarily from two characteristics of the source code of a system.
First, the CGRAPH method identifies numerous technical dependencies that involve files
4 The set of files used in the analysis also included the automatically generated source code files from functionality such as remote procedure calls. 5 The maximum level of density of a TD matrix produced by the FCT is 0.01151 if all 39 months of development activity are considered.
-
33
that once developed, are rarely modified. Cross-cutting concerns such as logging, tracing
and security are good examples. Commonly used low level functionality such memory
and thread management and basic storage types such as lists and queues are another
example. A second factor that might contribute to higher levels of density of the TD
matrices is the technical dependencies that exist with and between automatically
generated source code files. One such example is the source code for remote procedure
calls (RPCs). The FCT method would capture dependencies between caller and callee of
an RPC if there changes to the RPC specification or functionality. On the other hand, the
CGRAPH method would capture the complete path of dependencies from the caller
through the RPC stubs, marshalling and communication code all the way to the callee.
Given the potential bias that these two factors could have in the computations of
dependencies, I removed them from the quarterly call graphs and recomputed the density
measures for each quarterly TD matrices. The results showed a reduction in the density
(avg=0.0289, min=0.0241, max=0.0299). However, the density levels remained
significantly higher than those for TD matrices created with the FCT method when
considering the 19 month window for development activity.
-
34
0.000
0.050
0.100
0.150
0.200
0.250
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13
Quarter in the Dataset
Mea
sure
Lev
el
Density Clustering Coefficient
Figure 5: Evolution of the Density level of the TD matrices (CGRAPH method)
We also examined the percentage of change in the information contained in a TD
matrix from quarter t relative to the TD matrix from quarter t-1. Figure 6 shows that rate
of change is relatively low (avg=0.24%, min=0.1%, max=0.9%). Those rates of change
indicate whether the relationship between files exists or not. If we extend the idea of
change to also consider a modification in the weight of the relationship (e.g. number of
calls between files), the rate of change increases (avg=1.1%, min=0.4%, max=3%),
however, they remain relatively stable over time. This result it is not particularly
surprising since significant changes in the overall syntactic dependency structure of a
system would imply major code refactoring efforts or architectural changes, events that
do not occur often. A similar pattern of stability was found in the TD matrices produced
by the FCT method when I accumulated the commit information from 19 consecutive
months. Then, we could think of the volatility that the monthly TD matrices produced by
-
35
the FCT method showed as an indication of how the development work evolves over time
rather than just focusing how the overall structure of the technical dependencies changes
over time. In sum, the CGRAPH method produces TD matrices that contain significantly
more product dependency information relative to those produced by the FCT method.
Moreover, a fraction of the product dependencies identified by both methods identified
differed significantly.
0%
1%
2%
3%
4%
5%
Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13
Quarter in the Dataset
Perc
enta
ge o
f Cha
nge
Change in Number of Edges Change in Edge Weights
Figure 6: Evolution of the Change in the Information Contained in the TD matrices
(CGRAPH method)
Comparative Analysis of the Task Dependency Matrices
Although the analyses described above provides valuable information about the
various TD matrices, they do not tell us anything regarding the similarity in the sets of
technical dependencies identified by both, FCT and CGRAPH, methods. One of the
advantages of the FCT method is the potential to identify technical dependencies that
-
36
might not necessarily be captured by a simple syntactic dependency among modules of a
software system such as semantic dependencies (Gall et al, 1998). This argument
suggests that a comparison between the TD matrices generated by the two methods, FCT
and CGRAPH, might show differences, possibly significant. The first step of this analysis
was to compute the following two operations: TD(FCT) - TD(CGRAPH) and TD(CGRAPH) -
TD(FCT). These operations, which are equivalent to the set difference operation, allow us to
determine which dependencies that are identified by the FCT methods are not identified
by the CGRAPH method and vice versa. The focus is to identify whether a relationship
between two modules exists on one matrix, the other or in both. Hence, I do not consider
the differences in the weight on the linkages. I compared quarterly TD(CGRAPH) matrices
against the TD(FCT) computed for a period of time of the 19 months prior to the end of the
quarter. For the first two quarters, I did not have 19 month worth of past data to compute
the TD(FCT) matrices. Therefore, I used 13 months to construct the TD(FCT) that compared
to the TD(CGRAPH) matrix from the first quarter, and 16 months in the case of the second
quarter comparison.
Figure 7 shows the comparison between the TD matrices. The TD matrix computed
using the FCT method has an average of 14.6% of the dependencies that were not
identified by the CGRAPH methods (min=12.4%, max=17.1%). As discussed earlier, the
TD matrices computed using the CGRAPH method are denser and that situation is clearly
reflected in this comparison. On average, the TD matrix computed using the CGRAPH
had 74.3% of product dependencies that were not identified by the FCT method
(min=70.6%, max=79.2%).
-
37
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13
Quarter in Dataset
Perc
enta
ge o
f Non
-iden
tifie
dD
epen
denc
ies
FCT-CGR CGR-FCT
Figure 7: Comparison between TD matrices generated by the FCT and CGRAPH
methods
Comparative Analysis of the Coordination Requirement Matrices
As described in chapter 2, the Coordination Requirements matrix (CR) is a
function of two elements: the TA matrix and the TD matrix. Using the different methods
for identifying technical dependencies to construct TD matrices will result in different CR
matrices. Hence, we also need to examine the general properties of the both types of CR
matrices. Using the data from the modification requests resolved in each month to
compute the TA matrix. In terms of computing the TD matrix, we use a 19 month moving
windows in the case of the FCT method or the corresponding quarterly TD matrix in the
case of the CGRAPH method. Figure 8 shows the evolution of the density and clustering
coefficient measures for the CR matrices constructed based on the FCT method. We
observe that the density of the monthly CR matrices is low (avg=0.0655, min=0.0005,
-
38
max=0.1429) while the clustering coefficient measure shows relatively high levels
(avg=0.3179, min=0.0308, max=0.4331) suggesting an important degree of
interdependent clusters of files in the CR matrices.
0.000
0.100
0.200
0.300
0.400
0.500
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Month in the Dataset
Mea
sure
Lev
el
Density Clustering Coefficient
Figure 8: Evolution of Density and Clustering level of the CR matrices (FCT
method)
Figure 9 shows evolution of the density and clustering coefficient measures for
the CR matrices constructed based on the CGRAPH method. Although, the clustering
coefficient values (avg=0.3979, min=0.0312, max=0.5402) are relatively similar to those
shown in Figure 8. On the other hand, the CR matrices created using the CGRAPH
methods are significantly more dense (avg=0.1509, min=0.0009, max=0.2408) than those
created using the FCT method. In other words, CR matrices constructed with the
CGRAPH method would suggest significantly levels of coordination requirements for the
-
39
developers. Then, it is important to understand if the additional coordination needs are
indeed necessary. The question is addressed in chapter 5.
0.000
0.100
0.200
0.300
0.400
0.500
0.600
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Month in the Dataset
Mea
sure
Lev
elDensity Clustering Coefficient
Figure 9: Evolution of Density and Clustering level of the CR matrices (CGRAPH
method)
Chapters 5 and 6 present two empirical studies that use the dependency
identification techniques discussed in the previous paragraphs (FCT and CGRAPH) to
examine the mismatch between coordination needs and coordination activities and their
impact of two traditional outcome variables: development productivity and product
quality.
-
40
CHAPTER 5: DEPENDENCIES, CONGRUENCE AND THEIR
IMPACT ON DEVELOPMENT PRODUCTIVITY
Identifying work dependencies and determining the appropriate coordination
mechanisms to address the dependencies is not a trivial problem. Coordination is a
recurrent topic in the organizational theory literature and, as discussed in chapters 1 and
2, many stylized types of task dependencies and coordination mechanisms have been
proposed over the past several decades. These perspectives are useful in the context of
enduring structures. However, numerous types of work, for instance non-routine
knowledge-intensive activities such as software development, are potentially full of fine-
grain dependencies that might change on a daily or hourly basis. Conventional
coordination mechanisms like standard operating procedures or routines would have very
limited applicability in these dynamic contexts. Failure to identify the new needs for
coordination and information exchange might hinder the organization’s ability to adapt to
changes in their competitive environment (Henderson & Clark, 1990). The study reported
in this chapter represents the first step in the examination of how the gaps between
coordination needs and actual coordination activity impact outcome variable, such as
development productivity, in the context of software development activities.
Study I: Congruence and Development Productivity
Software development is populated with rapidly changing dependencies and this
attribute of software development tasks is a potential source of coordination problems
which impacts productivity. The analysis presented in this study focuses, first, in
-
41
exploring the dynamism in the coordination requirements and, secondly, examining the
impact that coordination activity congruent with coordination needs has on development
performance.
Research Questions
When members of a team are physically collocated and coordination requirements
within the team change, there are numero