dependencies in geographically distributed...

200
DEPENDENCIES IN GEOGRAPHICALLY DISTRIBUTED SOFTWARE DEVELOPMENT: OVERCOMING THE LIMITS OF MODULARITY 1 Marcelo Cataldo CMU-ISRI-07-120 December 2007 School of Computer Science Institute for Software Research Carnegie Mellon University Pittsburgh, PA Thesis Committee Kathleen M. Carley, Co-Chair James D. Herbsleb, Co-Chair Len J. Bass David Redmiles Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Copyright © 2007 Marcelo Cataldo 1 This dissertation was supported by the National Science Foundation under Grant No. IIS-0414698, Grant No. IIS-0534656 and Grant No. IGERT 9972762, by the U.S. Army Research Laboratory under Collaborative Technology Alliance Program, Cooperative Agreement DAAD19-01-2-0011, by the Office of Naval Research (ONR N00014-06-1- 0921) and by the Air Force Research Lab with Charles River Analytics SC060701.

Upload: others

Post on 29-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • DEPENDENCIES IN GEOGRAPHICALLY DISTRIBUTED

    SOFTWARE DEVELOPMENT:

    OVERCOMING THE LIMITS OF MODULARITY1

    Marcelo Cataldo

    CMU-ISRI-07-120

    December 2007

    School of Computer Science Institute for Software Research

    Carnegie Mellon University Pittsburgh, PA

    Thesis Committee

    Kathleen M. Carley, Co-Chair

    James D. Herbsleb, Co-Chair

    Len J. Bass

    David Redmiles

    Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy

    Copyright © 2007 Marcelo Cataldo

    1 This dissertation was supported by the National Science Foundation under Grant No. IIS-0414698, Grant No. IIS-0534656 and Grant No. IGERT 9972762, by the U.S. Army Research Laboratory under Collaborative Technology Alliance Program, Cooperative Agreement DAAD19-01-2-0011, by the Office of Naval Research (ONR N00014-06-1-0921) and by the Air Force Research Lab with Charles River Analytics SC060701.

  • ii

    Keywords: geographically distributed software development, collaborative software

    development, coordination, software dependencies.

  • iii

    Dedicada a Pei-Chi y a mis padres, Antonio Y Mirta

  • iv

    ACKNOWLEDGEMENTS

    I have been very fortunate to work with an outstanding dissertation committee in

    Kathleen Carley, Jim Herbsleb, Len Bass and David Redmiles. I am particularly indebted

    to Kathleen and Jim for being the best advisors a student could hope for. I also would like

    to thank my family for their patience and encouragement, specially, my wife Pei-Chi

    without whom my life as a doctoral student would have been a lot less enjoyable.

    Through out this process, many others helped shape my views and research.

    Special thanks go to Matthew Bass, Audris Mockus, Jeffrey Reminga, Jeffrey Roberts

    and Patrick Wagstrom.

  • v

    ABSTRACT

    Geographically distributed software development (GDSD) is becoming pervasive.

    Hence, the constraints in communication and its negative impact of developers’ ability to

    coordinate effectively is a growing problem that consistently results in sub-par

    performance of GDSD teams. Past research argues that geographically distributed teams

    do better when their work is almost independent from each other. In software

    engineering, modularization is the traditional technique intended to reduce the

    interdependencies among modules that constitutes a system. The modular design

    argument suggests that by reducing the technical dependencies, the work dependencies

    between teams developing interdependent modules are also reduced. Consequently, a

    modular product structure leads to an equivalent modular task structure. This dissertation

    argues that modularization is not a sufficient representation of work dependencies in the

    context of software development and it proposes a method for measuring socio-technical

    congruence, defined as the relationship between the structure of work dependencies and

    the coordination patterns of the organization doing the technical work. Two empirical

    studies assessed the impact of socio-technical congruence on development productivity

    and product quality. In addition, a third empirical study explores how developers in a

    geographically distributed software development organization evolve their coordination

    patterns to overcome the limitations of the modular design approach.

    Collectively, this dissertation has important contributions to software engineering,

    CSCW and organizational literatures. First, the empirical evaluation of the congruence

    framework showed the importance of understanding the dynamic nature of software

    development. Identifying the “right” set of product dependencies that determine the

  • vi

    relevant work dependencies and coordinating accordingly has significant impact on

    reducing the resolution time of modification requests. The analyses showed traditional

    software dependencies, such as syntactic relationships, tend to capture a relatively stable

    view of product dependencies that is not representative of the dynamism in product

    dependencies that emerges as software systems are implemented. On the other hand,

    logical dependencies provide a more accurate representation of the most relevant product

    dependencies in software development projects. Secondly, this dissertation moves

    forward our understanding of the relationship between product and work dependencies

    and software quality. Logical dependencies among software modules and work

    dependencies were found to be two very significant factors affecting the failure proneness

    of software modules. Finally, the longitudinal analysis of coordination activities in a

    GDSD project showed that developers centrally positioned in the social system of

    information exchanges and coordination activities performed a critical bridging function

    across formal teams and geographical locations. Moreover, those same individuals

    contributed an average of 57% of development effort in terms of implementing the

    software system in each release covered by the data.

  • vii

    TABLE OF CONTENTS

    ACKNOWLEDGEMENTS....................................................................................................................... IV

    ABSTRACT ..................................................................................................................................................V

    TABLE OF CONTENTS..........................................................................................................................VII

    LIST OF TABLES ..................................................................................................................................... XI

    LIST OF FIGURES ................................................................................................................................ XIII

    CHAPTER 1: INTRODUCTION ................................................................................................................2

    THE NATURE OF SOFTWARE DEVELOPMENT AND MODULAR DESIGN.........................................................3

    THE NATURE OF SOFTWARE DEVELOPMENT AND INTERDEPENDENCY THEORIES .......................................7

    RESEARCH QUESTIONS..............................................................................................................................11

    CHAPTER 2: A FRAMEWORK FOR IDENTIFICATION OF WORK DEPENDENCIES..............13

    THE CONCEPT OF SOCIO-TECHNICAL CONGRUENCE .................................................................................15

    IDENTIFICATION OF COORDINATION REQUIREMENTS................................................................................17

    MEASURING SOCIO-TECHNICAL CONGRUENCE.........................................................................................19

    CHAPTER 3: TERMINOLOGY AND DESCRIPTION OF THE DATASETS...................................21

    TERMINOLOGY ..........................................................................................................................................21

    DATASETS .................................................................................................................................................22

    Project A ..............................................................................................................................................22

    Project B ..............................................................................................................................................23

    Project C..............................................................................................................................................24

    Project D..............................................................................................................................................24

    CHAPTER 4: METHODS FOR IDENTIFYING WORK DEPENDENCIES IN SOFTWARE

    DEVELOPMENT PROJECTS..................................................................................................................25

    TWO APPROACHES TO DETERMINE PRODUCT DEPENDENCIES IN SOFTWARE SYSTEMS ............................25

  • viii

    General Properties and Evolution of the FCT Task Dependency Matrix ............................................27

    General Properties and Evolution of the CGRAPH Task Dependency Matrix....................................31

    COMPARATIVE ANALYSIS OF THE TASK DEPENDENCY MATRICES............................................................35

    COMPARATIVE ANALYSIS OF THE COORDINATION REQUIREMENT MATRICES ..........................................37

    CHAPTER 5: DEPENDENCIES, CONGRUENCE AND THEIR IMPACT ON DEVELOPMENT

    PRODUCTIVITY........................................................................................................................................40

    STUDY I: CONGRUENCE AND DEVELOPMENT PRODUCTIVITY ...................................................................40

    Research Questions..............................................................................................................................41

    Method .................................................................................................................................................42

    Description of the Measures............................................................................................................................ 43

    Description of the Model and Preliminary Analysis ....................................................................................... 48

    Results..................................................................................................................................................53

    The Evolution of Coordination Requirements................................................................................................. 53

    The Impact of Congruence on Resolution Time of MRs................................................................................. 57

    The Evolution of Congruence over Time ........................................................................................................ 62

    Discussion............................................................................................................................................70

    CHAPTER 6: DEPENDENCIES, CONGRUENCE AND THEIR IMPACT ON SOFTWARE

    QUALITY ....................................................................................................................................................72

    STUDY II: THE STRUCTURE OF DEPENDENCIES, CONGRUENCE AND PRODUCT QUALITY ..........................73

    Research Questions..............................................................................................................................75

    Method .................................................................................................................................................80

    Description of the Data and Measures............................................................................................................. 80

    Results..................................................................................................................................................94

    The Impact of Dependencies........................................................................................................................... 95

    Stability Analysis .......................................................................................................................................... 103

    Checks for Random Temporal Effects .......................................................................................................... 107

    Discussion..........................................................................................................................................108

    CHAPTER 7: THE EVOLUTION OF COORDINATION BEHAVIOR ............................................111

  • ix

    STUDY III: THE EVOLUTION OF COORDINATION BEHAVIOR....................................................................112

    Research Questions............................................................................................................................113

    Method ...............................................................................................................................................115

    Description of the Data ................................................................................................................................. 116

    Description of Measures................................................................................................................................ 118

    Results................................................................................................................................................125

    General Patterns of Coordination Behavior................................................................................................... 125

    On the Relationship between Network Position and Productivity................................................................. 130

    Stability of Coordination Patterns ................................................................................................................. 144

    Drivers of Coordination Patterns................................................................................................................... 146

    Discussion..........................................................................................................................................148

    CHAPTER 8: APPLICATIONS..............................................................................................................151

    APPLICATIONS FOR SOFTWARE DEVELOPERS...........................................................................................153

    Enhancing coordination needs awareness.........................................................................................153

    Enhancing awareness of product dependencies.................................................................................155

    Other applications of the congruence framework..............................................................................156

    MANAGERIAL APPLICATIONS ..................................................................................................................157

    Project-wide view of coordination patterns.......................................................................................158

    Identifying critical software and organizational agents and units.....................................................160

    CHAPTER 9: CONCLUSIONS...............................................................................................................162

    CONTRIBUTIONS......................................................................................................................................163

    LIMITATIONS ...........................................................................................................................................165

    FUTURE WORK........................................................................................................................................167

    Identification of coordination requirements in early stages of software projects..............................167

    The impact of formal roles in development organizations .................................................................168

    Communication beyond team and location boundaries and individual-level performance...............169

    Applying the congruence framework in other types of tasks..............................................................170

  • x

    REFERENCES..........................................................................................................................................173

    APPENDIX A: SURVEY FOR PROJECT D.........................................................................................187

  • xi

    LIST OF TABLES

    TABLE 1: DESCRIPTIVE STATISTICS FOR DEPENDENT AND CONTROL VARIABLES .........................................49

    TABLE 2: DESCRIPTIVE STATISTICS FOR CONGRUENCE MEASURES (FCT METHOD) ......................................50

    TABLE 3: DESCRIPTIVE STATISTICS FOR CONGRUENCE MEASURES (CGRAPH METHOD) .............................50

    TABLE 4: PAIR-WISE CORRELATIONS .............................................................................................................51

    TABLE 5: RESULTS FROM OLS REGRESSION OF EFFECTS ON RESOLUTION TIME (FCT METHOD)..................59

    TABLE 6: RESULTS FROM OLS REGRESSION OF EFFECTS ON RESOLUTION TIME (CGRAPH METHOD) .........60

    TABLE 7: EFFECT OF TIME ON CONGRUENCE. ................................................................................................65

    TABLE 8: DIFFERENCES BETWEEN DEVELOPERS’ POPULATION.......................................................................69

    TABLE 9: DESCRIPTIVE STATISTICS FOR LAST RELEASE OF PROJECT A.........................................................89

    TABLE 10: DESCRIPTIVE STATISTICS FOR LAST RELEASE OF PROJECT C .......................................................90

    TABLE 11: PAIR-WISE CORRELATIONS FOR LAST RELEASE OF PROJECT A (* P < 0.01)..................................91

    TABLE 12: PAIR-WISE CORRELATIONS FOR LAST RELEASE OF PROJECT C (* P < 0.01) ..................................93

    TABLE 13: BASELINE MODEL FOR FAILURE PRONENESS................................................................................96

    TABLE 14: IMPACT OF SYNTACTIC DEPENDENCIES ON FAILURE PRONENESS.................................................97

    TABLE 15: IMPACT OF LOGICAL DEPENDENCIES ON FAILURE PRONENESS.....................................................99

    TABLE 16: IMPACT OF WORKFLOW DEPENDENCIES ON FAILURE PRONENESS..............................................101

    TABLE 17: IMPACT OF COORDINATION REQUIREMENTS ON FAILURE PRONENESS .......................................102

    TABLE 18: IMPACT OF CONGRUENCE ON FAILURE PRONENESS....................................................................103

    TABLE 19: IMPACT OF TECHNICAL DEPENDENCIES, WORK DEPENDENCIES AND CONGRUENCE ACROSS

    RELEASES IN PROJECT A.....................................................................................................................105

    TABLE 20: IMPACT OF TECHNICAL DEPENDENCIES, WORK DEPENDENCIES AND CONGRUENCE ACROSS

    RELEASES IN PROJECT C.....................................................................................................................106

    TABLE 21: RANDOM-EFFECTS MODEL OF FAILURE PRONENESS ..................................................................108

    TABLE 22: DESCRIPTIVE STATISTICS FOR IRC DATASET ..............................................................................136

    TABLE 23: RESULTS OF THE MULTI-LEVEL REGRESSION MODEL USING THE IRC DATA ..............................139

    TABLE 24: RESULTS OF THE MULTI-LEVEL REGRESSION MODEL USING THE MR DATA ..............................142

    TABLE 25: RESULTS FROM MULTI-LEVEL REGRESSION MODEL USING PROJECT D DATA............................143

  • xii

    TABLE 26: STABILITY OF THE COORDINATION NETWORKS ..........................................................................146

    TABLE 27: PREDICTING COORDINATION ACTIVITIES ...................................................................................147

  • xiii

    LIST OF FIGURES

    FIGURE 1: THE CONCEPT OF CONGRUENCE....................................................................................................17

    FIGURE 2: EVOLUTION OF THE DENSITY AND CLUSTERING LEVEL OF THE TD MATRICES (FCT METHOD) ......28

    FIGURE 3: EVOLUTION OF THE CHANGE IN THE INFORMATION CONTAINED IN THE TD MATRICES (FCT

    METHOD) ..............................................................................................................................................30

    FIGURE 4: AVERAGE CUMULATIVE DENSITY OF THE TD MATRIX (FCT METHOD)..........................................31

    FIGURE 5: EVOLUTION OF THE DENSITY LEVEL OF THE TD MATRICES (CGRAPH METHOD) ..........................34

    FIGURE 6: EVOLUTION OF THE CHANGE IN THE INFORMATION CONTAINED IN THE TD MATRICES (CGRAPH

    METHOD) ..............................................................................................................................................35

    FIGURE 7: COMPARISON BETWEEN TD MATRICES GENERATED BY THE FCT AND CGRAPH METHODS ..........37

    FIGURE 8: EVOLUTION OF DENSITY AND CLUSTERING LEVEL OF THE CR MATRICES (FCT METHOD) .............38

    FIGURE 9: EVOLUTION OF DENSITY AND CLUSTERING LEVEL OF THE CR MATRICES (CGRAPH METHOD) ....39

    FIGURE 10: THE EVOLUTION OF COORDINATION REQUIREMENTS ON A MONTHLY BASIS .............................54

    FIGURE 11: THE EVOLUTION OF COORDINATION REQUIREMENTS IN OPEN SOURCE PROJECTS .....................55

    FIGURE 12: EVOLUTION OF THE CONGRUENCE MEASURES ACROSS RELEASES..............................................64

    FIGURE 13: PROPORTION OF CHANGES PER DEVELOPER PER RELEASE ..........................................................66

    FIGURE 14: CONGRUENCE MEASURES ACROSS RELEASES BASED ON TOP CONTRIBUTORS INTERACTIONS....68

    FIGURE 15: CONGRUENCE MEASURES ACROSS RELEASES FOR THE REST OF THE DEVELOPERS.....................68

    FIGURE 16: OVER TIME COORDINATION PATTERNS FROM THE MR SYSTEM DATA ......................................126

    FIGURE 17: OVER TIME COORDINATION PATTERNS FROM THE IRC DATA ...................................................126

    FIGURE 18: COORDINATION PATTERNS ACROSS FORMAL TEAMS AND GEOGRAPHICAL LOCATIONS ...........127

    FIGURE 19: LOCATION X NETWORK POSITION INTERACTION EFFECT..........................................................129

    FIGURE 20: COORDINATION PATTERNS AND PRODUCTIVITY........................................................................131

    FIGURE 21: THE SIZE OF THE CORE GROUP OVER TIME AND TOP PERFORMERS MEMBERSHIP ....................132

    FIGURE 22: COMPOSITION OF THE CORE GROUP OVER TIME BY PRODUCTIVITY LEVELS.............................133

    FIGURE 23: AMOUNT OF CHANGE IN DYADS CONNECTIONS ........................................................................145

  • 2

    CHAPTER 1: INTRODUCTION

    Over the past couple of decades, geographically distributed work has become

    pervasive and software development organizations are no exception. Factors such as

    access to talent, acquisitions and the need to reduce the time-to-market of new products

    are the driving forces for the increasing number of geographically distributed software

    development (GDSD) projects (Herbsleb & Moitra, 2001; Karolak, 1998). Unfortunately,

    this new trend has its costs. Distance leads to numerous problems in communication and

    coordination, and ultimately, impacts the performance of software development teams

    (Herbsleb et al, 2000; Herbsleb & Mockus, 2003). The failure to identify work

    dependencies among developers or development teams results in coordination problems.

    A growing body of work on coordination in software development suggests that the

    identification and the management of dependencies is a fundamental challenge in

    software development organizations, particularly in those that are geographically

    distributed (some examples are: Cataldo et al, 2007; de Sourza, 2005; Grinter et al, 1999;

    Herbsleb et al, 2000; Herbsleb & Mockus, 2003). The modular product design literature

    has developed an important body of research on interdependency, for instance, the work

    on design structure matrices to find alternative structures that reduce dependencies

    among the various components of the system (Eppinger et al, 1994; Sullivan et al, 2001).

    Interdependency is central to organizations and it has also been a perennial research topic

    in organizational theory (DeSanctis et al, 1999; Staudenmeyer, 1997). Those research

    streams could inform the design of software development organizations so they are better

    able to identify and manage work dependencies. However, we first need to understand

  • 3

    the assumptions of the different theoretical views and how those assumptions relate to the

    characteristics of software development tasks.

    The Nature of Software Development and Modular Design

    The idea of dividing a complex task into smaller manageable units is consistent

    with the reductionist view (Simon, 1962; von Hippel, 1990) which is well developed in

    the product development literature (Eppinger et al, 1994). Projects, typically, have a

    general description of the system’s components and their relationships or a more detailed

    report such as architectural or high-level design document. Managers use the information

    in those documents to divide the development effort into work items that are assigned to

    specific development teams minimizing the interdependencies among those teams

    (Conway, 1968; Eppinger et al, 1994; Sullivan et al, 2001). In the system design

    literature, it has long been speculated that the structure of a product inevitably resembles

    the structure of the organization that designs it (Conway, 1968). In Conway’s original

    formulation, he reasoned that coordinating product design decisions requires

    communication among the engineers making those decisions. If everyone needs to talk to

    everyone, the communication overhead does not scale well for projects of any size.

    Therefore, products must be split into components, with limited technical dependencies2

    among them, and each component assigned to a single team. Conway (1968) proposed

    that the component structure and organizational structure stand in a homomorphic

    relation, in that more than one component can be assigned to a team, but a component

    must be assigned to a single team.

    2 The terms “technical dependency” and “product dependency” are used interchangeably through this dissertation.

  • 4

    A similar argument has been proposed in the strategic management literature.

    Baldwin and Clark (2000, page 90) argued that modularization makes complexity

    manageable, enables parallel work and tolerates uncertainty. The design decisions are

    hidden within the modules which communicate through standard interfaces, then,

    modularization adds value by allowing independent experimentation of modules and

    substitution (Baldwin & Clark, 2000). Moreover, Baldwin and Clark (2000, page 89)

    argued that a modular design structure leads to an equivalent modular task structure.

    Then, their view aligns with Conway’s idea that one or more modules can be assigned to

    one organizational unit and work can be conducted almost independently of others. In the

    context of software engineering, a similar approach was first articulated by Parnas (1972)

    as modular software design. Parnas (1972) argued that modules ought to be considered

    work items instead of just a collection of subprograms. Then, development work can

    continue independently and in parallel across different modules. Parnas’ views also

    coincide with the theoretical arguments from product design and strategic management

    literatures.

    All three theoretical views rely on two interrelated assumptions. The authors

    assume a simple and obvious relationship between product modularization and task

    modularization. Hence, reducing the technical interdependencies among modules, the

    modularization theories argue, task interdependencies are reduced, which consequently,

    reduces the need for communication among work groups. Unfortunately, there are several

    problems with these assumptions. First, existing software modularization approaches

    only use a subset of the technical dependencies, typically syntactic relationships, of a

    software system (Garcia et al, 2007). Then, potentially relevant work dependencies might

  • 5

    be ignored. Secondly, recent empirical evidence indicates that the relationship between

    product structure and task structure is not as simple as previously assumed. Moreover, the

    theorized similarity between product and task structures diminishes over time (Cataldo et

    al, 2006).

    Thirdly, promoting minimal communication between teams responsible for

    interdependent modules is problematic. The computer-mediated communication literature

    suggests that loose-coupling tasks is the appropriate approach when teams are

    geographically distributed (Olson & Olson, 2000). However, recent studies suggest that

    minimal communication between teams, collocated or distributed, is detrimental to the

    success of projects. The product development literature argues that information hiding,

    which leads to minimal communication between teams, is an inevitable antecedent of

    variability in the evolution of projects resulting, typically, in integration problems

    (Yassine et al, 2003). In context of software development, de Souza and colleagues

    (2004) found that information hiding led development teams to be unaware of others

    teams’ work resulting in coordination problems. Grinter and colleagues (1999) reported

    similar findings for geographically distributed software development projects. The

    authors highlighted that the main consequence of reducing the teams’ need to

    communicate was to increase costs because problems were discovered too late in the

    development process. Those findings do not suggest that modularization is not useful.

    They highlight the need to supplement it with coordination mechanisms to allow

    developers to deal correctly with the assumptions that are not captured in the

    specification of the dependencies.

  • 6

    Another problem associated with the assumptions of modular design is the nature

    and stability of the interfaces between software modules. Although, the program

    dependency literature defines technical dependencies as a syntactic or semantic

    relationship between statements (Podgurski & Clarke, 1990), the same ideas are applied

    at the level of modules. Then, relationships among modules could also range from

    syntactic, for instance a function call from module A to module B, to more complex

    semantic dependencies where, for example, the computations done in one module affects

    the behavior of another module. Some authors refer to those types of semantic

    dependencies as dynamic (Bass et al, 2003) or logical (Gall et al, 1998). Even in the

    simple case of a function call between two modules, the complexity and the degree of

    dependency varies, for instance, if we consider the number of parameters of a function

    call or we compare parameters passed by value versus parameters passed by reference.

    Cataldo et al (2007) presented case studies where even simple interfaces between

    modules developed by remote teams create coordination breakdown and integration

    problems. The authors reported that semantic dependencies were even more problematic

    and they argued that the developers’ ability to identify and manage dependencies was

    hindered by several inter-related factors such as development processes, organizational

    attributes (e.g. structure, management style) and uncertainty of the interfaces. In a field

    study of a large software project, de Souza (2005) encountered that interfaces tended to

    change often and their design details tended to be incomplete, leading to serious

    integration problems. These findings argue that the interfaces between software modules

    might differ in complexity and, often, it is not possible to specify those interfaces at the

  • 7

    necessary level of detail, increasing the likelihood of future changes to them. This lack of

    stability represents a constant challenge for software development organizations.

    In sum, the modularization approach is a very useful tool for dividing the

    development of a complex software system into manageable units. However,

    modularization is not a sufficient representation of work dependencies in software

    development activities. The relationship between the task dependency structure and the

    product structure is not as simple as theorized. Appropriate mechanisms are then required

    to identify relevant work dependencies and, consequently, maintain suitable levels of

    communication and coordination among teams developing interdependent modules,

    particularly, in the case of geographically distributed software development.

    The Nature of Software Development and Interdependency Theories

    Coordination is a central concept in organizations, the idea of division of labor

    into interdependent units is a well developed and mechanisms for coping with the varying

    degree of interdependency have been proposed in the traditional organizational literature

    (for instance, March & Simon, 1958; Thompson, 1967; Galbraith, 1973; Staudenmayer,

    1997). More recent work, particularly in organizational design, has focused on

    computational and mathematical approaches to examine how organizational designs, that

    use different models of communication and coordination, are affected by factors such as

    stress, task decomposition, quality of information exchanged, and ability to adapt (for

    instance, Carley and Lin, 1995, 1997; Handley & Levis, 2001; Perdu & Levis, 1998).

    Then both streams of work, traditional organizational theory and computational and

  • 8

    mathematical organizational theory (CMOT), are relevant to the problem of coordination

    in software development projects.

    In the traditional organizational theory, March and Simon (1958) argued that

    coordination encompasses more than just a traditional division of labor and assignment of

    tasks. The authors proposed numerous mechanisms such the division of the task into

    nearly independent parts and they also argued that schedules and feedback mechanisms

    are required when interdependence is unavoidable. Thompson (1967) extended March

    and Simon’s work by matching three mechanisms: standardization, plan, and mutual

    adjustment, to stylized categorizations of dependencies such as pooled, sequential, and

    reciprocal. Galbraith (1973) argued that low levels of interdependency can be managed

    by traditional mechanisms such as rules and programs. However, as the level of

    interdependency increases additional mechanisms are required such as slack resources

    and lateral communication (Galbraith, 1973). Mintzberg (1979) took an organizational-

    level perspective and argued that specific coordination mechanisms are properties of

    particular kinds of organizations and environments. Crowston (1991) developed a

    typology of coordination problems to catalog coordination mechanisms that address

    specific types of interdependencies. Staudenmayer (1997) grouped the contributions of

    March and Simon, Thompson, and others into the information processing theories of

    interdependency which, she argued, rely on the assumptions of determinism and stability.

    In other words, those theoretical views focus on predictable and static tasks

    (Staudenmayer, 1997). This limitation of the information processing argument is not

    problematic if software development tasks can be identified a priori and the set of

    interdependencies that arise from the division of labor are managed with the appropriate

  • 9

    set of mechanisms. If we think in terms of project management activities, coarse-grain

    development activities such as “develop component A” or “implement feature X” can

    typically be identify at relatively early stages of the projects. Some dependencies among

    those development tasks are typically easy to identify. For instance, particular work items

    need to be finished before other work items can start. Work items that can only be

    assigned to specific teams because of the skill set required would represent another

    example. Then, specific organizational forms can be used to manage the dependencies

    among those coarse-grain development tasks (Malone & Crowston, 1991), even in the

    case of geographically distributed development organizations (Grinter et al, 1999).

    Unfortunately, there are several characteristics of software development activities

    that limit the applicability of traditional organizational theories as well as the more recent

    CMOT work. First, it is widely accepted among software engineering researchers and

    practitioners that the requirements of the system become known over time or those

    requirements change as time progresses (Leffingwell & Widrig, 2003). In some cases the

    changes in the requirements result in minor alterations of specific development tasks. In

    other cases, new features have to be added or features under development are eliminated.

    These events introduce a certain level of dynamism in software development that

    challenges the determinism and stability assumptions of the information processing views

    of interdependency.

    Secondly, the dynamic nature of finer-grain dependencies that arise as part of the

    development of a piece of code is not well suited for traditional organizational theories of

    coordination. The act of developing a software system consists of a collection of design

    decisions, either at the architectural level or at the implementation level. Those design

  • 10

    decisions introduce constraints that might establish new dependencies among the various

    parts of the system, modify existing ones or even eliminate dependencies. The changes in

    dependencies can generate new coordination requirements that are quite difficult to

    identify a priori, particularly when they are not obvious, or as a project matures over time

    (Henderson & Clark, 1990; Sosa et al, 2004). Failure to discover the changes in

    coordination needs might have a profound impact on the quality of the product (Curtis et

    al, 1988), on productivity (Herbsleb & Mockus, 2003) and even on the projects’ overall

    design (Bass et al, 2006). In addition, little is known about the specific impact of the

    various types of dependencies that arise among parts of a software system such as explicit

    versus implicit dependencies or syntactic versus logical dependencies. Then, the use of

    the computational and mathematical organizational theory approaches is limited because

    of the lack of theoretical framework that guides the modeling of the relationships

    between the organizational tasks, their dependencies and the need to communicate and

    coordination.

    In sum, software development tasks are embedded in an evolving network of

    coordination requirements that need to be satisfied. The coarse-grain and idealized

    approaches suggested by the organization theory literature are not appropriate to identify

    and manage such a dynamic web of interdependencies. A finer-grain view of

    coordination would provide a better framework in dynamic knowledge-intensive tasks

    such as software development.

  • 11

    Research Questions

    In the previous sections, I highlighted the limitations of the current mechanisms

    for identifying and managing dependencies in geographically distributed software

    development organizations. Product modularization does not necessarily yield an

    equivalent task modularization structure and additional mechanisms are required to

    maintain appropriate levels of coordination among workgroups. The nature of software

    development such as the attributes and stability of interfaces among modules and the

    dynamics of technical dependencies, limit the applicability of established task

    decomposability and coordination approaches. Moreover, these characteristics are a

    constant challenge for software development organizations, particularly, for those

    geographically distributed. This dissertation addresses the problem of work dependencies

    in software development by examining how to use technical dependencies to determine

    work dependencies and by investigating the impact of those work dependencies in the

    development process. Specifically, I address the following general research questions:

    RQ 1: How relevant task dependencies can be identified from technical

    dependencies?

    RQ 2: What is the impact of those task dependencies on traditional outcome

    variables such as productivity and quality?

    The rest of this document is organized as follows. Chapter 2 presents a framework

    for identifying and managing dependencies. Chapter 3 introduces terminology used in

    this dissertation and describes the various datasets used in the empirical studies. In

  • 12

    chapter 4, I examine different methods of identifying work dependencies from technical

    dependencies. Chapter 5 presents the first empirical study that examines the impact on

    development productivity of the mismatches between coordination requirements and

    coordination behavior. In chapter 6, I study the impact of the structure of technical and

    work dependencies on software quality. The last empirical study which explores the

    usage of the proposed framework for examining the relationship between coordination

    behavior and developer-level performance is described in chapter 7. Chapter 8 describes

    developer and managerial applications of the results reported in this dissertation. Finally,

    chapter 9 describes the contributions of this research endeavor, its limitations as well as

    future research directions.

  • 13

    CHAPTER 2: A FRAMEWORK FOR IDENTIFICATION OF WORK

    DEPENDENCIES

    It has long been observed that organizations carry out complex tasks by dividing

    them into smaller interdependent work units assigned to groups and coordination arises as

    a response to those interdependent activities (March & Simon, 1958). Communication

    channels emerge in the formal and informal organizations. Over time, those information

    conduits develop around the interactions that are most critical to the organization’s main

    task (Galbraith, 1973). This is particularly important in product development

    organizations which organize themselves around their products’ architectures because the

    main components of their products define the organization’s key subtasks (von Hippel,

    1990). Organizations also develop filters that identify the most relevant information

    pertinent to the task at hand (Daft & Weick, 1990). Changes in task dependencies,

    however, jeopardize the appropriateness of the information flows and filters and can

    disrupt the organization’s ability to coordinate effectively. For example, Henderson &

    Clark (1990) found that minor changes in product architecture can generate substantial

    changes in task dependencies, and can have drastic consequences for the organizations’

    ability to coordinate work. If effective ways of identifying detailed work dependencies

    and tracking their changes over time exist, we would be in a much better position to

    design mechanisms that could help to align information flow with work dependencies.

    Identifying work dependencies and determining the appropriate coordination

    mechanism to address the dependencies is not a trivial problem. Coordination is a

    recurrent topic in the organizational theory literature and many stylized types of task

  • 14

    dependencies and coordination mechanisms have been proposed over the past several

    decades (Crowston, 1991; Galbraith, 1973; Malone & Crowston, 1994; March & Simon,

    1958; Mitzberg, 1979; Thompson, 1968). However, numerous types of work, in

    particular non-routine knowledge-intensive activities, are potentially full of fine-grain

    dependencies that might change on a daily or hourly basis. Conventional coordination

    mechanisms like standard operating procedures or routines would have very limited

    applicability in these dynamic contexts. Therefore, designing mechanisms to handle

    rapidly shifting coordination needs requires a more fine-grained level of analysis than

    what the traditional views of coordination provide.

    In the context of software development, a technical dependency in the software

    system represents a coordination need that relevant software developers might need to

    address. The result of ignoring coordination requirement could lead to increased number

    of defects, problems in integration and longer development time (Curtis et al, 1988;

    Espinosa et al, 2002; Kraut et al, 1995; Herbsleb & Mockus, 2003). When members of a

    team are physically collocated and coordination requirements involve individuals from

    the same team, there are numerous ways for team members to identify the needs to

    coordinate and act on them such as group and status meetings and managerial

    intervention. The problem of identifying the need to coordinate is further complicated

    when coordination requirements change rapidly (Cataldo et al, 2006). In this chapter, I

    present a framework to determine the coordination requirements among developers. The

    objective of the framework is two-fold. First, provide a fine-grain level of analysis of

    coordination. The second objective is to allow for identification of work dependencies

    from alternative representations of technical dependencies of the system. I also propose a

  • 15

    measure of “fit” between work dependencies and the coordination activities performed by

    the software developers.

    The Concept of Socio-Technical Congruence

    Product development endeavors involve two fundamental elements: a technical

    and a social component. The technical properties of the product to develop, the processes,

    the tasks, and the technology employed in the development effort constitute the technical

    component. The second element is composed by the organizational individuals involved

    in the development process, their attitudes and behaviors. In other words, a product

    development project can be thought of a socio-technical system where the two

    components, the technical and the social elements, need to be aligned in order to have a

    successful project. Then, a key issue is to understand how we can examine the

    relationship between those two, the technical and the social, dimensions. Two lines of

    work are particularly relevant in this context. First, the concept of “fit” from

    organizational literature refers to the match between a particular organizational design

    and the organization’s ability to carry out a task (Burton & Obel, 1998). The work in this

    line of research has, traditionally, focused on two factors: the temporal dependencies

    among tasks that are assigned to organizational groups and the formal organizational

    structure as a means of communication and coordination (Carley & Ren, 2001; Levchuck

    et al, 2004). Secondly, the research on dynamic analysis of social networks provides an

    innovative approach, called the meta-matrix, to examine the dynamic co-evolution of

    relationships among multiple types of entities such as resources, tasks, and individuals

    (Carley, 2002; Krackhardt & Carley, 1998). The concept of socio-technical congruence

  • 16

    presented in this chapter builds on the idea of “fit” from the organizational theory

    literature and from a mathematical stand point builds on the meta-matrix model from the

    dynamic network analysis literature. Combining those two lines of research allows for

    two important contributions to the literature. First, the socio-technical congruence

    framework presented here provides a fine-grain level of analysis. Secondly, the measure

    facilitates assessing the role of coordination activities in multiple and complementary

    ways as well as examining the impact of several types of dependencies.

    Figure 1 presents an intuitive representation of the measure of congruence

    formally defined later in this chapter. A group of workers have a set of work

    dependencies which defines a set of coordination requirements. When the coordination

    activities carried out by those workers define a pattern of coordination similar to those

    defined by the coordination requirement (case A in Figure 1), we have high levels of

    congruence or “good fit”. If the patterns of coordination requirements and coordination

    activities do not match, we have low levels of congruence or a “poor fit” (case B in

    Figure 1).

    Formally, socio-technical congruence is defined as the match between the

    coordination requirements established by the dependencies among tasks and the actual

    coordination activities carried out by the workers. In other words, the concept of

    congruence has two components, coordination needs and coordination activities, and the

    following sections discuss the mathematical framework to measure them.

  • 17

    Figure 1: The Concept of Congruence

    Identification of Coordination Requirements

    In order to identify which set of individuals should be coordinating their

    activities, we need to represent two sets of relationships. One set is represented by which

    individuals are working on which tasks. The relationships or dependencies among tasks

    represent the second element. Past research has used a matrix formalization to capture

    and relate those two pieces of information. For instance, Carley and Ren (2001) proposed

    a metric, called resource congruence, to measure the relationship between the resources

    required to perform a task and workers’ access to those resources. The same metric was

    further examined by Carley and colleagues (2003) in the context of covert networks.

    In the framework proposed in this chapter, assignments of individuals to

    particular work items is be represented by a people by task matrix where a one in cell ij

  • 18

    indicates that worker i is assigned to task j. I will refer to this matrix as Task Assignments

    (TA). Following the same approach, the set of dependencies among tasks can be

    represented as a square matrix where a cell ij (or cell ji) indicates that task i and task j are

    interdependent. I will refer to this matrix as Task Dependencies (TD). Now, if the Task

    Assignment and Task Dependencies matrices are multiplied, a people by task matrix is

    obtained that represents the set of tasks a particular worker should be aware of, given the

    work items the person is responsible for and the dependencies of those work items with

    other tasks. Finally, a representation of the coordination requirements among the

    different workers is obtained by multiplying the product of the Task Assignment and Task

    Dependencies matrices by the transpose of the Task Assignment matrix. This product

    results in a people by people matrix where a cell ij (or cell ji) indicates the extent to

    which person i works on tasks that share dependencies with the tasks worked on by

    person j. In other words, the resulting matrix represents the Coordination Requirements

    or the extent to which each pair of people needs to coordinate their work. Formally, the

    Coordination Requirements matrix is determined by the following product:

    CR = TA * TD * TAT (Equation 1)

    where, TA is the Task Assignments matrix, TD is the Task Dependencies matrix and TAT

    is the transpose of the Task Assignments matrix.

    This framework provides alternatives ways of thinking about coordination

    requirements among workers depending on what type of data is used to populate the Task

    Dependencies matrix. Past work had focused on temporal relationships between tasks, for

  • 19

    instance, task A needs to be done before task B (e.g. Levchuk et al, 2003). In the context

    of software development, such way of thinking about task dependencies is quite common.

    Alternative views could be based on high level roles in the development organizations

    (e.g. integration and testing depends on development) or task dependencies based on

    product dependencies in the actual software code (e.g. function calls between modules).

    The focus on this dissertation is on the work dependencies structure-product dependency

    structure relationship because, as argued in chapter 1, the difficulty of identifying and

    managing certain types of product dependencies is a critical factor in coordination

    success and ultimately in productivity and quality.

    Measuring Socio-Technical Congruence

    Given a particular Coordination Requirements matrix constructed from relating

    product dependencies to work dependencies, we can compare it to an Actual

    Coordination (CA) matrix that represents the interactions workers engaged in through

    different means of coordination. I refer to the match between those to matrices as socio-

    technical congruence. Then, given a particular set of dependencies among tasks,

    congruence is the proportion of coordination activities that actually occurred (given by

    the Actual Coordination matrix) relative to the total number of coordination activities that

    should have taken place (given by the Coordination Requirements matrix). For example,

    if the Coordination Requirements matrix shows that 10 pairs should coordinate, and of

    these, 5 show Actual Coordination interactions, then the congruence is 0.5. Formally, we

    define congruence as follows:

  • 20

    Diff (CR, CA) = card { diffij | crij > 0 & caij > 0 }

    |CR| = card { crij > 0 }

    We have,

    Congruence (CR, CA) = Diff (CR, CA) / |CR| (Equation 2)

    In sum, the value of congruence belongs to the [0,1] interval that represents the

    proportion of coordination requirements that were satisfied through some type of

    coordination activity or mechanism. The measure of socio-technical congruence proposed

    here provides a new way of thinking about coordination, particularly, by providing a fine-

    grain level of analysis of different types of product dependencies and allowing us to

    examine how coordination needs are impacted by them.

  • 21

    CHAPTER 3: TERMINOLOGY AND DESCRIPTION OF THE

    DATASETS

    Terminology

    In this section, I define several terms are used through out the empirical studies as

    well as the description of the datasets:

    Source code file: A source code file represents a collection of functions, methods, and

    data type declarations and definitions that implement part of or an entire functionality of

    a software system. In this dissertation, I will use the terms source code file and module

    interchangeably. This definition does not refer or imply any specific way of partitioning a

    system into implementation modules.

    Commit: A commit represents an actual modification to one or more source code files in

    the version control system. A particular commit contain at least the following attributes: a

    date of submission, an author or developer responsible, a list of one or more files and the

    modifications to those files. The terms submission and changelist are used as synonyms

    of a commit through out this document.

    Modification request (MR): A modification request represents a work item that refers to a

    conceptual change to the software that involves modifications to a set of source code files

    (Mockus & Weiss, 2000). The changes could represent the development of new

    functionality or the resolution of a defect encountered by a developer, the quality

  • 22

    assurance organization or reported by a customer. A modification request consists of one

    or more commits from a version control system.

    Lines of code (LOC): In various parts of the dissertation, we refer to lines of code as a

    measure of size of a system or a module. The measure refers to non-blank non-comment

    lines of code.

    Datasets

    In order to address the research questions outlined in chapter 1, data from several

    geographically distributed software development projects was collected. The

    characteristics of those projects and the data are described in the rest of this chapter.

    Project A

    I collected data from a software development project of a large distributed system

    produced by a company that operates in the data storage industry. The data covered a

    period of 39 months of development activity and the first four releases of the product.

    The company had one hundred and fourteen developers grouped into eight development

    teams distributed across three development locations. All the developers worked full time

    on the project during the time period covered by the data. The system was composed of

    about 5 million lines of code distributed in 7737 source code files mostly in C language

    and a small portion (117 files and less than 96000 lines of code) in C++ language. The

    data corresponding to a total of 8,257 resolved modification requests were identified.

    Those MRs involved 67,652 commits to the version control system.

  • 23

    Software developers communicated and coordinated using various means.

    Opportunities for interaction exist when working in the same formal team or when

    working in the same location. Developers also use tools such as Internet Relay Chat

    (IRC) and a MR tracking system to interact and coordinate their work. For instance, the

    MR tracking system keeps track of the progress of the task, comments and observations

    made by developers as well as additional material used in the development process. I

    collected communication and coordination information from these two systems. Finally, I

    also collected demographic data about the developers such as their programming and

    domain experience and level of formal education.

    Project A represents the main source of data for the various empirical studies

    presented in this dissertation. In order to address potential external validity concerns, data

    from additional projects was used in each empirical study. Those projects are described in

    the following paragraphs.

    Project B

    Version control data from three open source projects from the Apache Software

    Foundation was collected. I focused on changes to the software that were associated with

    a modification request that were resolved between February of 2001 and January of 2003.

    There were a total of 1068 modification requests resolved in that timeframe involving

    1972 commits in the version control system. Those modification requests were related to

    three different projects, Ants, Tomcat and Structs, where a total of seventy five engineers

    participated in the development effort.

  • 24

    Project C

    The project involved the development of an embedded software system for a

    communications device developed by a major telecommunications company. Forty

    engineers participated in the project. The data covered a period of five years and the last

    six releases of the product. All the developers but one worked in the same development

    facility located in the United States. The remote developer worked in Australia. The

    system was composed of approximately 1.2 million lines of C and C++ code distributed

    in 1224 modules with 427 modules written using in C++ language. Data associated with

    about 7000 modification requests constituted the dataset.

    Project D

    This project was a large medical device system where the development

    organization had eighty three engineers grouped into 10 teams distributed across for

    development locations, one in India, one in Eastern Europe and two in the United States.

    Architects, some of the technical leads and managers were also in the development

    facilities located in the United States. All the developers worked full time on the project

    during the time period covered by the data. Engineers had formal roles such as architect,

    team lead, tester or developer. The project was organized into iterations which constitute

    fixed periods of time, about 8 weeks, focused on the development of a set of

    requirements defined at the beginning of the iteration. The data covered the 7th iteration

    of the project. A survey instrument based on a roster approach was used to collect

    coordination activity twice during the development iteration.

  • 25

    CHAPTER 4: METHODS FOR IDENTIFYING WORK

    DEPENDENCIES IN SOFTWARE DEVELOPMENT PROJECTS

    In this chapter, I explore different methods of determining work dependencies

    from product dependencies (e.g. relationships among the source code files of a software

    system). Then, those work dependencies will allow us to identify coordination

    requirements among software developers as proposed in the congruence framework

    introduced in chapter 2.

    Two Approaches to Determine Product Dependencies in Software Systems

    The traditional view of software dependency has its origins in compiler

    optimizations and they focus on control and dataflow relationships (Horwitz et al, 1990).

    This approach extracts relational information between specific units of analysis such as

    statements, functions or methods, as well as modules, typically, from the source code of a

    system or from an intermediate representation of the software code such as bytecodes or

    abstract syntax trees. These relationships can represent either a data-related dependency

    (e.g. a particular data structure modified by a function and used in another function) or a

    functional dependency (e.g. method A calls method B). This type of dependency analysis

    techniques has been widely used in a research context to examine the relationship

    between coupling and quality of a software system (e.g. Hutchins & Basili, 1985; Selby

    & Basili, 1991). Syntactic dependency analysis are also used by software developers to

    improve their understanding of programs and the linkages among the various parts of

    those programs (Murphy et al, 1998).

  • 26

    One characteristic of these relational structures such as a call-graph, and for that

    matter other graphs such as inheritance and data dependencies graphs, is that they provide

    a particular view of the system-wide structure. Moreover, the accuracy of the information

    represented in these graphs depends on the ability of the tool used to identify all the

    appropriate types of syntactic relationships allowed by the underlying programming

    language (Murphy et al, 1998).

    An alternative mechanism of identifying dependencies consists of examining the

    set of source code files that are modified together as part of a modification request. This

    approach is equivalent to the approach proposed by Gall and colleagues (1998) in the

    software evolution literature to identify logical dependencies between modules. A source

    code file can be viewed as representing a “bundle” of technical decisions. If a

    modification request can be implemented by changing only one file, it provides no

    evidence of any dependencies among files. However, when a modification request

    requires changes to more than one file, it can be assumed that decisions about the change

    to one file in a modification request depend in some way on the decisions made about

    changes to the other files involved in implementing the modification request.

    Dependencies could range from syntactic, for instance a function call between files, to

    more complex semantic dependencies where the computations done in one files affects

    the behavior of another files. This approach would represent a better estimate for

    semantic dependencies relative to call graphs or data graphs because it does not rely on

    language constructs to establish the dependency relationship between source code files.

    The remainder of this dissertation refers to this approach to identify dependencies as the

    “Files Changed Together” (FCT) method. I will refer to the method to identify

  • 27

    dependencies based on syntactic functional and data relationship described earlier as the

    CGRAPH method.

    The Task Dependency (TD) matrices produced by the techniques described in the

    previous paragraphs could change over time as new product dependencies are created or

    existing ones are removed. Moreover, the information captured by the TD matrix

    constructed with the FCT method might differ from the TD matrix constructed with the

    CGRAPG method. Those changes or differences could potentially impact the measures of

    coordination requirements (equation 1) and congruence (equation 2). Then,

    understanding the general properties of the task dependency matrices, how they evolve

    over time and how the differ from each other is critical to assess the impact of socio-

    technical congruence on outcome variables such as development productivity and

    software quality. The following sections address these issues using the data from Project

    A.

    General Properties and Evolution of the FCT Task Dependency Matrix

    Using the FCT method, I constructed monthly TD matrices which captured all the

    changes to the code associated with the set of modifications resolved on each month.

    Since a graph and a matrix are equivalent representations of a set of relational data, I can

    use widely accepted graph measure to examine the general properties of the TD matrices3.

    One basic measure is the density of the graph which provides a general idea of the level

    of interconnectivity among the nodes of the graph. In this research context, density

    translates to the overall degree of interdependence amongst the source code files in the

    3 I use the terms graph and network interchangeably throughout the dissertation

  • 28

    system. A second useful network measure is the clustering coefficient (Watts, 1999) and

    indicates the extent to which there are clusters of interdependent source code files that are

    also interdependent amongst themselves. Those two measures, density and clustering

    coefficient, provide a general view of the structural properties of the TD matrices.

    Figure 2 shows the evolution of the density and clustering coefficient measures

    over the time covered by the data. The density of the monthly TD matrices is relatively

    low, with a few exceptions where the levels of density exceed 0.01 (avg=0.0033,

    min=0.0004, max=0.0204). The clustering coefficient measure shows modest levels

    (avg=0.0925, min=0.0023, max=0.1774) suggesting a small degree of interdependent

    clusters of files in the TD matrices. In sum, the results indicate that, on a monthly basis, a

    small set of dependencies are identified, and those dependencies tend to be modestly

    clustered.

    0.000

    0.020

    0.040

    0.060

    0.080

    0.100

    0.120

    0.140

    0.160

    0.180

    0.200

    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

    Month in the Dataset

    Mea

    sure

    Lev

    el

    Density Clustering Coefficient

    Figure 2: Evolution of the Density and Clustering level of the TD matrices (FCT

    method)

  • 29

    An instance of a set of source code files changing together as part of a

    modification request represents a piece of evidence indicating the existence of a product

    dependency, potentially logical or implicit in nature. In order to capture the representative

    set of product dependencies, an understanding of the degree of change in the information

    contained in the TD matrices is required. If the matrices are relatively stable that suggests

    that considering a short time slice could suffice to capture all relevant product

    dependencies. On the other hand, if the information contained in the monthly TD matrices

    changes significantly from time t to time t+1, it is necessary to identify the appropriate

    time window size that would yield an accurate representation of the product

    dependencies. Figure 3 shows the percentage of change in the information contained in a

    TD matrix from time t relative to the TD matrix from time t-1. The set of technical

    dependencies captured differ significantly from month to month with an average change

    of 37% (min=5.11%, max=49.94%). These results suggest that the changes to the source

    code are affecting different sets of source code files over time. Hence, it is necessary to

    explore how many months of information would constitute an accurate and representative

    set of technical dependencies that could be used to compute the Coordination

    Requirement matrices.

  • 30

    0%

    20%

    40%

    60%

    80%

    100%

    2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38

    Month in the Dataset

    Perc

    enta

    ge o

    f Cha

    nge

    Figure 3: Evolution of the Change in the Information Contained in the TD matrices

    (FCT method)

    The following procedure was used to explore the time window size necessary to

    capture the relevant product dependencies. First, the union of all the k-tuples of

    consecutive TD matrices is computed, where k represents the number of months of data

    used to compute the new TD matrices and it ranges from 2 to 39 months. For instance, in

    the case of k=2, this computation outputs TD matrices that contain all the dependencies

    based on the changes made to the software between months 1 and 2, month 2 and 3,

    months 3 and 4, and so forth. The second step is to average the network density value of

    all the matrices associated with a particular value of k. Finally, I plotted that average

    value of network density for each value of k. Figure 4 depicts the results of this

    procedure. As the number of months of data considered to compute the TD matrix

  • 31

    increases, the density level of that TD matrix increases monotonically until month 19

    where a density value of 0.0109 is reached. The remaining 20 months of data increase the

    density of the TD matrix from 0.0109 up to 0.01151. In other words, any additional month

    of data beyond 19 month does not yield a significant increase in the value of the density

    of the TD matrix, indicating that any additional month of data does not contribute any

    additional information value in terms of technical dependencies. In view of this result, I

    used a time period of 19 months to compute the TD matrix used in the calculations of the

    coordination requirements.

    0.000

    0.002

    0.004

    0.006

    0.008

    0.010

    0.012

    2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38

    Amount of Months of Data

    Ave

    rage

    Cum

    ulat

    ive

    Den

    sity

    Lev

    Figure 4: Average Cumulative Density of the TD matrix (FCT method)

    General Properties and Evolution of the CGRAPH Task Dependency Matrix

    In this case of the CGRAPH, the dependencies between source code files are

    determined based on data and functional references. Data references are represented by

    relationships were a source code file, A, references a data object in a second source code

    file B. Functional references are represented by relationships where a source code file, A,

  • 32

    invokes a function or a method declared in a second source code file B. Unlike the

    relationships in the FCT methods, data and functional references are directional, that is,

    the pair of source code files (A,B) is considered different from the pair (B,A).

    I collected quarterly data for this type of dependency information, mapping each

    quarter to the corresponding 3 months of the data discussed in the previous paragraphs. I

    used the C-REX tool (Hassan and Holt, 2004) to identify programming language tokens

    and references in each entity of each source code file. This analysis was performed over

    the entire source code of the system4 at the end of the 3rd month of each quarter. Using

    the resulting data, I computed dependencies between source code files by identifying

    data, function and method references that cross the boundary of each source code file. In

    other words, each cell ij of the TD matrix computed with the CGRAPH method represents

    the number of data/function/method references that exist from file i to file j.

    Figure 5 shows the evolution of the network density measure over each quarter.

    The TD matrices have higher levels of density (avg=0.0311, min=0.0261, max=0.0322)

    relative to those obtained using the FCT method5. In terms of the evolution of the

    clustering coefficient measure, we see that the level are also very stable over time, and

    higher (avg=0.1862, min=0.1738, max=0.1909) than those reported for the TD matrices

    created with the FCT method. The density of the TD matrices produced by the CGRAPH

    is significantly higher than the density of the matrices produced by the FCT method. This

    difference could stem primarily from two characteristics of the source code of a system.

    First, the CGRAPH method identifies numerous technical dependencies that involve files

    4 The set of files used in the analysis also included the automatically generated source code files from functionality such as remote procedure calls. 5 The maximum level of density of a TD matrix produced by the FCT is 0.01151 if all 39 months of development activity are considered.

  • 33

    that once developed, are rarely modified. Cross-cutting concerns such as logging, tracing

    and security are good examples. Commonly used low level functionality such memory

    and thread management and basic storage types such as lists and queues are another

    example. A second factor that might contribute to higher levels of density of the TD

    matrices is the technical dependencies that exist with and between automatically

    generated source code files. One such example is the source code for remote procedure

    calls (RPCs). The FCT method would capture dependencies between caller and callee of

    an RPC if there changes to the RPC specification or functionality. On the other hand, the

    CGRAPH method would capture the complete path of dependencies from the caller

    through the RPC stubs, marshalling and communication code all the way to the callee.

    Given the potential bias that these two factors could have in the computations of

    dependencies, I removed them from the quarterly call graphs and recomputed the density

    measures for each quarterly TD matrices. The results showed a reduction in the density

    (avg=0.0289, min=0.0241, max=0.0299). However, the density levels remained

    significantly higher than those for TD matrices created with the FCT method when

    considering the 19 month window for development activity.

  • 34

    0.000

    0.050

    0.100

    0.150

    0.200

    0.250

    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

    Quarter in the Dataset

    Mea

    sure

    Lev

    el

    Density Clustering Coefficient

    Figure 5: Evolution of the Density level of the TD matrices (CGRAPH method)

    We also examined the percentage of change in the information contained in a TD

    matrix from quarter t relative to the TD matrix from quarter t-1. Figure 6 shows that rate

    of change is relatively low (avg=0.24%, min=0.1%, max=0.9%). Those rates of change

    indicate whether the relationship between files exists or not. If we extend the idea of

    change to also consider a modification in the weight of the relationship (e.g. number of

    calls between files), the rate of change increases (avg=1.1%, min=0.4%, max=3%),

    however, they remain relatively stable over time. This result it is not particularly

    surprising since significant changes in the overall syntactic dependency structure of a

    system would imply major code refactoring efforts or architectural changes, events that

    do not occur often. A similar pattern of stability was found in the TD matrices produced

    by the FCT method when I accumulated the commit information from 19 consecutive

    months. Then, we could think of the volatility that the monthly TD matrices produced by

  • 35

    the FCT method showed as an indication of how the development work evolves over time

    rather than just focusing how the overall structure of the technical dependencies changes

    over time. In sum, the CGRAPH method produces TD matrices that contain significantly

    more product dependency information relative to those produced by the FCT method.

    Moreover, a fraction of the product dependencies identified by both methods identified

    differed significantly.

    0%

    1%

    2%

    3%

    4%

    5%

    Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

    Quarter in the Dataset

    Perc

    enta

    ge o

    f Cha

    nge

    Change in Number of Edges Change in Edge Weights

    Figure 6: Evolution of the Change in the Information Contained in the TD matrices

    (CGRAPH method)

    Comparative Analysis of the Task Dependency Matrices

    Although the analyses described above provides valuable information about the

    various TD matrices, they do not tell us anything regarding the similarity in the sets of

    technical dependencies identified by both, FCT and CGRAPH, methods. One of the

    advantages of the FCT method is the potential to identify technical dependencies that

  • 36

    might not necessarily be captured by a simple syntactic dependency among modules of a

    software system such as semantic dependencies (Gall et al, 1998). This argument

    suggests that a comparison between the TD matrices generated by the two methods, FCT

    and CGRAPH, might show differences, possibly significant. The first step of this analysis

    was to compute the following two operations: TD(FCT) - TD(CGRAPH) and TD(CGRAPH) -

    TD(FCT). These operations, which are equivalent to the set difference operation, allow us to

    determine which dependencies that are identified by the FCT methods are not identified

    by the CGRAPH method and vice versa. The focus is to identify whether a relationship

    between two modules exists on one matrix, the other or in both. Hence, I do not consider

    the differences in the weight on the linkages. I compared quarterly TD(CGRAPH) matrices

    against the TD(FCT) computed for a period of time of the 19 months prior to the end of the

    quarter. For the first two quarters, I did not have 19 month worth of past data to compute

    the TD(FCT) matrices. Therefore, I used 13 months to construct the TD(FCT) that compared

    to the TD(CGRAPH) matrix from the first quarter, and 16 months in the case of the second

    quarter comparison.

    Figure 7 shows the comparison between the TD matrices. The TD matrix computed

    using the FCT method has an average of 14.6% of the dependencies that were not

    identified by the CGRAPH methods (min=12.4%, max=17.1%). As discussed earlier, the

    TD matrices computed using the CGRAPH method are denser and that situation is clearly

    reflected in this comparison. On average, the TD matrix computed using the CGRAPH

    had 74.3% of product dependencies that were not identified by the FCT method

    (min=70.6%, max=79.2%).

  • 37

    0.0%

    20.0%

    40.0%

    60.0%

    80.0%

    100.0%

    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13

    Quarter in Dataset

    Perc

    enta

    ge o

    f Non

    -iden

    tifie

    dD

    epen

    denc

    ies

    FCT-CGR CGR-FCT

    Figure 7: Comparison between TD matrices generated by the FCT and CGRAPH

    methods

    Comparative Analysis of the Coordination Requirement Matrices

    As described in chapter 2, the Coordination Requirements matrix (CR) is a

    function of two elements: the TA matrix and the TD matrix. Using the different methods

    for identifying technical dependencies to construct TD matrices will result in different CR

    matrices. Hence, we also need to examine the general properties of the both types of CR

    matrices. Using the data from the modification requests resolved in each month to

    compute the TA matrix. In terms of computing the TD matrix, we use a 19 month moving

    windows in the case of the FCT method or the corresponding quarterly TD matrix in the

    case of the CGRAPH method. Figure 8 shows the evolution of the density and clustering

    coefficient measures for the CR matrices constructed based on the FCT method. We

    observe that the density of the monthly CR matrices is low (avg=0.0655, min=0.0005,

  • 38

    max=0.1429) while the clustering coefficient measure shows relatively high levels

    (avg=0.3179, min=0.0308, max=0.4331) suggesting an important degree of

    interdependent clusters of files in the CR matrices.

    0.000

    0.100

    0.200

    0.300

    0.400

    0.500

    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

    Month in the Dataset

    Mea

    sure

    Lev

    el

    Density Clustering Coefficient

    Figure 8: Evolution of Density and Clustering level of the CR matrices (FCT

    method)

    Figure 9 shows evolution of the density and clustering coefficient measures for

    the CR matrices constructed based on the CGRAPH method. Although, the clustering

    coefficient values (avg=0.3979, min=0.0312, max=0.5402) are relatively similar to those

    shown in Figure 8. On the other hand, the CR matrices created using the CGRAPH

    methods are significantly more dense (avg=0.1509, min=0.0009, max=0.2408) than those

    created using the FCT method. In other words, CR matrices constructed with the

    CGRAPH method would suggest significantly levels of coordination requirements for the

  • 39

    developers. Then, it is important to understand if the additional coordination needs are

    indeed necessary. The question is addressed in chapter 5.

    0.000

    0.100

    0.200

    0.300

    0.400

    0.500

    0.600

    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

    Month in the Dataset

    Mea

    sure

    Lev

    elDensity Clustering Coefficient

    Figure 9: Evolution of Density and Clustering level of the CR matrices (CGRAPH

    method)

    Chapters 5 and 6 present two empirical studies that use the dependency

    identification techniques discussed in the previous paragraphs (FCT and CGRAPH) to

    examine the mismatch between coordination needs and coordination activities and their

    impact of two traditional outcome variables: development productivity and product

    quality.

  • 40

    CHAPTER 5: DEPENDENCIES, CONGRUENCE AND THEIR

    IMPACT ON DEVELOPMENT PRODUCTIVITY

    Identifying work dependencies and determining the appropriate coordination

    mechanisms to address the dependencies is not a trivial problem. Coordination is a

    recurrent topic in the organizational theory literature and, as discussed in chapters 1 and

    2, many stylized types of task dependencies and coordination mechanisms have been

    proposed over the past several decades. These perspectives are useful in the context of

    enduring structures. However, numerous types of work, for instance non-routine

    knowledge-intensive activities such as software development, are potentially full of fine-

    grain dependencies that might change on a daily or hourly basis. Conventional

    coordination mechanisms like standard operating procedures or routines would have very

    limited applicability in these dynamic contexts. Failure to identify the new needs for

    coordination and information exchange might hinder the organization’s ability to adapt to

    changes in their competitive environment (Henderson & Clark, 1990). The study reported

    in this chapter represents the first step in the examination of how the gaps between

    coordination needs and actual coordination activity impact outcome variable, such as

    development productivity, in the context of software development activities.

    Study I: Congruence and Development Productivity

    Software development is populated with rapidly changing dependencies and this

    attribute of software development tasks is a potential source of coordination problems

    which impacts productivity. The analysis presented in this study focuses, first, in

  • 41

    exploring the dynamism in the coordination requirements and, secondly, examining the

    impact that coordination activity congruent with coordination needs has on development

    performance.

    Research Questions

    When members of a team are physically collocated and coordination requirements

    within the team change, there are numero