visualizing work processes in software engineering with developer

9
Visualizing Work Processes in Software Engineering with Developer Rivers Michael Burch, Tanja Munz, Fabian Beck, and Daniel Weiskopf VISUS, University of Stuttgart, Germany Email: {michael.burch,fabian.beck,weiskopf}@visus.uni-stuttgart.de Abstract—Work processes involving dozens or hundreds of collaborators are complex and difficult to manage. Problems within the process may have severe organizational and financial consequences. Visualization helps monitor and analyze those processes. In this paper, we study the development of large software systems as an example of a complex work process. We introduce Developer Rivers, a timeline-based visualization technique that shows how developers work on software modules. The flow of developers’ activity is visualized by a river metaphor: activities are transferred between modules represented as rivers. Interactively switching between hierarchically organized modules and workload metrics allows for exploring multiple facets of the work process. We study typical development patterns by applying our visualization to Python and the Linux kernel. I. I NTRODUCTION Work processes in general, and software development in particular, involve a product that is built and people building the product. Conway [9] states in his well-known law that these two aspects interact: “[...] organizations which design systems [...] are constrained to produce designs which are copies of the communication structures of these organiza- tions.” This duality between people and systems needs to be considered for understanding work processes. Previous visu- alizations on work processes and software evolution, however, either showed one of the two aspects only or did not scale well to large systems and long histories. With the visualization technique presented in this paper, we intend to bring both aspects of work processes closer together to help researchers, analysts, and project managers better understand, monitor, and control the development process. Our main goal was to build a system that visually and technically scales to large projects (e.g., Python, Linux kernel), where monitoring the work activity is most crucial and difficult to manage without tool support. Since software development is an important, cost-intensive process, we focus on visualizing the evolution of software projects. These need to be studied at a high level of abstraction and over long spans of time. However, we should as well be able to drill down to smaller subsystems, individual developers, and specific events. Aspects that might be explored with the help of the visualization are, for instance, main activities of developers, responsibilities for a module, individual developer roles, special evolutionary events, the flow of development activity, or general trends. We designed a visualization technique called Developer Rivers, which connects developer activity with modules of a software system. Our technique represents development activity (i.e., developers modifying files) as river-like flows on a timeline, shown in Figure 1 for the evolution of Python. Each of the colored rivers encodes the developer activity within a user-defined module; rivers branch and merge according to the flow of activity between modules. A hierarchical representa- tion of the project on the left is used to interactively browse and select modules. In a case study, we apply the technique to Python and the Linux kernel using multiple configurations of the visualization. We report specific findings for these software systems and discuss insights gained on the qualities and limitations of Developer Rivers. II. RELATED WORK Previous work related to Developer Rivers comes from several fields: software visualization [11], time-oriented data visualization [1], and the visualization of graphs [38] and hier- archies [16]. In general, our visualization combines a Sankey diagram [33], [27] (also referred to as flow map [25]) with a timeline. An additional hierarchy representation is used for navigation. While several hierarchy visualization techniques exist [16], we decided to use icicle plots [18] because they are compact as well as easy to label and colorize. A Sankey diagram is a node-link diagram that encodes branching flows or quantities in the width of links. Those have already been mapped to a timeline, to encode split- ting and merging of objects [28], the evolution of character interaction in stories [20], or the temporal development of topics in document collections [10]. In particular, our work is visually similar to the diagrams by Burch et al. [4] for visualizing eye movements of participants of an eye-tracking study. Using smooth bends and stacking the flows on top of each other without margins creates the impression of rivers. It looks similar to multivariate time-series visualizations such as ThemeRiver [17] or Streamgraphs [6]; these, however, do not show transitions between different rivers or streams. RankExplorer [30] uses flow-in and flow-out color bars as an alternative to links for encoding transitions in river-like categorical timeseries representations. A common visualization for general work processes is the Gantt chart [13], which maps hierarchically organized tasks to a timeline and visualizes dependencies between the tasks. Also a number of approaches already focus on visualizing the evolution of software. Taking a rather technical perspective, releases or commits are visualized on a timeline [3], [12], [32], [36], [37], [35]; among these, the Code Flows approach [32]

Upload: dobao

Post on 14-Feb-2017

227 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Visualizing Work Processes in Software Engineering with Developer

Visualizing Work Processes inSoftware Engineering with Developer Rivers

Michael Burch, Tanja Munz, Fabian Beck, and Daniel WeiskopfVISUS, University of Stuttgart, Germany

Email: {michael.burch,fabian.beck,weiskopf}@visus.uni-stuttgart.de

Abstract—Work processes involving dozens or hundreds ofcollaborators are complex and difficult to manage. Problemswithin the process may have severe organizational and financialconsequences. Visualization helps monitor and analyze thoseprocesses. In this paper, we study the development of largesoftware systems as an example of a complex work process.We introduce Developer Rivers, a timeline-based visualizationtechnique that shows how developers work on software modules.The flow of developers’ activity is visualized by a river metaphor:activities are transferred between modules represented as rivers.Interactively switching between hierarchically organized modulesand workload metrics allows for exploring multiple facets of thework process. We study typical development patterns by applyingour visualization to Python and the Linux kernel.

I. INTRODUCTION

Work processes in general, and software development inparticular, involve a product that is built and people buildingthe product. Conway [9] states in his well-known law thatthese two aspects interact: “[...] organizations which designsystems [...] are constrained to produce designs which arecopies of the communication structures of these organiza-tions.” This duality between people and systems needs to beconsidered for understanding work processes. Previous visu-alizations on work processes and software evolution, however,either showed one of the two aspects only or did not scalewell to large systems and long histories.

With the visualization technique presented in this paper, weintend to bring both aspects of work processes closer togetherto help researchers, analysts, and project managers betterunderstand, monitor, and control the development process. Ourmain goal was to build a system that visually and technicallyscales to large projects (e.g., Python, Linux kernel), wheremonitoring the work activity is most crucial and difficult tomanage without tool support. Since software development isan important, cost-intensive process, we focus on visualizingthe evolution of software projects. These need to be studiedat a high level of abstraction and over long spans of time.However, we should as well be able to drill down to smallersubsystems, individual developers, and specific events. Aspectsthat might be explored with the help of the visualizationare, for instance, main activities of developers, responsibilitiesfor a module, individual developer roles, special evolutionaryevents, the flow of development activity, or general trends.

We designed a visualization technique called DeveloperRivers, which connects developer activity with modules ofa software system. Our technique represents development

activity (i.e., developers modifying files) as river-like flows ona timeline, shown in Figure 1 for the evolution of Python. Eachof the colored rivers encodes the developer activity within auser-defined module; rivers branch and merge according to theflow of activity between modules. A hierarchical representa-tion of the project on the left is used to interactively browseand select modules. In a case study, we apply the techniqueto Python and the Linux kernel using multiple configurationsof the visualization. We report specific findings for thesesoftware systems and discuss insights gained on the qualitiesand limitations of Developer Rivers.

II. RELATED WORK

Previous work related to Developer Rivers comes fromseveral fields: software visualization [11], time-oriented datavisualization [1], and the visualization of graphs [38] and hier-archies [16]. In general, our visualization combines a Sankeydiagram [33], [27] (also referred to as flow map [25]) witha timeline. An additional hierarchy representation is used fornavigation. While several hierarchy visualization techniquesexist [16], we decided to use icicle plots [18] because theyare compact as well as easy to label and colorize.

A Sankey diagram is a node-link diagram that encodesbranching flows or quantities in the width of links. Thosehave already been mapped to a timeline, to encode split-ting and merging of objects [28], the evolution of characterinteraction in stories [20], or the temporal development oftopics in document collections [10]. In particular, our workis visually similar to the diagrams by Burch et al. [4] forvisualizing eye movements of participants of an eye-trackingstudy. Using smooth bends and stacking the flows on top ofeach other without margins creates the impression of rivers. Itlooks similar to multivariate time-series visualizations suchas ThemeRiver [17] or Streamgraphs [6]; these, however,do not show transitions between different rivers or streams.RankExplorer [30] uses flow-in and flow-out color bars asan alternative to links for encoding transitions in river-likecategorical timeseries representations.

A common visualization for general work processes is theGantt chart [13], which maps hierarchically organized tasksto a timeline and visualizes dependencies between the tasks.Also a number of approaches already focus on visualizing theevolution of software. Taking a rather technical perspective,releases or commits are visualized on a timeline [3], [12], [32],[36], [37], [35]; among these, the Code Flows approach [32]

beckfn
Typewriter
VISSOFT 2015
Page 2: Visualizing Work Processes in Software Engineering with Developer

Fig. 1. The flow of developer activity in main modules of Python from January 1991 to December 2011.

also uses river-like streams—including splits, merges, andswaps—but represents flow of code fragments rather thandeveloper activity. The technical evolution of software systemscan also be visualized as sequence of coupling graphs [2], [5],[8] or a set of time-dependent software metrics [19], [26].

Considering social aspects of the development process,Storey et al. [31] survey approaches up to 2005. New tech-niques have been introduced since then: Timeline-based visu-alizations of software evolution, for instance, show evolvingcommunities of developers [24], [23]. These approaches aresimilar to ours but do not directly relate developers with themodified code artifacts within the timeline-based visualization;at least, Software Evolution Storylines [23] are prepared toindirectly relate files and developers by mapping the mostfrequently changed file type of a developer to the color of linerepresenting the developer. CodeSaw [14], in contrast, showsthe activity of core developers, however, without encodingrelations between developers. Ownership Maps [15] encodechanging files on a timeline and color each of these accordingto the author who currently ‘owns’ the file; this approachonly allows for discerning few developers. When color codingdevelopers as well, similar limitations hold for the approachby Weißgerber et al. [39] and XFlow [29].

Animation was also applied visualizing the role of devel-opers in software evolution. StarGate [21] shows the projecthierarchy as a radial icicle plot and depicts an animatedcollaboration graph in its center; developers are placed closeto the files that they changed recently. In code swarm [22]and Gource [7], the developers and changed files are directlyconnected in an animated node-link graph. While code swarmorganizes files as ‘satellites’ of developers, Gource emphasizesthe project hierarchy in the layout. In contrast, we do not useanimation because static representations on a timeline haveclear advantages for representing complex information [34]:in static diagrams, usually a good overview is provided fortime sequences, whereas animation lacks this overview.

III. VISUALIZATION TECHNIQUE

The visualization combines two separate components: thehierarchy view for the module group selection and the riversview for showing the time-varying developer activities.

A. Developer Activity Model

For a given software repository, we model a commit c ∈C as a triple c = (t, d, f) consisting of a timestamp t, adeveloper name d ∈ D and a set f ⊆ F of involved files.The set C denotes all commits, the set D all developers, andthe set F all files involved in all commits. The organization offiles F into modules is defined as a hierarchy H (i.e., a tree).Any subhierarchy of H is called module, any set of disjointmodules is called module group.

To analyze the sequence of time, we partition the sequenceof commits into intervals, either equally-sized with respect tothe frame of time or with respect to the number of commits.For every interval of commits and every module, we calculatethe individual activity of each devloper as the number offiles of the module edited by the developer within the givensequence of commits. By summing all individual developeractivities of one module, a total developer activity of themodule can be derived. Hence, each interval of commits canbe described as a vector of module-specific developer activity,the full evolution of the module group as a sequence of suchvectors.

To explicitly represent the changes between the consecutivecommit intervals, we calculate weighted transition matricesMi ∈Mat((l+1)× (l+1),R+

0 ). They describe the transitionof developer activity between two commit intervals dividedby the modules in the module group. Diagonal cells of thematrix represent a constant activity within the same module,off-diagonal cells a transition of activity from one moduleto another. An l + 1-th module group has to be addedto model influents and effluents, i.e., developers joining orleaving the project. To prevent that activity is transfered from

Page 3: Visualizing Work Processes in Software Engineering with Developer

(a) (b)

Fig. 2. A Developer River represented as (a) traditional ThemeRiver and (b)as an expanded representation showing the weighted transitions.

(a) (b)

(c) (d)Fig. 3. Weighted transitions between (a) the same and (b) two differentmodule groups, (c) an influent, (d) an effluent.

one developer to the other, the transition matrices are firstcomputed for each developer individually, before building thefinal matrices Mi by summing up all developer-specific ones.

B. Developer Rivers

A Developer River is composed of several elements as canbe seen in Figure 2. If it is represented in an expanded view,transitions are visually shown, as well as influents and effluents(see Figure 2 (b)). To compute this view, weighted transitionmatrices are required, for the traditional ThemeRiver view(Figure 2 (a)) the transition matrices are not needed.

1) Transitions, Influents, and Effluents: The transi-tions show the developer behavior in adjacent timeor commit intervals. For the visualizations of the sin-gle elements, the rendering algorithm needs four pointsP1(xP1 , yP1), P2(xP2 , yP2), P3(xP3 , yP3), and P4(xP4 , yP4)(see also Figure 3).

Transitions are used to represent how developers changetheir behavior between different module groups. Those areconstructed by applying cubic Bezier curves, see Figures 3 (a)and (b). The difference to the AOI rivers of Burch et al. [4] isthat we allow the change of river thicknesses from one intervalto the next one. The color of a transition between two differentmodules is a linear gradient from the color of the start moduleto the color of the target module.

Influents show new developers joining in the current timestep and effluents developers which stop contributing in thenext time step. Influents are drawn starting from top andmerging into the main river which runs from left to right.Effluents leave the main river similarly to the bottom, whichis illustrated in Figures 3 (c) and (d).

In the following, we describe how influents are treated byour layout and rendering algorithm, effluents are processedanalogously. Assume that an influent runs from P1P2 to P3P4

(see Figure 3 (c)). While the points P3 and P4 are given bypositions inside the main river, P1 and P2 are computed ina way that all corresponding influents are horizontally placednext to each other. Overlap is avoided by placing an influentmore to the right the lower it is merging into the main river.

The influent has to merge orthogonally into the mainriver which demands four more points Si(xSi, ySi) (Figure 3(c)). Their coordinates can be computed in the followingmanner: By regarding the radius for the innermost circler = min(xP3 − xP1, yP3 − yP1) · v where v ∈ [0, 1], thefollowing coordinates can be computed for the Si:

xS1= xP1

, xS2= xP2

,

yS1= yS2

= yP3− r,

yS3= yP3

, yS4= yP4

,

xS3 = xS4 = xP1 + r.

The segments P1S1, P2S2, P3S3, and P4S4 are straightlines. S1C1 and S3C1 are responsible for the radius of theoutermost circle and S2C2 and S4C2 for the innermost one.Circle segments are placed between S1 and S3 as well asbetween S2 and S4. The value of v expresses the ratio forthe circle radius. In the visualizations we choose v = 0.7 toreduce the effect of large bends, i.e., to produce aestheticallyappealing diagrams. We apply a fade-out visual appearanceat the top and bottom respectively to prevent influents andeffluents dominating the visualization.

2) Adjustment of the Transition Matrices: Since developersmight work in several modules before doing a commit, wehave to adjust the transition matrix concept introduced inthe work of Burch et al. [4]. For this reason, transitionsbetween two subsequent time steps or time intervals cannotbe regarded as a single transition between exactly two modulegroups. In general, developers’ amount of work has to beadded/subtracted to/from the weights of several rivers, alsowhen they enter or leave certain modules of interest. Anotherdifference is that each developer typically has varying amountsof work in different work spaces. This aspect is also taken intoaccount by the layout of Developer Rivers.

3) Vertical Order and Label Placements: In the verticalorder of the subrivers we first render the influents, then thetransitions, and finally the effluents. We first draw those tran-sitions to the same module group and as a second step thosewhich lead to the remaining module groups. All transitionsbetween two subsequent time intervals are ordered by their

Page 4: Visualizing Work Processes in Software Engineering with Developer

height, i.e., thinner subrivers are drawn in the foreground tomake crossing transitions more readable.

We additionally annotate each time step by labels expressingthe developers working most actively in the correspondingmodule groups, scaling the font size according to activity.However, we cannot show all developer names due to spacelimitations. To discern labels from colored rivers, labels aredisplayed as semi-transparent rounded rectangles on whichthe corresponding developer name is drawn. The most activedeveloper is centered, followed to the top and bottom by otherdevelopers in order of decreasing activity.

IV. USAGE STRATEGIES

When applying Developer Rivers in practice, certain strate-gies help efficiently use the tool. On the one hand, weintroduce visual patterns that have a specific meaning in theapplication domain; finding those in the visualization alreadyreveals some of the relevant information. On the other hand,we describe application scenarios for our approach; each ofthem specifies a configuration and a form of application weidentified as particularly useful.

A. Visual Patterns

Certain frequent patterns can be obtained from the rivertrajectories. These are simple geometric structures that can bedetected by looking only at two neighboring periods and thetransitions in between. To judge their relevance, not only thetopology of the transitions needs to be considered but alsotheir strength and context. If the patterns are not symmetric,a mirrored complement exists:

• Inflow/Outflow: A transition from or to the outside ofthe diagram identifies a developer entering or leaving theproject, i.e., one who is not active in the previous periodor in the following one.

• Constant Flow: An intra-transition with a constant widthindicates a group of developers continuing to work ona module with steady effort. While the sum of activitydoes not change considerably, the activities of individualdevelopers of this group might vary.

• Growth/Decline: An intra-transition with an increasingor decreasing strength hints at a group of developers thatkeep working on a module but with changing total effort.

• Split/Merge: A module that is split into or merged frommultiple flows shows a qualitative change of developeractivity (i.e., developers’ relative focus switches betweenmodules). While at least one inter-transition is requiredfor this pattern, one of the flows can be an intra-transition.

• Exchange: A pair of intra-transitions connecting twomodules in opposite directions at the same time is aspecific qualitative change of activity: some developersmove between the two modules in both directions.

While these patterns only provide basic findings, the in-terplay or sequence of those standard patterns might createmore complex patterns that lead to more advanced insights.Composites of the patterns can be formed through temporal

succession (i.e., one pattern follows the other), through rep-etition (i.e., a pattern is repeated constantly or cyclically), orthrough co-occurrence (i.e., two patterns appear in the sametimestep).

B. Application Scenarios

Developer Rivers are interactively configurable and allow,for instance, to select modules, to vary time frame andresolution, to select developers, or to change coloring andlabels. The following scenarios are configurations we consideruseful and apply throughout the case studies in Section V:

• Main Module Overview: The main modules of thesystem are selected, which usually consist of the maindirectories. While it is possible to simply select alldirectories of a level, it often is advisable to do a manualselection in order to obtain a better selection of modules.

• File Type Overview: The tool enables the automaticdefinition of modules by file types. We recommend thatone does not use all file types existing in the project asmodules but restricts the selection to the most frequentones.

• Developer Sparklines: Selecting a single developer, avisualization is created that reveals specific characteristicsof the individual activity of the developer. The reducedcomplexity of the visualization allows to down-scale thesize of the visualization and create a kind of sparkline(i.e., a word-sized graphic). Several of these developersparklines can be shown as juxtaposed small multiples.

• Subsystems Details: Selecting modules in a subdirectoryof the system shows details of the specific subsystem.During the analysis of the subsystem, the hierarchy canbe switched to only showing the subsystem.

We might want to focus on a specific time frame (e.g.,only recent development), which is orthogonal to the scenariosdescribed above and, hence, can be combined with these.

V. CASE STUDIES

We applied Developer Rivers to two reasonably large opensource software projects with a considerable developmenthistory: Python and the Linux kernel. The case study takesthe perspective of an analyst or researcher trying to figure outhow the development processes of the projects are organized,how responsibilities are distributed among core developers,and what subsystems have been and currently are the mainfocus of development. It follows a top-down approach, firststarting with high level observations of the history, then inves-tigating the role of main developers, and analyzing examplesof subsystems.

A. Python

As first example, we investigated the evolution of the ref-erence implementation of the programming language Python.While the development of Python started in 19891, we re-trieved exactly 21 years of development history from January

1http://python-history.blogspot.de/2009/01/brief-timeline-of-python.html

Page 5: Visualizing Work Processes in Software Engineering with Developer

1991–1993 1994–1996 1997–1999 2000–2002 2003–2005 2006–2008 2009–2011

Fig. 4. Python file type overview, 1991–2011, 3-years interval.

1, 1991 to December 31, 2011 from the Mercurial repositoryof the project. This dataset consists of 70,813 commits by159 developers. This project forms a suitable example fordemonstrating our approach as it has reasonable size andhistory. Since the same dataset has already been visualizedwith code swarm [22], Gource [7], and Software EvolutionStorylines [23], it might be considered a form of benchmark.

1) Project Overview: Figure 1 gives an overview of theproject. For sake of readability and simplicity, we cut thedevelopment history in clean periods covering only full years.In particular, we split the history of the project into sevenperiods each showing three years of development. Based onthe project’s documentation2, we selected four of the mostimportant directories of the project as modules:

• Lib: “The part of the standard library implemented inpure Python.” (blue)

• Modules: “The part of the standard library (plus someother code) that is implemented in C.” (green)

• Doc: “The official documentation.” (red)• Tools: “Various tools that are (or have been) used to

maintain Python.” (pink)One of the most obvious observations in Figure 1, which

shows the main module overview, is that the project underwenta considerable growth with respect to numbers of changed filesduring the studied period and modules. The growth was quiteconstant, only with a slight decrease of changes during 2003–2005. The activity in the individual modules roughly followsthe size of the modules in the project hierarchy: most commitscovered the Lib directory, followed by Doc and Modules; Toolsplayed only a minor role. This relation is mainly preservedover the years, only with an exception in the years 1997–1999, where Doc had the highest activity. This seems to bea particular phase of documentation, which might have beenneglected during the early years of development.

Figure 4 shows the same time division as in Figure 1 butdivides the project by the six most frequent file types intorivers (file type overview). Since file types largely conformas the modules selected before (Doc → .TEX, .RST; Lib →.PY; Modules → .C, .H; Tools → [diverse]), most of theabove observation can be confirmed, such as the specificphase of documentation in the third period. Additional tothat, Figure 4 provides some new insights: Visually quite

2http://docs.python.org/devguide/setup.html

apparent, during the last two periods, from 2006 and onward,the documentation format has changed from .TEX (LATEX)to .RST (reStructuredText), not instantly but gradually (twomain transitions between the orange and red river). As aposting on the Python developer mailing list tells3, the ideaof switching the documentation had already been discussedsince 2002; Guido van Rossum, the initiator of Python, reactedpositively on the idea but was not directly convinced4. Anotherobservation from Figure 4 is that the role of Python files (.PY)in relation to C code (.C, .H) is slowly increasing over thedevelopment history: at the beginning, the number of changeswith respect to these two groups of files were nearly balanced,while at the end of the studied period, more than twice thenumber of changes referred to Python files than to C files.Hence, the library code seems to have gained importance inrelation to the core implementation.

2) Developers: The labeled rivers of Figure 1 show whowas working on which part of Python. In general, the fluctu-ation of developers’ activity was quite high because the maindevelopers of a river rarely stayed the same from one timeperiod to the next. Within a time period, different developernames for the selected modules indicate a certain division ofresponsibilities. Nevertheless, some developers (e.g., Guidovan Rossum, Fred Drake, or Georg Brandl) appear as mainones of different modules. For the Doc directory, this is quitenatural because implemented functionality (in one of the otherdirectories) might need to be documented specifically.

As Figure 1 also shows, only few developers worked on theproject at the beginning with minimal fluctuation (at most 9developers during 1991–1999). In the period of 2000–2002,the project attracted many new developers and the numberof them jumped to 50 (maybe in the context of the releaseof Python 2.0 in October 2000). Most of them kept workingon the project throughout 2003–2005, although the number ofcommits declined slightly. From 2006 and onward, however,the fluctuation of developers increased: Many developers thatwere active between 2006–2008 did not continue working onthe project during 2009–2011.

In general, developers do not often seem to switch theirfocus of work activity between the main parts: only fewstronger transitions can be observed in Figure 1. The fewexisting transitions, however, may point to special events inthe history of the project. While transitions can be exploredinteractively by retrieving additional information in tooltips,restricting the time frame increases the temporal resolution.We demonstrate this for the early years of Python, 1991–2001, in Figure 5. The higher temporal resolution (now 2 yearsinstead of 3 years) and the vertical zoom much better reveal theactivities and transitions in the focused period. For instance,we see that Guido van Rossum stayed the only main developerof Python for a long time. A first interesting pair of transitionsfrom Lib and Modules to Doc between the second (1992–1993) and third (1994-1995) period was caused by Guido

3https://mail.python.org/pipermail/python-dev/2002-April/022099.html4https://mail.python.org/pipermail/python-dev/2002-April/022131.html

Page 6: Visualizing Work Processes in Software Engineering with Developer

1991–1992

DocLib Modules Tools

1992–1993 1994–1995 1996–1997 1999–2000

Fig. 5. Python main module overview, 1991–2000, 2-years interval.

DocLib Modules Tools

1995 2000 2005 2010

1. Guido van Rossum

2. Georg Brandl

3. Fred Drake

4. Benjamin Peterson

5. Jack Jansen

Fig. 6. Python developer sparklines of top 5 developers, 1991–2011, 1-yearinterval.

van Rossum editing more documentation files (as confirmedby using details on demand). These transitions fall togetherwith Fred Drake entering development and changing Doc files.For the period of 1996–1997, the documentation effort wasconsiderably intensified, which can be traced back largely toFred Drake, who then became the most active developer ofDoc. After that, during 1999–2000, Fred Drake started also toworking in Lib and Modules and Guido van Rossum changedhis focus back to these parts—this causes the strong transitionsfrom Doc to the blue and green river between the last twoperiods.

Generating developer sparklines for the top 5 developersof Python allows to confirm most of these observations. TheDeveloper Rivers as shown in Figure 6 encode again the fullperiod of 1991–2011; since a stretched aspect ratio matchesthe character of a sparkline, the temporal resolution can be

1997–2001 2002–2006 2007–2011

Fig. 7. Python subsystem details of the Tools directory, 1997–2011, 5-yearsinterval.

increased to one year per period. Studying the sparklines ofGuido van Rossum and Fred Drake, we see the increasingeffort in documentation and the subsequent switch of focustowards implementation. But much more insight can be gainedbased on these diagrams; just to name a few examples: Guidovan Rossum steadily worked less on the project during 1998–2004 (with respect to changed files) followed by a suddenpeak in 2007. Georg Brandl joined the project late (2005); hisefforts switch back and forth between Lib and Doc. BenjaminPeterson, in contrast to the others, started to contribute heavilyalready in his first year. Jack Jansen had two unconnectedperiods of high activity, while Fred Drake particularly focusedon Doc.

3) Subsystems: With these observations as a background, itis now interesting to go into the details of the module structure.As an example, we chose the Tools directory and mark someof its main subdiretories as new modules. Figure 7 depicts thesubsystem details; due to limited space, we here restricted thestudied period to 1997–2011, divided into 5-years intervals.The resulting diagram is quite different from those discussedbefore: the overall number of changes is not increasing, butroughly stays constant; there is a large variety of dominancecomparing two time periods; and there are many transitionconnections between the different modules.

In the first period (1997–2001), scripts (a collection ofscripts for various purposes), freeze (a Python compiler forUnix), pynche (a color editor), idle (a Python code editor),and compiler (a Python bytecode compiler) assembled themain activity. In the transition to the next period (2002–2006),there is no considerable outflow—nearly all Tools developerscontinued to work on the project, however, with overall lessactivity while some other developers joined. Within the period,scripts (see above), bgen (a source code generator), idle(see above), and pybench (a benchmark suite) were mostactive, following the strongest transitions between the twofirst periods, considerable developer activity moved from idleand compiler to scripts (split in idle and compiler, mergein scripts). In the last period (2007–2011), we again find a

Page 7: Visualizing Work Processes in Software Engineering with Developer

#developers

drivers fs arch kernel net Documentation

2006 2007 2008 2009 2010 20122011 2013

Fig. 8. Linux main module overview, 2006–2013, 1-year interval.

different collection of modules being most active; in particular,buildbot (a continuous integration framework) draws consid-erable attention of developers. Only the scripts directory hasmajor activity across all considered periods, but the developersare largely changing (considerable inflows and outflows).

B. Linux KernelThe second project for this case study is the Linux kernel.

Although we were only able to retrieve a part of the project’shistory, this part is already at least one order of magnitudelarger than the Python project: its Git repository for 8 yearsfrom January 1, 2006 to December 31, 2013 contains 408,555commits by 11,048 developers.

1) Project Overview: Figure 8 provides the projectoverview for the following directories selected as modules(descriptions accoding to the Linux Documentation Project5):

• drivers: “the system’s device drivers” (red)• fs: “file system code” (blue)• arch: “architecture specific kernel code” (green)• kernel: “main kernel code” (pink)• net: “kernel’s networking code” (cyan)• Documentation: documentation files (orange)In contrast to Python, the Linux kernel did not undergo an

extensive growth of changes in the studied period (however,we also study a shorter period), but just a slight growth.A minor exception is the year 2013, where developmentactivity slightly declined with respect to the previous year. Itis further interesting to note that the overall pattern createdby the flows is quite similar across all periods: a smallinflow distributed among all selected modules (relative to theirsize in the period), a similar but even smaller outflow, largeconstant flows for all modules, and only small inter-transitions;only between drivers and arch, there are considerable inter-transitions showing an exchange pattern—this might be partly

5http://www.tldp.org/LDP/tlk/sources/sources.html

2008 2009 2010 2013

1. Al Viro

2. Greg Kroah-Hartman

3. Tejun Heo

4. David Howells

5. Russell King

drivers fs arch kernel net Documentation

20072006 2011 2012

Fig. 9. Linux developer sparklines of top 5 developers, 2006–2013, 0.5-yearsinterval.

explained through their size, but could also mean that thesetwo modules are related so that developers naturally switchbetween them. In general, the development of the Linux kernelseems to be an established work process with low amount ofvariance in the developer activity.

2) Developers: Although the overall development activityis quite stable, the most active developers within the modulesselected in Figure 8 change quite often over the full timeperiod. To further investigate the stability of developer roles,Figure 9 shows developer sparklines for the top 5 mostactive ones. We find, for instance, that some developers haveclear responsibilities that stay constant over time (e.g., Greg

Page 8: Visualizing Work Processes in Software Engineering with Developer

2006–2007 2008–2009 2010–2011 2012–2013

sched time trace irq power events

Fig. 10. Linux subsystem details of the kernel directory, 2006–2013, 1-yearinterval.

Kroah-Hartman → drivers or Russell King → arch). Otherdevelopers, in contrast, switch their focus of activity (Al Viro:arch, drivers, fs; David Howells: drivers, arch). The overallactivity of them is considerably varying over time, but noneof the main developers left the project in this period.

3) Subsystems: Maybe the most important subsystem of theproject is the kernel module. Figure 10 shows this directory assubsystem details with the most active subdirectory selected asmodules. A first surprising observation is that this subsystemis not as stable as the overall system, even though 2-yearsintervals were selected to increase the readability of the figure.The overall activity as well as the relative activity in thedifferent modules changes. In the first period (2006–2007),there is only little activity, while in the second period (2008–2009) much work is done in trace (the kernel tracing systems),to a large extent by new developers but also by developerspreviously working on time (time keeping functionality) andirq (handling interrupt request). Only some of this activityis preserved throughout 2010–2011, but irq got more activeagain. During the last period (2012–2013), previously quiteinactive modules attract activity: sched (kernel scheduler) andevents (performance events). Hence, the stability of the overallsystem and the relative stability of the main developers are notpreserved in this subsystem, which is a particularly importantone but also small in relation to the complete project.

VI. DISCUSSION

The case study shows that Developer Rivers scale wellto large systems and long evolution histories. Although therivers highly aggregate the data, still interesting observationscan be made at this highest level of abstraction. Interactivelyselecting smaller subsystems or choosing a narrower windowof time further allows for investigating details. Selectingindividual developers creates quite characteristic lines that canbe interpreted even in small size like sparklines.

Nevertheless, our approach has a number of limitations,one of the main ones being the constraints in the numberof modules and time periods that can be visualized: Sinceonly a small number of colors can be visually discernedand vertical space is required for labeling the rivers, onlyup to about 10–12 rivers can be easily discerned, similar

as it is the case for ThemeRiver [17]. Cushion effects, likeused in Code Flows [32], could improve the discernibility ofrivers somewhat. The limitation with respect to the numberof time periods is caused by the transitions that connectthe modules across time periods. Like in other approachesshowing transitions on a timeline, the transition links requiresome horizontal space to be readable.

Like in most visualizations, visual and data artifacts maysometimes mislead the analyst. For Developer Rivers, theselected time granularity could make a difference: for instance,an entering developer who starts to heavily contribute tothe project might either appear as a strong inflow or by aweaker inflow directly followed by a growth. Also, it cannot bevisually discerned between developers not active only duringone period of time and people leaving the project permanently.Knowing these potential problems, they can be mitigated byusing the interaction techniques of Developer Rivers, such asretrieving details on demand.

A general limitation is that we tested the approach onlywith two open source systems. Development activity mightbe different in other projects, in particular, commercial ones.Hence, we do not draw any generalizing conclusions on devel-opment patterns based on our case study. Another problem thatour work shares with other studies on development activity isthat choosing a metric to estimate work effort is difficult. Wedecided to use the number of changed files as a simple metricthat is easy to interpret, but this metric is replaceable by anyother metric of development activity. Also developers usingmultiple alias names could be a problem; integrating a namedisambiguation approach could mitigate this issue.

VII. CONCLUSION AND FUTURE WORK

In this paper, we introduced Developer Rivers, a visual-ization approach for understanding development activity inlarge-scale software systems. Based on the committed codechanges, we relate the software system in the form of hier-archically organized modules to developers. Good overviewof the software evolution is provided through the timeline-based visualization technique. A flexible configuration and theinteraction techniques allow the application of the visualizationin different scenarios and on different levels of abstractions.Described visual patterns help quickly gain insights from thevisualization. The extensive case studies provide examples thatshow the usefulness of the scenarios and patterns; they alsodemonstrate that our approach scales to large systems.

For future application of Developer Rivers, we see two maindirections: First, we want to make the tool ready for easyapplication by practitioners, for instance, software consultants,project managers, and lead open source developers. For this,mainly technical issues still need to be solved (e.g., bundlingpreprocessing scripts and visualization tool or integration intoexisting development environments), but also the features ofthe tool need to be restricted to the most central ones toincrease acceptance. Second, we see the visualization as apotentially helpful tool for software engineering researchers toderive hypotheses about development patterns and developer

Page 9: Visualizing Work Processes in Software Engineering with Developer

roles. These hypotheses might act as starting points of newresearch projects and experiments studying the socio-technicalaspects of software projects.

ACKNOWLEDGMENTS

Fabian Beck is indebted to the Baden-Wurttemberg Stiftungfor the financial support of this research project within thePostdoctoral Fellowship for Leading Early Career Researchers.

REFERENCES

[1] Wolfgang Aigner, Silvia Miksch, Heidrun Schumann, and ChristianTominski. Visualization of Time-Oriented Data. Human-ComputerInteraction Series. Springer, 2011.

[2] Dirk Beyer and Ahmed E Hassan. Animated visualization of softwarehistory using evolution storyboards. In Proceedings of the 13th WorkingConference on Reverse Engineering, WCRE, pages 199–210. IEEEComputer Society, 2006.

[3] Michael Burch, Fabian Beck, and Stephan Diehl. Timeline Trees:Visualizing sequences of transactions in information hierarchies. InProceedings of 9th International Working Conference on AdvancedVisual Interfaces, AVI, pages 75–82. ACM, 2008.

[4] Michael Burch, Andreas Kull, and Daniel Weiskopf. AOI rivers forvisualizing dynamic eye gaze frequencies. Computer Graphics Forum,32:281–290, 2013.

[5] Michael Burch, Corinna Vehlow, Fabian Beck, Stephan Diehl, andDaniel Weiskopf. Parallel Edge Splatting for scalable dynamic graphvisualization. IEEE Transactions on Visualization and Computer Graph-ics, 17:2344–2353, 2011.

[6] Lee Byron and Martin Wattenberg. Stacked Graphs – geometry &aesthetics. IEEE Transactions on Visualization and Computer Graphics,14(6):1245–1252, 2008.

[7] Andrew H Caudwell. Gource: Visualizing software version controlhistory. In Companion to the 25th Annual ACM SIGPLAN Conference onObject-Oriented Programming, Systems, Languages, and Applications,SPLASH/OOPSLA, pages 73–74. ACM, 2010.

[8] Christian Collberg, Stephen Kobourov, Jasvir Nagra, Jacob Pitts, andKevin Wampler. A system for graph-based visualization of the evolutionof software. In Proceedings of the 2003 ACM Symposium on SoftwareVisualization, SoftVis, pages 77–86. ACM, 2003.

[9] Melvin E Conway. How do committees invent? Datamation Journal,14(4):28–31, 1968.

[10] Weiwei Cui, Shixia Liu, Li Tan, Conglei Shi, Yangqiu Song, Zekai Gao,Huamin Qu, and Xin Tong. TextFlow: Towards better understandingof evolving topics in text. IEEE Transactions on Visualization andComputer Graphics, 17(12):2412–2421, 2011.

[11] Stephan Diehl. Software Visualization - Visualizing the Structure,Behaviour, and Evolution of Software. Springer, 2007.

[12] Harald Gall, Mehdi Jazayeri, and Claudio Riva. Visualizing softwarerelease histories: The use of color and third dimension. In Proceedingsof the IEEE International Conference on Software Maintenance, ICSM,pages 99–108. IEEE Computer Society, 1999.

[13] Henry Laurence Gantt. Work, Wages, and Profits. Engineering MagazineCompany, 2nd edition, 1913.

[14] Eric Gilbert and Karrie Karahalios. CodeSaw: a social visualization ofdistributed software development. In Proceedings of the 11th IFIP TC 13International Conference on Human-Computer Interaction, INTERACT,pages 303–316. Springer, 2007.

[15] Tudor Girba, Adrian Kuhn, Mauricio Seeberger, and Stephane Ducasse.How developers drive software evolution. In Proceedings of the 8thInternational Workshop on Principles of Software Evolution, IWPSE,pages 113–122. IEEE, 2005.

[16] Martin Graham and Jessie Kennedy. A survey of multiple tree visuali-sation. Information Visualization, 9(4):235–252, 2010.

[17] Susan Havre, Elizabeth Hetzler, Paul Whitney, and Lucy Nowell. The-meRiver: Visualizing thematic changes in large document collections.IEEE Transactions on Visualization and Computer Graphics, 8(1):9–20,2002.

[18] Joseph B Kruskal and James M Landwehr. Icicle Plots: Better displaysfor hierarchical clustering. The American Statistician, 37(2):162–168,1983.

[19] Michele Lanza. The Evolution Matrix: Recovering Software Evolutionusing Software Visualization Techniques. In Proceedings of the 4thInternational Workshop on Principles of Software Evolution, IWPSE,pages 37–42. ACM, 2001.

[20] Shixia Liu, Yingcai Wu, Enxun Wei, Mengchen Liu, and Yang Liu.StoryFlow: Tracking the evolution of stories. IEEE Transactions onVisualization and Computer Graphics, 19(12):2436–2445, 2013.

[21] Michael Ogawa and Kwan-Liu Ma. StarGate: A unified, interactivevisualization of software projects. In Proceedings of the IEEE PacificVisualization Symposium, PacificVis, pages 191–198, 2008.

[22] Michael Ogawa and Kwan-Liu Ma. code swarm: A design study inorganic software visualization. IEEE Transactions on Visualization andComputer Graphics, 15(6):1097–1104, 2009.

[23] Michael Ogawa and Kwan-Liu Ma. Software Evolution Storylines. InProceedings of the 5th International Symposium on Software Visualiza-tion, SoftVis, pages 35–42. ACM, 2010.

[24] Michael Ogawa, Kwan-Liu Ma, Christian Bird, Premkumar Devanbu,and Alex Gourley. Visualizing social interaction in open source softwareprojects. In Proceedings of the 6th International Asia-Pacific Symposiumon Visualization, APVIS, pages 25–32. IEEE, 2007.

[25] Doantam Phan, Ling Xiao, Ron Yeh, Pat Hanrahan, and Terry Winograd.Flow Map Layout. In Proceedings of the 2005 IEEE Symposium onInformation Visualization, INFOVIS. IEEE Computer Society, 2005.

[26] Martin Pinzger, Harald Gall, Michael Fischer, and Michele Lanza.Visualizing multiple evolution metrics. In Proceedings of the 2005 ACMSymposium on Software Visualization, SoftVis, pages 67–75. ACM,2005.

[27] Patrick Riehmann, Manfred Hanfler, and Bernd Froehlich. InteractiveSankey Diagrams. In Proceedings of the IEEE Symposium on Informa-tion Visualization, INFOVIS, pages 233–240. IEEE, 2005.

[28] Kay A Robbins, Clinton L Jeffery, and Steven Robbins. Visualizationof splitting and merging processes. Journal of Visual Languages andComputing, 11(6):593–614, 2000.

[29] Francisco Santana, Gustavo Oliva, Cleidson RB de Souza, and Marco AGerosa. XFlow: An extensible tool for empirical analysis of softwaresystems evolution. In Proceedings of the VIII Experimental SoftwareEngineering Latin American Workshop, volume 11 of ESELAW, 2011.

[30] Conglei Shi, Weiwei Cui, Shixia Liu, Panpan Xu, Wei Chen, andHuamin Qu. RankExplorer: Visualization of ranking changes in largetime series data. IEEE Transactions on Visualization and ComputerGraphics, 18(12):2669–2678, 2012.

[31] Margaret A Storey, Davor Cubranic, and Daniel M German. On the useof visualization to support awareness of human activities in softwaredevelopment: a survey and a framework. In Proceedings of the 2005ACM Symposium on Software visualization, SoftVis, pages 193–202.ACM, 2005.

[32] Alexandru Telea and David Auber. Code Flows: Visualizing structuralevolution of source code. Computer Graphics Forum, 27(3):831–838,2008.

[33] Edward R Tufte. The Visual Display of Quantitative Information.Cheshire, CT: Graphics Press, 1983.

[34] Barbara Tversky, Julie Bauer Morrison, and Mireille Betrancourt. An-imation: can it facilitate? International Journal of Human-ComputerStudies, 57(4):247–262, 2002.

[35] Lucian Voinea, Alex Telea, and Jarke J van Wijk. CVSscan: visualizationof code evolution. In Proceedings of the 2005 ACM Symposium onSoftware Visualization, SoftVis, pages 47–56. ACM, 2005.

[36] Lucian Voinea and Alexandru Telea. CVSgrab: Mining the history oflarge software projects. In Proceedings of the Joint Eurographics–IEEE VGTC Symposium on Visualization, EuroVis, pages 187–194.Eurographics Association, 2006.

[37] Lucian Voinea and Alexandru Telea. Visual querying and analysis oflarge software repositories. Empirical Software Engineering, 14(3):316–340, 2009.

[38] Tatiana von Landesberger, Arjan Kuijper, Tobias Schreck, JornKohlhammer, Jarke J van Wijk, Jean-Daniel Fekete, and Dieter WFellner. Visual analysis of large graphs. Computer Graphics Forum,30(6):1719–1749, 2011.

[39] Peter Weißgerber, Mathias Pohl, and Michael Burch. Visual datamining in software archives to detect how developers work together.In Proceedings of Workshop on Mining Software Repositories, MSR,page 9, 2007.