[ieee 2012 34th international conference on software engineering (icse) - zurich, switzerland...

4
Automatically Detecting Developer Activities and Problems in Software Development Work Tobias Roehm Technische Universität München Munich, Germany [email protected] Walid Maalej Technische Universität München Munich, Germany [email protected] Abstract—Detecting the current activity of developers and problems they are facing is a prerequisite for a context-aware assistance and for capturing developers’ experiences during their work. We present an approach to detect the current activity of software developers and if they are facing a problem. By observing developer actions like changing code or searching the web, we detect whether developers are locating the cause of a problem, searching for a solution, or applying a solution. We model development work as recurring problem solution cycle, detect developer’s actions by instrumenting the IDE, translate developer actions to observations using ontologies, and infer developer activities by using Hidden Markov Models. In a preliminary evaluation, our approach was able to correctly detect 72% of all activities. However, a broader more reliable evaluation is still needed. Keywords-activity detection, task management, machine learning, ontologies, context-aware software engineering I. I NTRODUCTION Software engineering has become a knowledge intensive effort due to rapidly changing technologies and frameworks, a growing number of tools, differing requirements in dif- ferent domains (e.g. web or mobile development), and the complexity of software systems. Hence, approaches and tools to support software engineers in knowledge acquisi- tion, knowledge management, and knowledge exchange are important, especially for geographically distributed teams. In this paper, we address the problem of detecting the current activity of software developers and whether they are facing a problem based on their actions. Following the observations of Maalej and Happel [5], [6] and Roehm et al. [10], we developed a model of development work that is structured as recurring iterations of activities like “Searching for the cause of a problem”, “Searching for a solution”, or “Testing a solution”. Our approach detects these activities by monitoring fine grained developer actions like changing source code or performing a Google search and associating the observed actions with the current activity of developers. The activities are not directly observable because they are composed of many actions and inferring an activity from a single action is not trivial because actions can be performed in several activities. An illustrative scenario can be found in Table I. It shows actions and activities of a developer while Table I I LLUSTRATIVE SCENARIO Step Action of the developer Activity 1 Read ticket: Ticket description “Can’t write XML file” Get problem 2 Debug exception: “File already exists” thrown in class XMLWriter Locate cause of problem 3 Change code: Code that appends string “001” to filename if file exists Apply solution 4 Rerun application: Problem re-occurs because file with appended “001” to filename also exists Test solution 5 Google search: Search string is “file exists append number” Search solution 6 Change code: Copy and adapt sample code from web forum that appends incremented number to filename Apply solution 7 Rerun application: Successful run Test solution fixing a bug. Our goal is to infer the current activity of that developer from the actions. The ability to detect the current activity of developers and whether they are facing a problem represents important context information to record experiences of developers with certain problems and the solutions applied. This information can be recorded and made accessible to developers facing similar problems. Further, such knowledge can be used to provide context aware help to developers, e.g. by presenting only information that is relevant in the current activity or for the current problem. The contribution of this paper is twofold. First, we model development work as problem solution cycle. Second, we introduce an approach to infer the developer’s activity in the problem solution cycle from observable actions. The reminder of the paper is organized as follows. Section II reviews related work. Section III describes our approach. Section IV presents a first evaluation and Section V sum- marizes future directions. II. RELATED WORK Maalej and Happel [4] describe an approach for knowl- edge sharing in distributed software teams, which is the overall framework for this work. 978-1-4673-1067-3/12/$31.00 c 2012 IEEE ICSE 2012, Zurich, Switzerland New Ideas and Emerging Results 1261

Upload: walid

Post on 11-Dec-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2012 34th International Conference on Software Engineering (ICSE) - Zurich, Switzerland (2012.06.2-2012.06.9)] 2012 34th International Conference on Software Engineering (ICSE)

Automatically Detecting Developer Activities and Problemsin Software Development Work

Tobias RoehmTechnische Universität München

Munich, [email protected]

Walid MaalejTechnische Universität München

Munich, [email protected]

Abstract—Detecting the current activity of developers andproblems they are facing is a prerequisite for a context-awareassistance and for capturing developers’ experiences duringtheir work. We present an approach to detect the currentactivity of software developers and if they are facing a problem.By observing developer actions like changing code or searchingthe web, we detect whether developers are locating the causeof a problem, searching for a solution, or applying a solution.

We model development work as recurring problem solutioncycle, detect developer’s actions by instrumenting the IDE,translate developer actions to observations using ontologies,and infer developer activities by using Hidden Markov Models.In a preliminary evaluation, our approach was able to correctlydetect 72% of all activities. However, a broader more reliableevaluation is still needed.

Keywords-activity detection, task management, machinelearning, ontologies, context-aware software engineering

I. INTRODUCTION

Software engineering has become a knowledge intensiveeffort due to rapidly changing technologies and frameworks,a growing number of tools, differing requirements in dif-ferent domains (e.g. web or mobile development), and thecomplexity of software systems. Hence, approaches andtools to support software engineers in knowledge acquisi-tion, knowledge management, and knowledge exchange areimportant, especially for geographically distributed teams.

In this paper, we address the problem of detecting thecurrent activity of software developers and whether theyare facing a problem based on their actions. Following theobservations of Maalej and Happel [5], [6] and Roehm etal. [10], we developed a model of development work that isstructured as recurring iterations of activities like “Searchingfor the cause of a problem”, “Searching for a solution”, or“Testing a solution”. Our approach detects these activitiesby monitoring fine grained developer actions like changingsource code or performing a Google search and associatingthe observed actions with the current activity of developers.The activities are not directly observable because they arecomposed of many actions and inferring an activity from asingle action is not trivial because actions can be performedin several activities. An illustrative scenario can be found inTable I. It shows actions and activities of a developer while

Table IILLUSTRATIVE SCENARIO

Step Action of the developer Activity1 Read ticket: Ticket description “Can’t write

XML file”Get problem

2 Debug exception: “File already exists”thrown in class XMLWriter

Locate causeof problem

3 Change code: Code that appends string“001” to filename if file exists

Apply solution

4 Rerun application: Problem re-occursbecause file with appended “001” tofilename also exists

Test solution

5 Google search: Search string is “file existsappend number”

Searchsolution

6 Change code: Copy and adapt sample codefrom web forum that appends incrementednumber to filename

Apply solution

7 Rerun application: Successful run Test solution

fixing a bug. Our goal is to infer the current activity of thatdeveloper from the actions.

The ability to detect the current activity of developersand whether they are facing a problem represents importantcontext information to record experiences of developers withcertain problems and the solutions applied. This informationcan be recorded and made accessible to developers facingsimilar problems. Further, such knowledge can be used toprovide context aware help to developers, e.g. by presentingonly information that is relevant in the current activity orfor the current problem.

The contribution of this paper is twofold. First, we modeldevelopment work as problem solution cycle. Second, weintroduce an approach to infer the developer’s activity inthe problem solution cycle from observable actions. Thereminder of the paper is organized as follows. Section IIreviews related work. Section III describes our approach.Section IV presents a first evaluation and Section V sum-marizes future directions.

II. RELATED WORK

Maalej and Happel [4] describe an approach for knowl-edge sharing in distributed software teams, which is theoverall framework for this work.

978-1-4673-1067-3/12/$31.00 c© 2012 IEEEICSE 2012, Zurich, SwitzerlandNew Ideas and Emerging Results

1261

Page 2: [IEEE 2012 34th International Conference on Software Engineering (ICSE) - Zurich, Switzerland (2012.06.2-2012.06.9)] 2012 34th International Conference on Software Engineering (ICSE)

Work has been done to detect tasks of software developersand to support them when switching tasks. Kersten andMurphy [3] capture the context of the current task andsupport developers in reestablishing task context after atask switch. Coman and Sillitti [1] split interactions ofdevelopers in the IDE into task-related subsections basedon investigated code. Maalej and Sahm [7] capture artifactssuch as files a developer interacts with during work on aparticular task as context information. This approach doesnot detect task switches automatically. Similarly, Parnin andGörg [8] capture code methods a developer interacted withas context. They do not detect activity boundaries but assumea session length of one day.

Detecting the current task of knowledge workers has beenstudied, too. TaskTracer, presented by Dragunov et al. [2],collects task related user interaction and resource accessdata. Task switches have to be indicated by users proactively.Shen et al. [11], [12] describe an approach to detect thecurrent activity of a knowledge worker and recommendrelevant resources. They predict the current activity basedon resource-activity associations that are learned from us-age data and matched with the resources the current userinteracted with.

Our work is complementary to these approaches as we aimat detecting high-level situations of developers that are notrelated to a specific task. Further, we are detecting situationsin which developers are facing problems.

III. ACTIVITY DETECTION APPROACH

In this section we describe our approach to detect devel-oper activities based on developer actions.

A. Modeling Development Work as Problem Solution CycleRoehm et al. [10] observed developers during their daily

work and found that many developers follow a structuredproblem-solution-test work pattern. Maalej and Happel [5],[6] studied how developers describe their tasks and foundthat developers describe their work problem or issue based.These observations are the basis of our approach to describedevelopment work.

In our approach, development work is modeled as aproblem solution cycle (PSC) consisting of several activities(see Figure 1). Problems can be either a defined task oran unplanned issue occurring during development work.For each problem one or more iterations of the PSC areperformed until it is solved. A typical iteration is to locatethe cause of a problem, search for a solution, apply thesolution and then test it. If a problem is not solved after aniteration, a new iteration for the current problem is started.Otherwise another problem is tackled by starting an iterationfor the new problem.

Iterations can vary depending on the developer and thework context. For example, a developer can skip a step be-cause she knows the solution for a problem from experience,or because she has no time for documenting the solution.

Get problem

Locate cause of problem

Search solution

Apply solution

Test solution

Document solution

Figure 1. Problem solution cycle (PCS)

Due to interruptions, a developer might switch from thePSC of one problem (e.g. task A) to another problem(e.g. unplanned issue B) and resume after completing theinterrupting problem [8]. As a result, a problem solutioncycle for one problem is usually not carried out sequentiallyfrom the start to the end. Context switches mean that adeveloper can jump into the problem solution cycle at anyposition.

We identified six activities that a developer can perform ata certain point of time. They are shown in Table II. Our goalis to detect which of these activities a developer is currentlyperforming, e.g. if she is currently searching for a solutionor locating the cause of a problem.

B. Monitoring Developer Actions

To observe developer actions, we use the TeamWeaverframework [4]. This framework instruments the Eclipse IDEand senses developer’s actions such as changes of Java code,invocations of a method in source code, web searches, oroccurrences of compiler errors. Each occurrence of a devel-

Table IIACTIVITIES IN THE PROBLEM SOLUTION CYCLE

Activity Description Typical actions of adeveloper

GETPROBLEM

Developer learns aboutthe existence of aproblem

Read an e-mail or ticket,Get an exception, Get afailed test

LOCATECAUSE OFPROBLEM

Developer knows that aproblem exists but notthe cause, Developersearches for the cause

Read code, Reproduce anerror, Debug

SEARCHSOLUTION

Problem and its cause areknown but not a solution,Developer searches forsolution

Read code, Web search,Ask other developers,Read documentation

APPLYSOLUTION

Solution for problem isknown, Developer appliesthe solution

Refactor code, Write newcode, Copy and adaptcode

TESTSOLUTION

Developer tests theimplemented solution

Execute application, Runtest cases

DOCUMENTSOLUTION

Developer documents thesolution

Write code comments,Comment ticket, Writecommit message

1262

Page 3: [IEEE 2012 34th International Conference on Software Engineering (ICSE) - Zurich, Switzerland (2012.06.2-2012.06.9)] 2012 34th International Conference on Software Engineering (ICSE)

Table IIIOBSERVED DEVELOPER ACTIONS

Observation Description ExamplesUNKNOWN Unknown observationSEARCH Search commands Google Web searchREAD Reading source code Read source codeCHANGE Changing source code Write new code,

Change codePROBLEM Problem occurring during

development workRuntime exception,Compiler error

EXECUTECOMMAND

Eclipse commands Go to line start, Usecode completion

BREAKPOINT Manipulation of abreakpoint

Set breakpoint,Remove breakpoint

oper action is stored as an instantiation of a concept from anontology describing the software engineering domain. Thisontology includes an action hierarchy, e.g. action “changevisibility of a Java method” is asserted to be a subclass ofaction “change an artifact”.

We use this ontology of developer actions to translatethe actions (provided by the TeamWeaver framework) intoobservations (used as input for the learning algorithm) as fol-lows. If a developer changes source code, the TeamWeaverframework detects it and creates a new action event. Eachaction event is an instance of a concept in the ontology.Ontology concepts can be marked as observations. Thetranslation algorithm now inspects the current concept ofthe action event. If the current concept is marked as anobservation, the current action event is an observation andit is returned. If not, the algorithm traverses upwards in theaction hierarchy tree and checks the parent concept of thecurrent concept. This is done recursively until a concept thatis marked as an observation has been found. If no parentconcept that is marked as observation is found, the action ismapped to the observation UNKNOWN.

Using more abstract concepts as observations allows touse information about the taxonomy of actions in the on-tology in order to improve the detection rate when detect-ing activities. Consider a developer action SEARCH withtwo subclasses, WEBSEARCH (searching in the web) andWIKISEARCH (searching in a local wiki). If only SEARCH ismarked as observation and the learning algorithm is trainedonly with WEBSEARCHES, it can predict WIKISEARCHcorrectly without being trained for it. Both special searchesare abstracted to SEARCH and the algorithm detects that theuser is searching for the solution to a problem.

Table III gives an overview of the observations. Allaction events generated by the TeamWeaver framework weretranslated to one of these observations.

C. Inferring Activities From Observations Using HMMs

In this section we describe why we chose Hidden MarkovModels (HMM) and how we use them to infer developeractivities from observations.

Selection of the learning algorithm: Observations can-not be mapped to activities directly because of several as-pects. First, an observation may occur in different activities,e.g. the observation CHANGE may belong to the activitiesAPPLYSOLUTION or DOCUMENTSOLUTION. Second, themapping between an observation and an activity often de-pends on the previous observation, e.g. observation CHANGEafter a successful run may indicate that the developer isdocumenting whereas a CHANGE after a web search mayindicate the application of a solution. Third, several obser-vations may be observed in a single activity before switchingto another activity. These reasons prohibit a direct mappingfrom observations to activities and lead to three requirementsfor a learning algorithm: 1) It must be able to map visible,low-level observations to invisible, high-level activities. 2)It must be able to cope with the fact that a specific type ofobservation can be mapped to multiple activities dependingon context. 3) It has to take into account the history ofobservations because the current activity depends on thesequence of former observations and activities.

Usage of HMMs: HMMs [9] are designed to inferhidden states from visible observations. At each point intime, a HMM is in a certain state and when it reads anobservation, it updates its state. Each state depends directlyon the current observation and the previous state (by statetransition probabilities) and indirectly on all observationsand states seen before. We chose HMMs because they aredesigned to infer hidden states from visible observations(requirement 1)), deal with requirement 2) by probabilitymaximization and fulfill requirement 3) as each state isindirectly depending on the historical sequence of states. Ouractivities correspond to HMM states and our observationscorrespond to HMM observations. We implemented theHMMs using jahmm library1.

HMM Operation: A HMM is used to predict the currentactivity of developers based on their actions. Each developeraction is detected by the TeamWeaver framework, translatedto observations and fed to the HMM. Based on the currentstate, the sequence of all previous observations, and thecurrent observation, the HMM determines its new state.Technically, each observation is appended to the sequenceof observations and then the Viterbi algorithm is executed tocompute the most probable state sequence corresponding tothe observation sequence. From this most probable state se-quence, the last state is returned as new state. The algorithmruns online, right after detection of a developer action.

HMM Training: A HMM has to be trained before it canbe used for prediction. We implemented an offline learningalgorithm that is executed at system startup. Parameters tobe learned are the initial probability, the state transitionprobability and the observation probability . A first HMMmodel is estimated by doing frequency counts and calcu-

1http://code.google.com/p/jahmm/

1263

Page 4: [IEEE 2012 34th International Conference on Software Engineering (ICSE) - Zurich, Switzerland (2012.06.2-2012.06.9)] 2012 34th International Conference on Software Engineering (ICSE)

lating fractions with sample data. This first HMM model isoptimized by the Baum Welch Learner.

In our case, the initial probability corresponds to theprobability to start a PSC iteration with a certain activity, thestate transition probability corresponds to the probability oftransiting from activity A1 to A2 upon observing observationO a certain action and the observation probability the proba-bility of observing a certain action when being in a specificactivity state. These probabilities are learned using a dataset constituting of protocols of several developers’ imple-mentation work. A protocol is a sequence of observationsenriched with the corresponding activities.

IV. PRELIMINARY EVALUATION

In order to test our approach, we conducted a preliminaryevaluation by recording our own actions during a devel-opment session and feeded them to the HMM. To collectdata about developer activities, we extended Eureka [5],a tool that collects developers’ textual descriptions abouttheir work and associates them with events sensed by theTeamWeaver framework. We extended Eureka by adding adrop-down menu with all activities of the PSC. A developercan use this menu to report on her activity in the PSCwhen switching to another activity by selecting the finishedactivity from the menu. The developer actions sensed byTeamWeaver are appended automatically.

We used Eureka to protocol one day of developmentcarried out by one of the authors. This resulted in a collectionof 18 activities and 316 actions. We trained the HMM withthis data set. Afterwards, we fed all observations one afterthe other to the HMM and used the HMM to predict thecurrent activity. We then compared the activity predicted bythe HMM with the activity indicated by the developer inthe protocol. This experiment resulted in a 72% detectionrate. That is 72% of the activities predicted by the HMMmatched with the activities specified by the developer.

V. CONCLUSION AND FUTURE WORK

In this paper we presented an approach to detect theactivity of developers based on their actions and to recognizewhen developers face problems. We model developmentwork as problem solution cycle, detect developer’s actionsby instrumenting their tools, translate developer actions tomore general observations using ontologies, and use HiddenMarkov Models to infer developer activities from theiractions. A preliminary experiment with a detection rate of 72% indicates that the approach is promizing. However, we areaware that a larger, more reliable evaluation with differentsubjects and a carefule analysis of false positive and falsenegative predictions has to be conducted.

We are currently developing several improvements toenhance our approach, e.g. using more fine grained ontologyto describe developer’s actions. Further research is needed

to apply our approach for a context-aware developer supportand automatic knowledge recording and sharing.

REFERENCES

[1] I. D. Coman and A. Sillitti. Automated identification of tasksin development sessions. In Proc. of the 16th IEEE Int. Conf.on Program Comprehension, ICPC’08, pages 212–217, 2008.

[2] A. N. Dragunov, T. G. Dietterich, K. Johnsrude, M. McLaugh-lin, L. Li, and J. L. Herlocker. TaskTracer: A desktopenvironment to support multi-tasking knowledge workers. InProc. of the 10th Int. Conf. on Intelligent User Interfaces,IUI’05, pages 75–82, 2005.

[3] M. Kersten and G. C. Murphy. Using task context to improveprogrammer productivity. In Proc. of the 14th ACM SIGSOFTInt. Symposium on Foundations of Software Engineering,pages 1–11. ACM, 2006.

[4] W. Maalej and H.-J. Happel. A lightweight approach forknowledge sharing in distributed software teams. In PracticalAspects of Knowledge Management, volume 5345 of LNCS,pages 14–25. Springer Berlin / Heidelberg, 2008.

[5] W. Maalej and H.-J. Happel. From work to word: How dosoftware developers describe their work? In Proc. of the 6thIEEE Int. Working Conf. on Mining Software Repositories,MSR’09, pages 121–130, 2009.

[6] W. Maalej and H.-J. Happel. Can development work describeitself? In Proc. of the 7th Int. Working Conf. on MiningSoftware Repositories, MSR’10, pages 191–200, 2010.

[7] W. Maalej and A. Sahm. Assisting engineers in switchingartifacts by using task semantic and interaction history. InProc. of the 2nd Int. Workshop on Recommendation Systemsfor Software Engineering, pages 59–63, 2010.

[8] C. Parnin and C. Görg. Building usage contexts duringprogram comprehension. In Proc. of 14th IEEE Int. Conf.on Program Comprehension, ICPC’06, pages 13–22, 2006.

[9] L.R. Rabiner. A tutorial on hidden Markov models andselected applications in speech recognition. Proceedings ofthe IEEE, 77(2):257–286, 1989.

[10] T. Roehm, R. Tiarks, R. Koschke, and W. Maalej. How doprofessional developers comprehend software? In Proc. of the34th Int. Conf. on Software Engineering, ICSE’12, 2012.

[11] J. Shen, W. Geyer, M. Muller, C. Dugan, B. Brownholtz,and D. R. Millen. Automatically finding and recommendingresources to support knowledge workers’ activities. In Proc.of the 13th Int. Conf. on Intelligent User Interfaces, IUI ’08,pages 207–216, 2008.

[12] J. Shen, J. Irvine, X. Bao, M. Goodman, S. Kolibaba, A. Tran,F. Carl, B. Kirschner, S. Stumpf, and T. G. Dietterich.Detecting and correcting user activity switches: Algorithmsand interfaces. In Proc. of the 14th Int. Conf. on IntelligentUser Interfaces, IUI’09, pages 117–126, 2009.

1264