ieee transactions on software engineering, vol. x, no....

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. T, MONTH YEAR 1

How We Refactor, and How We Know ItEmerson Murphy-Hill, Chris Parnin, and Andrew P. Black

Abstract—Refactoring is widely practiced by developers, and considerable research and development effort has been invested inrefactoring tools. However, little has been reported about the adoption of refactoring tools, and many assumptions about refactoringpractice have little empirical support. In this article, we examine refactoring tool usage and evaluate some of the assumptions made byother researchers.To measure tool usage, we randomly sampled code changes from four Eclipse and eight Mylyn developers and ascertained, foreach refactoring, if it was performed manually or with tool support. We found that refactoring tools are seldom used: 11% by Eclipsedevelopers and 9% by Mylyn developers. To understand refactoring practice at large, we drew from a variety of datasets spanningmore than 39 000 developers, 240 000 tool-assisted refactorings, 2500 developer hours, and 12 000 version control commits. Usingthese data, we cast doubt on several previously-stated assumptions about how programmers refactor, while validating others. Finally,we interviewed the Eclipse and Mylyn developers to help us understand why they did not use refactoring tools, and to gather ideas forfuture research.

Index Terms—refactoring, refactoring tools, floss refactoring, root-canal refactoring

F

1 INTRODUCTION

Refactoring is the process of changing the structure of soft-ware without changing its behavior. While the practice ofrestructuring software has existed ever since software hasbeen structured, the term was introduced by Opdyke andJohnson [12]. Later, Fowler popularized refactoring when hecataloged 72 different refactorings, ranging from localizedchanges such as INLINE TEMP, to more global changes suchas TEASE APART INHERITANCE [5].

Especially in the last decade, the body of research aboutrefactoring has been growing rapidly. Examples of such re-search include studies of the effect of refactoring on errors [20]and the relationship between refactoring and software qual-ity [17]. Such research builds upon a foundation of previouswork about how programmers refactor, such as what kindsof refactorings programmers perform and how frequently theyperform them.

Unfortunately, this foundation is, in some cases, based onlimited evidence, or on no evidence at all. For example,consider Murphy-Hill and Black’s Refactoring Cues tool thatallows programmers to refactor several program elementsat once [8]. If the assumption that programmers frequentlywant to refactor several program elements at once holds, thistool would be very useful. However, prior to the tool beingintroduced, no foundational research existed to support thisassumption. As we show in this article, this case is not isolated;other research also rests on unsupported or weakly-supported

• E. Murphy-Hill is with the Department of Computer Science, NorthCarolina State University, Raleigh, NC, 27695.E-mail: [email protected]

• C. Parnin is with the College of Computing, Georgia Institute of Technol-ogy, Atlanta, GA, 30332.E-mail: [email protected]

• A. P. Black is with the Department of Computer Science, Portland StateUniversity, Portland, OR, 97201.E-mail: [email protected]

foundations.Without strong foundations for refactoring research, we

can have only limited confidence that research built on thesefoundations will be valid in the larger context of real-worldsoftware development. As a step towards strengthening thesefoundations, this article revisits some of the assumptions andconclusions drawn in previous research. Our experimentalmethod takes data from eight different sources (described inSection 2) and applies several different analysis strategies tothem. The contributions of our work lie in both the exper-imental method and in the conclusions that we are able todraw:

• The RENAME refactoring tool is used much more fre-quently by ordinary programmers than by the developersof refactoring tools (Section 3.1);

• about 40% of refactorings performed using a tool occur inbatches (Section 3.2);

• about 90% of configuration defaults in refactoring tools arenot changed when programmers use the tools (Section 3.3);

• messages written by programmers in version-control com-mit logs do not reliably indicate the presence of refactoringin the commit (Section 3.4);

• programmers frequently floss refactor, that is, they inter-leave refactoring with other types of programming activity(Section 3.5);

• about half of refactorings are not high-level, so refactoringdetection tools that look exclusively for high-level refac-torings will not detect them (Section 3.6);

• refactorings are performed frequently (Section 3.7);• close to 90% of refactorings are performed manually,

without the help of tools (Section 3.8); and• the kind of refactoring performed with a tool differs from

the kind performed manually (Section 3.9).

In Section 4 we discuss the interaction between these


conclusions and the assumptions and conclusions of otherresearchers.

This article is an extension of work reported at ICSE2009 [11], which provided evidence for each of the conclu-sions above. The primary weakness of the ICSE work wasthat several of our conclusions were based on data from asingle development team. This article includes analysis offour additional data sets, with the consequence that everyconclusion drawn here is based on data from at least twodevelopment teams.

2 THE DATA THAT WE ANALYZED

The work described in this article is based on eight sets of data.The first set we will call Users; it was originally collected inthe latter half of 2005 by Murphy and colleagues [7], who usedthe Mylyn Monitor tool to capture and analyze fine-grainedusage data from 41 volunteer programmers in the wild us-ing the Eclipse development environment (http://eclipse.org).These data capture an average of 66 hours of developmenttime per programmer; about 95 percent of the programmerswrote in Java. The data include information on which Eclipsecommands were executed, and at what time. Murphy andcolleagues originally used these data to characterize the wayprogrammers used Eclipse, including a coarse-grained analysisof which refactoring tools were used most often. Murphy-Hilland Black have also used these data as a source of evidencefor the claim that refactoring tools are underused [10].

The second set of data we will call Everyone; it is publiclyavailable from the Eclipse Usage Collector [18], and includesdata from every user of the Eclipse Ganymede release whoconsented to an automated request to send data back to theEclipse Foundation. These data aggregate activity from over13 000 Java developers between April 2008 and January 2009,but also include non-Java developers. The data include infor-mation on the number of programmers who have used eachEclipse command (including the refactoring commands), andhow many times each command was executed. We know ofno other researchers who have used this data for investigatingprogrammer behavior.

The third set of data we will call Toolsmiths; it includesrefactoring histories from 4 developers who primarily main-tain Eclipse’s refactoring tools. These data include detailedhistories of which refactorings were executed, when they wereperformed, and with what configuration parameters. Thesedata include all the information necessary to recreate the usageof a refactoring tool, assuming that the original source code isalso available. These data were collected between December2005 and August 2007, although the date ranges are differentfor each developer. This data set is not publicly available. Theonly author that we know of using similar data is Robbes [16];he reports on refactoring tool usage by himself and one otherdeveloper.

The fourth set of data we will call Eclipse CVS, because itis the version history of the Eclipse and JUnit (http://junit.org)code bases as extracted from their Concurrent VersioningSystem (CVS) repositories. Commonly, CVS data must bepreprocessed before analysis. This is because CVS does not

record which file revisions were committed in a single trans-action. The standard approach for recovering transactions isto find revisions committed by the same developer with thesame commit message within a small time window [22]; weuse a 60 second time window. Henceforth, we use the word“revision” to refer to a particular version of a file, and theword “commit” to refer to one of these synthesized committransactions. We excluded from our sample (a) commits toCVS branches, which would have complicated our analysis,and (b) commits that did not include a change to a Java file.

In our experiments, we focus on a subset of the commits inEclipse CVS. Specifically, we randomly sampled from about3400 source file commits (Section 3.4) that correspond to thesame time period, the same projects, and the same developersrepresented in Toolsmiths. Using these data, two of the authors(Murphy-Hill and Parnin) inferred which refactorings wereperformed by comparing adjacent commits manually. Whilemany authors have mined software repositories automaticallyfor refactorings (for example, Weißgerber and Diehl [20]), weknow of no other researchers that have compared refactoringtool logs with code histories.

The fifth set of data we call Mylyn; it includes the refactor-ing histories from 8 developers who primarily maintain theMylyn project, a task-focused interface for Eclipse (www.eclipse.org/mylyn). The data format is the same as for Tool-smiths, although we obtained it through different means; thedevelopers checked their refactoring histories into CVS whileworking on the project, so those histories are publicly availablefrom Mylyn’s open-source code repository. The refactoringhistory spans the period from February 2006 to August 2009,though different developers worked on the project at differentperiods of time.

The sixth dataset we call Mylyn CVS, which is the versioncontrol history corresponding to the Mylyn refactoring history,in the same way that Eclipse CVS corresponds to Toolsmiths.We analyzed, filtered, randomly drew from, and inspected thedata in the same way as with Eclipse CVS.

The seventh dataset is called UDC Events, which is asubset of the Everyone data containing more detail: insteadof aggregating counts of Eclipse command uses, UDC Eventscontains timestamps for each command usage. This data ismuch like the Users data but includes 275 903 developersspanning several weeks in June and July 2009.

The final dataset is called Developer Responses. When wecompleted our analysis of the first six data sets, we sent asurvey to three developers in the Toolsmiths dataset and fourdevelopers in the Mylyn dataset. The survey included severalquestions about those developers’ refactoring habits and refac-toring tool use habits.1 In total, we received two responsesfrom Toolsmith developers and three responses from Mylyndevelopers. We use this qualitative Developer Responses datato augment the quantitative data in the other seven data sets.

3 FINDINGS ON REFACTORING BEHAVIORIn this section we analyze these eight sets of data and discussour findings.

1. The survey template appears in Appendix 2.


Ref

acto

ring

Tool

Tool

smith

sM

ylyn

Use

rsE

very

one

Use

sU

se%

Bat

ched

Bat

ched

%U

ses

Use

%B

atch

edB

atch

ed%

Use

sU

se%

Bat

ched

Bat

ched

%U

ses

Use

%R

enam

e67

028

.7%

283

42.2

%27

0653

.6%

1146

42.4

%18

6261

.5%

1009

54.2

%49

672

71.8

%E

xtra

ctL

ocal

Var

iabl

e56

824

.4%

127

22.4

%26

05.

2%57

21.9

%32

210

.6%

106

32.9

%49

177.

1%In

line

349

15.0

%13

237

.8%

110

2.2%

5247

.3%

137

4.5%

5238

.0%

1426

2.1%

Ext

ract

Met

hod

280

12.0

%28

10.0

%31

66.

3%27

8.5%

259

8.6%

5722

.0%

3345

4.8%

Mov

e14

76.

3%50

34.0

%95

819

.0%

459

47.9

%17

15.

6%98

57.3

%38

695.

6%C

hang

eM

etho

dSi

gnat

ure

934.

0%26

28.0

%19

13.

8%73

38.2

%55

1.8%

2036

.4%

1642

2.4%

Con

vert

Loc

alTo

Fiel

d92

3.9%

1213

.0%

220.

4%10

45.5

%27

0.9%

1037

.0%

504

0.7%

Intr

oduc

ePa

ram

eter

411.

8%20

48.8

%1

0.1%

0-

160.

5%11

68.8

%16

20.

2%E

xtra

ctC

onst

ant

220.

9%6

27.3

%27

85.

5%91

32.7

%81

2.7%

4859

.3%

1039

1.5%

Con

vert

Ano

nym

ous

ToN

este

d18

0.8%

00.

0%25

0.5%

00.

0%19

0.6%

736

.8%

860.

1%M

ove

Mem

ber

Type

toN

ewFi

le15

0.6%

00.

0%55

1.1%

47.

3%12

0.4%

541

.7%

343

0.5%

Pull

Up

120.

5%0

0.0%

170.

3%2

11.8

%36

1.2%

411

.1%

397

0.6%

Enc

apsu

late

Fiel

d11

0.5%

872

.7%

290.

6%17

58.6

%4

0.1%

250

.0%

406

0.6%

Ext

ract

Inte

rfac

e2

0.1%

00.

0%29

0.6%

26.

9%15

0.5%

00.

0%49

20.

7%G

ener

aliz

eD

ecla

red

Type

20.

1%0

0.0%

00.

0%0

-4

0.1%

250

.0%

560.

1%Pu

shD

own

10.

1%0

-3

0.1%

266

.7%

10.

1%0

-80

0.1%

Infe

rG

ener

icTy

peA

rgum

ents

00.

0%0

-31

0.6%

1341

.9%

30.

1%0

0.0%

179

0.3%

Use

Supe

rtyp

eW

here

Poss

ible

00.

0%0

-1

0.1%

0-

20.

1%0

0.0%

470.

1%In

trod

uce

Fact

ory

00.

0%0

-0

0.0%

0-

10.

1%0

-31

0.1%

Ext

ract

Supe

rcla

ss7

0.3%

00.

0%16

0.3%

00.

0%*

-*

*15

80.

2%E

xtra

ctC

lass

10.

1%0

0.0%

00.

0%0

-*

-*

*28

90.

4%In

trod

uce

Para

met

erO

bjec

t0

0.0%

0-

00.

0%0

-*

-*

*64

0.1%

Intr

oduc

eIn

dire

ctio

n0

0.0%

0-

00.

0%0

-*

-*

*51

0.1%

Tota

l23

3110

0%69

229

.7%

5048

100.

0%19

5538

.7%

3027

100%

1431

47.3

%69

255

100%

TAB

LE

1:R

efac

tori

ngto

olus

age

inE

clip

se.S

ome

tool

logg

ing

bega

nin

the

mid

dle

ofth

eTo

olsm

iths

and

Myl

ynda

taco

llect

ion

(sho

wn

inlig

htgr

ey)

and

afte

rth

eU

sers

data

colle

ctio

n(d

enot

edw

itha

*).T

he‘-

’sy

mbo

lde

note

sa

perc

enta

geco

rres

pond

ing

toa

frac

tion

for

whi

chth

ede

nom

inat

oris

zero

.


3.1 Toolsmiths and Users DifferWe hypothesize that the refactoring behavior of the program-mers who develop the Eclipse refactoring tools differs fromthat of the programmers who use them. Toleman and Welshassume a variant of this hypothesis — that the designers ofsoftware tools erroneously consider themselves typical toolusers — and argue that the usability of software tools shouldbe objectively evaluated [19]. To test our hypothesis, wecompared the refactoring tool usage in the Toolsmith data setagainst the tool usage in the User and Everyone data sets.For this comparison, we will omit the Mylyn dataset, becauseUsers and Everyone are more likely to represent the behaviorof people who do not develop refactoring tools.

In Table 1, the “Uses” columns indicate the number of timeseach refactoring tool was invoked in that dataset. The “Use %”column presents the same measure as a percentage of the totalnumber of refactorings. (The “Batched” columns are discussedin Section 3.2.) Notice that while the rank order of each toolis similar across Toolsmiths, Users, and Everyone — RENAME,for example, always ranks first — the proportion of uses of theindividual refactoring tools varies widely between Toolsmithsand Users/Everyone. In Toolsmiths, RENAME accounts forabout 29% of all refactorings, whereas in Users it accountsfor about 62% and in Everyone for about 72%. We suspectthat this difference is not because Users and Everyone performmore RENAMES than Toolsmiths, but because Toolsmiths aremore frequent users of the other refactoring tools.

This analysis is limited in two ways. First, each data setwas gathered over a different period of time, and the toolsthemselves may have changed between those periods. Second,the Users data include both Java and non-Java RENAME andMOVE refactorings, but the Toolsmiths, Mylyn and Everyonedata report on just Java refactorings. This may inflate actualRENAME and MOVE percentages in Users.

3.2 Programmers Repeat RefactoringsWe hypothesize that when programmers perform a refactoring,they typically perform several more refactorings of the samekind within a short time period. For instance, a program-mer may perform several EXTRACT LOCAL VARIABLES inpreparation for a single EXTRACT METHOD, or may RENAMEseveral related instance variables at once. Based on personalexperience and anecdotes from programmers, we suspect thatprogrammers often refactor multiple pieces of code becauseseveral related program elements need to be refactored inorder to perform a composite refactoring. In previous research,Murphy-Hill and Black built a refactoring tool that supportedrefactoring multiple program elements at once, on the assump-tion that this is common [8].

To determine how often programmers do in fact repeatrefactorings, we used the Toolsmiths, Mylyn and Users datato measure the temporal proximity of multiple invocations ofa refactoring tool. We say that refactorings of the same kindthat execute within 60 seconds of each another form a batch.From our personal experience, we think that 60 seconds isusually long enough to allow the programmer to complete anEclipse wizard-based refactoring, yet short enough to exclude

Refactoring Tool Toolsmiths MylynMOVE 1 (n=147) 1 (n=958)

PUSH DOWN 5 (n=1) 3 (n=3)PULL UP 2.5 (n=12) 4 (n=17)

EXTRACT INTERFACE 4.5 (n=2) 4 (n=29)EXTRACT SUPERCLASS 17 (n=7) 9 (n=16)

TABLE 2: The median number of explicitly batched elementsused for several refactoring tools, where n is the number oftotal uses of that refactoring tool.

0

10

20

30

40

50

60

70

0 30 60 90 120 150 180 210 240

Users

ToolsmithsMylyn

% b

atch

edbatch threshold, in seconds

Fig. 1: Percentage of refactorings that appear in batches as afunction of batch threshold, in seconds. 60-seconds, the batchsize used in Table 1, is drawn in green.

refactorings that are not part of the same conceptual group.Additionally, a few refactoring tools, such as PULL UP inEclipse, can refactor multiple program elements, so a singleapplication of such a tool is an explicit batch of relatedrefactorings; we measured the median batch size for thesetools.

In Table 1, each “Batched” column indicates the numberof refactorings that appeared as part of a batch, while each“Batched %” column indicates the percentage of refactoringsthat appeared as part of a batch. Overall, we can see thatcertain refactorings, such as RENAME, INLINE, and ENCAP-SULATE FIELD, are more likely to appear as part of a batch,while others, such as EXTRACT METHOD and PULL UP, areless likely to appear in a batch. In total, we see that 30% ofToolsmiths refactorings, 39% of Mylyn refactorings, and 47%of Users refactorings appear as part of a batch.2

The median batch size for explicitly batched refactoringsin tools that can refactor multiple program elements variedbetween tools in both Toolsmiths and Mylyn (Table 2). Overall,the table indicates that, with the exception of MOVE, mostrefactorings are performed in batches.

The main limitation of this analysis is that, while wewished to measure how often several related refactorings areperformed in sequence, we instead used a 60-second heuristic.It may be that some related refactorings occur outside our 60-

2. We suspect that the difference in percentages arises partially because theToolsmiths and Mylyn data set counts the number of completed refactoringswhile Users counts the number of initiated refactorings. We have observedthat programmers occasionally initiate a refactoring tool on some code, cancelthe refactoring, and then re-initiate the same refactoring shortly thereafter [9].


Fig. 2: A configuration dialog box in Eclipse.

second window, and that some unrelated refactorings occurinside the window. To show how sensitive these results areto the batch threshold, Figure 1 displays the total percentageof batched refactorings for several different batch thresholds.Other metrics for detecting batches, such as burstiness, shouldbe investigated in the future.

3.3 Programmers often don’t Configure RefactoringToolsRefactoring tools are typically of two kinds: they either forcethe programmer to provide configuration information, suchas whether a newly created method should be public orprivate — an example is shown in Figure 2 — or they quicklyperform a refactoring without allowing any configuration.Configurable refactoring tools are more common in someenvironments, such as Netbeans (http://netbeans.org), whereasnon-configurable tools are more common in others, such as X-develop (http://www.omnicore.com/en/xdevelop.htm). Whichinterface is preferable depends on how often programmersconfigure refactoring tools.

We hypothesize that programmers often don’t configurerefactoring tools. We suspect this is because tweaking codemanually after the refactoring may be easier than configuringthe tool. In the past, we have found some limited evidence thatprogrammers perform only a small amount of configurationof refactoring tools. When we conducted a small survey inSeptember 2007 at a Portland Java User’s Group meeting, 8programmers estimated that, on average, they supply configu-ration information only 25% of the time.

To validate this hypothesis, we counted how often pro-grammers used various configuration options in the Toolsmiths

and Mylyn data, when performing the 5 refactorings mostfrequently performed by Toolsmiths. We skipped refactoringsthat did not have configuration options. The results of theanalysis are shown in Table 3. “Configuration Option” refersto a configuration parameter that the user can change. “DefaultValue” refers to the default value that the tool assigns tothat option. “Change Frequency” refers to how often a userused a configuration option other than the default. The datasuggest that refactoring tools are configured infrequently: theoverall mean change frequency for these options is about10% in Toolsmiths and 12% in Mylyn. Although differentconfiguration options are changed from defaults with varyingfrequencies, almost all configuration options that we inspectedwere below the average configuration percentage predicted bythe Portland Java User’s Group survey.

This analysis has several limitations. First, we could notcount how often certain configuration options were changed,such as how often parameters are reordered when EXTRACTMETHOD is performed. Second, we examined only the 5 most-common refactorings; configuration may be more frequentfor less popular refactorings. Third, we measured how oftena single configuration option is changed, but not how oftenany configuration option is changed for a refactoring. Fourth,we were not able to distinguish between the case wherea developer purposefully used a non-default configurationoption, and the case where she blindly used the non-defaultoption left over from the last time that she used the tool.

3.4 Commit Messages don’t predict Refactoring

Several researchers have used messages attached to commitsinto a version control as indicators of refactoring activity [6],[14], [15], [17]. For example, if a programmer commits codeto CVS and attaches the commit message “refactored classFoo,” we might predict that the committed code contains morerefactoring activity than if a programmer commits with amessage that does not contain the word “refactor.” However,we hypothesize that this assumption is false, perhaps becauserefactoring can be an unconscious activity [2, p. 47], andperhaps because the programmer may consider the refactoringsubordinate to some other activity, such as adding a fea-ture [10].

In his thesis, Ratzinger describes the most sophisticatedstrategy for finding refactoring messages of which we areaware [14]: searching for the occurrence of 13 keywords, suchas “move” and “rename,” and excluding “needs refactoring.”Using two different project histories, the author randomly drew100 file modifications from each project and classified each aseither a refactoring or as some other change. He found that hiskeyword technique accurately classified modifications 95.5%of the time. Based on this technique, Ratzinger and colleaguesconcluded that an increase in refactoring activity tends to befollowed by a decrease in software defects [15].

We replicated Ratzinger’s experiment for the Eclipse codebase. Using the Eclipse CVS data, we grouped individual filerevisions into global commits as previously discussed in Sec-tion 2. We also manually removed commits whose messagesreferred to changes to a refactoring tool (for example, “105654


Refactoring Tool Configuration Option Default Value Change FrequencyToolsmiths Mylyn

Extract Local Variable Declare the local variable as ‘final’ false 5% 0%Extract Method New method visibility private 6% 6%

Declare thrown runtime exceptions false 24% 0%Generate method comment false 9% 1%

Rename Type Update references true 3% 0%Update similarly named variables and methods false 24% 26%Update textual occurrences in comments and strings false 15% 23%Update fully qualified names in non-Java text files true 7% 35%

Rename Method Update references true 0% 0%Keep original method as delegate to renamed method false 1% 1%

Inline Method Delete method declaration true 9% 1%

TABLE 3: Refactoring tool configuration in Eclipse.

[refactoring] Convert Local Variable to Field has problemswith arrays”), because such changes are false positives thatoccur only because the project’s code is itself implementingrefactoring tools. Next, using Ratzinger’s 13 keywords, weautomatically classified the log messages for the remaining2788 commits. 10% of these commits matched the keywords,which compares with Ratzinger’s reported 11% and 13% fortwo other projects [14]. Next, a third party randomly drew 20commits from the set that matched the keywords (which wewill call “Labeled”) and 20 from the set that did not match(“Unlabeled”). Without knowing whether a commit was in theLabeled or Unlabeled group, two of the authors (Murphy-Hilland Parnin) manually compared each committed version ofthe code against the previous version, inferring how many andwhich refactorings were performed, and whether at least onenon-refactoring change was made. Together, Murphy-Hill andParnin compared these 40 commits over the span of about6 hours, comparing the code using a single computer andEclipse’s standard compare tool.

The results are shown in Table 4, under the Eclipse CVSheading. In the left column, the kind of Change is listed. PureWhitespace means that the developer changed only whitespaceor comments; No Refactoring means that the developer did notrefactor but did change program behavior; Some Refactoringmeans that the developer both refactored and changed pro-gram behavior, and Pure Refactoring means the programmerrefactored but did not change program behavior. The centercolumn counts the number of Labeled commits with eachkind of change, and the right column counts the numberof Unlabeled commits. The parenthesized lists record thenumber of refactorings found in each commit. For instance,the table shows that in 5 commits, when a programmermentioned a refactoring keyword in the CVS commit message,the programmer made both functional and refactoring changes.The 5 commits contained 1, 4, 11, 15, and 17 refactorings.

These results suggest that classifying CVS commits bycommit message does not provide a complete picture ofrefactoring activity. While all 6 pure-refactoring commitswere identified by commit messages that contained one ofthe refactoring keywords, commits labeled with a refactoringkeyword contained far fewer refactorings (63, or 36% of the

total) than those not so labeled (112, or 64%). Figure 3 showsthe variety of refactorings in Labeled (darker bars) commitsand Unlabeled (lighter bars) commits. We will explain the (H),(M), and (L) tags in Section 3.6.

We replicated this experiment once more for the Mylyn CVSdata set. As with Eclipse CVS, 10% of the commits wereclassified as refactorings using Ratzinger’s method. Underthe Mylyn CVS heading, Table 4 shows the results of thisexperiment. These results confirm that classifying CVS com-mits by commit message does not provide a complete pictureof refactoring, because commits labeled with a refactoringkeyword contained fewer refactorings (52, or 46%) than thosenot so labeled (60, or 54%).

There are several limitations to this analysis. First, whilewe tried to replicate Ratzinger’s experiment [14] as closely aswas practicable, the original experiment was not completelyspecified, so we cannot say with certainty that the observeddifferences were not due to methodology. Likewise, observeddifferences may be due to differences in the projects studied.Indeed, after we completed this analysis, a personal com-munication with Ratzinger revealed that the original experi-ment included and excluded keywords specific to the projectsbeing analyzed. Second, because the process of gatheringand inspecting subsequent code revisions is labor intensive,our sample size (40 commits in total) is smaller than wouldotherwise be desirable. Third, the classification of a codechange as a refactoring is somewhat subjective. For example, ifa developer removes code known to her to never be executed,she may legitimately classify that activity as a refactoring,although to an outside observer it may appear to be the removalof a feature. We tried to be conservative, classifying changes asrefactorings only when we were confident that they preservedbehavior. Moreover, because the comparison was blind, anybias introduced in classification would have applied equallyto both Labeled and Unlabeled commit sets.

3.5 Floss Refactoring is CommonIn previous work, Murphy-Hill and Black distinguished twotactics that programmers use when refactoring: floss refactor-ing and root-canal refactoring [10]. During floss refactoring,the programmer uses refactoring as a means to reach a specific


Eclipse CVS Mylyn CVSChange Labeled Unlabeled Labeled Unlabeled

Pure Whitespace 1 3 1 1No Refactoring 8 11 6 6

Some Refactoring 5 (1,4,11,15,17) 6 (2,9,11,23,30,37) 9 (1,1,1,2,3,5,7,9,11) 11(1,1,2,2,2,3,4,6,8,9,14)Pure Refactoring 6 (1,1,2,3,3,5) 0 4 (1,1,4,6) 2 (1,7)

Total 20(63) 20(112) 20(52) 20(60)

TABLE 4: Refactoring between commits in Eclipse CVS and Mylyn CVS. Plain numbers count commits in the given category;tuples contain the number of refactorings in each commit.

end, such as adding a feature or fixing a bug. Thus, duringfloss refactoring the programmer intersperses other kinds ofprogram changes with refactorings to keep the code healthy.Root-canal refactoring, in contrast, is used for correctingdeteriorated code and involves a protracted process consistingexclusively of refactoring.

A survey of the literature suggested that floss refactoring isthe recommended tactic, but provided only limited evidencethat it is the more common tactic [10].

Why does this matter? Case studies in the literature, forexample those reported by Pizka [13] and by Bourqun andKeller [1], describe root-canal refactoring. However, infer-ences drawn from these studies will be generally applicableonly if most refactorings are indeed root-canals.

We can estimate which refactoring tactic is used morefrequently from the Eclipse CVS and Mylyn CVS data. Wefirst define behavioral indicators of floss and root-canal refac-toring during programming sessions, which (in contrast to theintentional definitions given above) we can hope to recognizein the data. For convenience, we define a programming sessionas the period of time between consecutive commits to CVS bya single programmer. In a particular session, if a programmerboth refactors and makes a semantic change, then we saythat that the programmer is floss refactoring. If a programmerrefactors during a session but does not change the semanticsof the program, then we say that the programmer is root-canal refactoring. Note that a true root-canal refactoring mustalso last an extended period of time, or take place overseveral sessions. The above behavioral definitions relax thisrequirement, and so will tend to over-estimate the number ofroot canals.

The results suggest that floss refactoring is more commonthan root-canal refactoring. Returning to Table 4, we cansee that “Some Refactoring”, indicative of floss refactoring,accounted for 28% (11/40) of commits in Eclipse CVS and50% (20/40) in Mylyn CVS. Comparatively, Pure Refactoring,indicative of root-canal refactoring, accounts for 15% (6/40)of commits in both Eclipse CVS and Mylyn CVS. Normal-izing for the fact that only 10% (4/40) of all commits werelabeled with refactoring keywords in Eclipse CVS, commitsindicating floss refactoring would account for 30% of allcommits while commits indicating root-canal would accountfor only 3% of all commits.3 Normalizing in Mylyn CVS, flosscommits account for 54% while root-canal commits accountfor 11%. Looking at the Eclipse CVS data another way, 98%

3. Our normalization procedure is described in Appendix 1.

of individual refactorings would occur as part of a SomeRefactoring (floss) commit, while only 2% would occur aspart of a Pure Refactoring (root-canal) commit, again afternormalizing for labeled commits. For the Mylyn CVS data set,86% of individual refactorings would occur as part of a SomeRefactoring (floss) commit, while 14% would occur as part ofa Pure Refactoring (root-canal) commit.

We also notice that for Eclipse CVS in Table 4, the “SomeRefactoring” (floss) row tends to show more refactoringsper commit than the “Pure Refactoring” (root-canal) row.However, this trend was not confirmed in the Mylyn CVSdata; the number of refactorings in floss commits does notappear to be significantly different from the number in root-canal commits.

Pure refactoring with tools is relative infrequent in the Usersdata set, suggesting that very little root-canal refactoring oc-curred in Users as well. We counted the number of refactoringsperformed using a tool during sessions in that data. In nomore than 10 out of 2671 sessions did programmers use arefactoring tool without also manually editing their program.In other words, in less that 0.4% of commits did we observethe possibility of root-canal refactoring using only refactoringtools.

Our analysis of Table 4 is subject to the same limitationsdescribed in Section 3.4. The analysis of the Users dataset (but not the analysis of Table 4) is also limited in thatwe consider only those refactorings performed using tools.Some refactorings may have been performed by hand; thesewould appear in the data as edits, thus possibly inflating thecount of floss refactoring and reducing the count of root-canalrefactoring.

3.6 Many Refactorings are Medium and Low-level

Refactorings operate at a wide range of levels, from aslow-level as single expressions to as high-level as wholeinheritance hierarchies. Past research has often drawn con-clusions based on observations of high-level refactorings. Forexample, several researchers have used automatic refactoring-detection tools to find refactorings in version histories, butthese tools can generally detect only those refactorings thatmodify packages, classes, and member signatures [3], [4],[20], [21]. The tools generally do not detect sub-methodlevel refactorings, such as EXTRACT LOCAL VARIABLE andINTRODUCE ASSERTION. We hypothesize that in practiceprogrammers also perform many lower-level refactorings. Wesuspect this because lower-level refactorings will not change


Introduce Factory (H)

Rename Package (H)

Rename Parameter (H)

Rename Resource (H)

Rename Class (H)

Extract Class (H)

Reorder Parameter (H)

Introduce Parameter (M)

Increase Method Visibility (H)

Extract Constant (M)

Decrease Method Visibility (H)

Rename Type (H)

Rename Constant (H)

Rename Local (L)

Inline Method (M)

Inline Constant (M)

Rename Method (H)

Rename Field (H)

Add Parameter (H)

Move Class (H)

Inline Local (L)

Remove Parameter (H)

Extract Local (L)

Move Member (H)

(L)

Extract Method (M)

Push Down (H)

Rename Constant (H)

Manual (Labeled)

Manual (Unlabeled)

Tool (Labeled)

Tool (Unlabeled)

0 5 10 15 20 250510 30

Generalize Declared Type (H)

Move Member (H)

Push Down (H)

Extract Method (M)

(H)

(L)

Fig. 3: Refactorings over 40 sessions for Mylyn (at left) and 40 sessions for Eclipse (at right). Refactorings shown includeonly those with tool support in Eclipse.

the program’s interface, and thus programmers may feel morefree to perform them.

To investigate this hypothesis, we divided the refactoringsthat we observed in our manual inspection of Eclipse CVScommits into three levels — High, Medium and Low. Weclassified refactoring tool uses in the Mylyn and Toolsmithsdata in the same way. High-level refactorings are those thatchange the signatures of classes, methods, or fields; refac-torings at this level include RENAME CLASS, MOVE STATICFIELD, and ADD PARAMETER. Medium-level refactoringsare those that change the signatures of classes, methods,and fields and also significantly change blocks of code; thislevel includes EXTRACT METHOD, INLINE CONSTANT, andCONVERT ANONYMOUS TYPE TO NESTED TYPE. Becausemedium-level refactorings affect both signatures and code

alike, more sophistication is needed for automated analysisand may not be properly identified by automated refactoringdetectors. Low-level refactorings are those that make changesonly to blocks of code; low level refactorings include EX-TRACT LOCAL VARIABLE, RENAME LOCAL VARIABLE, andADD ASSERTION. Refactorings with tool support that werefound in the Eclipse CVS and Mylyn CVS data set are labeledas high (H), medium (M), and low (L) in Figure 3.4

The results of this analysis are displayed in Table 5. For eachlevel of refactoring, we show what percentage of refactoringsfrom Eclipse CVS (normalized), Toolsmiths, Mylyn CVS (nor-

4. Note that one refactoring, GENERALIZE DECLARED TYPE can be eitherhigh (if the type is declared in a signature) or low (if the type is declaredin the body of a method). This refactoring was excluded from the analysisreflected in the data in Table 5.


Eclipse CVS Toolsmiths Mylyn CVS MylynLow 21% 33% 28% 10%

Medium 11% 27% 22% 14%High 68% 40% 50% 76%

TABLE 5: Refactoring level percentages in four data sets.

malized), and Mylyn make up that level. We see that manylow and medium-level refactorings do indeed take place; asa consequence, tools that detect only high-level refactoringswill miss 24 to 60 percent of refactorings.

3.7 Refactorings are Frequent

While the concept of refactoring is now popular, it is notentirely clear how commonly refactoring is practiced. In Xingand Stroulia’s automated analysis of the Eclipse code base,the authors conclude that “indeed refactoring is a frequentpractice” [21]. The authors make this claim largely based onobserving a large number of structural changes, 70% of whichare considered to be refactoring. However, this figure is basedon manually excluding 75% of semantic changes — resultingin refactorings accounting for 16% of all changes. Further,their automated approach suffers from several limitations, suchas the failure to detect low-level refactorings, imprecisionwhen distinguishing signature changes from semantic changes,and the course granularity available from the inspection ofCVS revisions.

To validate the hypothesis that refactoring is a frequentpractice, we characterize the occurrence of refactoring activityin the Users, Toolsmiths, Mylyn data. Note that these data setscontain records of only those refactorings that were performedwith tools.

In order for refactoring activity to be defined as frequent, weseek to apply criteria that require refactoring to be habitual andto occur at regular intervals. For example, if refactoring occursjust before a software release, but not at other times, then wewould not want to claim that refactoring is frequent. First,we examined the Toolsmiths and Mylyn data to ascertain howrefactoring activity was spread throughout the developmentcycle. Second, we examined the Users data to determine howoften refactoring occurred within a programming session, andwhether there was significant variation across the population.

In the Toolsmiths data, we found that refactoring occurredthroughout the Eclipse development cycle. In 2006, an averageof 30 refactorings took place each week; in 2007, there were46 refactorings per week. Only two weeks in 2006 did not haveany refactoring activity, and one of these was a winter holidayweek. In 2007, refactoring occurred every week. In the Mylyndata, we find a similar trend for the first two years, but drop inthe frequency and eventually the magnitude of refactoring inthe last two years. Specially, refactoring activity did not occurfor 1 week in both 2006 and 2007, 7 weeks in 2008, and 8 (outof 34) weeks in 2009. The respective averages were 31, 28,36, and 6 tool refactorings per week. However, the averagenumber of commits per week also declined in recent years(62, 65, 41 and 22); because there was decreased developmentactivity, we would expect lower refactoring activity.

In the Users data set, we found refactoring activity dis-tributed throughout the programming sessions. Overall, 41%of programming sessions contained refactoring activity. Moreinterestingly, if we assume that the number of edits (changes toa program made with an editor) approximates how much workwas done during a session, then significantly more work wasdone in sessions with refactoring than without refactoring. Wefound that, on average, sessions without refactoring activitycontained an order of magnitude fewer edits than sessionswith refactoring. Looking at it a different way, sessions thatcontained refactoring also contained, on average, 71% of thetotal edits made by the programmer. This was consistentacross the population: 22 of 31 programmers had an averagegreater than 72%, whereas the remaining 9 ranged from0% to 63%. This analysis of the Users data suggests thatwhen programmers must make large changes to a code base,refactoring is a common way to prepare for those changes.

Inspecting refactorings performed using a tool does not havethe limitations of automated analysis; it is independent of thegranularity of commits and semantic changes, and capturesall levels of refactoring activity. However, it has its ownlimitation: the exclusion of manual refactoring. Including man-ual refactorings can only increase the observed frequency ofrefactoring. Indeed, this is likely: as we will see in Section 3.8,many refactorings are in fact performed manually.

3.8 Refactoring Tools are Underused

A programmer may perform a refactoring manually, or maychoose to use an automated refactoring tool if one is availablefor the refactoring that she needs to perform. Ideally, a pro-grammer will always use a refactoring tool if one is available,because automated refactorings are theoretically faster andless error-prone than manual refactorings. However, in onesurvey of 16 students, only 2 reported having used refactoringtools, and even then only 20% and 60% of the time [10].In another survey of 112 agile enthusiasts, we found that thedevelopers reported refactoring with a tool a median of 68% ofthe time [10]. Both of these estimates of usage are surprisinglylow, but they are still only estimates. We hypothesize thatprogrammers often do not use refactoring tools. We suspectthis is because existing tools may not have a sufficiently usableuser-interface.

To validate this hypothesis, we correlated the refactoringsthat we observed by manually inspecting Eclipse CVS commitswith the refactoring tool usages in the Toolsmiths data set.Similarly, we performed the same correlation for the MylynCVS commits and tool usages in the Mylyn data set. Arefactoring found by manual inspection can be correlatedwith the application of a refactoring tool by looking for toolapplications between commits. For example, the Toolsmithsdata provides sufficient detail (time, new variable name andlocation) to correlate an EXTRACT LOCAL VARIABLE per-formed with a tool with an EXTRACT LOCAL VARIABLEobserved by manually inspecting adjacent commits in EclipseCVS.

After analyzing the Toolsmiths data, we were unable to link89% of 145 observed refactorings that had tool support to


Frequency 0 1 2-5 6-10 >10Percentage 79% 10% 8% 1% 0.05%

TABLE 6: Distribution of number of tool refactorings perweek from 39 729 tool-using developers collected from395 814 weeks with development activity.

any use of a refactoring tool (also 89% when normalized).After analyzing the Mylyn data, we were unable to link 78%of 72 observed refactorings that had tool support to any useof a refactoring tool (91% when normalized). This suggeststhat the developers associated with the Toolsmiths and Mylyndata primarily refactor manually. An unexpected finding wasthat 31 refactorings that were performed with tools were notvisible by comparing revisions in CVS for the Toolsmiths data;the same phenomenon was observed 13 times with the Mylyndata. It appeares that most of these refactorings occurred inmethods or expressions that were later removed, or in newlycreated code that had not yet been committed to CVS.

Overall, the results support the hypothesis that programmersrefactor manually in lieu of using tools. Measured tool usagewas even lower than the median estimate from the professionalagile developer survey. This suggests that either programmersunconsciously or consciously overestimate their tool usage(perhaps refactoring is often an unconscious activity, or per-haps expert programmers are embarrassed to admit that theydo not use refactoring tools), or that expert programmers preferto refactor manually.

To observe if tools were underused by a larger popu-lation, we analyzed the UDC Events data, which includestimestamped Eclipse commands from developers. In the UDCEvents data, of the 275 903 participants, only 39 729 partici-pants had used refactoring tools during the period covered bythe data. We would expect that not all of the participants wouldhave used refactoring tools, because this dataset included non-Java developers, and many participants were not active users ofEclipse. From the 39 729 tool-using participants, we examinedhow many times they used a refactoring tool each week.We also counted weeks where developers had developmentactivity, but no refactoring tool usage. The distribution ispresented in Table 6. Nearly 80% of the weekly developmentsessions did not have any refactoring tool usage, even amongthose who had used refactoring tools at some point. Whendevelopers did use refactoring tools, the usage within a weekmostly remained in the single digits. The results suggest thattool usage may be as low among a wider population as withthe Eclipse and Mylyn developers we have studied.

This analysis suffers from several limitations. First, it ispossible that some tool usage data from Toolsmiths maybe missing. If programmers used multiple computers duringdevelopment, some of which were not included in the dataset, this would result in under-reporting of tool usage. Given asingle commit, we could be more certain that we have a recordof all refactoring tool uses over the code in that commit if wehave a record of at least one refactoring tool use applied to thatcode since the previous commit. If we apply our analysis onlyto those commits, then 73% of refactorings (also 73% whennormalized) cannot be linked with a tool usage. Likewise,

data from Mylyn may be missing because developers may nothave checked their refactoring histories into CVS. In fact, onedeveloper in Developer Responses explicitly confirmed thatsometimes he does not commit refactoring history. Applyingour analysis to only those commits for which we have atleast one refactoring tool use, then 41% of refactorings (45%when normalized) cannot be linked with a tool usage. Second,refactorings that occurred at an earlier time might not becommitted until much later; this would inflate the count ofrefactorings found in CVS that we could not correlate to theuse of a tool, and thus cause us to underestimate tool usage.We tried to limit this possibility by looking back several daysbefore a commit to find uses of refactoring tools, but maynot have been completely successful. Finally, in our analysisof UDC Events, we cannot discount the possibility that thispopulation refactors less frequently in general, because wehave no estimate of their manual refactoring.

3.9 Different Refactorings are Performed with andwithout ToolsSome refactorings are more prone to being performed byhand than others. We have recently identified a surprisingdiscrepancy between how programmers want to refactor andhow they actually refactor using tools [10]. Programmerstypically want to perform EXTRACT METHOD more oftenthan RENAME, but programmers actually perform RENAMEwith tools more often than they perform EXTRACT METHODwith tools. (This can also be seen in all four groups ofprogrammers in Table 1.) Comparing these results, we inferredthat the EXTRACT METHOD tool is underused: the refactoringis instead being performed manually. However, it is unclearwhat other refactoring tools are underused. Moreover, theremay be some refactorings that must be performed manuallybecause no tool yet exists. We suspect that the reason thatsome kinds of refactoring — especially RENAME — are moreoften performed with tools is because these tools have simpleruser interfaces.

To explore this suspicion, we examined how the kindsof refactorings differed between refactorings performed byhand and refactorings performed using a tool. We once againcorrelated the refactorings that we found by manually inspect-ing Eclipse CVS commits with the refactoring tool usage inthe Toolsmiths data. We repeated this process for the MylynCVS commits and Mylyn data. Finally, when inspecting theEclipse CVS and Mylyn CVS commits, we identified severalrefactorings that currently have no tool support.

The results are shown in Figure 3. Tool indicates how manyrefactorings were performed with a tool; Manual indicateshow many were performed without. The figure shows thatmanual refactorings were performed much more often forcertain kinds of refactoring. For example, EXTRACT METHODis performed 9 times manually but just once with a tool inEclipse CVS; REMOVE PARAMETER is never performed witha tool in the Mylyn CVS commits. Moreover, no refactoringswere performed more often with a tool than manually in bothEclipse CVS and Mylyn CVS together. We can also see fromthe figure that many kinds of refactorings were performedexclusively by hand, despite having tool support.


Most refactorings that programmers performed had tool sup-port. However, 30 refactorings from the Eclipse CVS commitsand 36 refactorings from the Mylyn CVS commits did not havetool support. One of the most popular of these was MODIFYENTITY PROPERTY, performed 8 times in the Eclipse CVScommits and 4 times in the Mylyn CVS commits, which wouldallow developers to safely modify properties such as staticor final. A frequent but unsupported refactoring, REMOVEDECLARED EXCEPTION, occurred 12 times in the MylynCVS commits; it was commonly used to remove unnecessaryexceptions from method signatures. Finally, we observed 3instances of a REPLACE ARRAY WITH LIST refactoring inthe Mylyn CVS commits. The same limitations apply as inSection 3.8.

4 DISCUSSION

How do the results presented in Section 3 affect futurerefactoring research and tools?

4.1 Tool-Usage BehaviorSeveral of our findings illuminate the behavior of programmersusing refactoring tools. For example, our finding about howtoolsmiths differ from ordinary programmers in terms ofrefactoring tool usage (Section 3.1) suggests that most kindsof refactorings will not be used as frequently as the toolsmithshoped, when compared to the RENAME refactoring. For thetoolsmith, this means that improving the underused tools ortheir documentation (especially the tool for EXTRACT LOCALVARIABLE) may increase tool use.

Other findings provide insight into the typical workflowinvolved in refactoring. Consider that refactoring tools areoften used repeatedly (Section 3.2), and that programmersoften do not configure refactoring tools (Section 3.3). Forthe toolsmith, this means that configuration-less refactoringtools, which have recently seen increasing support in Eclipseand other environments, will suit the majority of, but notall, refactoring situations. In addition, our findings about thebatching of refactorings provides evidence that tools that forcethe programmer to repeatedly select, initiate, and configurecan waste programmers’ time. This was in fact one of themotivations for Murphy-Hill and Black’s refactoring cues, atool that allows the programmer to select several programelements for refactoring at one time [8].

Questions still remain for researchers to answer. Why isthe RENAME refactoring tool so much more popular thanother refactoring tools? Why do some refactorings tend tobe batched while others do not? Moreover, our experimentsshould be repeated in other projects and for other refactoringsto validate our findings.

4.2 Detecting RefactoringIn our experiments we investigated the assumptions underlyingseveral commonly used refactoring-detection techniques. Itappears that some techniques may need refinement to addresssome of the concerns that we have uncovered. Our finding thatcommit messages in version histories are unreliable indicators

of refactoring activity (Section 3.4) is at variance with anearlier finding by Ratzinger [14]. It also casts doubt on thereliability of previous research that relies on this technique [6],[15], [17]. Thus, further replication of this experiment in othercontexts is needed to establish more conclusive results.

Our finding that many refactorings are medium or low-level suggests that refactoring-detection techniques used byWeißgerber and Diehl [20], Dig and colleagues [4], Counselland colleagues [3], and to a lesser extent, Xing and Strou-lia [21], will not detect a significant proportion of refactor-ings. The effect that this has on the conclusions drawn bythese authors depends on the scope of those conclusions. Forexample, Xing and Stroulia’s conclusion that refactorings arefrequent can be strengthened by taking low-level refactoringsinto consideration. In contrast, Dig and colleagues’ tool wasintended to help automatically upgrade library clients, andthus has no need to find low-level refactorings. In general,researchers who wish to detect refactorings automaticallyshould be aware of what level of refactorings their tool candetect.

Researchers can make refactoring detection techniques morecomprehensive. For example, we observed that a common rea-son for Ratzinger’s keyword-matching to mis-classify changesas refactorings was that a bug-report title had been includedin the commit message, and this title contained refactoringkeywords. By excluding bug-report titles from the keywordsearch, accuracy could be increased. In general, future re-search can complement existing refactoring detection toolswith refactoring logs from tools to increase recall of low-levelrefactorings.

4.3 Refactoring Practice

Several of our findings add to existing evidence about refac-toring practice across a large population of programmers.Unfortunately, the findings also suggest that refactoring toolsneed further improvements before programmers will use themfrequently. First, our finding that programmers refactor fre-quently (Section 3.7) confirms the same finding by Weißgerberand Diehl [20] and Xing and Stroulia [21]. For toolsmiths,this highlights the potential of refactoring tools, telling themthat increased tool support for refactoring may be beneficialto programmers.

Second, our finding that floss refactoring is a more fre-quently practiced refactoring tactic than root-canal refactoring(Section 3.5) confirms that floss refactoring, in addition tobeing recommended by experts [5], is also popular among pro-grammers. This has implications for toolsmiths, researchers,and educators. For toolsmiths, this means that refactoring toolsshould support flossing by allowing the programmer to switchquickly between refactoring and other development activities,which is not always possible with existing refactoring tools,such as those that force the programmer’s attention awayfrom the task at hand with modal dialog boxes [10]. Forresearchers, studies should focus on floss refactoring for thegreatest generality. For educators, it means that when theyteach refactoring to students, they should teach it throughoutthe course rather than as one unit during which students are


taught to refactor their programs intensively. Students shouldunderstand that refactoring can be practiced both as a way ofincrementally improving the whole program design, and alsoas a way to simplify the process of adding a feature, or tomake a fragment of code easier to understand.

Lastly, our findings that many refactorings are performedwithout the help of tools (Section 3.8) and that the kindsof refactorings performed with tools differ from the kindsperformed manually (Section 3.9) confirm the results of oursurvey on programmers’ under-use of refactoring tools [10].Toolsmiths need to explore alternative interfaces and identifycommon refactoring workflows, such as reminding users toEXTRACT LOCAL VARIABLE before an EXTRACT METHODor finding a easy way to combine these refactorings: thegoal should be to encourage and support programmers intaking full advantage of refactoring tools. For researchers,more information is needed about exactly why programmersdo not use refactoring tools as much as they could.

4.4 Developer Responses on Manual RefactoringOur discussion with Eclipse and Mylyn developers (DeveloperResponses data) about their refactoring behavior generatedseveral insights that may guide future research. We do notclaim any generality for these insights, but believe that theydo provide useful seeds for future investigation.

We provided developers with several examples of a refactor-ing that they themselves performed with a tool in one case butwithout a tool in another. We then asked them to explain thisbehavior. From their responses, we identified three factors —awareness, opportunity, and trust — and two issues with toolworkflow — touch points and disrupted flow — that may limittool usage.

Awareness is whether a developer realizes that there is arefactoring tool in her programming environment that canchange her code on her behalf. As an example, one developerwas not familiar with how the tools for the INLINE refactoringworked despite being an experienced developer. Similarly,one toolsmith described awareness problems occuring in thefollowing scenario:

• “I already know exactly how I want the code to look like.• Because of that, my hands start doing copy-paste and the

simple editing without my active control.• After a few seconds, I realize that this would have been

easier to do with a refactoring [tool].• But since I already started performing it manually, I just

finish it and continue.”

Awareness is partially a problem of education, exposure, andencouragement; future research can consider how to improvesharing and exposing developers to different IDE features. But,as the toolsmith’s scenario suggests, awareness can also bea problem when a developer knows what refactoring toolsare available; future research may be able to help developersrealize that the change that they are about to make can beautomated with a refactoring tool.

Opportunity is similar to awareness, but differs in that thedeveloper knows about a feature, but does not know of an

opportunity to use that feature. One developer described thissituation:

My problem isn’t so much with how these tools dotheir job, but more with [opportunity]. Are thesetools . . . available within the editor when pressingCtrl+1 [a Eclipse automated suggestion tool]? Per-haps more should be recommended based on whereI’m at in the code and then perhaps I’d use themmore. As it is I have to remember the functionalityexists, locate it in a popup menu (hands leavingkeyboard = bad) and figure out the proper name forwhat it is I want to achieve.

Researchers and toolsmiths can consider how to improve refac-toring opportunities by introducing mechanisms for filteringapplicable refactorings or conveying bad smells local to thesource code currently being worked on.

The final factor is trust: does a developer have faith in anautomated refactoring tool? Developers must often act on faithbecause they are not informed of the limitations of a tool, orof cases in which an automated refactoring may fail. Severaldevelopers mentioned they would avoid using a refactoringtool because of worries about introducing errors or unintendedside-effects. Perhaps future research can improve trust bygenerating inspection checklists to cover situations where thetool may fail?

Developers also described several limitations with the waythat refactoring tools fit into their development workflow. Thefirst limitation involves touch points: the set of places inthe code that may be affected by a potential change. Onedeveloper described how using an automated refactoring toolwould curtail what would otherwise be a potentially more richand thorough experience:

More often than not, when starting a manual refac-toring, upon saving that first change, all depen-dent code throughout the code base will light up(compile errors). At this point you can survey whatthe refactoring will involve and potentially discoverpotions of the code you didn’t realize the refactoringwould affect. Tool support tries to achieve this bypresenting users with a preview of the changes, butthis information is always being presented in anunfamiliar way (i.e., presented in a wizard dialoginstead of the PackageExplorer for example). Thepreviews are not only unfamiliar UI, but usuallymore difficult to use to explore the affected codethan it would be in the native package explorer andeditor.

Current preview dialogs do not provide the same level ofparticipation and interaction available by visiting error loca-tions one by one in the editor. Researchers and toolsmithsmight well consider how to better engage a developer witha thorough review or more intelligent summarization of aproposed change.

A second limitation involves disrupted flow, an interruptionto the focus and working memory of the developer. Besidesperceived slowness, developers mentioned disrupted flow asa general concern when using refactoring tools. Consider


why refactoring tools may be disruptive: development of-ten requires deep concentration and maintenance of mentalthoughts and plans, which are facilitated by the availability oftextual cues and fluid movement through the source code. Adeveloper may feel that using a refactoring tool would disrupther concentration: by initiating a refactoring that requires con-figuration, a developer traps herself within a modal window,temporarily isolating herself from the source code until therefactoring is completed. The modal window blocks sourcetext previously available on the screen and limits mobility toother parts of the code that the developer may want to inspect.Toolsmiths should consider how to better integrate refactoringtools within the programming environment in a way that limitsdisruption.

Finally, we believe that further examination of why devel-opers use refactoring tools in some cases, but not in others,will help identify why refactoring tools break down and whatcan be done to improve these tools.

4.5 Limitations of this StudyIn addition to the limitations noted in each subsection ofSection 3, some characteristics of our data limit the validity ofall of our analyses. First, all the data report on refactoring Javaprograms in the Eclipse environment. While this is a widely-used language and environment, the results presented in thisarticle may not hold for other languages and environments.Second, Users, Toolsmiths, and Mylyn may not representprogrammers in general. Third, the Users and Everyone datasets may overlap the Toolsmith or Mylyn data set: both theUsers and Everyone data sets were gathered from volunteers,and some of those volunteers may have been Toolsmithsor Mylyn developers. Finally, our inspection of the EclipseCVS and Mylyn CVS data excluded commits to branches;readers of our previous work [11] have expressed concernthat developers may have refactored more in branch commitsthan in regular commits. However, our interviews with bothToolsmith and Mylyn developers confirmed that refactoring inbranches was discouraged because of the difficulty of latermerging refactored code.

4.6 Experimental Data and ToolsOur publicly available data, the SQL queries used for cor-relating and summarizing that data, and the tools we usedfor batching refactorings and grouping CVS revisions can befound at http://multiview.cs.pdx.edu/refactoring/experiments.Our normalization procedure and survey sent to developerscan be found in the appendices.

5 CONCLUSIONS

Research about refactoring, like research in all areas, relies ona solid foundation of data. In this article, we have examined thefoundations of refactoring from several perspectives. In somecases, the foundations appear solid; for example, programmersdo indeed refactor frequently. In other cases, the foundationsappear weak; for example, when committing code to versioncontrol, developers’ messages appear not to reliably indicate

refactoring activity. Our results can help the research commu-nity build better refactoring tools and techniques in the futurethrough enhanced knowledge of how we refactor, and how weknow it.

ACKNOWLEDGMENTS

We thank Barry Anderson, Christian Bird, Tim Chevalier,Danny Dig, Thomas Fritz, Markus Keller, Ciaran LlachlanLeavitt, Ralph London, Gail Murphy, Suresh Singh, and thePortland Java User’s Group for their assistance, as well as theNational Science Foundation for partially funding this researchunder CCF-0520346. Thanks to our anonymous reviewers andthe participants of the Software Engineering seminar at UIUCfor their excellent suggestions.

ERRATA

In the ICSE paper on which this work is based [11], we madetwo minor mistakes that have been corrected in this article.We highlight the corrections here.

In the ICSE paper, we said that Low-, Medium-, andHigh-level refactorings made in Eclipse CVS accounted for18%, 22% and 60% of refactorings, respectively. We didnot correctly apply our normalization procedures when wereported these numbers. In this article, in Table 5, we includethe corrected numbers: 21%, 11% and 68%.

In the ICSE paper, we reported finding 21 PUSH DOWNrefactorings. We misclassified 5 of these refactorings; theyshould have been classified as MOVE METHOD refactorings.The correct classification is shown in this article in Figure 3.

APPENDIX 1: NORMALIZATION PROCEDURE

In Section 3.5, we discussed a normalization procedure forsome reported data. To explain the procedure we used forcalculating, we’ll give the intuitive explanation and an examplecalculation below:

We wish to estimate how many pure-refactoring commitswere made to CVS. Recall that previously, we sampled 20Labeled projects and 20 Unlabeled projects, and we knowthat 6 Labeled commits were pure-refactoring and 0 Unlabeledcommits were pure-refactoring. Naively, we might simply dothe addition (6+0) and divide over the total commits to get theestimate: 6/40 = 15%. However, this is a good estimate forour sample, but a bad estimate for the population as a whole,because our 20-20 sample was drawn from two unequal strata.Specifically, in this naive estimate, we are giving too muchweight to the 6 pure-refactoring commits, because Labeledcommits only account for about 10% of total commits. Sowhat do we do?

Instead of the naive approach, we normalize our estimate forthe relative proportions of Labeled ( 10%) to Unlabeled com-mits ( 90%). The following calculation gives the normalizedresult:

6 is the number of Labeled pure-refactoring commits.0 is the number of Unlabeled pure-refactoring commits.290 is the number of Labeled commits.2498 is the number of Unlabeled commits.


(6/20)*(290/(290+2498)) + (0/20)*(1-290/(290+2498)) =0.0312051649928264

And thus, we estimate that about 3% of commits containedpure-refactorings.

APPENDIX 2: SURVEY

The following email template describes a survey sent to peoplewho developed code for the Mylyn and Toolsmiths data sets.In this template, XXX was instantiated with the developer’sfirst name. YYY was instantiated with either “Mylyn” or“Eclipse”, depending on which project the developer workedon. “ZZZ.refactoringname” was instantiated with a transactionnumber and a refactoring name, DD/MM/YYYY indicateswhen that refactoring was checked in to CVS, and “some com-ment” was instantiated with the comment that the developermade when checking in to CVS.

Subject: Your thoughts on our refactoring study

Dear XXX,

My colleagues Chris Parnin, Andrew Black, and I are com-pleting a study about refactoring and refactoring tool use. Weinvestigated two case studies of refactoring, one of which wasof the YYY project, of which you were a committer. In short,our analysis compared your refactoring tool history (producedwhen you used the Eclipse refactoring tools) with what weinferred as refactoring in the code when you committed toCVS. From this, we made estimates about how often peoplerefactor, what kinds of refactoring tools they use, and whenthey use or do not use refactoring tools.

We are hoping you will answer a few questions about yourthoughts on issues related to your refactoring. We hope thiswill provide some insights into how we can interpret ourresults. You are one of less than 10 people that we are invitingto participate, so your comments are extremely valuable to us.

Below, you will find several interview questions; we antic-ipate that they will take around 15 minutes to complete intotal. Unless you indicate otherwise, we would like to reservethe right to summarize or repeat your answers verbatim inour forthcoming paper. As for privacy, in the paper we do notpersonally identify developers by name or by CVS username(although we do say which projects we analyzed). You canrespond to this questionnaire simply by replying to this email.

Because we are on a tight publishing deadline, if you chooseto participate, we would appreciate your response by February25th.

Sincerely,

Emerson Murphy-Hill, The University of British ColumbiaChris Parnin, Georgia Institute of TechnologyAndrew P. Black, Portland State University—–• We published a first version of our analysis in a research

paper called “How We Refactor, and How We KnowIt” in 2009 at the International Conference on SoftwareEngineering. Did you happen to read it?

• One of our main findings was that, by comparing refac-toring tool histories and the refactorings apparent in CVS,developers appear to use refactoring tools for about 10%refactorings for which a refactoring tool is available.Speaking for yourself, why do you think you would notuse a refactoring tool when one was available?

• In the attached PDFs, we have included a few snippets ofcode where we inferred that you performed a refactoringfor which Eclipse has tool support. However, for someof the examples, we did not have a record of you usinga refactoring tool and thus concluded that you refactoredwithout one. If you are able to remember the change thatyou made, could you recall or infer why you did or didn’tuse the tool for that refactoring?(For each file, we use green and red annotations highlightsto show what was added and removed, and grey highlightsto draw your attention to specific parts of code.)File: ZZZ.refactoringname-tool.diff.pdfChange date: DD/MM/YYYYCVSComment: some comment

• Another finding was that developers sometimes repeat-edly used the same refactoring tool in quick succession(e.g., used Inline, then used it again in the next fewseconds). Can you think of any reasons why you mightdo this?

• Off the top of your head, please try to name the threerefactorings that you perform most often, and three thatyou perform most often using refactoring tools.

• Do you plan long-term ‘refactoring campaigns’, whereyou engage in extended refactoring for a period of time?If so, what is the motivation? How long do these usuallytake?

• Are there pitfalls during these campaigns? How wouldyou want refactoring tools to help at those times?

• In our analysis, we looked at only refactorings in commitsto the main line. Do you think you refactored differentlywhen you committed to branches?

• How you think the fact that your team developed toolsfor Eclipse affected how you used refactoring tools? Doyou think you used the refactoring tools more/less/aboutthe same as the average Eclipse Java developer?

• Is there some particular Eclipse refactoring tool (or partof a tool) that doesn’t fit with the way that you refactor?If so, please tell us which tool, and what the problem is.Do you desire additional tool support when you refactor?(Below, we’ve included a list of Eclipse refactoring toolsto help jog your memory.)

– Rename– Extract Local Variable– Inline– Extract Method– Move– Change Method Signature– Convert Local To Field– Introduce Parameter– Extract Constant– Convert Anonymous To Nested


– Move Member Type to New File– Pull Up– Encapsulate Field– Extract Interface– Generalize Declared Type– Push Down– Infer Generic Type Arguments– Use Supertype Where Possible– Introduce Factory– Extract Superclass– Extract Class– Introduce Parameter Object– Introduce Indirection

• Would you like a copy of the paper when it is complete?

REFERENCES

[1] F. Bourqun and R. K. Keller. High-impact refactoring based onarchitecture violations. In CSMR ’07: Proceedings of the 11th EuropeanConference on Software Maintenance and ReEngineering, pages 149–158, Washington, DC, USA, 2007. IEEE Computer Society.

[2] S. Counsell, Y. Hassoun, R. Johnson, K. Mannock, and E. Mendes.Trends in Java code changes: the key to identification of refactorings?In PPPJ ’03: Proceedings of the 2nd International Conference onPrinciples and Practice of Programming in Java, pages 45–48, NewYork, NY, USA, 2003. Computer Science Press, Inc.

[3] S. Counsell, Y. Hassoun, G. Loizou, and R. Najjar. Common refactor-ings, a dependency graph and some code smells: an empirical studyof Java OSS. In ISESE ’06: Proceedings of the 2006 ACM/IEEEInternational Symposium on Empirical Software Engineering, pages288–296, New York, NY, USA, 2006. ACM.

[4] D. Dig, C. Comertoglu, D. Marinov, and R. Johnson. Automateddetection of refactorings in evolving components. In D. Thomas, editor,ECOOP, volume 4067 of Lecture Notes in Computer Science, pages404–428. Springer, 2006.

[5] M. Fowler. Refactoring: Improving the Design of Existing Code.Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,1999.

[6] A. Hindle, D. M. German, and R. Holt. What do large commits tellus?: A taxonomical study of large commits. In MSR ’08: Proceedingsof the 2008 International Workshop on Mining Software Repositories,pages 99–108, New York, 2008. ACM.

[7] G. C. Murphy, M. Kersten, and L. Findlater. How are Java softwaredevelopers using the Eclipse IDE? IEEE Software, 23(4):76–83, 2006.

[8] E. Murphy-Hill and A. P. Black. High velocity refactorings in Eclipse.In Proceedings of the Eclipse Technology eXchange at OOPSLA 2007,New York, 2007. ACM.

[9] E. Murphy-Hill and A. P. Black. Breaking the barriers to successfulrefactoring: Observations and tools for extract method. In ICSE ’08:Proceedings of the 30th International Conference on Software Engi-neering, pages 421–430, New York, 2008. ACM.

[10] E. Murphy-Hill and A. P. Black. Refactoring tools: Fitness for purpose.IEEE Software, 25(5):38–44, 2008.

[11] E. Murphy-Hill, C. Parnin, and A. P. Black. How we refactor, andhow we know it. In ICSE ’09: Proceedings of the 31st InternationalConference on Software Engineering, New York, 2009.

[12] W. F. Opdyke and R. E. Johnson. Refactoring: An aid in design-ing application frameworks and evolving object-oriented systems. InSOOPPA ’90: Proceedings of the 1990 Symposium on Object-OrientedProgramming Emphasizing Practical Applications, September 1990.

[13] M. Pizka. Straightening spaghetti-code with refactoring? In H. R.Arabnia and H. Reza, editors, Software Engineering Research andPractice, pages 846–852. CSREA Press, 2004.

[14] J. Ratzinger. sPACE: Software Project Assessment in the Course ofEvolution. PhD thesis, Vienna University of Technology, Austria, 2007.

[15] J. Ratzinger, T. Sigmund, and H. C. Gall. On the relation of refactoringsand software defect prediction. In MSR ’08: Proceedings of the 2008International Workshop on Mining Software Repositories, pages 35–38,New York, 2008. ACM.

[16] R. Robbes. Mining a change-based software repository. In MSR’07: Proceedings of the Fourth International Workshop on MiningSoftware Repositories, pages 15–23, Washington, DC, USA, 2007. IEEEComputer Society.

[17] K. Stroggylos and D. Spinellis. Refactoring–does it improve softwarequality? In WoSQ ’07: Proceedings of the 5th International Workshopon Software Quality, pages 10–16, Washington, DC, USA, 2007. IEEEComputer Society.

[18] The Eclipse Foundation. Usage Data Collector Results, February12, 2009. Website, http://www.eclipse.org/org/usagedata/reports/data/commands.csv.

[19] M. A. Toleman and J. Welsh. Systematic evaluation of design choices forsoftware development tools. Software - Concepts and Tools, 19(3):109–121, 1998.

[20] P. Weißgerber and S. Diehl. Are refactorings less error-prone thanother changes? In MSR ’06: Proceedings of the 2006 InternationalWorkshop on Mining Software Repositories, pages 112–118, New York,2006. ACM.

[21] Z. Xing and E. Stroulia. Refactoring practice: How it is and how itshould be supported — an Eclipse case study. In ICSM ’06: Proceedingsof the 22nd IEEE International Conference on Software Maintenance,pages 458–468, Washington, DC, USA, 2006. IEEE Computer Society.

[22] T. Zimmermann and P. Weißgerber. Preprocessing CVS data forfine-grained analysis. In MSR ’04: Proceedings of the InternationalWorkshop on Mining Software Repositories, pages 2–6, 2004.

Emerson Murphy-Hill Emerson is an assistantprofessor at North Carolina State University. Hisresearch interests include human-computer in-teraction and software tools. He holds a Ph.D.in Computer Science from Portland State Uni-versity. Contact him at [email protected];http://www.csc.ncsu.edu/faculty/emerson.


Chris Parnin Chris is a PhD student at theGeorgia Institute of Technology. His researchinterests includes psychology of programmingand empirical software engineering. Contact himat [email protected];http://cc.gatech.edu/∼vector.

Andrew P. Black Andrew is a professor atPortland State University. His research interestsinclude the design of programming languagesand programming environments. In addition tohis academic posts he has also worked as anengineer at Digital Equipment Corp. He holdsa D.Phil in Computation from the University ofOxford. Contact him at [email protected];http://www.cs.pdx.edu/∼black.

ieee transactions on software engineering, vol. x, no....

Documents