a metric for software readability by raymond p.l. buse and westley r. weimer presenters: john and...

A Metric for Software Readability

by

Raymond P.L. Buse and Westley R. Weimer

Presenters: John and Suman

Readability

• The human judgment of how easy a text is to understand• A local, line-by-line feature• Not related to the size of a program• Not related to essential complexity of software

Readability and Maintenance

• Reading code is the most time-consuming of all maintenance activities [J. Lionel E. Deimel 1985, D. R. Raymond 1991, S. Rugaber 2000]

• 70% of software cost for maintenance [B. Boehm and V. R. Basili 2001]

• Readability also correlates with software quality, code change and defect reporting

Problem Statement

• How do we create a software readability metric that:– Correlates strongly with human annotators– Correlates strongly with software quality

• Currently no automated readability measure for software

Contributions

• A software readability metric that:– correlates strongly with human annotators – correlates strongly with software quality

• A survey of 120 human readability annotators• A discussion of the features related to software readability

Readability Metrics for Natural Language

• Empirical and objective models of readability• Flesch-Kincaid Grade Level has been used for over 50

years• Based on simple features:

– Word length– Sentence length

• Used by government agencies, MS Word, Google Docs

Experimental Approach

• 120 human annotators were shown 100 code snippets• Resulting 12,000 readability judgments available online

(ok, not really)

Snippet Selection - Goals

• Length– Short enough to aid feature discrimination– Long enough to capture important readability considerations

• Logical Coherence– Shouldn't span methods– Include comments adjacent comments

• Avoid trivial snippets (e.g a group of import statements)

Snippet Selection - Algorithm

• Snippet = 3 consecutive simple statements– Based on authors' experience– Simple statements are: declarations, assignments, function

calls, breaks, continues, throws and returns

• Other nearby statements are included:– Comments, function headers, blank lines, if, else, try-catch,

switch, etc.

• Snippets cannot cross scope boundaries

Readability Scoring

• Readability was rated from 1 to 5 1 - “less readable” 3 - “neutral” 5 - “more readable”

Inter-annotator agreement

• Good correlation needed for a coherent model

• Pearson product-moment correlation coefficient– Correlation of 1 indicates perfect correlation– Correlation of 0 indicates only random correlation

• Calculated for pair wise for all annotators

• Average correlation of 0.56– Typically considered “moderate to strong”

Readability Model

Objective:• Mechanically predict human readability judgments• Determine code features that are predictive of

readability

Usage:• Use this model to analyze code (automate software

readability metric)

Model Generation• Classifier

“Machine learning algorithms”• Instances

“Feature vector from a snippet”

• Experiment procedure

- training phase

- set of instances with

labeled “correct answer”

- classify based on the

score from the bimodal

distribution

Model Generation (contd …)

• Decide on a set of features that can be detected statically

• These factors relate to structure, density, logical complexity, documentation of the analyzed code

• Each feature is independent of the size/block of code

Model Generation (contd …)

• Build a classifier on a set of features• Use 10-fold cross validation

- random partitioning of data set into 10 subsets

- train on 9 and test on 1

- repeat this process 10 times

• Mitigate any bias from partitioning by repeating the 10-fold validation 20 times

• Average the results across all of the runs

Results

• Two relevant success metrics – precision & recall• Recall - % of snippets judged by annotators and

classified by model as “more readable”• Precision – fraction of snippets judged by annotators

and classified by model as “more readable”

“Performance is measured by weighing together the

f-measure statistic and harmonic mean of the

two metrics”

Results (contd …)

• “0.61” – f-measure of the classifier trained on randomly generated score labels

• “0.8” – f-measure of the classifier trained on average human data

Results (contd …)

• Repeated the experiment separately with annotator experience group (100 200 and 400 level, graduate CS students

Interesting Facts from performance measure …

• Average line length and average number of identifiers per line are important to readability

• Average identifier length, loops, if constructs and comparison operators are not very predictive features

Readability Correlations (Experiment 1)

• Correlate defects detected by FindBugs* and readability metric

• Run FindBugs on benchmarks • Classified the defects reports (one containing at

least one defect and other containing none)• Run the trained classifier • Record the f-measure for “contains a bug” with

respect to classifier judgment of “less readable”

*FindBugs – a popular static bug finding tool

Readability Correlations (Experiment 2)

• Correlates future code churn to readability• Uses readability to predict those functions that will

be modified between 2 successive releases of a program

• Consider a function to have changed – Where text is not exactly the same– Changes in whitespaces

Readability Correlations - Results• Average f-measure:

– For Experiment 1 -> 0.61 and for Experiment 2 -> 0.63

Relating Metric to Software Life Cycle

• Readability tends to change over a long period of time

Relating Metric to Software Life Cycle (contd …)

• Correlate project readability against project maturity (as reported by developers)

“Projects that reach maturity tend to be more readable”

Discussion• Identifier Length

– No influence!– Long names can improve readability, but can also reduce it– Comments might be more appropriate– Author's suggestions: Improved IDEs and code inspections

• Code Comments– Only moderately correlated– Being used to “make up for” ugly code?

• Characters/identifiers per line– Strongly correlated– Just as long sentences are more difficult to understand, so are

long lines of code– Author's suggestion: keep lines short, even if it means breaking

them up over several lines

Related Work

• Natural Language Metrics [R. F. Flesch 1948, R. Gunning 1952, J. P. Kincaid and E. A. Smith 1970, G. H. McLaughlin 1969]

• Coding Standards [S. Ambler 1997, B. B. Bederson et al. 2002, H. Sutter and A. Alexandrescu 2004]

• Style Checkers [T. Copeland 2005]• Defect Prediction [T. J. Cheatham et al. 1995, T. L.

Graves et al. 2000, T. M. Koshgoftaar et al. 1996, J. P. Kincaid and E. A. Smith 1970]

Future Work

• Examine personal preferences– Create personal models

• Models based on application domain• Broader features

– e.g. number of statements in an if block

• IDE integration• Explore minimum set of predictive features

Conclusion

• Created a readability metric based on a specific set of human annotators

• This metric:– agrees with the annotators as much as they agree with each

other– has significant correlation with conventional metrics of

software quality

• Examining readability could improve language design and engineering practice

a metric for software readability by raymond p.l. buse and westley r. weimer presenters: john and...

Documents