a metric for software readability by raymond p.l. buse and westley r. weimer presenters: john and...
TRANSCRIPT
A Metric for Software Readability
by
Raymond P.L. Buse and Westley R. Weimer
Presenters: John and Suman
Readability
• The human judgment of how easy a text is to understand• A local, line-by-line feature• Not related to the size of a program• Not related to essential complexity of software
Readability and Maintenance
• Reading code is the most time-consuming of all maintenance activities [J. Lionel E. Deimel 1985, D. R. Raymond 1991, S. Rugaber 2000]
• 70% of software cost for maintenance [B. Boehm and V. R. Basili 2001]
• Readability also correlates with software quality, code change and defect reporting
Problem Statement
• How do we create a software readability metric that:– Correlates strongly with human annotators– Correlates strongly with software quality
• Currently no automated readability measure for software
Contributions
• A software readability metric that:– correlates strongly with human annotators – correlates strongly with software quality
• A survey of 120 human readability annotators• A discussion of the features related to software readability
Readability Metrics for Natural Language
• Empirical and objective models of readability• Flesch-Kincaid Grade Level has been used for over 50
years• Based on simple features:
– Word length– Sentence length
• Used by government agencies, MS Word, Google Docs
Experimental Approach
• 120 human annotators were shown 100 code snippets• Resulting 12,000 readability judgments available online
(ok, not really)
Snippet Selection - Goals
• Length– Short enough to aid feature discrimination– Long enough to capture important readability considerations
• Logical Coherence– Shouldn't span methods– Include comments adjacent comments
• Avoid trivial snippets (e.g a group of import statements)
Snippet Selection - Algorithm
• Snippet = 3 consecutive simple statements– Based on authors' experience– Simple statements are: declarations, assignments, function
calls, breaks, continues, throws and returns
• Other nearby statements are included:– Comments, function headers, blank lines, if, else, try-catch,
switch, etc.
• Snippets cannot cross scope boundaries
Readability Scoring
• Readability was rated from 1 to 5 1 - “less readable” 3 - “neutral” 5 - “more readable”
Inter-annotator agreement
• Good correlation needed for a coherent model
• Pearson product-moment correlation coefficient– Correlation of 1 indicates perfect correlation– Correlation of 0 indicates only random correlation
• Calculated for pair wise for all annotators
• Average correlation of 0.56– Typically considered “moderate to strong”
Readability Model
Objective:• Mechanically predict human readability judgments• Determine code features that are predictive of
readability
Usage:• Use this model to analyze code (automate software
readability metric)
Model Generation• Classifier
“Machine learning algorithms”• Instances
“Feature vector from a snippet”
• Experiment procedure
- training phase
- set of instances with
labeled “correct answer”
- classify based on the
score from the bimodal
distribution
Model Generation (contd …)
• Decide on a set of features that can be detected statically
• These factors relate to structure, density, logical complexity, documentation of the analyzed code
• Each feature is independent of the size/block of code
Model Generation (contd …)
• Build a classifier on a set of features• Use 10-fold cross validation
- random partitioning of data set into 10 subsets
- train on 9 and test on 1
- repeat this process 10 times
• Mitigate any bias from partitioning by repeating the 10-fold validation 20 times
• Average the results across all of the runs
Results
• Two relevant success metrics – precision & recall• Recall - % of snippets judged by annotators and
classified by model as “more readable”• Precision – fraction of snippets judged by annotators
and classified by model as “more readable”
“Performance is measured by weighing together the
f-measure statistic and harmonic mean of the
two metrics”
Results (contd …)
• “0.61” – f-measure of the classifier trained on randomly generated score labels
• “0.8” – f-measure of the classifier trained on average human data
Results (contd …)
• Repeated the experiment separately with annotator experience group (100 200 and 400 level, graduate CS students
Interesting Facts from performance measure …
• Average line length and average number of identifiers per line are important to readability
• Average identifier length, loops, if constructs and comparison operators are not very predictive features
Readability Correlations (Experiment 1)
• Correlate defects detected by FindBugs* and readability metric
• Run FindBugs on benchmarks • Classified the defects reports (one containing at
least one defect and other containing none)• Run the trained classifier • Record the f-measure for “contains a bug” with
respect to classifier judgment of “less readable”
*FindBugs – a popular static bug finding tool
Readability Correlations (Experiment 2)
• Correlates future code churn to readability• Uses readability to predict those functions that will
be modified between 2 successive releases of a program
• Consider a function to have changed – Where text is not exactly the same– Changes in whitespaces
Readability Correlations - Results• Average f-measure:
– For Experiment 1 -> 0.61 and for Experiment 2 -> 0.63
Relating Metric to Software Life Cycle
• Readability tends to change over a long period of time
Relating Metric to Software Life Cycle (contd …)
• Correlate project readability against project maturity (as reported by developers)
“Projects that reach maturity tend to be more readable”
Discussion• Identifier Length
– No influence!– Long names can improve readability, but can also reduce it– Comments might be more appropriate– Author's suggestions: Improved IDEs and code inspections
• Code Comments– Only moderately correlated– Being used to “make up for” ugly code?
• Characters/identifiers per line– Strongly correlated– Just as long sentences are more difficult to understand, so are
long lines of code– Author's suggestion: keep lines short, even if it means breaking
them up over several lines
Related Work
• Natural Language Metrics [R. F. Flesch 1948, R. Gunning 1952, J. P. Kincaid and E. A. Smith 1970, G. H. McLaughlin 1969]
• Coding Standards [S. Ambler 1997, B. B. Bederson et al. 2002, H. Sutter and A. Alexandrescu 2004]
• Style Checkers [T. Copeland 2005]• Defect Prediction [T. J. Cheatham et al. 1995, T. L.
Graves et al. 2000, T. M. Koshgoftaar et al. 1996, J. P. Kincaid and E. A. Smith 1970]
Future Work
• Examine personal preferences– Create personal models
• Models based on application domain• Broader features
– e.g. number of statements in an if block
• IDE integration• Explore minimum set of predictive features
Conclusion
• Created a readability metric based on a specific set of human annotators
• This metric:– agrees with the annotators as much as they agree with each
other– has significant correlation with conventional metrics of
software quality
• Examining readability could improve language design and engineering practice