the comment density of open source software code
DESCRIPTION
The development processes of open source software are different from traditional closed source development processes. Still, open source software is frequently of high quality. Thus, we are investigating how open source software creates high quality and whether it can maintain this quality for ever larger project sizes. In this paper, we look at one particular quality indicator, the density of comments in open source software code. In a large-scale study of more than 5,000 projects, we find that active open source projects document their source code, and we find that the comment density is independent of team and project size, but not of project age. In future work, we intend to correlate comment density with project success or failure.TRANSCRIPT
The Comment Densityof Open Source Software Code
Oliver ArafatSiemens AG, Corporate Technology
Dirk RiehleSAP Research, SAP Labs LLC
[email protected], http://dirkriehle.comhttp://twitter.com/dirkriehle
Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to Proceedings of the 31st International Conference
on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195198.[Link: http://dirkriehle.com/2009/02/04/thecommentdensityofopensourcesoftwarecode/]
OverviewSummary (blog post): The Sweet Spot of Code Commenting in Open Source[Link: http://dirkriehle.com/2009/02/04/thesweetspotofcodecommentinginopensource/]
Abstract: The development processes of open source software are different from traditional closed source development processes. Still, open source software is frequently of high quality. Thus, we are investigating how open source software creates high quality and whether it can maintain this quality for ever larger project sizes. In this paper, we look at one particular quality indicator, the density of comments in open source software code. In a largescale study of more than 5,000 projects, we find that active open source projects document their source code, and we find that the comment density is independent of team and project size, but not of project age. In future work, we intend to correlate comment density with project success or failure.
Reference: Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195198. [Link: http://dirkriehle.com/2009/02/04/thecommentdensityofopensourcesoftwarecode/]
2April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.
Analysis Process and Tool Chain
3
Raw Data(ohloh.net)
Aggr. Data(RDBMS)
12 3 4
5
SQL, Java
Excel / Calc,R, spec. tools
Raw data source• Local database (ohloh.net snapshot , crawled sources )• Web services access (ohloh.net, sourceforge .net, others)
Preprocessing• Database querying using SQL and scripts• Java library for computationally heavyweight filters , aggregation
Aggregated data source• Output of pre processing stage for specific analytical tasks• Aggregated data significantly improves analysis speed
Analytical processing• Mines aggregated (and raw) data for insights , hypothesis testing• At present basic processing (Excel), machine learning next
Analysis output• Results of analytical processing : averages , distributions , correlations• Presented as models , tables, graphs, charts, etc.
April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.
Commenting Practice in Open Source
4
y = 0.1596xR2 = 0.872
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09
Project Size in Lines of Code (LoC = CL + SLoC)
Com
men
t Lin
es
April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.
Comment Density
• Comment density– Percentage of lines of code that are comments– Measure of how well documented code is– Indicator of likelihood of survival
• Definitions– CL = comment lines– SLoC = source lines of code– CD = CL / (CL + SLoC)
5
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09
Project Size in Lines of Code (LoC = CL + SLoC)
Com
men
t Den
sity
mean = 0.1867median = 0.1674stdev = 0.1088correl = 0.00787
April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.
Variation by Programming LanguageJava
y = 0.2798x; R2 = 0.9334; mean = 0.2587; median = 0.2566; stdev = 0.1111; population size = 1085
PERL
y = 0.1602x; R2 = 0.9464; mean = 0.1044; median = 0.0902; stdev = 0.0713; population size = 273
6
y = 0.2798xR2 = 0.9334
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
Project Size in Lines of Code (CL + SLoC)
Com
men
t Lin
es
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
Project Size in Lines of Code (CL + SLoC)
Com
men
t Den
sity
mean = 0.2587median = 0.2566stdev = 0.1111
0.0841
0.2450
0.3327
0.2340
0.0841
0.0165 0.0037 0.0000 0.00000%
10%
20%
30%
40%
50%
60%
70%
[0.0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.4[ [0.4, 0.5[ [0.5, 0.6[ [0.6, 0.7[ [0.7, 0.8[ [0.8, 0.9[
Comment Density
Per
cent
age
of O
ccur
renc
es
0.5855
0.3382
0.0582
0.0073 0.0109 0.0000 0.0000 0.0000 0.00000%
10%
20%
30%
40%
50%
60%
70%
[0.0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.4[ [0.4, 0.5[ [0.5, 0.6[ [0.6, 0.7[ [0.7, 0.8[ [0.8, 0.9[
Comment Density
Per
cent
age
of O
ccur
renc
es
y = 0.1602xR2 = 0.9464
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
Project Size in Lines of Code (CL + SLoC)
Co
mm
ent L
ines
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08
Project Size in Lines of Code (CL + SLoC)
Com
men
t Den
sity
mean = 0.1044median = 0.0902stdev = 0.0713
April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.
More here: How Open Source Comments (by Programming Language)[Link: http://dirkriehle.com/2008/11/10/howopensourcecommentsbyprogramminglanguage/]
Comment Density by Commit Size
7
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 10 20 30 40 50 60 70 80 90 100
Source Lines of Code in Commit [SLoC]
Com
men
t Den
sity
Comment Density by SLoC Size
Total Average Comment Density
sloc size = 1100mean = 0.2513median = 0.2340stdev = 0.0626
sloc size = 50100mean = 0.2234median = 0.2190stdev = 0.0169
sloc size = 80100mean = 0.2224median = 0.2171stdev = 0.0209
April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.
Comment Density by Team Size
8
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
0 10 20 30 40 50 60 70 80 90 100
Team Size [Number of Committers]
Com
men
t Den
sity
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%Comment Density by Team Size
Total Average Comment Density
Percent Projects with Team Size
team size = 120mean = 0.1914median = 0.1878stdev = 0.0255
team size = 150mean = 0.1922median = 0.1906stdev = 0.0425
team size = 1100mean = 0.1856median = 0.1857stdev = 0.0641correl = 0.0550
April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.
Comment Density by Project Age
9
17.50%
17.75%
18.00%
18.25%
18.50%
18.75%
19.00%
19.25%
19.50%
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48
Project Age [months]
Ave
rage
Com
men
t Den
sity
correl = 0.9054
April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.
Commenting Practice Summary• Successful open source projects follow a consistent commenting practice
– 1 out of 20 commits serves commenting purposes only– 1 out of 5 nonempty lines is a comment line
• Average comment density is independent of project size
• Average comment density is independent of team size
• Average comment density slowly falls with project age
10April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.
References• Dirk Riehle, John Ellenberger, Tamir Menahem, Boris Mikhailovski, Yuri Natchetoi, Barak Naveh, Thomas
Odenwald. “Bringing Open Source Best Practices into Corporations Using a Software Forge.” IEEE Software, 2009. See: http://dirkriehle.com/2009/02/11/opencollaborationwithincorporationsusingsoftwareforges/
• Dirk Riehle. “The Economic Motivation of Open Source: Stakeholder Perspectives.” IEEE Computer, vol. 40, no. 4 (April 2007). Page 2532. See: http://dirkriehle.com/computerscience/research/2007/computer2007.html
• Oliver Arafat, Dirk Riehle. “The Commit Size Distribution of Open Source Software.” In Proceedings of the 42nd Hawaiian International Conference on System Sciences (HICSS 42). IEEE Press, 2009. See: http://dirkriehle.com/2008/09/23/thecommitsizedistributionofopensourcesoftware/
• Amit Deshpande, Dirk Riehle. “The Total Growth of Open Source.” In Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008. Page 197209. See: http://dirkriehle.com/2008/03/14/thetotalgrowthofopensource/
• Amit Deshpande, Dirk Riehle. “Continuous Integration in Open Source Software Development.” In Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008. Page 273280. See: http://dirkriehle.com/2008/03/08/continuousintegrationinopensourcesoftwaredevelopment/
• Philipp Hofmann, Dirk Riehle. “Estimating Commit Sizes Efficiently.” In Proceedings of the 5th International Conference on Open Source Systems (OSS 2009). Springer Verlag, 2009. Page 105115. See: http://dirkriehle.com/2009/02/11/estimatingcommitsizesefficiently/
• Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195198. See: http://dirkriehle.com/2009/02/04/thecommentdensityofopensourcesoftwarecode/ 11April 6, 2009. Version 1.1.
© 2009 Dirk Riehle. All Rights Reserved.
Thank you! Questions?
For any feedback or questions, please email the authors:
Oliver ArafatSiemens AG, Corporate Technology
Dirk RiehleSAP Research, SAP Labs LLC
[email protected], http://dirkriehle.comhttp://twitter.com/dirkriehle