ayelet israeli and dror g. feitelson, “the linux kernel as a case study in software evolution”....
TRANSCRIPT
Ayelet Israeli and Dror G. Feitelson,“The Linux Kernel as a Case Study in Software Evolution”.Journal of Systems and Software 83(3), pp. 485-501, Mar 2010.
Presented by Dror Feitelson.
Synopsis
• A study of 810 versions of the Linux kernelreleased over 14 yearscomparing the evolution of the systemto Lehman’s Laws of software evolution.
• Conclusion: several laws are supported by the data.
• Observation: average complexity is decreasing with time.
Linux Background
• First announced August 1991• First release March 1994• Dual release scheme till 2003
– Odd versions are development (1.1, 1.3, 2.1, 2.3, 2.5)
– Even versions are production (1.0, 1.2, 2.0, 2.2, 2.4)
• New release scheme in 2.6– New version every 2-3 months– Development is distributed, no official releases
• Full source code of all versions available online
Linux Kernel Versions
• Paper used all 810 versions from March 1994 to August 2008 (all .h and .c files)– 144 production– 429 development– 237 of 2.6
• Unprecedented scale of investigation• Other researchers used only production or
only .c or a sample of versions• Some versions (test kernels and release
candidates) were missed
1.0 v1.0/linux-1.01.1 v1.1/v1.1.0
v1.1/linux-1.1.*1.2 v1.2/linux-1.2.*1.3 v1.3/linux-1.3.*
v1.3/linux-pre2.0.*2.0 v2.0/linux-2.0.*2.1 v2.1/linux-2.1.*
v2.1/linux-2.2.0-pre*2.2 v2.2/linux-2.2.*2.3 v2.3/linux-2.3.*
v2.3/linux-2.3.99-pre*v2.4/old-test-kernels/linux-2.4.0-*
2.4 v2.4/linux-2.4.*2.5 v2.5/linux-2.5.*
v2.6/pre-releases/linux-2.6.0-test*2.6 v2.6/linux-2.6.*
v2.6/testing/v2.6.*/linux-2.6.*-rc* v2.6/longterm/v2.6.*/linux-2.6.*
Kernel version locations on www.kernel.org
CPP Problems
• Kernel is littered with preprocessor directives• Removed them in order to analyze all the code
– This is what the developers see• Sometimes this leads to incorrect syntax
– Files where this happened were ignored– About 1.5% of the code
• Alternative is to perform preprocessing (used by others)– Induces code bloat (macros and #include)– Only one configuration of the system
Evolution Background• Textbooks: Software developed in well-defined phases:
– Elicit requirements– Create specifications– Design the system– Implement– Test and correct– Install and maintain
• Reality: Software evolves:– Start with a small useful project– Users will introduce new requirements– Adapt the system to do what is needed– Needs cannot be anticipated in advance
Three Types of Programs
• S-type: derived from well defined formal specifications
• P-type: can’t derive a formal solution, so use an iterative process to find and refine a solution
• E-type: a program that becomes embedded in its environment and changes with it; mechanizing an activity changes it and induces new requirements
Continuous evolution instead of
development and then maintenance
Early Data
• Data on OS/370– Size in modules– As function of
release serial no.• Main results
– Steady growth– Ripple effect– Instability in late releases
OS/360-370
S(i) = S(i-1)+0.19
1
2
3
4
5
6
7
1 5 9 13 17 21 25
SizeRelativeto RSN 1
RSN
Lehman's Laws
1) Continuing change (adaptation)2) Increasing complexity (unless refactored)3) Self regulation (of rate of change)4) Invariant work rate (inertia)5) Conservation of familiarity (of users and
developers)6) Continuing growth (more features)7) Declining quality (unless maintained)8) Feedback system (at multiple levels)
The Idea
• Lehman used little data from closed-source systems
• A lot of data is now available• Use Linux data to see if it supports Lehman’s
Laws• In particular try to use software metrics to
quantify the laws
Law VI
Continuing GrowthThe functional capability of E-type systems must be continually enhanced to maintain
user satisfaction over system lifetime
• Law requires new functionality to be added• Can also be interpreted as requiring growth
in size• Are the two interpretations equivalent?
Lehman’s Data
• Size in modules as function of release number for OS/360 and other systems
• Grows, but growth rate often seen to decline– Though not for OS/360– Turski suggested inverse square law: – Idea: effort E is spent on all possible
interactions among si modules– Leads to a model where
21i
ii sss
E
3 isi
OS/360-370
S(i) = S(i-1)+0.19
1
2
3
4
5
6
7
1 5 9 13 17 21 25
SizeRelativeto RSN 1
RSN
Law VI and Linux
• The dominant effect (so we deal with it first)• The easiest to measure and analyze
– If interpreted as size• Growth is super-linear (quadratic?)• Explained by positive feedback with growth of
developer base• Functional growth is harder to quantify
Godfrey & Tu Linux Data
• LOC or tarball size as function of date, 1994-2000
• Focus on development versions• Growth rate seen to increase• Fits quadratic model• Largely verified by others• Also for other (but not all) open-source
systems
Release Number vs. Time
• Doesn’t matter if releases are regular– In Linux before 2.6 they are not
• Changes growth shape if irregular• Question of interleaving multiple versions
– Assume version 2.3 was released after 3.0– If sorted by number their order is reversed– Justified because related to 2.2, not to 3.0
Linux Growth Data
Super-linear growthContradicts Lehman and Turski who claimed growth should slow down due to increasing complexity
Functionality
• Previous results good for all common size metrics
• Different results if try to measure functional growth
• System calls are leveling out– Possibly reflects maturity, as predicted by Torvalds
• Config options are growing faster– Indicates growth is in internal mechanisms rather
than user-visible services
System Calls
Config Options
Law I
Continuing ChangeAn E-type system must be continually
adapted, else it becomes progressively less satisfactory in use
• This means that software must evolve• “Adapt” implies keeping up with a changing
environment
Law I and Linux
• Change is obviously true– In 2.6 a new version is released every 2-3 months
• Change is achieved through growth• Adaptation to changing hardware environment• Hard to distinguish adaptation from growth
– Is adding support for sound cards a new feature or adaptation to a changing environment?
Adaptation to New Hardware
• Special case of operating system environment• Confined to two subdirectories
– arch (supported architectures)– drivers (supported peripherals)
• Together about 60% of the code• Grow together with the rest of the system at
about the same rate
arch + drivers vs. Whole Kernel
Law II
Increasing ComplexityAs an E-type system is changed its complexity
increases and it becomes more difficult to evolve unless work is done to maintain it
and reduce the complexity
• Functionality costs in complexity• Two-sided law: supported either way
Law II and Linux
• Complexity not necessarily increasing• System is largely modular (e.g. no coupling
between file systems, scheduler, and drivers)• New functions being added are short and
simple• Growing number but reduced fraction of
high-MCC functions• Active work to reduce complexity
McCabe Cyclomatic Complexity (MCC)
• Introduced by McCabe in 1976• Essentially counts the minimal number of
paths through the code• Suggestion: functions with MCC>10 may
require refactoring• Easily calculated by counting predicates
– All while, for, if, and case statements• Widely used in tools and research• Has been criticized, but no better alternatives
Measuring MCC
• Use commercial static analysis tool (klocwork)– Requires compilation of the code– Therefore limited to specific configuration– Some bug and usage problems
• Use free tool (pmccabe)– But not in this paper
• Write your own script– Simple and what we need– Danger of bugs and not being standard
Results• Total MCC grows with code
Results• Total MCC grows with code• But average MCC per function is decreasing
Distribution of MCC
Possible Explanations
• Many new functions being added, and they tend to be simpler than the old ones– Indeed, new functions tend to have lower MCC
• Code is being actively improved with time
High-MCC Functions
• Distribution of MCC values is heavy-tailed• Highest values are in the hundreds
– 369 functions with MCC ≥ 100 over the years• Some of these functions evolve
– Massive reduction in MCC as in sys32_ioctl– Gradual growth of MCC– Occasional large growth in production version
• Very long, but actually not very complex
Tail of MCC Distribution
An Aside on Heavy Tails
• Definition: tail decays as a power law
• CDF:
• CCDF:
• Heavy tail:
• LLCD:
)Pr()( xXxF
)Pr()( xXxF
20)( axxF a
xaxF log)(log
Law VII
Declining QualityUnless rigorously adapted and evolved to
take into account changes in the operational environment, the quality of an E-type system
will appear to be declining
• Again can be supported either way• What is “quality”?
Law VII and Linux
• Question of how to quantify quality• Quality is most probably not decreasing• It may even be improving
Perceived Quality
• If quality declines system will fall out of use• Linux usage is strong and growing• Ergo Linux quality is not declining
Measured Quality
Oman’s Maintainability Index (MI)
• HV = Halstead’s volume (N ln n)– Bits required to write the function
• MCC = McCabe Cyclomatic Complexity• LoC = Lines of Code• pCM = percent Comment lines
– Interpreted as fraction (0-1) rather than percent
pCMLoCMCCHVMI 46.2sin50ln2.1623.0ln2.5171
Changes in MI
Law IV
Invariant Work rateThe work rate of an organization evolving an E-type software system tends to be constant over the operational lifetime of that system
or phases of that lifetime
• Large organizations have inertia• What about open source communities?
Law IV and Linux
• Work on Linux is growing superlinearly• Fraction of files handled is near constant• Release rate is near constant
– 5-10 days per minor release till 2.5– 2-3 months for new version in 2.6
Interpretation 1: Work Hours
• Data not available• Ill-defined: developers typically have other
daytime job• Nevertheless, work rate is most probably not
constant– Growth in developer base– Increased growth rate of code
Interpretation 2: Elements Handled
• Suggested by Lehman• Use development versions (+ 1st year of 2.4)• Includes number added (reflects growth)• Absolute number grows with time• Fraction of existing files relatively constant
Interpretation 3: Release Rate
• Release rate of development versions 1996-2003 around 3-6/month– Lower in 2.4
Releases per Month
Interpretation 3: Release Rate
• Release rate of development versions 1996-2003 around 3-6/month
• Production versions have high minor release rate until next development version is forked
Rate of Minor Releases
Linear slope =steady release rate
Interpretation 3: Release Rate
• Release rate of development versions 1996-2003 around 3-6/month
• Production versions have high minor release rate until next development version is forked
• Since 2003 (version 2.6) new version every 2-3 months
• Conclusion: seems to support constant rate
Law V
Conservation of FamiliarityIn general, the incremental growth (growth rate trend) of E-type systems is constrained
by the need to maintain familiarity
• Capacity of humans to change constrains the rate of change
Law V and Linux
• Rapid development releases imply small change between versions
• Production versions branch off from development versions again with small change
• Large difference between production versions– So user familiarity is not conserved
• Users may continue to use production version for long time– Evidence for need for conservation of familiarity
Law III
Self RegulationGlobal E-type system evolution is feedback
regulated
• Reflects a balance between forces that demand change, and constraints on what can actually be done
Lehman’s Ripple
• Ripple indicates negative feedback control• Or maybe alternation of major/minor releases?
Increments of Growth
• Large increment reflects desire to add more new functionality
• Small increment reflects need to stabilize• Alternations reflect self regulation• Also seen to some degree in Linux
Alternating Increments
Law VIII
Feedback SystemE-type evolution processes are multi-level, multi-loop, multiagent feedback systems
• Extension of law III?
Law VIII and Linux
• Archetypal open-source system• Continued development based on feedback
from users– Defect reports– Bug fixes– Contribution of code
• Change of release scheme in 2.6 reflects need for more rapid dissemination
• Hard to quantify
Lehman’s Laws and Linux: Summary
• Some laws are two-sided– II (complexity), VII (quality)
• Some laws are qualitative– I (adaptation), III (self regulation), V (familiarity),
VII (quality), VIII (feedback)• Laws need to be interpreted and quantified
– II (complexity), IV (work rate), VII (quality)
Lehman’s Laws and Linux: Summary
I change Adaptation to new hardwareII complexity Not increasingIII self
regulationMaybe
IV work rate Constant release rate, superlin. growthV familiarity Within production versionsVI growth SuperlinearVII quality Not decreasingVIII feedback Inherent in open source paradigm