a micro/macro measure of software complexity

A Micro/Macro Measure of Software Complexity

Warren Harrison University of Portland, Portland, Oregon

Curtis Cook Oregon State University, Corvallis, Oregon

A software complexity metric is a quantitative measure of the difficulty of comprehending and working with a specific piece of software. The majority of metrics currently in use focus on a program’s “microcomplexity.” This refers to how difficult the details of the software are to deal with. This paper proposes a method of measuring the “macrocomplexity,” i.e., how difficult the overall structure of the software is to deal with, as well as the microcomplexity. We evaluate this metric using data obtained during the development of a compiler/environment project, involving over 30,CKMl lines of C code. The new metric’s performance is compared to the performance of several other popular metrics, with mixed results. We then discuss how these metrics, or any other metrics, may be used to help increase the project management efficiency.

1. INTRODUCTION

Software complexity metric research has focused on obtaining quantitative measures of how difficult a piece of software is to understand and work with. It seems that when dealing with a system of software (as opposed to a single independent piece of software) there are two different levels of understanding: micro and macro. The exact meaning of these two levels can be best described by an example.

Consider the enhancement of a software system containing several subprograms. Before making the modification, the programmer must first determine which part or parts of the system are to be changed. This involves macrounderstanding, since the detailed operation of any given subprogram need not be known (their functions, however, must be determined). All that is

Address correspondence to Warren Harrison, School of Busi- ness Administration, University of Portland, Portland, OR 97203.

The Journal of Systems and Software 7, 213-219 (1987) 0 1987 Elsevier Science Publishing Co., Inc., 1987

really necessary is to determine how each subprogram relates to the function of the overall system, and to the proposed enhancement.

Once the proper subprogram has been identified, the programmer must understand the detailed operation of the subprogram in order to make the needed modifica- tions. This activity requires a microunderstanding of the individual subprogram.

Few macrolevel measures have been developed as most software complexity metric research has focused on microlevel metrics. This is understandable since much of the current work revolves around controlled experimentation. Due to time limits imposed with such research, the software under study tends to be rather small, eliminating the possibility of large systems with many interacting parts.

In this paper, we present a new metric that addresses both the micro and macro aspects of complexity. As reported in this paper, this metric was found to have had a significantly higher correlation with the number of errors encountered in a major software development project than several other popular metrics.

Section 2 contains a brief description of some of the most popular methods of measuring the microlevel understandability of a piece of software. Section 3 presents two methods of measuring the macrolevel understandability of a software product: Henry and Kafura’s Information Flow Complexity and Hall and Preiser’s Combined Network Complexity Metric. The new macrolevel metric introduced in Section 4 is loosely based on these metrics. The results of a field study (a compiler project with over 30,000 lines of C code and 20 ‘ ‘logical modules, ’ ’ each consisting of a number of function definitions) is given in Section 5. For each module, errors uncovered during integration testing were compared with the various metrics. The new

213

0164-1212/87/$3.50

214 W. Harrison and C. Cook

metric had a significantly higher correlation with the errors than the other metrics.

Interestingly, of the six modules with the most errors (63% of the total errors), all six of the metrics studied had the highest values for the same five of these six “high error” modules. On the practical side, our study suggests that software project managers can use software complexity measures as a tool in identifying the few subprograms most likely to contain the majority of errors, and hence can allocate their testing resources more efficiently .

2. MICROCOMPLEXITY METRICS

Micro-level understanding is detailed knowledge of the internal operation of an individual subprogram or module in a system. A micro-level complexity metric measures the ease (or difficulty) of micro-level understanding. The majority of software complexity research has been devoted to developing measures of micro-level metrics. With very few exceptions, all of these metrics are based on characteristics of the subprogram code. In this section, we describe the three most popular microlevel metrics, lines of code, McCabe’s Cyclomatic Complexity [6], and Halstead’s Effort [3].

Perhaps the oldest measure of microcomplexity is a simple count of lines of code (Deliverable Source Lines or DSL). Alternative size measures also exist, such as number of noncomment lines of code (NonComment Source Lines or NCSL). This measure is based on the (quite reasonable) premise that a bigger program is harder to work with than a smaller program. Previous studies have repeatedly shown that this measure is quite reliable as microcomplexity measures go.

McCabe felt that the number of basic execution paths of a program was a measure of program complexity since each execution path through the program was composed of a combination of basic paths. He reasoned that the greater the number of basic paths, the larger the number of different paths through the program, and hence the more difficult it would be for a programmer to understand the internal behavior of the program. The number of basic paths is equal to the cyclomatic number of the program’s control flow graph. He showed that the control flow graph did not have to be constructed because, for a “well-formed” program, the cyclomatic number was equal to one plus the number of decision statements in the program. McCabe’s Cyclomatic Com- plexity is denoted VG, where G is the control flow graph of the program.

Halstead’s Software Science is based on four simple counts:

1. nl : the number of unique operators 2. n2: the number of unique operands

3. Nl : the number of total occurrences of operators 4. N2: the number of total occurrences of operands.

Further, these parameters can be combined to form the following two aggregate counts:

5. n = nl + n2, 6. N = Nl + N2.

His Effort metric, denoted E, is intended to be an approximation of the number of elementary mental discriminations required to write a given program, Halstead suggests that Effort be calculated as:

E = V/L,

where V (Volume) is a measure of program size, calculated as:

V = N log2n,

and L (Program Level) is a measure of “abstraction” inherent in a particular implementation of an algorithm. L is defined to be:

L=V * IV,

where V* is the “potential volume,” the volume that would be observed if the program were implemented as a “built-in” function. This is typically based on the number of “potential inputs” and “potential outputs,” as well as a constant number (2) of “potential operators”, which would consist of the name of the “function” and the calling mechanism (e.g., “CALL”).

However, it is often difficult to obtain an accurate count of potential operands. This is especially true in the current case where we were not involved in the actual implementation of the product, but instead obtained the data through an intermediary. In fact, such care had been taken to ensure that the original code could not be regenerated from the Reduced Form data, we were not even aware of the functions of each module, much less its potential operands. Because of the difficulty involved in obtaining a product’s potential volume it has been suggested [l] that the “estimated” Level (L’) be calculated as:

L’ =(2/d) * (n2/N2).

Thus, E may be approximated by [l] :

E= N log2n

(2hl) * (n2/N2) ’

3. METRICS TO MEASURE MACRO COMPLEXITY

Macrolevel understanding refers to understanding of the overall behavior of the entire system. Understanding the relationship of the subprograms to each other, and how

A Micro/Macro Measure of Software Complexity 215

they communicate is the key to macrolevel understanding.

Very few macrocomplexity metrics have been developed. The two macrocomplexity metrics presented in this section were selected because they are representa- tive and because they influenced the new macrocomplexity metric introduced in the next section. Generally, macrolevel metrics include a microlevel component, but concentrate on the communication links between subprograms-the more links, the more complex the macrolevel understanding. Potentially, each subprogram can communicate with every other subprogram in the system. Also, each potential communication link may introduce “side effects” that may affect other subprograms in the system. Hence, there is both direct and indirect communication between modules.

In a system with n subprograms, there are ((n - l)n)/ 2 potential communication links between subprograms. Therefore, the complexity introduced by adding additional subprograms increases at a nonlinear rate.

Henry and Kafura’s Information Flow Complexity metric [5] is composed of two parts:

1. The “detailed complexity” of an individual subprogram in a system.

2. The complexity of the relationship between the subprogram and each of the other subprograms within the system.

They defined the detailed complexity of a subprogram as its size, in number of lines of code. The relationship complexity is determined by counting the “fan-in” and “fan-out’ ’ of the subprogram.

The fan-in of a subprogram is the number of different data items whose values are affected by other subprograms in the system before being used by the subprogram being examined. The fan-out of a subprogram is the number of data items whose values are affected by the subprogram, and whose values are subsequently used by other parts of the system.

Henry and Kafura considered the relationship complexity to have a greater effect on the total complexity of an object than the detailed complexity. This is reflected by giving greater weight to the relationship complexity in the calculation of the object’s complexity:

(fan-in + fan-out) ** 2 * length. I

I Note that this differs somewhat from Henry and Kafura’s original detinition of their metric-fan-in and fan-out were summed in our work due to the format used in data collection. While we had access to the number of global variables and parameters, we did not have their role (i.e., input or output) available. The original definition uses the product of fan-in and fan-out. Since we could not obtain the product of fan-in and fan-out from our Reduced Form, we used the sum instead. Subsequent references to this metric will refer to a summed value, and not a product.

This metric is used to determine the complexity of a single part of a system. However, it can be easily extended to assess the complexity of the entire system by summing the complexities of the subprograms, or calculating the average complexity of the subprograms.

Hall and Preiser’s Combined Network Complexity metric [2] was initially developed to measure the complexity of concurrent systems. They developed a “combined measure” that included a generalized network component (nodes are modules and edges repre- sent invocation of modules and return of control) plus a measure of microcomplexity for each module in the system. This combined measure is calculated as:

w[l] * CN+w[2] * i C[i], i=l

where w[ l] and w[2] are weighting factors (currently set to 1); CN is the network complexity, and C[i], i = 1 to k is a micromeasure of complexity for each of the k modules in the system.

Any micromeasure can be used for C[i], however, Hall and Preiser suggest the use of Storm and Preiser’s index of complexity [7].

4. A NEW MEASURE OF SYSTEM COMPLEXITY, THE MMC METRIC

The new system complexity measure presented in this section is based loosely on the metrics of Hall and Preiser and Henry and Kafura. This metric will be referred to as the MMC (Micro/Macro Complexity) Metric. The MMC metric attempts to assess the total complexity of a system by including both a micro and a macrocomponent. The Macro/Micro Complexity (MMC) of a system can be calculated as:

# subprograms

MMC = c [SC(i) * MC(DI, i=l

where SC(i) is the system level complexity contributed by subprogram i, and MC(i) is the microlevel complexity contributed by subprogram i. The amount of system level complexity contributed by subprogram i is calculated as:

SC(i) = [Glob(i) * (#subprograms - 1)

+ Param( * [l -01(i)],

where:

Glob(i) is the number of times global variables are used in subprogram i.

Param is the number of times parameters are used in subprogram i .

DI(i) is the “Documentation Index” for subprogram i.

216 W. Harrison and C. Cook

This is a measure of the quality of the subprogram’s documentation. The better the documentation, the higher the index. Our working definition of this measure is the ratio of comments to total lines within the subprogram. Obviously, if DI(i) = 1, this implies that the subprogram is entirely comprised of comments, while in the case of DI(i) = 0, no comments are present in the code.

The microcomplexity contributed by subprogram i is calculated using some measure of microcomplexity, such as McCabe’s VG. For example:

MC(i) = VG(i),

where VG(i) is McCabe’s measure for subprogram i. McCabe’s metric was selected for the microcomplexity assessment because it seemed to best reflect the total complexity in extreme cases such as a system with a very few large subprograms or a large number of tiny subprograms. When DSL or NCSL were used to assess microlevel complexity, their contribution to the final result seemed to swamp the contribution of the network complexity.

The components of the MMC metric are based on the following reasoning, in the context of software mainte- nance or debugging. In order to locate the subsystem to be modified or containing an error, the function of each module, plus the other modules that it affects, must be determined. This is addressed by multiplying the number of data items used in each module by the number of other modules that they could affect. In a system comprised of n modules, a global data item usage could potentially affect n - 1 of the modules besides the module it is in (it would be expected that the usage would affect that module). On the other hand, a parameter usage could affect only the invoking module (i.e., the one calling the module under consideration). Thus, the weight assigned to global variable usage should be n - 1 and the weight assigned to parameter usage should be 1.

However, if a description of the global/parameter usage of the module is readily available, this will make it easier to understand the use of global items and parameters. Thus, the System Level Complexity is weighted by the Documentation Index. This weighting ranges from 0 for the “best” documentation to 1 for the “worst” (this is accomplished by subtracting the actual documentation index, which ranges from 0 to 1, from 1). If “perfect documentation” were available, global item usage and parameters would have minimal effect on comprehension. Clearly, such a level of documentation would be very difficult to attain.

To allow automation of this metric, the Documenta- tion Index for module i, can be defined using the

following relation:

DI(i) = (DSL(i) - NCSL(i))/DSL(i),

where DI(i) is the documentation index of module i,

NCSL(i) is the number of noncommentary source lines in module i, and DSL(i) is the number of delivered (total) source lines in module i. This in effect, is the ratio the number of comment lines and the number of total lines in subprogram i.

This clearly allows trivial manipulation of the metric by just inserting random blank lines. However, in a statistical context, it is felt that this method of computing the Documentation Index will be a reasonable measure of the documentation.

The major differences between the new metric and those of Henry and Kafura and Hall and Preiser are the components of the metric and the method of calculation. Hall and Preiser’s metric did not explicitly include the use of global and/or parameter data. Rather, their metric considered only the number of modules in the system. Likewise, Henry and Kafura’s metric, while considering the global variables and parameters used, did not explicitly take into account the number of subprograms in the system. Finally, neither metric attempts to assess the effect of documentation quality on the understandability of the system. Furthermore, the new metric puts a greater emphasis on the contribution of the network complexity to overall complexity than does Hall and Preiser’s metric.

5. EVALUATING THE PERFORMANCE OF THE NEW METRIC This section presents the results derived from a field study of a moderate-sized compiler project consisting of about 30,000 lines of C code, and 20 “logical” modules (each module contained a number of function detini- tions) .

Information on program characteristics was captured using the Reduced Form [4], which provides for each module:

1. Enumeration of operators and operands, including flow of control operators

2. Number of lines of code with comments 3. Number of lines of code without comments 4. Global/local/parameter usage information

making available virtually all of the characteristics needed to calculate most of the more popular complexity metrics. Further, the original source program cannot be reconstructed from the Reduced Form, since only the type and number of statements can be inferred from the data, not their order or actual form.

In addition, the error history of the project during integration testing was also made available to us. This

A Micro/Macro Measure of Software COInpkXity

consisted of 275 error reports, each of which was associated with a particular logical module. Since the data was organized in terms of logical modules, each of which contained a number of physical function definitions, we were able to consider the interaction of the functions within the logical modules.

We examined the relationship between the number of errors associated with each logical module and the various metrics described earlier, through the use of the Pearson correlation coefficient. The results of our study are shown in Table 1. As can be seen, the MMC metric performed significantly better than any of the other metrics examined.

Another interesting result of our study involved the analysis of the relationships among the various metrics themselves. This is also shown in Table 1. Not surpris- ingly, McCabe’s VG was highly correlated with the MMC metric. This is to be expected since a major component of the MMC metric is VG. Likewise, the Hall and Preiser metric was also strongly related to the MMC metric. This suggests that each of these metrics is reflecting similar characteristics within the code. How- ever, by considering the relationship between the metrics and bugs, we can see that the performances of the VG and HNP metrics do not warrant replacing them as a proxy for the MMC metric.

Likewise, even though the performance of a simple count of lines of code approaches (but does not surpass) the MMC metric, the relationship between the two metrics is not significant enough to warrant the use of a count of lines of code in place of the MMC metric. Furthermore, their relationship suggests that they are not measuring the same characteristics. Finally, note that the size measure, DSL, had the second highest correlation with errors.

6. USING THE NEW COMPLEXITY METRIC IN PROJECT MANAGEMENT

Oftentimes, when metrics are discussed, their applica- tion to the day-to-day concerns of a software project

Table 1. Results of the Study”

217

manager are left as an exercise to the reader. We would like to present a practical side to our study that could be used to help managers efficiently allocate testing resources.

Usually, by the time the project has reached the integration testing phase, any slack that had existed in the budget is depleted. Thus, the resources available to the project manager are usually fixed at some (usually inadequate) level. However, when dealing with a system with n modules, it is not always clear how to allocate the resources available for testing. The manager might simply use l/n of the resources to test each of the n modules.

A manager could use a complexity metric to assist in the allocation of resources during testing. It is com- monly accepted that a small number of the modules contain a majority of the errors. Likewise, an equally small number of the modules contain an extraordinarily small number of errors. If a manager could identify these modules, he could ensure that the majority of the testing resources allocated for the “low error” modules be reallocated to the “high error” modules. The problem, of course, is to identify these modules.

Our study showed that six of the 20 modules accounted for 63% of the errors discovered during testing and another six of the 20 modules accounted for only 5% of the errors discovered (the remaining eight modules accounted for 3 1% of the errors). Each of the six metrics we used identified the same five of the first six modules as the most complex, and performed fairly well in identifying the second six. However, the exact rankings of these modules differed among the six metrics. The rankings of the various modules for each of the metrics as well as the number of error reports is shown in Table 2. The important point is that the metrics identified the few modules with the majority of errors. Thus, a project manager could use this information to allocate the bulk of the testing resources to those six modules that have been identified as being most “complex. ” The additional resources could be taken from

BUGS MMC HNP HNK DSL VG E PRC

MMC .82 - .90 .71 .67 .91 .80 .79 HNP .75 .90 .84 .70 .95 .87 .92 HNK .62 .I7 .84 - .77 .89 .98 .91 DSL .76 .67 .70 .77 - .I7 .81 .70 VG .73 .91 .95 .89 .77 - .93 .92 E .69 .80 .87 .98 .81 .93 - .93 PRC .64 .79 .92 .91 .70 .92 .93

a Pearson Product Moment coefficients of correlation for each of the metrics vs. the number of errors and the other metrics.

MMC: the new metric; HNP: Hall and Preiser’s Metric: HNK: Henry and Kafora’s Metric; DSL: Delivered Source Lines: VG:

McCabe’s Metric; E: Halstead’s Effort; PRC: number of procedures.

218

Table 2. Identification of Most Error Prone and Least Error Prone Modules

Module Error DSL MMC VG E HNK HNP PRC name ranking ranking ranking ranking ranking ranking ranking ranking

mO1 1 5 1 1 4 4 1 4 m&2 2 4 5 5 5 5 4 5 m03 3 2 2 3 2 2 3 3 mO4 4 1 4 4 3 3 5 2 mO5 5 7 15 15 17 12 11 15 m06 5 3 3 2 1 1 2 1 n107 6 11 17 14 15 13 17 6 m08 7 16 11 9 10 10 13 8 m09 8 12 20 19 19 17 18 16 ml0 8 6 18 18 20 18 20 18 ml1 9 14 14 11 12 19 14 10 ml2 10 14 6 9 7 14 12 9 ml3 11 10 7 6 6 6 8 8 ml4 12 15 13 13 11 9 15 7 ml5 13 9 9 8 8 8 7 13 ml6 14 8 10 7 14 11 10 14 ml7 14 17 12 10 13 15 9 12 ml8 14 13 8 12 9 7 6 11 ml9 15 18 16 17 18 20 16 17 m20 15 19 19 16 16 16 19 14

+ t.76) C.49) C.50) C.49) C.55) C.47) C.63) ++ t.85) C.83) C.81) c.72) C.77) C.83) t.67)

o Abbreviations are the same as in Table 1. + : Spearman Rank Correlation for rankings of metrics for all 20 modules with error ranking. + + : Speamun Rank Correlation for rankings of metrics for only the first and last six ranked modules (i.e.., mO1 through mO6 and ml5 through 1%?0) with error ranking.

W. Harrison and C. Cook

those six modules that have been identified as being the least ‘ ‘complex. ’ ’

How well each of the metrics accomplished this ranking can be determined by using the Spearman Rank Correlation Coefficient to assess the relationship between the rank associated with each module based on the number of errors observed, and the rank associated with each module based on the metric values. As can be seen in Table 2, Lines of Code had the highest rank correlation (.76) with errors for the entire set of modules. The next highest rank correlation was a count of the number of procedures (.63).

In general, the rank correlation obtained in this manner was quite disappointing. However, if we only consider the top six and bottom six modules, the metrics display a great deal more accuracy (lines of code possesses a rank correlation of .85 followed by the MMC and Hall & Prieser metrics at .83). This suggests that the metrics work quite well in identifying the “extraordinary cases,” but do a relatively poor job of distinguishing among modules which do not fit into one of these “extraordinary” categories (i.e., either few errors or many).

It should be noted that the Spearman Rank Correla- tions, differs significantly from the Pearson Correlations discussed earlier. The Spearman Correlation only con-

siders the order of the observations, not how far apart they are. The Peaison Correlation, on the other hand addresses the issue of how well the independent variable “explains” the dependent variable. As an example, the correlations of the following paired values would display the same Spearman Correlation, but a different Pearson Correlation. A(l, l), B(2, 2), C(3, 3) would provide both a Pearson and a Spearman correlation of 1. However, the pairs A(l, I), B(2, lOOO), C(3, 1001) would possess a Spearman Correlation of 1.00, but a Pearson correlation of 0.86. The rank correlation was used in the current situation because we wanted to consider how well the metrics ranked the modules. In Table 1, we were more interested in the metrics’ “prediction” abilities, therefore, we used the Pearson correlation. Thus, the results in Tables 1 and 2 suggest that the MMC metric can provide a basis for more accurate ‘ ‘predictions” of error counts than can Lines of Code (or any of the other metrics analyzed, for that matter); however, Lines of Code still surpasses the performance displayed by the others in ranking the most error-prone modules.

The exact percentage of the “extraordinary bug rate” modules would probably vary from situation to situation. However, the percentage of “extraordinary bug rate” modules would probably seldom go above 60%

A Micro/Macro Measure of Software Complexity

(30% high and 30% low). For example, Weiss and BasiIi [S] noted that in three projects, only 10% of the component parts were changed more than six times.

7. CONCLUSIONS

There are three primary conclusions from our study. For the first two conclusions, the reader should keep in mind that the study involved only a single system.

1.

2.

3.

The new MMC metric performed considerably better than the other five metrics in terms of the Pearson Correlations with errors. However, in identifying “extraordinary error rate” software, it performed only marginally better than some of the other metrics, and slightly worse than Lines of Code. Our study reaffirmed what many other studies have found: Lines of Code is a good, straightforward, easily calculated, and reliable measure of program complexity. Our study suggests that it be used in place of McCabe’s and Halstead’s E measures as a microlevel complexity metric die to its relative ease of. calculation and somewhat superior correlations with errors. The use of a Reduced Form is an invaluable tool for acquiring software complexity data from industrial sources. It is doubtful that we would have obtained any of the data used in our study if the Reduced Form tool were not readily available for use by the organization. Certainly, we would not have had

219

access to a general enough form of the data to experiment with the various adjustments to the metrics that are necessary in a study of this scope.

REFERENCES

1.

2.

3.

4.

5.

6.

A. Fitzsimmons and T. Love, A Review and Evaluation of Software Science, ACM Computing Surveys (March 1978), pp. 3-18. N. Hall and S. Preiser, Combined Network Complexity Measures, IBM Journal of Research and Development (January 1984), pp. 15-27. M. Halstead, Elements of Software Science, Elsevier, New York, 1977. W. Harrison and C. Cook, A Method of Sharing Industrial Software Complexity Data, ACM SZGPLAN Notices (February 1985), pp. 42-49. S. Henry and D. Kafura, Software Structure Metrics Based on Information Flow, IEEE Transactions on Software Engineering (September 1981), pp. 510-518. T. McCabe, A Complexity Measure, IEEE Transactions on Software Engineering (December 1976), pp. 308- 320. I. Storm and S. Preiser, An Index of Complexity for Structured Programming, IEEE Proceedings of the Workshop on Quantitative Software Models (1979), pp. 130-133. D. Weiss and V. Basili, Evaluating Software Develop- ment by Analysis of Changes: Some Data from the Software Engineering Laboratory, IEEE Transactions on Software Engineering (February 1985), pp. 157-168.

a micro/macro measure of software complexity

Documents