sqat(software quality assurance and testing)
Post on 01-Dec-2015
37 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Unit-III
5.1. Establishing software quality requirements:
The software quality metrics methodology is a systematic approach to
establishing quality requirements and identifying, implementing, analyzing, and
validating the process and product of software quality metrics for a software system. It
spans the entire software life cycle and comprises five steps.
These steps shall be applied iteratively because insights gained from applying a
step may show the need for further evaluation of the results of prior steps.
(a).Establish software quality requirements: .A list of quality factors is selected,
prioritized, and quanti- fied at the outset of system development or system change. These
requirements shall be used to guide and control the development of the system, to access
the whether the system met the quality requirements specified in the contract.
(b). identib software quality requirements: The software quality metrics framework is
applied in the selection of relevant metrics.
©implementthesoftwarequalityrequiremnsts: Tools are either procured or developed, data
is collected, and metrics are applied at each phase of the software life cycle.
(d) Analyze software quality metrics results: The metrics results are analyzed and
reported to help control the development and assess the final product.
(e) Validatethequalitymetrics. Predictive metrics results are compared to the direct
metrics results to determine whether the predictive metrics accurately measure their
associated factors.
The documentation or output produced as a result of these steps is shown in table I
Table I-Outputs of metrics methodology steps
Metric Methodology Step output
Establish software quality requirements -Quality requirements
Identify software quality metrics Approved quality metrics framework
- Metrics set
2
-Cost-benefit analysis
Implement the software quality metrics - Description of data items
- Metrieddata item
-Traceability matrix
-Training plan and schedule
5.1 Establish software quality requirements :
Quality requirements shall be represented in either of the following forms:
Direct metric value: A numerical target for a factor to be met in the final product.
For example, mean time to failure (MTTF) is a direct metric of final system reliability.
This is an intermediate requirement that is an early indicator of final system
performance. For exam- ple, design or code errors may be early predictors of final system
reliability.
5.1 .I Identify a list of possible quality requirements:
Identify quality requirements that may be applicable to the software system. Use
organizational experience, required standards, regulations, or laws to create this list.
Annex A contains sample lists of factors and sub- factors. In addition, list other system
requirements that may affect the feasibility of the quality requirements. Consider
acquisition concerns, such as cost or schedule constraints, warranties, and organizational
self- interest. Do not rule out mutually exclusive requirements at this point. Focus on
factoddirect metric combi- nations instead of predictive metrics.
All parties involved in the creation and use of the system shall participate in the
quality requirements identi- fication process.
5.1.2 Determine the actual list of quality requirements :
3
Rate each of the listed quality requirements by importance. Importance is a
function of the system characteristics and the viewpoints of the people involved.
To determine the actual list of the possible quality require-ments, follow the two
steps below:
Create the actual quality requirements; Resolve the results of the survey into a
single list of quality requirements. This shall involve a technical feasibility analysis of the
quality requirements. The proposed factors for this list may have cooperative or
conflicting relationships. Conflicts between requirements shall be resolved at this point.
In addition, if the choice of quality requirements is in conflict with cost, schedule, or
system functionality, one or the other shall be altered. Care shall be exercised in choosing
the desired list to ensure that the requirements are technically feasible reasonable,
complementary, achievable, and verifiable. All involved parties shall agree to this final
List.
5.1.3 Quantify each factor:
For each factor, assign one or more direct metrics to represent the factor, and
assign direct metric values to serve as quantitative requirements for that factor. For
example, if “high efficiency” was one of the quality requirements from the previous item,
the direct metric “actual resource utilizatiodallocated resource utilization” with a value of
90% could represent that factor. This direct metric value is used to verify the achieve-
Ment of the quality requirement. Without it, there is no way to tell whether or not the
delivered system meets its quality requirements.
The-quantified list of quality requirements and their definitions again shall be
approved by all involved parties.
5.2 Identify software quality metrics:
4
When identifying software quality metrics, apply the software quality metrics
framework, and perform a cost-benefit analysis. Then gain commitment to the metrics
from all involved parties.
5.2.1 Apply the software quality metrics framework:
Create a chart of the quality requirements based on the hierarchical tree structure
found in figure 1. At this point, only the factor level must be complete. Next decompose
each factor into subfactors as indicated in clause 4. The decomposition into subfactors
must continue for as many levels as needed until the subfactor level is complete.
SOFTWARE QUALITY METRICS METHODOLOGY
Using the software quality metrics framework, decompose the subfactors into
measurable metrics. For each validated metric on the metric level, assign a target value
and a critical value and range that should be achieved during development. The target
values constitute additional quality requirements for the system.
The framework and the target values for the metrics shall be reviewed and
approved by all involved parties.
To help ensure that metrics are used appropriately, only validated metrics (that is,
either direct metrics or metrics validated with respect to direct metrics) shall be used to
assess current and future product and process quality. Nonvalidated metrics may be
included for future analysis, but shall not be included as part of the system requirements.
Furthermore, the metrics that are used shall be those that are associated with the quality
requirements of the software project. However, given that the above conditions are
satisfied.
The selection of specific metrics as candidates for validation and the selection of
specific vali dated metrics for application is at the discretion of the user of this standard.
Examples of metrics and experi ences with their use are given in annex B.
5
Document each metric using the format shown in table 2.
Table 2-Metrics set
Item Description
Name Name given to the metric.
Impact Indication of whether a metric may be
Used to alter.
Target value Numerical value of the metric that is to
be achieved in order to meet quality.
Factors Factors that are related to this metric.
5.2.2 Perform a cost-benefit analysis:
Perform a cost-benefit analysis by identifying the costs of implementing the
metrics, identifying the benefits of applying the metrics, and then applying the metrics
set.
5.2.2.1 Identify the costs of implementing the metrics:
Identify document(see 2.2) all the costs associated with the metrics in the metrics set.
For each met ric, estimate and document the following impacts and costs.
Metric utilization cost: The costs of collecting data; automating the metric value
calculation (when possible); and analyzing, interpreting, and reporting the results.
Software development cost change: The set of metrics may imply a change in the
organizational structure used to produce the software system.
Special equipment: The hardware or software tools may have to be located, purchased,
adapted, or developed to implement the metrics.
Training: The quality assurancekontrol organization or the entire development team may
6
need train ing in the use of the metrics and data collection procedures.
If the introduction of metrics has caused changes in the development process, the
development team may need to be educated about the changes .
5.2.2.2 Identify the benefits of applying the metrics :
Identify and document the benefits that are associated with each metric in the
metrics set .
Some benefits to consider include the following:
-Identify quality goals and increase awareness of the goals in the software organization.
- Provide timely feedback useful in developing higher quality software
- Increase customer satisfaction by quantifying the quality of the software before it is
delivered to the customer.
-provide a a quantitative basis for making decisions about software quality
5.2.2.3 Adjust the metrics set :
Weigh the benefits, tangible and intangible, against the costs of each metric. If the
costs exceed the benefits of a given metric, alter or delete it from the metrics set. On the
other hand, for metrics that remain, make plans for any necessary changes to the software
development process, organizational structure, tools, and training. In most cases it will
not be feasible to quantify benefits. In these cases judgment shall be exercised in
weighing qualitative benefits against quantitative costs.
5.2.3 Gain commitment to the metrics set :
All involved parties shall review the revised metrics set to which the cost-benefit
7
analysis has been added. The metrics set shall be formally adopted and supported by this
group.
5.3 Implement the software quality metrics :
To implement the software quality metrics, define the data collection procedures,
use selected software to prototype the measurement process, then collect the data and
compute the metrics values.
5.3.1 Define the data collection procedures:
For each metric in the metric set, determine the data that will be collected and
determine the assumptions that will be made about the data (for example, random sample,
subjective or objective measure). The flow of data shall be shown from point of
collection to evaluation of metrics. Describe or reference when and how tools are to be
used and data storage procedures. Also, identify candidate tools. Select tools for use with
the prototyping process. A traceability matrix shall be established between metrics and
data items.
Identify the organizational entities that will directly participate in data collection
including those responsible for monitoring data collection. Describe the training and
experience required for data collection and the training process for personnel involved.
Describe each data item thoroughly, using the format shown in table 3
Table 3-Description of a data item
Item I Description
Name Name given to the data item
Metries Metrics that are associated with the data item.
Definition Straightforward description of the data item.
Source Location of where the data originates.
Collector Entity responsible for collecting the data.
Timing Time(s) in life cycle at which the data is to be
collected. (Some dataitems , are collected more than
once.)
Procedures Methodology (for example. automated or manual)
8
used to collect the data.
Representation Manner in which the data is represented, for
example. its precision format.
Sample Method used to select the data to be collected and
the percentage of the available datathat is to be
collected.
Alternatives methods that may be used to collected the data other
than method.
5.3.2 Prototype the measurement process :
Test the data collection and metric computation procedures on selected software
that will act a a prototype. If possible, the samples selected should be similar to the
project(s) on which the metrics will later be used. analysis shall be made to determine if
the data is collected uniformly and if instructions have always been interpreted in the
same manner. In particular, data requiring subjective judgments shall be checked to
determine if the descriptions and instructions are clear enough to ensure uniform results.
In addition, the cost of the measurement process for the prototype shall be
examined to verify or improve the cost analysis.
5.3.3 Collect the data and compute the metrics values:
Using the formats in table 2 and table 3, collect and store data at the appropriate
time in the life cycle. The data shall be checked for accuracy and proper unit of measure.
Data collection shall be monitored. If a sample of data is used, requirements such
as randomness, minimum size of sample, and homogeneous sampling shall be verified.
If more than one person is collecting the data, it shall be checked for uniformity.
Compute the metrics values from the collected data.
5.4 Analyze the software metrics results:
9
Analyzing the results of the software metrics includes interpreting the results,
identifying software quality, making software quality predictions, and ensuring
compliance with requirements.
5.4.1 Interpret the results:
The results shall be interpreted and recorded against the broad context of the
project as well as for a particu lar product or process of the project. The differences
between the collected metric data and the target values for the metrics shall be analyzed
against the quality requirements. Substantive differences shall be investi gated.
5.4.2 Identify software quality:
Quality metric values for software components shall be determined and reviewed.
Quality metric values that are outside the anticipated tolerance intervals (low or
unexpected high quality) shall be identified for further study. Unacceptable quality may
be manifested as excessive complexity, inadequate documentation, lack of traceability, or
other undesirable attributes.
The existence of such conditions is an indication that the soft ware may not satisfy
quality requirements when it becomes operational. Since many of the direct metrics that
are usually of interest cannot be measured during software development (for example,
reliability met rics), validated metrics shall be used when direct metrics are not available.
Direct or validated metrics shall be measured for software components and process steps.
The measurements shall be compared with critical values of the metrics.
Software components whose measurements deviate from the critical values shall
be analyzed in detail.
Unexpected high quality metric values shall cause a review of the software
development process as the expected tolerance levels may need to be modified or the
process for identifying quality metric values may need to be improved.
5.4.3 Make software quality predictions:
10
During development validated metrics shall be used to make predictions of direct
metric values. Predicted values of direct metrics shall be compared with target values to
determine whether to flag software compo nents for detailed analysis. Predictions shall be
made for software components and process steps. Software components and process steps
whose predicted direct metric values deviate from the target values shall be analyzed in
detail.
Potentially, prediction is very valuable because it estimates the metric of ultimate
interest-the direct metric. However, prediction is difficult because it involves using
validated metrics from an earlier phase of the life cycle (for example, development) to
make a prediction about a different but related metric (direct metric) in a much later
phase (for example, operations).
5.4.4 Ensure compliance with requirements:
Direct rnetrics shall be used to ensure compliance of software products with
quality requirements during system and acceptance testing. Direct metrics shall be
measured for software components and process steps. These values shall be compared
with target values of the direct metrics that represent quality require ments. Software
components and process steps whose measurements deviate from the target values are
non compliant
5.5 Validate the software quality metrics:
For infuriation on the statistical techniques used. See [B 1141. [B 115], (B 11 71,'
or similar references.
5.5.1 Purpose of metrics validation:
The purpose of metrics validation is to identify both product and process metrjcs
that can predict specified quality factor values, which are quantitative representations of
quality requirements. If metrics are to be us ful, they shall indicate accurately whether
11
quality requirements have been achieved or are likely to be achieved in the future. When
it is possible to measure factor values at the desired point in the life cycle, these direct
metrics are used to evaluate software quality. At some points in the life cycle, certain
quality factor values (for example, reliability) are not available. They are obtained after
delivery or late in the project. In these cases, other metrics are used early on in a project
to predict quality factor values.
Although quality subfactors are useful when identifying and establishing factors
and metrics, they need not be used in metrics validation. because the focus in validation
is on determining whether a statistically signi icant relationship exists between predictive
metric values and factor values.
Quality factors may be affected by multiple variables. A single metric. therefore,
may not sufficiently repre sent any one factor if it ignores these other variables.
5.5.2 Validity criteria:
To be considered valid, a predictive metric shall demonstrate a high degree of
association with the quality factors it represents. This is equivalent to accurately
portraying the quality condition(s) of a product or process. A metric may be valid with
respect to certain validity criteria and invalid with respect to other criteria.
The person in the organization who understands the consequence of the values selected
shall designate threshold values for the following:
V - square of the linear correlation coefficient
B-rank correlations coefficient
A-prediction error
P-success rate
A short numerical example follows the definition of each validity criterion.
Detailed examples of the appli cation of metrics validation are contained in annex.
Correlations: The variation in the quality factor values explained by the variation in the
metric val ues, which is given by the square of the linear correlation coefficient between
the metric and the corresponding factor, shall exceed where R2>v
This criterion assesses whether there is a sufficiently strong linear association between a
12
factor and a metric to warrant using the metric as a substitute for the factor, when it is
infeasible to use the latter. For example, the correlation coefficient between a complexity
metric and the factor reliability may be 0.8. The square of this is 0.64. Only 64% of the
variation in the factor is explained by the varia tion in the metric. If V has been
established as 0.7, the conclusion would be drawn that the metric is invalid.
Trucking: If a metric M is directly related to a quality factor F,for a given productor
process, then a change in aquality factor value from FT1to FT2 at times T1 and T2, shall
be accompanied by a change in metric value from MT1 to MT2, which is the same
direction .
For example, if a complexity metric is claimed to be a measure of reliability, then
it is reasonable to expect a change in the reliability of a software component to be
accompanied by an appropriate change in metric value (for instance, if the product
increases in reliability, the metric value should also change in a direction that indicates
the product has improved).
That is, if Mean Time to Failure(MTTF) is used to measure reliability and is
equal to 1000 hours during testing (TI) and 1500 hours during operation (T2), a
complexity metric whose value is 8 in TI and 6 in T2, where 6 less complex than 8, is
said to track reliability for this software component. If this relationship is demonstrated
over a representative sample of software components, the conclusion could be drawn that
the metric can track reliability (that is, indicate changes in product reliability) over the
software life cycle.
Consistency:if factor values f1,f2,….fn corresponding to products or processes
relationship FI > F2> . . ,, Fn, then the corresponding metric values shall have the
relationship MI M2 > . . . , Mn. To perform this test, compute the coefficient of rank
correlation(r) between paired values (from the same software components) of the factor
and the metric;
Id shall exceed This criterion assesses whether there is consistency between the
ranks of the factor values of a set of software components and the ranks of the metric
values for the same set of software components
Predictability: a metric is used at time T1 to predict a quality factor for a given product
or processit shall predict a related quality factorFP~2with accurancy of where
13
FAT2is the actual value Fat atimeT2 This criterion assesses whether a metric is
capable of predicting a factor value with the required accuracy. For example, if a
complexity metric is used during development to predict the reliability of a soft ware
component during operation (T2) to be 1200 hours MTTF (FpT2) and the actual MTTF
that is measured during operation is 1000 hours FuT2), then the error in prediction is 200
hours, or 20%.
If the acceptable prediction error (A) is 25%. prediction accuracy is acceptable. If
the ability to predict is demonstrated over a representative sample of software
components, the conclusion could be drawn that the metric can be used as a predictor of
reliability.
Discriminative power: A metric shall be able to discriminate between high quality
software compo nents (for example, high MTTF) and low quality software components
(for example, low MTTF). For instance, the set of metric values associated with the
former should be significantly higher (or lower) than those associated with the latter.
The Mann Whitney Test and the Chi-square test for differences in probabilities
(Contingency Tables) can be used for this validation test.
Reliability: A metric shall demonstrate the above correlation. tracking, consistency,
predictability, and discriminative power properties forP percent of the applications of the
metric. This criterion is used to ensure that a metric has passed a validity test over a
sufficient number or percentage of applications so that there will be confidence that the
metric can perform its intended function consistently.
5.5.3 Validation procedure:
Metrics validation shall include identifying the quality factors sample, identifying
the metrics sample, per forming a statistical analysis, and documenting the results.
5.5.3.1 Identify the quality factors sample:
These factor values (for example, measurements of reliability), which represent
the quality requirements of a project, were previously identified (see 5.1.3) and collected
and stored (see 5.3.3). For validation purposes, draw a sample from the metrics database.
14
5.5.3.2 Identify the metrics sample:
These metrics (for example, design structure) are used to predict or to represent
quality factor values, when the factor values cannot be measured. The metrics were
previously identified (see 5.2.1) and their data collected and stored, and values computed
from the collected data (see 5.3.3). For validation purposes, draw a sample from the same
domain (for example, same software components) of the metrics database as used in
5.5.3.1.
5.5.3.3 Perform a statistical analysis :
The tests described in 5.5.2 shall be performed.
Before a metric is used to evaluate the quality of a product or process, it shall be
validated against the criteria described in 5.5.2. If a metric does not pass all of the
validity tests, it shall only be used according to the criteria prescribed by those tests (for
example, if it only passes the tracking validity test, it shall be used only for tracking the
quality of a product or process).
5.5.3.4 Document the results:
Documenting results shall include the direct metric, predictive metric, validation
criteria, and numerical results, as a minimum.
5.5.4 Additional requirements:
Additional requirements, those being the need for revalidation, confidence in
analysis results, and the stability of environment, are described in the following
subclauses.
5.5.4.1 The need for revalidation:
It is important to revalidate a predictive metric before it is used for another
environment or application. AS the software engineering process changes, the validity of
metrics changes. Cumulative metric validation val ues may be misleading because a
15
metric that has been valid for several uses may become invalid. It is wise to compare the
one-time validation of a metric with its validation history to avoid being misled.
The following statements of caution should be noted:
*A validated metric may not necessarily be valid in other environments or future
applications.
*A metric that has been invalidated may be valid in other environments or future
applications.
5.5.4.2 Confidence in analysis results:
Metrics validation is a continuous process. Confidence in metrics is increased
over time as metrics are vali dated on a variety of projects and as the metrics database
and sample size increases. Confidence is not a static, one-time property. If a metric is
valid, confidence will increase with increased use (that is, the correlation coefficient will
be significant at decreasing values of the significance level). Greatest confidence occurs
when metrics have been tentatively validated based on data collected from previous
projects. Even when this is the case, the validation analysis will continue into future
prqjects as the metrics database and sample size grow.
5.5.4.3 Stability of environment:
Practicable metrics validation shall be undertaken in a stable development
environment (that is, where the design language, implementation language, or program
development tools do not change over the life of the project in which validation is
performed). In addition, there shall be at least one project in which metrics data have
been collected and validated prior to application of the predictive metrics.
This project shall be similar to the one in which the metrics are applied with
respect to software engineering skills, application, size, and software engineering
environment.
SOFTWARE QUALITY INDICATOR:
16
A software quality indicator (SQI) is a variable whose value can be determined
through direct analysis of product or process characteristics and whose evidential
relationship to one ormoresoftware engineering attributes is
undeniable.
A Software Quality Indicator can be calculated to provide an
indication of the quality of the system by assessing system
characteristics.
Method
Assemble a quality indicator from factors that can be determined
automatically with commercial or custom code scanners, such as
the following:
cyclomatic complexity of code (e.g., McCabe's
Complexity Measure),
unused/unreferenced code segments (these should
be eliminated over time),
average number of application calls per module
(complexity is directly proportional to the number of calls),
size of compilation units (reasonably sized units
have approximately 20 functions (or paragraphs), or about
2000 lines of code; these guidelines will vary greatly by
environment),
use of structured programming constructs (e.g.,
elimination of GOTOs, and circular procedure calls).
These measures apply to traditional 3GL environments and are
more difficult to determine in environments which are using
object-oriented languages, 4GLs, or code generators.
Tips and Hints
With existing software, the Software Quality Indicator could also
include a measure of the reliability of the code. This can be
17
determined by keeping a record of how many times each module
has to be fixed in a given time period.
There are other factors which contribute to the quality of a system
such as:
procedure re-use,
clarity of code and documentation,
consistency in the application of naming
conventions,
adherence to standards,
consistency between documentation and code,
the presence of current unit test plans.
These factors are harder to determine automatically. However,
with the introduction of CASE tools and reverse-engineering tools,
and as more of the design and documentation of a system is
maintained in structured repositories, these measures of quality
will be easier to determine, and could be added to the indicator.
Fundamentals in Measurement Theory:
To discusses the fundamentals of measurement theory. We outline the
relationshipsamong theoretical concepts, definitions, and measurement, and describe
some basic measures that are used frequently. It is important to distinguish the levels of
the conceptualization proces s, from abstract concepts, to definitions that are used
operationally, to actual measurements. Depending on the concept and the operational
definition derived from it, different levels of measurement may be applied: nominal scale,
ordinal scale, interval scale, and ratio scale.
It is also beneficial to spell out the explicit differences among some basic
measures such as ratio, proportion, percentage, and rate. Significant amounts of wasted
effort and resources can be avoided if these fundamental measurements are well
understood.
We then focus on measurement quality. We discuss the most important issues in
18
measurement quality, namely, reliability and validity, and their relationships with
measurement errors. We then discuss the role of correlation in observational studies and
the criteria for causality.
3.1 Definition, Operational Definition, and Measurement:
It is undisputed that measurement is crucial to the progress of all sciences. Scientific
progress is made through observations and generalizations based on data and
measurements, the derivation of theories as a result, and in turn the confirmation or
refutation of theories via hypothesis testing based on further empirical data. As an
example, consider the proposition "the more rigorously the front end of the software
development process is executed, the better the quality at the back end."
To confirm or refute this proposition, we first need to define the key concepts. For
example, we define "the software development process" and distinguish the process steps
and activities of the front end from those of the back end. Assume that after the
requirements-gathering process, our development process consists of the following
phases:
Design
Design reviews and inspections
Code
Code inspection
Debug and development tests
Integration of components and modules to form the product
Formal machine testing
Early customer program
Integration is the development phase during which various parts and components are
integrated to form one complete software product. Usually after integration the product is
under formal change control. Specifically, after integration every change of the software
must have a specific reason(e.g., to fix a bug uncovered during testing) and must be
documented and tracked. Therefore, we may want to use integration as the cutoff point:
The design, coding, debugging, and integration phases are classified as the front end of
19
the development process and the formal machine testing and early customer trials
constitute the back end.
We need to specify the indicator(s) of the definition and to make it (them)
operational. For example, suppose the process documentation says all designs and code
should be inspected. One operational definition of rigorous implementation may be
inspection coverage expressed in terms of the percentage of the estimated lines of code
(LOC) or of the function points (FP) that are actually inspected. Another indicator of
good reviews and inspections could be the scoring of each inspection by the inspectors at
the end of the inspection, based on a set of criteria.
We may want to operationally use a five-point Likert scale to denote the degree of
effectiveness (e.g., 5 = very effective, 4 = effective, 3 = somewhat effective, 2 = not
effective, 1 = poor inspection). There may also be other indicators.
In addition to design, design reviews, code implementation, and code inspections,
development testing is part of our definition of the front end of the development process.
We also need to we need to operationally define "quality at the back end" and decide
which measurement indicators to use.
For the sake of simplicity let us use defects found per KLOC (or defects per function
point) during formal machine testing as the indicator of back-end quality. From these
metrics, we can formulate several testable hypotheses such as the following:
For software projects, the higher the percentage of the designs and code that
areinspected, the lower the defect rate at the later phase of formal machine testing.
The more effective the design reviews and the code inspections as scored by the
inspection team, the lower the defect rate at the later phase of formal machine
testing.
The more thorough the development testing (in terms of test coverage) before
integration,the lower the defect rate at the formal machine testing phase.
With the hypotheses formulated, we can set out to gather data and test the hypotheses.
We also need to determine the unit of analysis for our measurement and data. In this case,
it could be at the project level or at the component level of a large project. If we are able
to collect a number of data points that form a reasonable sample size (e.g., 35 projects or
20
components), we can perform statistical analysis to test the hypotheses
The building blocks of theory are concepts and definitions. In a theoretical
definition a concept is defined in terms of other concepts that are already well
understood. In the deductive logic system, certain concepts would be taken as undefined;
they are the primitives. All other concepts would be defined in terms of the primitive
concepts. For example, the concepts of point and line may be used as undefined and the
concepts of triangle or rectangle can then be defined based on these primitives.
3.2 Level of Measurement:
We have seen that from theory to empirical hypothesis and from theoretically defined
concepts to operational definitions.
The process is by no means direct. As the example illustrates, when we
operationalize a definition and derive measurement indicators, we must consider the scale
of measurement. For instance, to measure the quality of software inspection we may use a
five-point scale to score the inspection effectiveness or we may use percentage to indicate
the inspection coverage. For some cases, more than one measurement scale is applicable;
for others, the nature of the concept and the resultant operational definition can be
21
measured only with a certain scale. In this section, we briefly discuss the four levels of
measurement: nominal scale, ordinal scale, interval scale, and ratio scale.
Nominal Scale
The most simple operation in science and the lowest level of measurement is
classification. In classifying we attempt to sort elements into categories with respect to a
certain attribute. For example, if the attribute of interest is religion, we may classify the
subjects of the study into Catholics, Protestants, Jews, Buddhists, and so on. If we
classify software products by the development process models through which the
products were developed, then we may have categories such as waterfall development
process, spiral development process, iterative development process, object-oriented
programming process, and others.
In a nominal scale, the names of the categories and their sequence bear no
assumptions about relationships among categories. For instance, we place the waterfall
development process in front of spiral development process, but we do not imply that one
is "better than" or "greater than" the other.
Ordinal Scale
Ordinal scale refers to the measurement operations through which the subjects can
be compared in order. For example, we may classify families according to socio-
economic status: upper class, middle class, and lower class. We may classify software
development projects according to the SEI maturity levels or according to a process rigor
scale.
The ordinal measurement scale is at a higher level than the nominal scale in the
measurement hierarchy. Through it we are able not only to group subjects into categories,
but also to order the categories.
An ordinal scale is asymmetric in the sense that if A > B is true then B > A is
false. It has the transitivity property in that if A > B and B > C, then A > C.
Therefore, when we translate order relations into mathematical operations, we
cannot use operations such as addition, subtraction, multiplication, and division. We can
use "greater than" and "less than." However, in real-world application for some specific
types of ordinal scales (such as the Likert five-point, seven-point, or ten-point scales), the
22
assumption of equal distance is often made and operations such as averaging are applied
to these scales. In such cases, we should be aware that the measurement assumption is
deviated, and then use extreme caution when interpreting the results of data analysis.
Interval and Ratio Scales:
An interval scale indicates the exact differences between measurement points.
The mathematical operations of addition and subtraction can be applied to interval scale
data. For instance, assuming products A, B, and C are developed in the same language, if
the defect rate of software product A is5 defects per KLOC and product B's rate is 3.5
defects per KLOC, then we can say product A's defect level is 1.5 defects per KLOC
higher than product B's defect level. An interval scale of measurement requires a well-
defined unit of measurement that can be agreed on as a common standard and that is
repeatable.
Given a unit of measurement, it is possible to say that the difference between two
scores is 15 units or that one difference is the same as a second. Assuming product C's
defect rate is 2 defects per KLOC, we can thus say the difference in defect rate between
products A and B is the same as that between B and C.
For interval and ratio scales, the measurement can be expressed in both integer and
noninteger data. Integer data are usually given in terms of frequency counts (e.g., the
number of defects customers will encounter for a software product over a specified time
length).
3.3 Some Basic Measures
Regardless of the measurement scale, when the data are gathered we need to
analyze them toextract meaningful information. Various measures and statistics are
available for summarizing theraw data and for making comparisons across groups. In this
section we discuss some basic measures such as ratio, proportion, percentage, and rate,
which are frequently used in our dailylives as well as in various activities associated with
software development and software quality. These basic measures, while seemingly easy,
are often misused. There are also numerous
Sophisticated statistical techniques and methodologies that can be employed in data
analysis.
23
However, such topics are not within the scope of this discussion.
Ratio
A ratio results from dividing one quantity by another. The numerator and
denominator are from twodistinct populations and are mutually exclusive. For example,
in demography, sex ratio is defined asIf the ratio is less than 100, there are more females
than males; otherwise there are more malesthan females.
Ratios are also used in software metrics. The most often used, perhaps, is the ratio
of number ofpeople in an independent test organization to the number of those in the
development group. Thetest/development head-count ratio could range from 1:1 to 1:10
depending on the managementapproach to the software development process. For the
large-ratio (e.g., 1:10) organizations, thedevelopment group usually is responsible for the
complete development (including extensivedevelopment tests) of the product, and the test
group conducts system-level testing in terms ofcustomer environment verifications. For
the small-ratio organizations, the independent group takesthe major responsibility for
testing (after debugging and code integration) and quality assurance.
Proportion
Proportion is different from ratio in that the numerator in a proportion is a part of the
denominator:
Proportion also differs from ratio in that ratio is best used for two groups,
whereas proportion isused for multiple categories (or populations) of one group. In other
words, the denominator in thepreceding formula can be more than just a + b. If
then we have
When the numerator and the denominator are integers and represent counts of
certain events,then p is also referred to as a relative frequency. For example, the
following gives the proportion ofsatisfied customers of the total customer set:
24
The numerator and the denominator in a proportion need not be integers. They can be
frequencycounts as well as measurement units on a continuous scale (e.g., height in
inches, weight in pounds). When the measurement unit is not integer, proportions
arecalled fractions.
Percentage
A proportion or a fraction becomes a percentage when it is expressed in terms of
per hundred units(the denominator is normalized to 100). The word percent means per
hundred. A proportion p istherefore equal to 100p percent (100p%).Percentages are
frequently used to report results, and as such are frequently misused. First,because
percentages represent relative frequencies, it isimportant that enough
contextualinformation be given, especially the total number of cases, so that the readers
can interpret theinformation correctly. Jones (1992) observes that many reports and
presentations in the softwareindustry are careless in using percentages and ratios. He cites
the example:Requirements bugs were 15% of the total, design bugs were 25% of the
total,coding bugs were50% of the total, and other bugs made up 10% of the total.Had the
results been stated as follows, it would have been much more informative:The project
consists of 8 thousand linesof code (KLOC). During its development a total of 200defects
were detected and removed, giving adefect removal rate of 25 defects per KLOC. Of
the200 defects, requirements bugs constituted 15%, design bugs 25%, coding bugs 50%,
and otherbugs made up 10%.A second important rule of thumb is that the total number of
cases must be sufficiently largeenough to use percentages. Percentages computed from a
small total are not stable;
they alsoconvey an impression that a large number of cases are involved. Some
writers recommend that theminimum number of cases for which percentages should be
calculated is 50. We recommend that,depending on the number of categories, the
minimum number be 30, the smallest sample sizerequired for parametric statistics. If the
number of cases is too small, then absolute numbers,instead of percentages, should be
used. For instance,Of the total 20 defects for the entire project of 2 KLOC, there were 3
requirements bugs, 5 designbugs, 10 coding bugs, and 2 others.When results in
25
percentages appear in table format, usually both the percentages and actualnumbers are
shown when there is only one variable. When there are more than two groups, suchas the
example in Table 3.1, it is better just to show the percentages and the total number of
cases (N) for each group. With percentages and N known, one
canalwaysreconstructthefrequency distributions. The total of 100.0% should always be
shown so that it is clear how thepercentages are computed. In a two-way table, the
direction in which the percentages arecomputed depends on the purpose of the
comparison. For instance, the percentages inTable3.1arecomputedvertically (the total of
each column is 100.0%), and the purpose is to compare thedefect-type profile across
projects (e.g., project B proportionally has more requirements defectsthan project A).In
Table 3.2, the percentages are computed horizontally. The purpose here is to compare
thedistribution of defects across projects for each type of defect. The inter-pretations of
the two tablesdiffer.
Therefore, it is important to carefully examine percentage tables to determine exactly
how
the percentages are calculated.
Table 3.1. Percentage Distributions
26
Ratios, proportions, and percentages are static summary measures. They provide a
cross-sectionalview of the phenomena of interest at a specific time. The concept of rate is
associated with thedynamics (change) of the phenomena of interest; generally it can be
defined as a measure ofchange in one quantity (y) per unit of another quantity (x) on
which the former (y) depends. Usuallythe x variable is time. It is important that the time
unit always be specified when describing a rateassociated with time. For instance, in
demography the crude birth rate (CBR) is defined as:where B is the number of live births
in a given calendar year, P is the mid-year population, and K isa constant, usually 1,000.
The concept of exposure to risk is also central to the definition of rate, which
distinguishes rate fromproportion. Simply stated, all elements or subjects in the
denominator have to be at risk ofbecoming or producing the elements or subjects in the
numerator. If we take a second look at thecrude birth rate formula, we will note that the
denominator is mid-year population and we knowthat not the entire population is subject
to the risk of giving birth. Therefore, the operational
Page
where B is the number of live births in a given calendar year, P is the mid-year
population, and K isa constant, usually 1,000.The concept of exposure to risk is also
central to the definition of rate, which distinguishes rate fromproportion. Simply stated,
all elements or subjects in the denominator have to be at risk ofbecoming or producing
the elements or subjects in the numerator.
27
If we take a second look at thecrude birth rate formula, we will note that the denominator
is mid-year population and we knowthat not the entire population is subject to the risk of
giving birth. Therefore, the operational
Six Sigma
The term six sigma represents a stringent level of quality. It is a specific defect
rate: 3.4 defectiveparts per million (ppm). It was made known in the industry by
Motorola, Inc., in the late 1980swhen Motorola won the first Malcolm Baldrige National
Quality Award (MBNQA). Six sigma hasbecome an industry standard as an ultimate
quality goal.
Sigma ( Figure 3.2 indicates, the areas
under thecurve of normal distribution defined by standard deviations are constants in
terms of percentages,regardless of the distribution parameters. The area under the curve
as defined by plus and minusone standard deviation (sigma) from the mean is 68.26%.
The area defined by plus/minus twostandard deviations is 95.44%, and so forth. The area
defined by plus/minus six sigma is99.9999998%. The area outside the six sigma area is
thus 100% -99.9999998% = 0.0000002%.
Figure 3.3. Specification Limits, Centered Six Sigma, and Shifted (1.5 Sigma) Six
Sigma
Page
If we take the area within the six sigma limit as the percentage of defect-free parts
and the areaoutside the limit as the percentage of defective parts, we find that six sigma is
equal to 2 defectivesper billion parts or 0.002 defective parts per million. The
interpretation of defect rate as it relates tothe normal distribution will be clearer if we
include the specification limits in the discussion, asshown in the top panel of Figure 3.3.
Given the specification limits (which were derived fromcustomers' requirements), our
28
purpose is to produce parts or products within the limits. Parts orproducts outside the
specification limits do not conform to requirements.
If we can reduce thevariations in the production process so that the six sigma
(standard deviations) variation of theproduction process is within the specification limits,
then we will havesixsigma quality level.
The six sigma value of 0.002 ppm is from the statistical normal distribution. It
assumes that eachexecution of the production process will produce the exact distribution
of parts or productscentered with regard to the specification limits. In reality, however,
process shifts and drifts alwaysresult from variations in process execution. The maximum
process shifts as indicated by research(Harry, 1989) is 1.5 sigma. If we account for this
1.5-sigma shift in the production process, we willget the value of 3.4 ppm. Such shifting
is illustrated in the two lower panels of Figure 3.3. Givenfixed specification limits, the
distribution of the production process may shift to the left or to theright. When the shift is
1.5 sigma, the area outside the specification limit on one end is 3.4 ppm,and on the other
it is nearly zero.
The six sigma definition accounting for the 1.5-sigma shift (3.4 ppm) proposed
and used byMotorola (Harry, 1989) has become the industry standard in terms of six
29
sigma–level quality (versusthe normal distribution's six sigma of 0.002 ppm).
Furthermore, when the production distributionshifts 1.5 sigma, the intersection points of
the normal curve and the specification limits become 4.5sigma at one end and 7.5 sigma
at the other. Since for all practical purposes, the area outside 7.5
3.4 Reliability and Validity
Recall that concepts and definitions have to be operationally defined before
measurements can betaken. Assuming operational definitions are derived and
measurements are taken, the logicalquestion to ask is, how good are the operational
metrics and the measurement data? Do theyreally accomplish their task—measuring the
concept that we want to measure and doing so withgood quality? Of the many criteria of
measurement quality, reliability and validity are the two mostimportant.
Reliability:
refers to the consistency of a number of measurements taken using the
sameMeasurement method on the same subject. If repeated measurements are highly
consistent oreven identical, then the measurement method or the operational definition
has a high degree of
Reliability. If the variations among repeated measurements are large, then reliability is
low. Forexample, if an operational definition of a body height measurement of children
(e.g., between ages3 and 12) includes specifications of the time of the day to take
measurements, the specific scale touse, who takes the measurements (e.g., trained
pediatric nurses), whether the measurementsshould be taken barefooted, and so on, it is
likely that reliable data will be obtained. If theoperational definition is very vague in
terms of these considerations, the data reliability may be low.Measurements taken in the
early morning may be greater than those taken in the late afternoonbecause children's
bodies tend to be more stretched after a good night's sleep and becomesomewhat
compacted after a tiring day. Other factors that can contribute to the variations of
themeasurement data include different scales, trained or untrained personnel, with or
without shoeson, and so on.
30
Figure 3.4. An Analogy to Validity and Reliability
From Practice of Social Research (Non-Info Trac Version), 9th edition, by E. Babbie.
© 2001
Thomson Learning. Reprinted with permission of Brooks/Cole, an imprint of the
Wadsworth
Group, a division of Thomson Learning (fax: 800-730-2215).
Note
Note that there is some tension between validity and reliability. For the data to be
reliable, themeasurement must be specifically defined. In such an endeavor, the risk of
being unable torepresent the theoretical concept validly may be high. On the other hand,
for the definition to havegood validity, it may be quite difficult to define the
measurements precisely. For example, themeasurement of church attendance may be
quite reliable because it is specific and observable.However, it may not be valid to
represent the concept of religiousness. On the other hand, toderive valid measurements of
religiousness is quite difficult. In the real world of measurements andmetrics, it is not
uncommon for a certain tradeoff or balance to be made between validity and
reliability.
Validity and reliability issues come to the fore when we try to use metrics and
measurements to
represent abstract theoretical constructs. In traditional quality engi-neering where
measurements
are frequently physical and usually do not involve abstract concepts, the counterparts of
validity
and reliability are termed accuracy and precision (Juran and Gryna, 1970). Much
confusion surroundsthese two terms despite their having distinctly different meanings. If
31
we want a much higherdegree of precision in measurement (e.g., accuracy up to three
digits after the decimal point whenmeasuring height), then our chance of getting all
measurements accurate may be reduced. Incontrast, if accuracy is required only at the
level of integerinch (less precise), then it is a lot easier
to meet the accuracy requirement.
3.5 Measurement Errors
In this section we discuss validity and reliability in the context of measurement
error. There are twotypes of measurement error: systematic and random.
Systematic measurement error is associatedwith validity; random error is
associated with reliability. Let us revisit our example about thebathroom weight scale
with an offset of 10 lb. Each time a person uses the scale, he will get ameasurement that
is 10 lb. more than his actual body weight, in addition to the slight variationsamong
measurements. Therefore, the expected value of the measurements from the scale does
notequal the true value because of the systematic deviation of 10 lb. In simple formula:
In a general case:
where M is the observed/measured score, T is the true score, s is systematic error, and e is
random
error.
The presence of s (systematic error) makes the measurement invalid. Now let us
assume themeasurement is valid and the s term is not in the equation. We have the
following:The equation still states that any observed score is not equal to the true score
because of randomdisturbance—the random error e. These disturbances mean that on one
measurement, a person'sscore may be higher than his true score and on another occasion
the measurement may be lowerthan the true score. However, since the disturbances are
random, it means that the positive errorsare just as likely to occur as the negative errors
and these errors are expected to cancel eachother. In other words, the average of these
32
errors in the long run, or the expected value of e, iszero: E(e) = 0. Furthermore, from
statistical theory about random error, we can also assume the
following:
The correlation between the true score and the error term is zero.
There is no serial correlation between the true score and the error term.
The correlation between errors on distinct measurements is zero.
From these assumptions, we find that the expected value of the observed scores is equal
to the
true score:
Therefore, the reliability of a metric varies between 0 and 1. In general, the larger the
errorvariance relative to the variance of the observed score, the poorer the reliability. If
all variance ofthe observed scores is a result of random errors, then the reliability is zero
[1 – (1/1) = 0].
3.5.1 Assessing Reliability
Thus far we have discussed the concept and meaning of validity and reliability
and their
interpretation in the context of measurement errors. Validity is associated with systematic
error andthe only way to eliminate systematic error is through better understanding of the
concept we try tomeasure, and through deductive logic and reasoning to derive better
definitions. Reliability isassociated with random error. To reduce random error, we need
good operational definitions, andbased on them, good execution of measurement
operations and data collection. In this section, wediscuss how to assess the reliability of
empirical measurements.There are several ways to assess the reliability of empirical
measurements including the test/retest
method, the alternative-form method, the split-halves method, and the internal
consistency method(Carmines and Zeller, 1979). Because our purpose is to illustrate how
to use our understanding ofreliability to interpret software metrics rather than
indepthstatisticalexamination of the subject,we take the easiest method, the retest method.
The retest method is simply taking a secondmeasurement of the subjects some time after
33
the first measurement is taken and then computingthe correlation between the first and
the second measurements. For instance, to evaluate thereliability of a blood pressure
machine, we would measure the blood pressures of a group of peopleand, after everyone
has been measured, we would take another set of measurements. The secondmeasurement
could be taken one day later at the same time of day, or we could simply take
twomeasurements at one time. Either way, each person will have two scores. For the sake
of simplicity,let us confine ourselves to just one measurement, either the systolic or the
diastolic score. We thencalculate the correlation between the first and second score and
the correlation coefficient is thereliability of the blood pressure machine. A schematic
representation of the test/retest method forestimating reliability is shown in Figure 3.5.
Figure 3.5. Test/Retest Method for Estimating Reliability
Page
The equations for the two tests can be represented as follows:
From the assumptions about the error terms, as we briefly stated before, it can be shown
that
m is the reliability measure.
As an example in software metrics, let us assess the reliability of the reported number of
defects
found at design inspection. Assume that the inspection is formal; that is, an inspection
meeting washeld and the participants include the design owner, the inspection moderator,
and the inspectors.At the meeting, each defect is acknowledged by the whole group and
34
the record keeping is done bythe moderator. The test/retest method may involve two
record keepers and, at the end of theinspection, each turns in his recorded number of
defects. If this method is applied to a series ofinspections in a development organization,
we will have two reports for each inspection over asample of inspections. We then
calculate the correlation between the two series of reportednumbers and we can estimate
the reliability of the reported inspection defects.
3.5.2 Correction for Attenuation
One of the important uses of reliability assessment is to adjust or correct
correlations for
unreliability that result from random errors in measurements. Correlation is perhaps one
of the
most important methods in software engineering and other disciplines for analyzing
relationships
between metrics. For us to substantiate or refute a hypothesis,
we have to gather data for boththe independent and the dependent variables and
examine the correlation of the data. Let usrevisit our hypothesis testing example at the
beginning of this chapter: The more effective thedesign reviews and the code inspections
as scored by the inspection team, the lower the defectrate encountered at the later phase
of formal machine testing.As mentioned, we first need to operationally define the
independent variable (inspection
effectiveness) and the dependent variable (defect rate during formal machine testing).
Then we
gather data on a sample of components or projects and calculate the correlation between
theindependent variable and dependent variable. However, because of random errors in
the data, theresultant correlation often is lower than the true correlation. With knowledge
about the estimate ofthe reliability of the variables of interest, we can adjust the observed
correlation to get a moreaccurate picture of the relationship under consideration. In
software development, we observedthat a key reason for some theoretically sound
hypotheses not being supported by actual projectdata is that the operational definitions of
the metrics are poor and there are too many noises in thedata.Given the observed
35
correlation and the reliability estimates of the two variables, the formula forcorrection for
attenuation (Carmines and Zeller, 1979) is as follows:
where
xt yt) is the correlation corrected for attenuation, in other words, the estimated true
correlation
xi yi) is the observed correlation, calculated from the observed data
xx' is the estimated reliability of the X variable
yy' is the estimated reliability of the Y variable |
For example, if the observed correlation between two variables was 0.2 and the reliability
estimates were 0.5 and 0.7, respectively, for X and Y, then the correlation corrected for
attenuation
would be
This
This means that the correlation between X and Y would be 0.34 if both were measured
perfectly
without error.
3.6 Be Careful with Correlation
Correlation is probably the most widely used statistical method to assess
relationships amongobservational data (versus experimental data). However, caution
must be exercised when usingcorrelation; otherwise, the true relationship under
investigation may be disguised ormisrepresented. There are several points about
correlation that one has to know before using it.
First, although there are special types of nonlinear correlation analysis available in
statistical
literature, most of the time when one mentions correlation, it means linear correlation.
Indeed, themost well-known Pearson correlation coefficient assumes a linear relationship.
Therefore, if a
36
correlation coefficient between two variables is weak, it simply means there is no linear
relationshipbetween the two variables. It doesn't mean there is no relationship of any
kind.
Let us look at the five types of relationship shown in Figure 3.6. Panel A
represents a positive linearrelationship and panel B a negative linear relationship. Panel C
shows a curvilinear convexrelationship, and panel D a concave relationship. In panel E, a
cyclical relationship (such as theFourier series representing frequency waves) is shown.
Because correlation assumes linear
relationships, when the correlation coefficients (Pearson) for the five relationships are
calculated,
the results accurately show that panels A and B have significant correlation. However,
thecorrelation coefficients for the other three relationships will be very weak or will show
norelationship at all. For this reason, it is highly recommended that when we use
correlation wealways look at the scattergrams. If the scattergram shows a particular type
of nonlinearrelationship, then we need to pursue analyses or coefficients other than linear
correlation.
Figure 3.6. Five Types of Relationship Between Two Variables
Page
Second, if the data contain noise (due to unreliability in measurement) or if the range of
the data
points is large, the correlation coefficient (Pearson) will probably show no relationship.
In such a
situation, we recommend using the rank-order correlation method, such as Spearman's
rank-order
37
correlation. The Pearson correlation (the correlation we usually refer to) requires interval
scaledata, whereas rank-order correlation requires only ordinal data. If there is too much
noise in theinterval data, the Pearson correlation coefficient thus calculated will be
greatly attenuated. Asdiscussed in the last section, if we know the reliability of the
variables involved, we can adjust theresultant correlation. However, if we have no
knowledge about the reliability of the variables,
rank-order correlation will be more likely to detect the underlying relationship.
Specifically, if thenoises of the data did not affect the original ordering of the data points,
then rank-order correlationwill be more successful in representing the true relationship.
Since both Pearson's correlation andSpearman's rank-order correlation are covered in
basic statistics textbooks and are available inmost statistical software packages, we need
not get into the calculation details here.Third, the method of linear correlation (least-
squares method) is very vulnerable to extreme values.If there are a few extreme outliers
in the sample, the correlation coefficient may be seriouslyaffected. For example, Figure
3.7 shows a moderately negative relationship between X and Y.However, because there
are three extreme outliers at the northeast coordinates, the correlationcoefficient will
become positive. This outlier susceptibility reinforces the point that when correlationis
used, one should also look at the scatter diagram of the data.
3.7 Criteria for Causality
The isolation of cause and effect in controlled experiments is relatively easy. For
example, aheadache medicine was administered to a sample of subjects who were having
headaches. Aplacebo was administered to another group with headaches (who were
statistically not differentfrom the first group). If after a certain time of taking the
headache medicine and the placebo, theheadaches of the first group were reduced or
disappeared, while headaches persisted among thesecond group, then the curing effect of
the headache medicine is clear.
For analysis with observational data, the task is much more difficult. Researchers (e.g.,
Babbie,
1986) have identified three criteria:
1. The first requirement in a causal relationship between two variables is that the cause
precede the effect in time or as shown clearly in logic.
38
2. The second requirement in a causal relationship is that the two variables be empirically
correlated with one another.
3. The third requirement for a causal relationship is that the observed empirical
correlation
between two variables be not the result of a spurious relationship.
The first and second requirements simply state that in addition to empirical
correlation, therelationship has to be examined in terms of sequence of occurrence or
deductive logic. Correlationis a statistical tool and could be misused without the guidance
of a logic system. For instance, it ispossible to correlate the outcome of a Super Bowl
(National Football League versus AmericanFootball League) to some interesting artifacts
such as fashion (length of skirt, popular color, and soforth) and weather. However, logic
tells us that coincidence or spurious association cannot\substantiate causation.The third
requirement is a difficult one. There are several types of spurious relationships,
as Figure3.8 shows, and sometimes it may be a formidable task to show that the
observed correlation is notdue to a spurious relationship. For this reason, it is much more
difficult to prove causality inobservational data than in experimental data. Nonetheless,
examining for spurious relationships isnecessary for scientific reasoning; as a result,
findings from the data will be of higher quality.
Figure 3.8. Spurious Relationships
In Figure 3.8, case A is the typical spurious relationship between X and Y in which X and
Y have a
common cause Z. Case B is a case of the intervening variable, in which the real cause of
Y is an
intervening variable Z instead of X. In the strict sense, X is not a direct cause of Y.
However, since X
39
causes Z and Z in turn causes Y, one could claim causality if the sequence is not too
indirect. Case C
is similar to case A. However, instead of X and Y having a common cause as in case B,
both X and Yare indicators (operational definitions) of the same concept C. It is logical
that there is a correlation
between them, but causality should not be inferred.
An example of the spurious causal relationship due to two indicators measuring the same
concept
is Halstead's (1977) software science formula for program length:
where
N = estimated program length
n1 = number of unique operators
n2 = number of unique operands
Researchers have reported high correlations between actual program length (actual lines
of code
count) and the predicted length based on the formula, sometimes as high as 0.95
(Fitzsimmons andLove, 1978). However, as Card and Agresti (1987) show, both the
formula and actual programlength are functions of n1 and n2, so correlation exists by
definition. In other words, both theformula and the actual lines of code counts are
operational measurements of the concept ofprogram length. One has to conduct an actual
n1 and n2 count for the formula to work.However, n1and n2 counts are not available until
the program is complete or almost complete. Therefore, therelationship is not a cause-
and-effect relationship and the usefulness of the formula's predictabilityis limited.
top related