a methodology for interactive evaluation of user reactions to software packages: an empirical...

Int. J. Man-Machine Studies (1984) 20, 169-188

A methodology for interactive evaluation of user reactions to software packages: an empirical analysis of system performance, interaction, and run time

Avl RUSHINEKt, SARA F. RUSHINEK:~ AND JOV~L S'I'UTZ~

t Department of Accounting and $ Department of Management Science and Computer Information Systems, University of Miami, Coral Gables, Florida 33124, U.S.A.

(Received 20 November 1982, and in revised form 28 March 1983 )

This study deals with identifying primary factors which determine the usefulness of computer-assisted instruction (CAI). Usefulness is defined as user performance score. The study establishes linear models which explain the effects of CAI modifications designed to improve the utilization of computer facilities as well as users' performance.

Introduction

Interactive software maintenance and modification has become an integral activity of software producers and users, especially with respect to user-oriented (friendly) systems. Coupled with the substantial rise in software costs, a decline in hardware costs, and the rapid proliferation of computerized business/accounting systems, the evaluation of software maintenance, modification, and upgrading becomes increasingly beneficial and often times critical to the livelihood of the entire organizational system.

A survey-questionnaire is the traditional method used to address the urgent needs of software evaluation, especially for larger software producers and computer user organizations such as IT&T, IBM, CDC, UNIVAC, DEC, HP, etc. The present research develops and examines the feasibility of switching from a traditional, manual survey-questionnaire to a condensed, but comparable and consistent, computerized interactive software evaluation.

The perceived usefulness of software is measured in this study by the end users. Swanson (1974), Gallagher (1974) and Zmud (1978) argue that the user perception of information has four basic dimensions:

1. the significance, usefulness, or helpfulness of information; 2. the accuracy, factualness, and timeliness of the information; 3. the quality of format or physical representation and reliability of the information;

and 4. the meaningfulness or the reasonableness of the information.

These four dimensions may be reduced to two primary dimensions as applied to the usefulness of financial information (Chandra, 1974; Estes, 1968; Gallagher, 1974). The two primary dimensions emerging from the four basic dimensions are:

1. usefulness (including such characteristics as in the first two dimensions); and 2. clarity (including the information content of the second pair of basic dimensions).

169

0020-7373/84/020169 + 20503.00/0 (63 1984 Academic Press Inc. (London) Limited

170 A. I~ ,USIIINt :K F,T A L .

Some inadequacies can be identified in prior research in this area. One weakness results from attempting to measure usefulness with only one item. The use of a single item instrument as suggested by Nunnally (1978), does not provide sufficient domain sampling for a complex construct, and therefore tends to have a very low reliability. This study uses multiple items in order to provide sufficient domain sampling. Another weakness concerns the use of subjective data such as attitudes and the lack of testing for validity. In this study objective performance data is collected interactively by the computer. Furthermore, this study uses the performance score as a criterion for usefulness.

Still another weakness of many studies can be attributed to the manual nature of the evaluation process. The participation in a minimal evaluation is a tedious and time consuming affair, which may create a negative bias in the participants. Therefore, this study proposes to computerize the evaluation process and shorten the questionnaire to its most essential factors. The shorter questionnaire will be administered interactively from a computer terminal in order to reduce negative bias in the subjects.

Construct validity of the major dimensions of perceived usefulness can be tested by factor analysis. As stated by Kerlinger (1964), "factor analysis may also be called the most important of construct validity tools". Runkel & McGrath (1972) assert that when several measures are proposed as means of measuring a construct, the measures should relate to other measures of the same construct.

Romaniuk & Montgomerie (1976) studied the use of CAI as a supplement to a BASIC introductory computer course taken by first-year computer systems students. Eighteen students took the course with CAI while another 18 students took the course without CA1. Romaniuk & Montogomerie compared the performance and the time required to complete the course. However, they did not evaluate the quality of CAI software prior to the experiment. The authors of this study contend that the qualities of the CAI software may strongly affect its use.

Numerous other studies report on successful applications of CAI (Barr, Beard & Atkinson, 1976; Boyle & Wright, 1977 Brown, 1966; Bunderson, 1970; Buss & Kearsley, 1976; Caldwell, Nix & Peckham, 1976; Chandra, 1974; Chizmar, Hiebert & McCarney, 1977; Computer-Assisted Instruction in Programmig: AID Project, 1968-1969; Computer-Assisted Instruction in Programming: SIMPER and L O G O Project, 1968-1969; Dzida, Herda & Itzefeldt, 1978; Dorn, 1976; Ellinger & Frank- land, 1976; Romaniuk & Montgomerie, 1976). However, none of them has explored in depth the process of CAI system development and modifications. The primary advantage of constantly revising the CAI system is that it is well updated and especially tailored to the needs of its current users. Several problems may be created by outdated inadequate systems. Such problems include the following.

1. The performance score may not be sutliciently discriminative among users. 2. The number of questions may not represent the material adequately. 3. The number of times questions are repeated may be excessive. 4. The length of time to respond to each question may not be appropriate. If it is

too short, for example, users may be unable to respond in time. If it is too long, users may waste precious computer time while trying to search an appropriate response.

5. The time period in which users are exposed to the CAI may be either too early or too late.

I s O F R E A C T I O N S T O S O F T W A R F 171

The present study suggests that by the evaluation possible modification impact these problems can be mitigated. Accordingly, the evaluation will be more complete, legible, unbiased, representative, prompt, timely, accurate, and more pleasant for the user and evaluator, if it is done methodically and systematically.

In summarizing the projects and studies presented above, it can be noted that the computer assumes varying roles of instructional nature. As previously stated, two significant findings result from the literature review. First is the fact that very few of the projects and the studies reported any extensive testing or evaluations of the CAI modification impact on the instructional process. One of the expressed purposes of the present study is to improve the tools used for CAI evaluation; however, such an evaluation system may be more broadly applied not only to CAI but to other interactive software as well.

The second finding is the fact that CAI studies have relied almost totally on manual survey-questionnaires for their evaluations and data. Thus, another focus of this study is on an examination of the potential for computerizing the evaluation system for interactive software. Accordingly, all the data collected for the present study was totally computerized, avoiding problems of traditional manual CAI evaluation. The advantages of computerizing CAI evaluation include time savings by the users as well as improved precision and reliability. Following are some of the problems of manual software evaluation which were avoided in this study:

1. incomplete evaluation forms, 2. illegible evaluation forms, 3. biased sample of system users, 4. low response rate, 5. excessive time elapsed from program execution to the evaluation, 6. time-consuming transcription of data from evaluation forms on to data-entry

forms, 7. time-consuming key-punch or key-entry operations, 8. high rate of errors in data transcription, and 9. user resentment due to excessive paper work.

Objectives and hypotheses Frequently, end-users do not participate or have sufficient input in decisions concerning the acquisition, implementation, maintenance, and modification of interactive software. As a result, vast economic resources are wasted developing software which becomes underutilized due to user dissatisfaction with systems which fail to meet thcir true needs. One of the obstacles to resolving this problem is the lack of effective and efficient methods for interactive software evaluation. The current study presents a specific case of the more generic problem in order to enhance the effectiveness and efficiency of such evaluations. In this case, novice computer users of CAI are test subjects in an investigation into three dimensions bf the problem.

In general, the primary objective of this study is to construct a formal model for predicting the impact of CAI modifications upon user performance on these CAI programs. Formally the model can be expressed as follows:

Yj -- a + b~(Xq) + e~,

172 A. RI_JSIIINEK E T A L .

where i and j are indices for each predictor variable and observation, respectively, Y, is the user 's performance score on the CAI, a is a constant term, the b~ are regression coefficients representing the impact of each modification factor, X# is the score representing each modification factor for each CAI user, and ei is an error term of unexplained performance score deviations.

The central hypotheses of this study deal with whether such (bi) regression coefficients can be established for the population of users. Formally the hypotheses will be stated as follows:

null hypothesis Ho: b~ = 0,

Alternative hypothesis H~ : b~ ~ 0.

Methodology and procedures

The subjects in this study were 146 novice computer users receiving instruction in the BASIC programming language via formal classroom lecture supplemented by CAI tutorials. The instructor had no part in the creation and implementat ion of the CAI programs or the study. Therefore, he had no vested interest in the outcome.

A time schedule for the completion of the CAI tutorials was recommended, and the users were able to repeat a single tutorial an unlimited number of times. The CAI tutorials were presented interaetively on a Digital Equipment Corporat ion DEC-10 computer . Each user had the choice of using either a cathode ray tube (CRT) or a teletype (TTY) terminal.

The sequence of events provided that the instructor first covered the material in lecture; the material was then reviewed and supplemented by the tutorials. Before using the CAI tutorial programs, the users were given a demonstrat ion of how to execute the tutorials and received handouts on the use of the C R T and T I ' Y terminals.

The data concerning users' scores, the length of their interaction with the computer and users' performance was recorded by the computer during the execution of the CAI program. This performance data was stored on a disk file which was inaccessible to the users.

Evaluation of CAI modifications

ASSESSMENT OF CAI M O D I F I C A T I O N MOI)ELS

The primary objective of this paper is to identify variables which may help to improve an existing CA1 system or design a new system, such that it would induce favorable attitudes.

A secondary objective is to consider several models of assessing the impact of CAI modifications. Models considered include (1) the simple bivariate model, (2) the multivariate model, and (3) the ordinal multivariate model. These models may prove useful in the prediction of users' score for computer assisted instruction.

The two main questions that will be discussed are concerned with (1) sample generalizability and (2) parameters ' estimates. The first question is whether the model can generalize the results of this sample observation to the population of all novice users. The second question of parameters ' estimates is whether one can estimate the

E V A L U A T I O N O F R E A C T I O N S T O S O F T W A R E 173

most likely population parameters from the examination of sample observations. The main focus here is in delineating a particular value or values for the population. For instance, the knowledge of the time required for all EDP users to use the CAI may be useful for a feasibility study. Since the computer facilities are usually limited, one may wish that only a few classes would be allowed to use the CAI at any given time.

Instead of asking what the population parameters are, one may test the null hypothesis that there is no linear relationship between the dependent (criterion) variable score and the following independent (predictor) variables.

1. Questnum--number of programmed questions intended to elicit user responses and enter them onto the CRT.

2. Asktime--number of times required to repeat a given query to elicit a correct response from the user.

3. Runtime--the elapsed time in minutes from the beginning to the end of program execu-~ion.

4. Dmonth / - -dummy variable with l - - fo r executions during O - f o r executions during

5. Dmonth2--dummy variable with 1--for executions during O--for executions during

6. Dmonth3---dummy variable with i---for executions during

the following values: the first month, the other months. values of: the second month, any other month. values of: the third month,

0--for executions during months 1 and 2. 7. Lenmonl--a product of Dmonthl and Runtime. 8. Lenmon2--a product of Dmonth2 and Runtime. 9. Lenmon3--a product of Dmonth3 and Runtime.

In this paper, one is interested in predicting users' scores, the dependent (criterion) variable, from the number of questions asked (Questnum), the number of times these questions were asked (Asktime) and elapsed time of each session in minutes (Runtime). For the interval models, all the variables are measured on interval scales. Through multiple regression one obtains a prediction equation that indicates how the number of questions or elasped time could be weighted and summed to obtain the best prediction of the score for the sample. This objective requires an emphasis upon the overall dependence of the score on the independent (predictor) variables, Questnum (number of questions), Asktime (number of times questions were asked) and Runtime (length of each interactive session in minutes). Based upon this, one may have to delete variables because of insufficient contribution t o predictability of the score. The F-ratio and the significance level of each variable are the deletion criteria. These criteria will be discussed later in this paper regarding evaluation of variables.

In this paper the authors will deal with the following subjects.

1. The overall goodness-of-fit of the model. 2. The test for specific regression coefficients of tile models. 3. The tests for subsets of the multivariate model.

The overall F-test uses statistical inference procedures to test the null hypothesis that the multiple correlation is zero in the population from which the sample was

174 A. RUSHINEK E T A L .

drawn. Expressed in another way, the test indicates whether the (assumed random) sample of observations being analyzed has been drawn from a populat ion in which the multiple correlation is equal to zero. If this is the case the observed multiple correlation is due to sampling fluctuations or measurement error. The F - t e s t statistic employed for the correlation is equal to zero. The sum of squares of the regression indicates the part explained by the regression, while the sum of squares of the residual indicates the unexplained sum of squares.

The estimates of changes in the score that may result f rom changing the number of questions or any other predictor variable are useful in planning the development of this CAI system. For instance, one may moni tor the score and not cause the user excessive frustration that results f rom scores which are too low. This monitor ing can be performed, for instance, by raising the t ime limits (Runtime), and thus increasing users ' scores.

The assumptions of this study are the standard multiple regression analysis assumptions and they will be discussed in detail. One assumes that the users were ho~e~, and worked independently. The study is limited to the population. The population is of novice computer users.

This study is designed to examine the three aforement ioned models for score prediction. The first model is a bivariate model. The second model to be considered is a multivariate model and the third model is the ordinal multivariate model, which includes dummy variables. In this paper, the generic hypothesis is that the multiple correlation coefficients of the three models are not equal to zero. This generic hypothesis will be formally presented, and repeatedly tested with different models, later in this paper.

SIMPLE B I V A R I A T E M O D E L

Regression analysis generally requires that variables be measured on interval or ratio scales and the relationship among the variables be linear or additive. These restrictions are not absolute. Nominal variables can be incorporated into the regression through the use of dummy variables. The regression analysis is a general statistical technique by which one can analyze the relationship between a dependent (criterion) variable (Score) and a set of independent (predictor) variables (Questnum and others). Regression analysis may be viewed either as a descriptive tool by which the linear dependence of one variable on others is smnmarized, or as an inferential tool by which the relationship in populat ion is evaluated f rom the examination of a sample data. Although these two aspects are closely related, it may be convenient to treat each separately on a conceptual level.

Instead of focusing on the prediction of the dependent variable, the score, and its overall dependence on the independent variables, one may concentrate on an examination of the relationship between the score and a particular independent (predictor) variable. The most important relationship of score and a single predictor variable is that of Questnum, since the number of questions is a variable which must be present in any CAI program. Although the other variables may be present in CAI, it is not as common as Questnum. One may examine the influence of number of questions (Questnum) on score. This is per formed by the bivariate model, the simplest and the most practical model which will be tested and discussed first in this section. However , a simple regression of score on Questm~m alone may not be sufficient because the


number of questions is confounded with the number of times the questions were asked. That is, the larger the number of questions, the larger the number of mistakes, and the greater the number of times questions will be asked. The other variables may themselves affect the score. Therefore, one may examine the relationship between question number and score while controlling for the other variables. For this task one will use multipi.e regression and one will get the partial correlations, to be discussed later in this paper.

Although the bivariate model excludes the variables Runtime and Asktime, these variables may play an important role in CAI. For example, one may control the elapsed time of a session. This can be performed by using the computer clock for terminating a session after a given time. On the other hand, Asktime depends upon how well the user prepared for the session, his intelligence, and other uncontrollable variables. This condition may limit the application of the prediction model. Neverthe- less, the maximum number of repetitions allowed per question (Asktime) can be pre-programmed into a CAI program. Accordingly, it can be partially controllable and therefore it is included in the multivariate models, in later sections.

TABLE 1 Significance test for the coefficients of the bivariate model

~.E.

Variable B B F Sig. Beta Elasticity

Questnum -0,693 0.130 28.359 0.000 -0.12168 -0.04873 Constant 94.183 0.897 11,030.421 0.000

Table 1 gives the simple bivariate regression coefficients for the variable Questnum as a single predictor. In a simple regression analysis the value of score is predicted by the linear model

estimated score = 94.183 - 0.693 x Questnum.

The difference between the predicted score and the actual score is called the residual. The simple bivariate regression mett~od minimizes the squared residuals. The constant of 94.182 is the score intercept, that is, the point at which the regression line crosses the score axis and represents the predicted value of score when Questnum is equal to zero~ The smallest value that Questnum assumes is 2, therefore, zero is outside the reievant range and is meaningless. The constant B, - 0 . 6 9 3 , is the nonstandardized regression coefficient. This is the slope of the regression line and indicates the expected change in the score with a change in the number of questions. The negative sign preceding the coeff• indicates that an increase in the number of questions is most likely to decrease the score. 'l~le regression coefficient - 0 . 6 9 3 3 always has the same sign as the standardized regression coefficient Beta of - 0 . 1 2 1 6 8 . These coefficients will be equal only if me variables, score and Questnum, are standardized because the variance of the two w~_i] equal to one~

Beta in the bivariate model is equal to the simple Pearson correlation coefficient. When working with the standardized data the intercept is zero rather than 94.183.

176 A. R U S H I N E K E T A L .

Using standardized data should be simple, because the intercept is always equal to zero and it is easily transformable into nonstandardized coefficients. Furthermore, this may be the most sensible way to compare the relative effects of independent (predictor) variables upon the dependent variable score, especially if different independent (predictor) variables are measured in different units. In this case, Questnum is measured in number of questions while Runtime is measured in terms of minutes.

S.E. B is the standard error of the estimates of the unstandardized coefficients. Since the sample is large (146 users), regression coefficients (B) from repeated sampling would have a normal distribution. Therefore, one may establish the confidence interval for the estimated B. The 95% confidence interval (c.I.) would be computed in the following way:

95% c.I. of the unstandardized coefficient = -0 .6 9 3 3 + - 1 . 9 6 x 0.13,

95% cJ. of the variable Questnum = B + -Z (0 .05 ) x S.E. jR.

Table 2 gives the confidence intervals for the bivariate model with the number of questions as the predictor.

TABLE 2 Bivariate model: coefficients and confidence intervals

Variable B 95% c.r.

Questnum -0 .6933 -0 -9486 - 0 . 4 3 7 9 Constant 94.1832 92.4244 95.9419

The confidence interval of Questnum may be one method of testing the null hypothesis stating that the population parameter of the unstandardized regression coefficient of Questnum is equal to zero. Observing Table 2, one may see that the value of zero is excluded and is not enclosed within the bounds of the 95% confidence interval. This is evidence for concluding that one rejects the null hypothesis with Alpha equal to 0.05.

Let us return to Table 1 and examine the significance test of the unstandardized coefficients (B). The significance of B can be tested either by examining the confidence interval in Table 2, or more conveniently, by evaluating the F- ra t io in the fourth column of Table 1. If the computed F is larger than the statistical table's critical value F (d.f. 1 = 1, d.f. 2 = 1886) for level of significance, say 0.05, one rejects the null hypothesis. The null hypothesis in this case states that the population parameters of the unstandardized regression coefficients are equal to zero. Otherwise, it would be concluded that the observed B is not significant at the 0.05 level. That is, if the probability is greater than 0.05, then the sample was drawn from a population whose parameters are equal to zero.

The correlation coefficient (R) and the coefficient of determination (R-square) describe the strength and the direction of the regression function. At this point one should examine the overall strength of this model. Table 3 gives the aforementioned criteria, R and R-square.

EVALUATION OF REACTIONS TO SOFTWARE

TABLE 3 Bivariate model: analysis of variance

177

Analysis of variance d.f. Sum of squares Mean square F Sig.

Regression 1 6899.56954 6899.56954 28.35872 0.000 Residual 1887 459,099.99562 243.29624

89.89673 Mean response Std dev. 15.71056 Multiple R- square = 0.01481

R =0.12168 Adj R-Square=-0.0] 428 Std dev.= 15.59796 Coeff. of variability= 17.37%

The value of R- squa re is equal to 0.01481, this may indicate that approximately 99% of the variation of score is not explained by the variation of Questnum. If one wishes to predict the score based upon the number of questions, one should not use this equation. However , one should use it to evaluate the predicted scores for small changes in the number of questions. The goal may be to determine the desired number of questions, such that the tutorials will be challenging but not excessively difficult. In order to predict the score one may refer to Table 1. The predicted score for 10 questions is computed in the following way:

predicted Score for (Questnum = 10) = 94.183 - 0 . 6 9 3 • 10 = 87.253.

However , this predicted score may be inaccurate, since it is computed by the bivariate model using a single predictor variable Questnum, which has been shown to account for less than 2% of the variability in the criterion variable, Score. One may examine the accuracy of the prediction by the size of the coefficient of variability which is equal to 17.37%. This indicates that the average error in forecasting scores f rom number of questions is within 17% of the prediction. The F -va lue of 28-36 indicates that one can reject the null hypothesis that the overall regression equation coefficients at the Alpha of 0.05. This F explains the significance for the overall function, rather than individual coefficients.

The bivariate model, which was developed in this section, is simple to understand and to use. However , it excludes some important variables such as the number of times users were asked questions (Asktime) and the length of a single interaction with the tutorials in minutes (Runtime). Therefore , the bivariate model will be extended into a multivariate model in the following section. Thus, it will be more comprehensive and include more variables. Nevertheless, a stepwise method will be used to evaluate the potential of bivariate models with the added variables, Askt ime and Runtime.

EXTENSION TO MULTIVARIATE MODELS

The basic principles of regression analysis used in the bivariate case may be extended to multiple regression. The coefficients for the single step multivariate model are given in Table 4.

B describes the value of the unstandardized partial regression coefficients. B of - 0 . 0 0 2 indicates the expected difference in Score, the dependent (criterion) variable, between two groups of users that are different by one minute on Runt ime but equal


TABLE 4

Multivariate model: significance test for the coefficients

s . g .


Questnum 6.471 0.1!1 3392.138 0 1.13582 0.45488 Runtime -0.002 0.008 0.054 0-816 -0.00258 -0.00011 Asktime -6.068 0.077 6139.434 0.000 -1.52810 -0.50267 Constant 94.109 0.437 46,344.377 0

on Ques tnum and Asktime, or, one may say that Ques tnum and Askt ime are controlled variables. This interpretat ion is important for differentiating between faster and slower users. Further, this enables one to estimate appropr ia te t ime limits to be used in the C A I by preprogramming time constraints for users' interactions with the comp~Jter. Such time constraints should save computer t ime but should not excessively reduce users' per formance scores.

Equally important is another interpretat ion based upon the assumption of additivity of the independent variables. If one were to change one unit on both variables, Ques tnum and Asktime, the expected change in score would be

[B (Questnum) + B (Asktime)] = 6.471 - 6.068 = 0 4 0 3 .

Changes in Ques tnum and Askt ime can be per formed in a variety, of ways. The most common way of changing Ques tnum is by adding more questions to each module. This may be accomplished by the instructor in accordance with users ~ feedback~ Although an upper limit on Askt ime can be p rogrammed, Askt ime may be harder to manipulate than Qt~estnum, since it depends more on the learner than on the instructor. However , if the instructor gives additional help to the users, Askt ime may be reduced.

The Beta column represents important information. This column indicates the relative importance of the coeff• by using the same ~mit of measurement for all variables. The original units of measurement of the B of Runt ime are minutes. However , the units of measurements for Beta of Runt ime are standard deviations f rom the mean ra ther than minutes,

The notion of controlling for or holding constant a variable can be demonstra ted by the difference between the simple and the partial unstandardized coefficient of Questnum. In Table 1, one notices that the simple coefficient of Ques tnum is equal to - 0 . 6 9 3 . However , in Table 4 it is raised to +6 .471 . This may be explained by the notion that one has controlled for the rest of the variables, Runt ime and Asktime. Fur thermore , this observat ion may be carried forward to the last two co!mnns, Beta and Elasticity. One notices that the Elasticity of Ques tnum in the bivariate model (Table 1), is almost a perfect sum of the elasticities of Questnum, Runt ime and Asktime in the multivariate model. The elasticities are derived f rom the coefficients and provide us with a relative measure of importance. The last s ta tement may be applicable to the Beta column as well as to the elasticity column. However , the sum of the multivariate coefficients is not as close to the value of Beta in the bivariate model (Table 1). The reason for this may be the differences in distribution of the three


independent (predictor) variables, which in turn cause differences in the units of measurement of Beta, namely the standard deviation of the variables.

By using the function given in Table 4 one may compute the predicted users' scores for any given combination of Questnum, Asktime and Runtime values. The overall accuracy is reflected by R -square, the proportion of variation explained by the variables included in the regression equation~ It is dem(mstrated in the table of R-square for the multivariate model, Table 5.

Some of the independent (predictor) variables are measured iN different uni~ such as minutes and nmnbe~ of quesr176 I~: ~s dihSc~!~ ~o deter~ fine relative importance of the independent (predictor) variables by d~e ~mstanda:~:dized partial coe~,.cients, the /3-values in Table & Since the relative contribution is of interest, one may refer to the column of Beta, in order to evaluate the relative importance of each independent (predictor) variable, The Beta column indicates that the length in minutes (Runtime) is the least important, while the Askfime is the most important. Other things being equal, the standardized partial Beta indicates that one standard deviation unit change of Asktime would introduce the greatest change in score and one unit change in Runtime, the least change. The increase of the absolute value of the unstandardized partial (13) of Questnum, from - 0 . 6 9 3 up to + 6 . 4 7 ! is due to the relatively strong negative relatio,,~ship between Asktime and Score, to be discussed later.

TABLE 5 Multivariate model: analysis of variance


Regression 3 358,198.36137 119,399.45379 2087,80572 0 Residual 1885 i07,801.20379 57.18897 Mean response 89.80673 Std dev. 15.71056 Multiple R =0.87674 R-square = 0'76867 Std dev. =7.56234 Adj. R-square = 0-76830

Coeff. of variability = 8.42%

By comparing Tables 3 and 5 one sees the overal impact of adding two more variables to the model. The overall accuracy of prediction was raised from 0.01481 up to 0.76876. This indicates that the multivariate model explains approximately 0.77 of the variability in the score prediction. The prediction accuracy in absolute units is reflected by the standard deviation (Std dev.) of the estimate for the regression model. The standard deviation was reduced from 15-6 to 7-56, which is much lower. The number of independent (predictor) variables in the multivariate model is indicated by the degrees of f reedom of the regression (d.f. = 3), which equals 3. Consequently, the degrees of f reedom of the residuals were reduced by 2, to become 1885. One may observe that the F-value is higher in the multivariate model (2087) than the bivariate model (28). This is due to the change in allocating the sum of squares. In this case, the larger part of the sum of squares is allocated to the regression while the smaller part is allocated to the residual. The regression sum of squares of the multivariate model is much larger than the bivariate model, This reflects the improved overall goodness-of-fit.


Reference to the F-dis t r ibut ion in a statistical table indicates that the probability of getting an F - ra t io greater than or equal to 12.582 is less than 0.001. Since F-values exceed the value of 12.582 in both models, the bivariate and the multivariate, one would conclude that it is very unlikely that the sample was drawn from a population with a zero multiple correlation. The following is a formal generic model of testing the aforement ioned hypotheses.

1. Null hypothesis H(0): R = 0 or B I ( 1 ) = B ( 2 ) = B ( 3 ) = 0. 2. Alternative hypothesis H(A): R is not 0, or~ not all the B coefficients are 0. 3. Decision criterion: Reject H(Q) if F(computed) is greater than 12.582. 4. Decision: Reject H(0), since F = 2087 > 12.582 at the 0.05 level.

The overall test indicates that not all coefficients are zero. However , specific tests will be needed for each coefficient to determine its significance.

Observing Table 4, one notices that the F -va lue of Runt ime is much lower than the others. This indicates that one may not reject the null hypothesis at Alpha of 0.001. However , more attention will be devoted to testing the individual coefficients later in this section.

At this point, one should examine the individual coefficients. The strategy used to test the unstandardized model coefficients, R, involves the decomposi t ion of the explained sum of squares into its components . These components are then attributed to different independent variables. One will compare the two methods of partitioning the sum of squares, referring to them as:

1. the hierarchical method, and 2. the standard method.

In the standard method, each variable is t reated as if it has been added to the equation in a separate step after all other variables already have been included. The explained sum of squares, due to the addition of a given variable, is then taken as a component of variation attr ibutable to that variable. In the hierarchical method, variables are added to the equation in predetermined order. For variables that are added in a single step, the increment of R -square is taken as a component of variation attr ibutable to the added variables.

Table 4 shows the F-va lues of the unstandardized B. The F-ra t io , in the third column, corresponds to the B-va lue in the second column. The degrees of freedom for each ratio are 1 and ( 1 8 8 9 - 3 - 1 ) . One would compare the F- ra t ios from the table to the critical F -va l ue of 1 and 1885 degrees of f reedom and would notice that all B-values are significant at the 0.1 level. Testing B by the standard method would only reflect the direct linkages between the independent variables and score. The hierarchical method involves adjustment for only those variables that precede a given variable. Since Ques tnum is the first in the hierarchy, it is tested without adjustments for Askt ime and Runtime. Therefore , the sum of squares attributable to Questnum will include not only that which is due to its direct influence on score, but also that port ion which was created through the Askt ime to questnum path.

One will examine the summary table of each method starting with the standard method. The following is the summary table for the standard method.

Compar ing these latter F-ratios to the critical F-values , one comes to the conclusion that one may reject H(0) at the 0.05 level of significance, however, the comparison

E V A L U A T I O N OF REACTIONS TO SOFTWARE

T A B L E 6

Multivariate model and the standard (single step) method

181

Step Variable F Multi-R R-Sq. Change R OverallF Sig.

1 Questnum 3392.138 0.122 0.015 0.015 -0.122 2087.806 0 Runtime 0.054 0.123 0.015 0.000 -0.020 Asktime 6139.434 0.877 0.769 0.753 -0.593

TABLE 7 Multivariate model: the hierarchical (multi-step) method

Step Variable F Multi-R R-Sq. Change R OverallF Sig.

1 Questnum 92.359 0.122 0.015 0.015 -0.122 28.359 0.000 2 Asktime 6145.800 0.877 0.769 0.754 -0.593 3133.253 0 3 Runtime 0.054 0.877 0.769 0.000 -0.020 2087.806 0

between Tables 6 and 7 may clarify the difference between the standard and the hierarchical methods. Table 7 is the summary table for the hierarchical method.

Compar ing Tables 6 and 7, one notices that the first column of the standard method (Table 6), indicates that there was only one step taken to compute the coefficients. Using the hierarchical method (Table 7), the first step involved computa- tion of the coefficient of Ques tnum and then two more steps were necessary to compute the other F-values . The only other difference between the two methods is the F -va lue for Questnum. One notices that the F -va lue of Ques tnum in the hierarchical method is reduced to 28. This large reduction in the F -va lue does not alter the decision as to the rejection of the null hypothesis and therefore the dependency of Ques tnum and Askt ime is not detr imental to the validity of the multivariate model.

The reduction of the F -va lue of Ques tnum may be explained by the causality and the high correlation between Ques tnum and Asktime. Since there is an intrinsic order to the relationship between Ques tnum and Asktime, the hierarchical method will be more appropriate . The correlation between this variable is the result of causal relations. For example, if one was asked more questions he would be likely to give more erroneous answers and therefore the questions will be repeated more times, or conversely, he would have more opportunit ies to answer correctly.

Finally, an alternative way to test the individual coefficients may be using the confidence intervals of the coefficients. Table 8 gives the values of the coefficients and their 95% confidence intervals.

Table 8 shows the 0.95 confidence interval for the unstandardized regression coefficients (B column). One notices that the bounds of the variable Runtime, - 0 . 0 1 8 and 0-0142, enclose the value of zero.

The multivariate model introduces the problem of multicolinearity. The problem of multicolinearity is created by the high intercorrelation between Ques tnum and Asktime. At this point, one should examine the correlation coefficients of the variables.

Table 9 demonst ra ted that the correlation between Askt ime and Ques tnum is equal to 0.82292. This situation is somewhat complicated. The more strongly correlated


TABLE 8

Multivariate model: model coefficients and confidence intervals

Variable B 95 % c.I.

Questnum 6.4712 6.2533 6.6891 Runtime -0.0019 -0.0180 0.0142 Asktime -6.0677 -6.2196 -5.9158 Constant 94.1087 93.2513 94.9660

TABLE 9 Multivariate model: Variables' correlation coefficients

Score Questnum Asktime

Questnum -0.12168 Asktime -0.59343 0.82292 Runtime -0.01986 -0.00317 0.00895

the independent variable is, the greater the need to control the confounding effects. At the same time the reliability of the relative importance of the independent (predictor) variables, partial regression coefficients, is reduced. The main purpose of control- ing for the confounding variables, in the first place, was to increase the reliability of evaluating the relative importance of the independent (predictor) variables. One notices that Asktime and Questnum are positively correlated. This supports the notion that as the number of questions (Questnum) increases, the questions become more difficult and thus are repeated more frequently (Asktime).

TABLE 10 Multivariate model: variance/ covariance matrix

Questnum Asktime Runtime

Questnum 0' 01235 Asktime -0.00708 0.00600 Runtime -0.01986 -0.00317 0.00007

Table 10 demonstrates that the coefficients of Asktime and Questnum are negatively correlated ( - 0 . 00708 ) . Moreover, it seems, from observation of the covariance (Cov [B1, B2] = Sum of [ B 1 - B e t a l ] x [B2-Beta2]) , that the magnitude of the correlation between the unstandardized coefficients increases. In other words, the absolute value of the correlations is positively correlated, while, the relative value is negatively correlated. This indicates a high collinearity between Questnum and Asktime which reduces the reliability of the multivariate model and should be used with caution, or instead, the problem of having two variables with redundant information will be

E V A I . U A T I O N OF REACTIONS TO S O F T W A R t : 183

resolved by either extracting a single variable out of the two, or simply removing the least important variable from the model, to be shown in Table 11 and discussed later.

Finally, one may examine subsets of the multivariate model. Using only one of the highly intercorrelated variables will resolve the prdblem of multicollinearity. However, it may also reduce the R-square , which will weaken the predictive power. Table 11 summarizes the subsets of the multivariate model.

TABLE 11 Bivariate subsets of the multivariate model

S,E.


Questnum -0.694 0.130 28.385 0.000 -0.12174 -0.04876 Runtime -0.015 0-017 0.785 0.376 -0.02024 -0.00090 Constant 94.266 0.902 10,929.149 0

Asktime -2.356 0.074 1025.020 0.000 -0.59330 -0.19517 Runtime -0.011 0.014 0.616 0.433 --0.01455 --0.00065 Constant 107-392 0.624 29,635.631 0

Table 11 demonstrates that in both models Runt ime has a low F-value. Therefore, one may improve the model by excluding Runt ime as a predictor. One may also conclude that the length of t ime does not have a significant effect on the score. The only variable that is completely under our control is Questnum. Therefore, it may be used as the best and the most practical single predictor. By doing this, one may employ the aforement ioned bivariate model, using Ques tnum as the only predictor.

E X T E N S I O N TO O R D I N A L M U L T I V A R I A T E MODELS

In this section, the impact of users' experiences interacting with the computer will be discussed. The elapsed t ime is measured in terms of months which the users used the computer . Let us first examine whether the additional variables, Dmonth3 and Len- mon3, raise the predictive power of the model. The following is a table of R-square .

In Table 12, R-square is equal to 0-77, Compared with R-square of 0-76867 in Table 5 one notices a very slight improvement in R-square . One may also compare the standard deviation and the coefficient of variability and conclude that the extension

TABI.E 12 Ordinal multivariate model: analysis of variance


Regression 4 358,818.54971 89,704.6374 31,576'80477 0 Residual 1884 107,181.01546 56.89014

Multiple R =0.87750 R-square=0.77000 Adj. R-Square=0.76951 Std dev. = 7- 54255 Coeff. of variability = 8-40%

184 A. R U S H I N E K E T AL.

of the model to ordinal level data may slightly enhance the predictive power of the function and reduce the scatter. In the next step, one will examine the table of coefficients and their levels of significance.

TABLE 13 Ordinal multivariate model: coefficients' significance test

S.E.


Asktime -6-020 0-079 5871.042 0 -1.51612 -0,49873 Questnum 6-420 l).112 3291-358 0 1.12687 0.45130 Dmonth3 2.456 0.847 8.400 0.004 0.06821 0.00699 Lenmon3 -0.541 0.164 10.954 0.001 -0.07829 -0.00706 Constant 94.072 0-441 45,434.604 0

The two additional variables in Table 13 are Dmonth3 and Lenmon3. Dmonth is a dummy variable which assumes the value of 1 for the month of December and the value of zero for any other month. December is the last month in which users were allowed to interact with the computer . These are dummy variables, since their values denote merely order rather than quantities, as denoted by the other interval variables. Theoretically, one may expect that this latest interaction with the computer wil! have a positive contribution to the performance. Indeed, it increased the score of the users. The coefficient of Dmonth3 turned out to be 2.456. It may be interpreted as a learning factor which is a result of the experience users gained by using the computer for approximately three months.

Lenmon3 is a product of Dmonth3 and Runtime. It indicates how the score is affected by both Dmonth3 and Runtime. This may be important since one wants to set t ime limits upon each interaction in order to save computer time. However , it may also encourage the user to be more extensively prepared and therefore get a higher score. The negative sign preceding the Lenmon3 coefficient indicates that excessive time allowed for interaction may have an adverse effect upon the score.

The third and fourth columns, headed " F " and "Sig.", enable one to test the null hypothesis that any of the coefficients are equal to zero. According to Table 13, one may reject this null hypothesis at a level of significance of 0.01. The Beta and the elasticity columns indicate the relative importance of the different coefficients. One may conclude that Askt ime is the most important while Dmonth3 is the least important.

Finally, if one at tempts to predict the score that different variable values will have, one may be interested in the confidence interval of each coefficient. At this point, one should examine the confidence interval of the coefficients. Table 14 is a table of the 0.95 confidence interval.

Table 14 may prove to be helpful in planning the development of the CAI software. In order to increase the reliability or the estimates, one may prefer to use intervals rather than point estimates of the coefficients.

The ordinal multivariate model may be useful but should be used with caution since the underlying assumption may not be fully met. However , the model is designed to


TABLE 14 Ordinal multivariate model: variables' coefficients and

confidence intervals

Variable B 95% c.L

Asktime -6.0202 -6.1743 -5.866l Questnum 6.4202 6.2007 6.6397 Dmonth3 2.4559 0.7940 4.1177 Lenmon3 -0.5413 -0.8620 -0.2205 Constant 94.0724 93.2069 94.9380

give a rough estimate rather than a precise prediction. The last model, the ordinal multivariate model, using ordinal data seems the best single alternative.

Summary, conclusion and implications

User evaluation of software packages provides important feedback for software designers and implementers. The current study concludes that this evaluation can be done most effectively by performing the data collection interactively at the computer terminal, rather than manually. This can be done by providing an interactive program that will be executed immediately following the use of the software to be evaluated. Thus, the user responses are immediate and the results of the evaluation can be assessed promptly and corrective action can follow.

For acquisition decisions, the evaluation software may be appended to newly acquired software. This will be particularly useful for a satisfaction guarantee. Further- more, interactive software evaluation would be even more useful for maintenance and modification evaluation. The modified or "upgraded" version's evaluation will be compared with the previous version, in order to measure the differences in users' performance due to the modifications.

This paper also tests three predictive models for assessing the impact of CAI modifications, as well as exploring specific variables in an effort to identify those which contribute most to the effectiveness of CAI. The models were constructed with null hypotheses stating that there is no linear relationship between the dependent (criterion) variable score and the independent (predictor) variables.

In summary, this paper concludes the rejection of the null hypothesis stating that the multiple correlations equal zero, for the three main models: the bivariate, the multivariate and the ordinal multivariate. Further, the hypotheses stating that the estimated predictor coefficients are equal to zero was rejected for the variables Questnum, Asktime and Dmonth. However, this null hypothesis was not rejected for the variable Runtime, concluding that unless Runtime is combined with Dmonth (Lenmon), it is not a critical factor in evaluating CAI modifications.

The conclusion of this study is that a strong significant interaction exists between the user's performance and their experience of using the computer. This is apparent by the increase in performance score due to the additional experience users have had by the third month of using the system. Moreover, the negative relationship between the elapsed time of interaction and experience reveals that excessively long time leads

186 A. RUSIIINEK E T AL .

to d e t e r i o r a t e d p e r f o r m a n c e , s ince users come u n p r e p a r e d . T he re fo r e , users t ake a long t ime to r e s p o n d and the i r r e sponses are of ten incorrec t .

References

BARR, A., BEARD, M. & ATKINSON, R. C. (1976). The computer as a tutorial laboratory: The Stanford BIP Project. International Journal of Man-Machine Studies, 8 (5), 567-596.

BOYLE, T. & WRIGHT, G. (1977). Computer-assisted evaluation of student achievement. Engineering Education, 68, l -5 .

BROWN, B. R. (1966). An instrument for the measure of expressed attitude toward computer- assisted instruction. In MITZEI., H. E. & BRADON, G. I,., Eds, Experimentation with Computer-Assisted Instruction in Technical Education. (Semi-Annual Progress Report, Project No. OEC-5-85-074) University Park, Pennsylvania: The Pennsylvania State Uni- versity.

BUNDERSON, C. V. (1970). The computer and instructional design. In HOI.TZMAN, W. E., Ed., Computer-Assisted Instruction, Testing and Guidance. t tarper and Row.

BUSS, A. & KEARSLEY, G. (1976). Individual and small-group learning with computer-assisted instruction. AC Communication Review, 24 (1), 79-86.

CAI.DWELL, E., NIX, D. & PECKHAM, P. (1976). The use of CAI to provide problems for students in introductory genetics. Journal of Computer-Based Instruction, 3 (1), 13-20.

CIIANDRA, G. (1974). A study of the consensus on disclosure among public accountants and security analysts. The Accounting Review, 49 (4), 733-742.

CHIZMAR, J., HIF.BERT, L. D. & MCCARNEY, B. J. (1977). Assessing the impact of an instructional innovation on achievement differentials: The case of computer-assisted instruction. Journal of Economic Education, 9, 42--46.

COMPUTER-ASSISTED INSTRUCTION IN PRO(;RAMMING: AID PROJECT (1968-1969). Progress Reports, Standard Program in Computer-Assisted Instruction.

COMPUTER-ASSISTED INSTRUCTION IN PROGRAMMING: SIMPER AND LOGO PROJECT (1968-1969). Progress Reports, Stanford Program in Computer-Assisted Instruction.

DZIDA, W., HERDA, S. & ITZEFEI~DT, W. D. (1978). User-perceived quality of interactive systems. IEEE Transactions on Software Engineering, SE-4 (4), 270-276.

DORN, C. (1976). Computer assistance in veterinary medical education. Journal of Veterinary Medical Education, 3, 7-21.

ELLINGER, R. ~,~ FRANKLAND, P. (1968). Computer-assisted and lecture instruction: A comparative experiment. Journal of Geography, 75, 109-120.

ESTES, R. W. (1968). An assessment of the usefulness of current cost and price-level information by financial statement users. Journal of Accounting Research, 6 (2), 200-207.

GALLAGItER, C. A. (1974). Perceptions of the value of a management information system. Academy of Management Journal, 17 (1), 46-55.

KERI,INOER, F. N. (1964). Foundations of Behavioural Research. New York: Holt, Rinehart and Winston.

NUNNALLY, J. C. (1978). Psychometric Theory. New York: McGraw-Hill. PANKOFF, L. D. & VIRGIL, R. L. (1970). On the usefulness of financial statement information:

A suggested research approach. The Accounting Review, 45 (2), 269-279. ROMANIUK, E. W. & MONTGOMERIF,, T. C. (1976). After implementing your CAI course--

what's next. ERIC Ed. 151022, Alberta University, Edmonton. ROVI'ER, J. (1966). Generalized expectancies for internal versus external control of reinforce-

ment. Psychological ll,Ionographs, 80 (1), 1-28. RUNKEL, P. J. & MCGRATIt, J. E. (1972). Research on Human Behaviour, pp. 162-167. New

York: Holt, Rinehart and Winston. RUSHINEK, A., RUSHINEK, S. F., CHANG, I.. S. & MOST, K. S. (1981). The dimensions of

usefulness for corporate published annual reports. South East American Accounting Associ- ation Proceedings, pp. 182-188.

SWANSON, E. B. (1974). Management information systems: Appreciation and involvement. Management Science, 21 (2), 178-188.

I-VAI.UATION OF REACTIONS TO SOFTWARE 187

ZMUD, R. W. (1978). An empirical investigation of the dimensionality of the concept of information. Decision Science, 9 (2), 187-195.

Appendix A: Tutorial evaluation (manual)

Tutor ia l No.:

Please state your op in ion on the tutor ial you have just comple ted , by circling a n u m b e r from one to seven be tween each pair of descriptors listed below.

Very Neutral Very

EXAMPLE: GOOD 1 2 3 4 5 6 7 BAD 1. CLEAR 1 2 3 4 5 6 7 UNCLEAR 2. BORING 1 2 3 4 5 6 7 INTERESTING 3. EASY 1 2 3 4 5 6 7 HARD 4. USELESS 1 2 3 4 5 6 7 USEFUL 5. ENJOYABLE 1 2 3 4 5 6 7 UNENJOYABLE 6. TOO-SHORT 1 2 3 4 5 6 7 TOO-LONG

C O M M E N T S A N D S U G G E S T I O N S Please indicate below any suggest ions you might have for improving this tutorial , or any c o m m e n t s you would like to make abou t it.

Appendix B: Tutorial evaluation (computerized interactive)%

DEAR USER:

PLEASE STATE YOUR OPINION ON THE COMPUTER PROGRAM YOU HAVE JUST

COMPLETED BY ENTERING A NUMBER FROM I TO 7.

THE NUMBERS INDICATE THE FOLLOWING:

1. VERY STRONCLY AGREE

2 ...... STRONGLY AGREE

3 ............... AGREE

4. NEUTRAL

t User responses are underlined, the rest is printed on the terminal by the system.

1 8 8 A. RUSHINEK E T AI . .

5 ............... DISAGREE

6 ...... STRONGLY DISAGREE

7. VERY STRONGLY DISAGREE

(I) THIS PROGRAM WAS INTERESTING, USEFUL AND ENJOYABLE?... 3

(2) THIS PROGRAM WAS CLEAR AND EASY TO USE? ............... 2

(3) PLEASE ENTER ANY SUGGESTIONS YOU MAY HAVE

PROGRAM?...(D0 NOT EXCEED TEN LINES OF COMMENTS)

The response time is too long.

ABOUT THIS

a methodology for interactive evaluation of user reactions to software packages: an empirical...

Documents