test disclosure and retest performance the sat

81

Test Disclosure and Retest Performance onthe SAT

Lawrence J. StrickerEducational Testing Service

The aim of this study was to evaluate the effects ofdisclosing a Scholastic Aptitude Test (SAT) form onthe retest performance of examinees who initially tookthe disclosed form and subsequently took a differentform. Retest performance was compared for three ran-dom samples of examinees who took the SAT as highschool juniors in the May 1981 administration in NewYork and then retook it in the October 1981 adminis-tration : two experimental groups that were sent thestandard set of disclosed material for the May SAT,along with either a noncommittal or an encouragingletter intended to vary their motivation to use the ma-

terial, and a control group that was not sent anything.The three groups were generally similar in the leveland retest reliability of their October scores, indicatingthat access to the disclosed material had no apprecia-ble effects on retest performance.

Public disclosure of the content of admissions

tests, originally mandated by legislation in NewYork and now a nationwide policy of many ad-missions testing programs (Brown, 1980; &dquo;Test-

Takers,&dquo; 1981), has potentially important conse-quences for the performance of examinees. Al-though there has been a great deal of speculationabout this subject (see the reviews by Brown, 1980,and Strenio, 1979), data are scarce. It is well es-

tablished, though, that very few examinees seekthe disclosed material for most admissions tests,

with the striking exception of the Law School Ad-mission Test (see the review by Linn, 1982).The only information on the effects of disclosure

on test performance comes from a study of thespecific recall of disclosed material (Hale, 9 ~n~~lis 9& Thibodeau, in press). This experiment, in a

classroom setting, found that the examinees achievedsubstantially elevated scores on special forms ofthe Test of English as a Foreign Language (Edu-cational Testing Service, 1981b), which consistedof questions already disclosed to the students. Theseeffects occurred regardless of whether the ques-tions were discussed in class, but the extent of theeffects depended on the size of the pool of discloseditems: the examinees who had been given manymore disclosed questions than what subsequentlyappeared on the special forms of the test obtainedlower scores. a

Nothing is known thus far about the broader

impact of disclosure in the more realistic situationin which examinees repeat a test after taking anentirely different form of the test, which is sub-

sequently disclosed, and then receiving its ques-tions and their answers. This issue is of consid-

erable practical importance in view of the substantialproportion of examinees who repeat admissions tests(e.g., Donlon & Angoff, 1971) and of the weightthat admissions officers attach to retest scores (e.g.,Educational Testing Service, 1981a).

In principle, access to the disclosed material insuch ciicumstances-in common with retaking a

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

82

test, receiving test coaching, and using test ori-entation materials, such as guidebooks and practicetests (e.g., Educational Testing Service, 1979,1980)-has the potential for increasing examinees’familiarity with a test’s instructions and content,reducing their anxiety about it, and providing anopportunity for them to drill on specific types ofquestions (Anastasi, 1981; 1lRessick, 1980). Ac-cordingly, insofar as disclosure has any impact overand above these other influences, subsequent retestscores may be affected in two distinct ways. First,the scores may be elevated. All the possible effectsof disclosure that were just mentioned should con-tribute to score improvement. It is noteworthy thatretaking a test and test coaching both produce somescore gains (see the reviews by Anastasi, 1981,and Messick, 1980).

Second, the retest scores may not measure thesame thing as the initial scores. Greater familiaritywith the nature of a test and reduced anxiety shouldlead to a more veridical assessment of ability,whereas intensive drilling should produce a dis-torted appraisal (Messick, 1980, 1981). Hence, va-lidity may increase or decrease, depending uponthe relative importance of these two kinds of in-fluences. Retest reliability may be lowered in anyevent, for both influences would reduce the cor-

respondence between initial and retest scores.

However, the sparse data that are available on these

points, based on initial scores on the ScholasticAptitude Test (SAT; Donlon & Angoff, 1971) andordinary retest scores on a different form of thistest, suggest that the effects may be relatively small,at least for test familiarization and anxiety reduc-tion. Although these influences should make thetwo scores diverge, the scores had similar validityin predicting college grades (Olsen & Schrader,1959); and the retest reliability of the scores is

extremely high, approximately .9 (Donlon & An-

goff, 1971).A related matter is that these effects on retest

scores, rather than being uniform, may vary sys-tematically with the examinees’ characteristics.These include variables (such as unfamiliarity andanxiety) that may lead to poor test performance andbe alterable by exposure to the disclosed material,as well as other variables (such as motivation and

ability) that may determine access to the materialand effective use of it. Data are lacking on thisissue.The primary aim of this study was to evaluate

the effects of disclosing a SAT form on the retestperformance of examinees who initially took thedisclosed form and subsequently took a differentform. More specifically, the goal was to determinewhether receiving the disclosed test affected thelevel of retest scores and their retest reliability. Asecondary purpose was to explore whether the ef-fects depended on the examinees’ characteristics-those that may affect performance and be alterableby exposure to the disclosed material, those thatmay determine access and use of it, and demo-

graphic variables.

Method

Procedure

Three random samples, each consisting of 2,500examinees, were drawn from those taking the SAT(Form 1Z) in the May 2, 1981, administration inNew York. The samples were limited to examineeswith the following characteristics, as determinedfrom the registration form and other records aboutthe administration: (1) Junior in high school, (2)resident of New York, (3) registered on time for aSaturday administration, and (4) SAT Verbal (V)and Mathematical (M) scores were both availablefor the administration.

Two of the samples, the Not Encouraged andEncouraged experimental groups, were sent thestandard set of disclosed material (the operationalitems on the test, a copy of the examinee’s answer

sheet, scoring instructions, and key) that is rou-

tinely provided to those who request it. The mailingtook place at approximately the same time (June26 to 30) that the disclosed material was sent tothe first of the May examinees who asked for it.The material for the two experimental groups

was accompanied by a letter from the College Board,intended to vary motivation to use the disclosed

material. The letters for the two groups differed.The letter for the Not Encouraged group consistedof a single paragraph:



83

Although you may not have requested them,I am sending the questions and answers, aswell as a copy of your own answer sheet, forthose parts on the May SAT that counted to-ward your scores on the test. The CollegeBoard is sending these materials, on an ex-perimental basis, to a cross-section of all stu-dents who took the test.

The letter to the Encouraged Group contained thesame paragraph plus an additional one:

In the event that you plan to take the SATagain, you may find these materials useful inpreparing for the test. They should help youto become more familiar with the instructionsand the kinds of questions used, and may makeit possible for you to take the test with greaterconfidence.

Nothing was mailed to the third sample, the ControlGroup.

Subsequently, the examinees in each of the threegroups who retook the SAT (Form lY) in the Oc-tober 10 to 11, 1981, administration were identi-fied, after excluding three examinees in the NotEncouraged group and four in the Encouraged groupto whom the disclosed material could not be de-livered. The number of examinees with either SAT-V or SAT-M scores available for the administrationwere 1,248 for the Control group, 1,229 for theNot Encouraged group, and 1,272 for the Encour-aged group. Of these examinees, 87 in the Controlgroup, 59 in the Not Encouraged group, and 62 inthe Encouraged group had requested the disclosedmaterial for the May administration.

Measures

SAT scores and background variables were usedin the statistical analysis. They were obtained fromrecords for the May and October administrationsand from the Student Descriptive Questionnairecompleted when the examinee applied to take theSAT at the May administration. The SAT scoreswere (1) May SAT-V, (2) May SA~’-l~, (3) Oc-tober SAT-V, and (4) October SAT-M. The back-ground variables were (1) sex, (2) ethnicity, (3)father’s education, (4) mother’s education, (5) par-ent’s income, (6) financial need, (7) high school

type, (8) hi~h sch®®1 pr®~ram, (9) high school rank, ,(10) high school grade-point average, and ( 11 ) ed-ucational aspiration.

Statistical

All statistical analyses were limited to the threesamples of examinees who retook the SAT in theOctober administration and had V or M scoresavailable for the administration. Because of miss-

ing data for SAT scores and background variables,the sample sizes fluctuated for the analyses; eachanalysis was based on all the available data. SAT-V and SAT-M scores were analyzed separatelythroughout, and parallel analyses were carried outfor the May and October SAT scores.

It is important to recognize that although theoriginal samples were comparable by virtue of beingdrawn randomly, self-selection could have pro-duced differences in the fractions of these samplesthat were retested-the three samples used in thisanalysis. Hence, sample differences in the OctoberSAT results may be attributable to differences inthe composition of the samples rather than to dif-ferences in their retest performance.The influenceof this self-selection can be determined by com-parisons of the May and October results. Becauseany sample differences in the May results are pre-sumably due to variations in sample compositionproduced by self-selection, it can likewise be as-sumed that similar differences in the October re-

sults have the same cause.

Score level. differences in andOctober SAT means were assessed by one-wayanalyses of variance. Differences in October SATmeans were also appraised by one-way analyses ofcovariance, controlling for the pertinent May SATscores (e.g., May SAT-V was the covariate in theanalysis of October SAT-V). Interactions betweensamples and background variables were evaluatedby corresponding two-way (sample by backgroundvariable) analyses of variance and analyses of co-variance, with a separate analysis for each back-ground variable (dichotomized, where necessary).These two-way analyses were carried out by mul-tiple regression methods, each main effect beingadjusted for the other main effect, and the inter-



84

action being adjusted for all main effects. Inter-

actions between samples and May SAT scores (di-chotomized) were also evaluated by correspondingtwo-way (sample by May SAT score) analyses ofvariance and analyses of co variance. Interactionswith May SAT scores were excluded in the anal-yses where the same May SAT score was also thedependent variable or covariate (e.g., May SAT-V was excluded in the analysis of variance of MaySAT-V and in the analysis of covariance of OctoberSAT-V).

Retest reliability. Sample differences in the

product-moment correlations between correspond-ing SAT scores for May and October were ap-praised by a X2 test (Snedecor & Cochran, 1967).Interactions between samples and background var-iables were evaluated sequentially by the same X2test:

1. An overall test was made of the correlationsin the six subsamples formed by dividing eachsample on the basis of a background variable(dichotomized the same way as in the analysesof variance and the analyses of covariance).For instance, in the case of sex, the six sub-

samples were Male Control, Male Not En-couraged, Male Encouraged, Female Control,Female Not Encouraged, and Female Encour-aged.

2. If this test was significant, follow-up tests weremade of the correlations in the three subsam-

ples at the same level of the background var-iable. For sex, one level was Male, and its

subsamples were Male Control, Male Not En-couraged, and Male Encouraged; the other levelwas Female, and its subsamples were FemaleControl, Female Not Encouraged, and FemaleEncouraged.

This process was carried out separately for eachbackground variable. Interactions between samplesand May SAT scores were evaluated in the sameway. Interactions with May SAT scores were e~~-cluded in these analyses when the correlations werebased on the same May SAT score (e.g., May SAT-V was excluded in the analysis of the correlationsof May SAT-V with October SAT-V).

Results and Discussion

Score Level’

Analyses of variance of initial scores. The meansand standard deviations for the May SAT scoresin the three samples and the F ratios for the one-way analyses of variance appear in Table 1. Thetwo F ratios were not significant (p > .05); the Fratios for the interactions with the background var-iables and May SAT scores in the two-way analysesof variance were also not significant. These resultsindicate that self-selection in the examinees whoreturned for retesting did not affect the compara-bility of the samples with regard to their initial

performance. This point is reinforced by the anal-yses of interactions, which established that the sim-ilarity of the samples extended to a variety of sub-samples.

Analyses of variance ®~~°~~es~ scores. The meansand standard deviations for the October SAT scoresin the three samples, along with the F ratios forthe one-way analyses of variance, are also shownin Table 1. Both F ratios were not significant (p> .05). Similarly, the F ratios for the interactionswith the background variables and May SAT scoresin the two-way analyses of variance were not sig-nificant.

These consistently negative results strikinglydemonstrate that the samples did not differ in theirretest scores, even when various subgroups wereexamined. The present findings, taken together withthe uniformly negative results in the analyses ofinitial scores, imply that access to the disclosedmaterial and the motivation provided by the en-couraging letter did not affect the level of retestperformance, either for the total samples or thesubsamples.

Analyses of covariance of retest scores. The

covariance-adjusted means and standard deviationsfor the October SAT scores in the three samples,

1Tables containing the means and standard deviations for theMay, October, and covariance-adjusted October SAT scores inthe subsamples defined by the background variables and MaySAT scores, together with summaries of the corresponding anal-yses of variance and analyses of covariance, are available fromthe author.



85

Table 1

Means and Standard Deviations for Initial, Retestand Covariance-Adjusted Retest Scores

Note. None of the F ratios are significant (p > .05).

as well as the F ratios for the one-way analyses ofcovariance, are also reported in Table 1. These Fratios were not significant (p > .05). In addition,the F ratios for the interactions with the backgroundvariables and May SAT scores in the two-way anal-yses of covariance were not significant.

These findings are congruent with the precedingresults for the October SAT scores in demonstrat-

ing that the samples and subsamples did not differin their retest scores and in suggesting that thedisclosed material and the motivating letter uni-formly failed to have an impact on the level ofretest performance. The close resemblance be-

tween the two sets of results is not surprising, eventhough the present analyses take into account initialdifferences in the samples and the other analysesdo not, for the samples were observed to be similarin the analyses of May SAT scores.

Retest Reliability2 2

The correlations between the May and October

SAT-V scores in the three samples appear in Table2 together with the x2’s. The x2 was significant(p < .05) for SAT-M but not for SAT-V.

In the analyses of the SAT-V correlations in thesubsamples defined by the background variablesand May SAT scores, none of the ~2’s for the

correlations in the subsamples at the same level ofthese variables was significant. In the parallel anal-yses of the SAT-M correlations, the were sig-nificant for one level of Sex (Male), Ethnicity(Nonwhite), and High School Rank (Top Fifth ofClass). The statistics for these subsamples are alsoreported in Table 2.

These results indicate that the sample differencesin retest reliability were very minor, being limitedto extremely small divergences for SAT-M. Thesubsample findings also suggest that these differ-ences were not uniform throughout the samples butstemmed from a few isolated subgroups of exam-inees. Whether this outcome is traceable to vari-ations in sample composition produced by self-selection or in retest performance cannot be deter-mined. In any event, it appears that the disclosed

material and the letters had no more than a negli-gible impact on retest reliability for the samples asa whole as well as for the various subgroups.

2Tables containing the correlations between corresponding Mayand October SAT scores in the subsamples defined by the back-

ground variables and May SAT scores, together with the cor-

responding X2’s, are available from the author.



86

Table 2

Correlations between Initial and Retest Scoresfor Samples and Selected Subsamples

__________

Note. All the correlations are significant ~_~ < .01).*B < 005

-

Conclusions

The main conclusion of this study is that accessto the disclosed test material had no appreciableeffects on the subsequent retest performance ofexaminees in general and various subgroups of them,regardless of whether the performance was definedin te of the level or retest reliability of the newscores. It also appears, though the evidence on thispoint is less direct, that use of the material had nodiscernible effects either. These outcomes are es-

pecially noteworthy in view of the large samplesinvolved and the accordingly high level of statis-tical power that they provide to detect even verysmall effects. °

Although this investigation was concerned withboth access to the disclosed material and its use,the data on the latter issue were indirect becauseof the nature of the experimental design. Accesswas guaranteed in both experimental groups; usewas not ensured in either of them, though an effortwas made to increase use in the Encouraged group.Hence, the failure to find effects for this groupimplies that use of the material had no impact, butdoes not demonstrate it, in the absence of infor-mation about actual use by the examinees.The general failure to find any effects in this

study casts some doubt on the line of reasoningthat suggested retest performance might be altered.This reasoning rests on two propositions: (1) greaterfamiliarity with the nature of a test, reduced test

anxiety, and intensive drilling on the test’s itemsenhance performance; and (2) these factors are in-fluenced by using the disclosed material. It is en-

tirely possible that the second proposition is in-

correct, at least in the present context, because ofthe nature of the examinees and the test involved.

First, the examinees may have been maximally fa-miliar with the test and minimally anxious aboutit by the time that they received the disclosed ma-terial. All had taken the SAT in the May admin-istration and routinely received Taking the SAT(Educational Testing Service, 1979), an orientationbooklet that contained a sample form of the SAT,when they registered for that administration. Hence,the familiarization and anxiety reduction providedby using the disclosed material may have alreadybeen accomplished. This speculation is consistentwith the finding that practice on a test and otherkinds of exposure to it have the greatest effects onscore level for naive examinees and that the gainsdiminish with repeated practice and exposure (seethe reviews by Bond, 1981, and Messick & Jun-

geblut, 1981). Second, the SAT may not be in-fluenced by drilling because this test includes few, 9if any, of the kinds of items on which performancecan be improved by such practice. Systematic ef-forts are made to eliminate such items from theSAT (Donlon & Angoff, 1971).The negative results necessarily raise questions

about the efficacy of the experimental operations,particularly the encouraging letter and the variables



87

used to form the subgroups. The effectiveness ofthe letter in increasing use of the disclosed materialis suggested by the impact that the model for thisletter had in a study of test familiarization involvingthe Graduate Record Examinations Aptitude Test(Conrad, Trismen, & I~ill~r9 1957): the originalmarkedly affected both the amount of time that theexaminees used the familiarization material and theirtest scores (Powers & Swinton, in press). The mea-sures used to define the subgroups in the presentinvestigation, though adequate for exploratory pur-poses, are not ideal. The May SAT scores are rea-sonable indexes of the ability to take advantage ofthe disclosed material. The background variablesare adequate measures of key demographic char-acteristics, but no more than substitutes for directassessments of anxiety, motivation, familiarity withtests, and so forth.

References

Anastasi, A. (1981). Diverse effects of training on testsof academic intelligence, In E. F. Green (Ed.), Issuesin testing: Coaching, disclosure, and ethnic bias (pp.5-19). San Francisco: Jossey-Bass.

Bond, L. (1981). Bias in mental tests. In B. F. Green(Ed.), Issues in testing: Coaching, disclosure, andethnic bias (pp. 55-77). San Francisco: Jossey-Bass.

Brown, R. (1980). Searching for the truth about "truthin testing" legislation. Denver CO: Education Com-mission of the States.

Conrad, L., Trismen, D., & Miller, R. (Eds.). (1977).Graduate Record Examinations technical manual.Princeton NJ: Educational Testing Service.

Donlon, T. F., & Angoff, W. H. (1971). The ScholasticAptitude Test. In W. H. Angoff (Ed.), The CollegeBoard Admissions Testing Program: A technical re-port on research and development activities relatingto the Scholastic Aptitude Test and Achievement Tests(pp. 15-47). New York: College Entrance Exami-nation Board.

Educational Testing Service. (1979). Taking the SAT.New York: College Entrance Examination Board.

Educational Testing Service. (1950). 4 SATS. New York:

College Entrance Examination Board.Educational Testing Service. (1981a). ATP guide for

high schools and colleges 1981-82. New York: Col-lege Entrance Examination Board.

Educational Testing Service. (1981b). TOEFL test andscore manual, 1981 edition. Princeton NJ: Educa-tional Testing Service.

Hale, G. A., Angelis, P. J., & Thibodeau, L. A. (inpress). Effects of test disclosure on performance inthe Test of English as a Foreign Language. LanguageLearning.

Linn, R. L. (1982). Admissions testing on trial. Amer-ican Psychologist, 37, 279-291.

Messick, S. (1980). The effectiveness of coaching forthe SAT: Review and reanalysis of research from thefifties to the FTC. Princeton NJ: Educational TestingService.

Messick, S. (1981). The controversy over coaching: Is-sues of effectiveness and equity. In B. F. Green (Ed.),Issues in testing: Coaching, disclosure, and ethnicbias (pp. 21-53). San Francisco: Jossey-Bass.

Messick, S., & Jungeblut, A. (1981). Time and methodin coaching for the SAT. Psychological Bulletin, 89,191-216.

Olsen, M., & Schrader, W. B. (1959). The use of pre-liminary and final Scholastic Aptitude Test scores inpredicting college grades (ETS SR 59-19). PrincetonNJ: Educational Testing Service.

Powers, D. E., & Swinton, S. S. (in press). Effects ofself-study for coachable test items. Journal of Edu-cational Psychology.

Snedecor, G. W., & Cochran, W. G. (1967). Statisticalmethods (6th ed.). Ames IA: Iowa State University.

Strenio, A., Jr. (1979). The debate over open versussecure testing: A critical review (National Consortiumon Testing Staff Circular No. 6). Cambridge MA:Huron Institute.

Test-takers may ask for and get answers to SAT’s next

year, College Board decides. (1981, April 6). Chron-icle of Higher Education, pp. 1, 10.

Acknowledgrnents

This was supported by the College Entrance Ex-amination Board. Thanks are due Donald I,. Alderman,John A. Centra, Philip l~. Oltman, Donald E. Powers,Donald A. Rock, and Warren W. Willingham for odvis-ing about the research design and statistical analysis;Donald Schiariti for supervising the andmailing of the disclosed material; Peter E. Smith forarranging the retrieval of the data; Patricia W. Cox forstatistical calculating; Norma A. Norris for computeprogramming; and Gordon A. Hale, Donald E. Powers,and Gretchen W. Rigol for critically reviewing a draftof this article.

Author’s Address

Send requests for reprints or further information to Law-rence J. Stricker, Educational Testing Service, PrincetonNJ 08541, U.S.A.



test disclosure and retest performance the sat

Documents