sell sample article

8/3/2019 SELL Sample Article

http://slidepdf.com/reader/full/sell-sample-article 1/31

A Comparative Study of Construct Validation of Graduation English

Proficiency Tests between Universitiesin Mainland China and Taiwan

Byron Gong Department of English Language and Literature

Soochow University, Taipei, Taiwan, R.O.C.

Keywords: construct validation, college graduation English

test, washback effect, high-stakes test

Abstract

This paper reports findings from an analytical studyof how construct validation is reflected in graduation Englishtests that have been widely conducted at universities in

Mainland China and Taiwan. The findings relate to key test

perceptions of construct validation for test designers andstakeholders to consider should the mandatory testing systemsin use be endorsed by the educational authorities in the greater China area. This study analyzes documentary and empiricaldata to perceive how significant that construct validation

appears in these tests. Evidence has shown that the constructvalidation behind the college graduation English proficiencytests such as the CET-4 test in Mainland China and the GEPT(High-Intermediate) test in Taiwan has been effectively and

moderately represented in the test content. Both the CET andGEPT tests have fairly sound rationale to be selected as a kindof qualified English proficiency test for universities atgrass-roots to use. The CET test in the Mainland appears tohave exerted a more powerful washback influence on her national college English programmes; while the autonomous

and decentralized language assessment approach withinuniversities in Taiwan provides autonomy in defining the testconstruct of ability according to their own local needs. It issuggested that universities in Taiwan should consider using a

unified testing system as a kind of benchmark graduationEnglish proficiency test although such a unified testing systemmay not be the only tool that should be adopted for assessingcollege students’ English competence so as to avoid collegeELT programmes as test-driven.

標題: 16p字、置中

齊左，粗體全大寫標題

關鍵詞標題：粗斜體，12p字，靠左對齊。

關鍵詞:10p字，靠左對齊

摘要標題：12p字，置中。

摘要內容：12p

字，左右對齊，第一行向內縮排四空格，左右向內縮排各 2.5

字元。

全文行間距：採單行（ s i n gl e - s p a c e d

）打



INTRODUCTION

The purpose of this paper is to present a comparative study of how

significantly test constructs can be reflected in the graduation English testsused at universities in Mainland China and Taiwan respectively. To what extent

that a large-scale test is designed to measure the intended test contentseffectively and satisfactorily can be considered the primary concern for language test designers. In this sense, construct validity of such a test could

become one of the most important considerations for test designers to consider.This could be especially critical when a test becomes high stakes because theoutcomes of a high stakes test can be closely linked with test candidates’ accessto upward socio-economic mobility.

This paper provides studies of large-scale benchmark tests for collegegraduation across the Taiwan Strait. College students in Mainland China areunder enormous pressure from a mandatory national testing system, i.e. theCollege English Test (CET-4); while their counterparts in Taiwan have to face

the General English Proficiency Test (GEPT— at its High-Intermediate level)or other tests designed by individual university equivalent to that level. Manyuniversities in Taiwan have demanded that undergraduate students should passthe GEPT or other similar tests before graduation. To enhance the quality of

their English education, universities in both Mainland and Taiwan are trying

their best to carry out an unprecedented educational movement in terms of implementing a graduation English threshold test for all university students(Cheng 2008; Yang 2009; Tsai and Tsou 2009). In this background, constructvalidity could become one of the most important considerations for testdesigners to consider; and it is significant to assess the construct validation of

such tests and the implications for all test stakeholders. However, until recentlythere has been no research focused on the test validation between the CET andGEPT tests in general, and little is known about their features concerningconstruct validity in particular. Thus, this research attempts to fill up a blank in

this field.But on the other hand, as can be seen from the large scale and number of

test candidates in Mainland China and Taiwan, it is impossible to discuss indepth the issue of how construct validity is completely reflected in every aspectin the national tests such as the CET-4 and GEPT (at High-Intermediate level)in a paper of this scope. This paper only serves to introduce the issues and

concerns of test construct validation. Therefore, this paper focuses only on some key aspects of how construct validation is reflected in the English

language testing system in the Chinese universities. As this paper is not a casestudy, the test construct validation is discussed only at macro level viewed

from the angle of national interest and local colleges or universities. (Note thatin this paper, college and university are used interchangeably, and both of thesetwo words are used to refer to institutions of higher education within theChinese context.)

There are different kinds of English proficiency tests designed

大標題：12p字。齊左，粗體全大寫標題 (Level 1)

本文：12p字。左右邊對齊。每一段開始縮進 4空格。

4

空



individually by local universities, but passing the CET and GEPT can be

considered as a prerequisite for (non-English major) college students tograduate from a university in both Mainland and Taiwan (Cheng 2008; Yang

and Weir 1998; MOE 2005). The grass-roots policies at most universities inMainland and Taiwan stipulate clearly that college students have to take theCET or GEPT test so as to graduate from a university. Therefore, for the

purpose of this research, the researcher would focus on the CET-4 in Mainland

China and its counterpart GEPT High-Intermediate in Taiwan rather than other levels of the GEPT test that are irrelevant to this paper. Other grass-rootsEnglish proficiency test, such as the SCUEPT test designed locally bySoochow University in Taipei, which is similar to the GEPT, may be studied in

order to provide an example of how universities are doing at grass-roots level

in this respect.Although the number of test candidates of the CET in the Mainland ismuch larger than that of the GEPT (or other kinds of threshold tests, such asthe SCUEPT) in Taiwan, the two kinds of English proficiency tests shareastonishing similarities in many important aspects such as the test nature,

purpose, format, method, and components. As it is a common practice to

conduct comparative studies between regional or local English proficiencytests designed and applied by individual universities with the national or international ones such as TOEFL or IELTS, it is significant to conduct acomparative study between the CET-4 and GEPT (High-Intermediate) across

the Taiwan Strait although only a narrow aspect of construct validation of suchtests can be probed due to the limited space of this paper. In addition, it issignificant to have a mutual understanding of such similar tests even though thetest constructs and washback effect may be interpreted by different teststakeholders. A comparative study of test constructs of these similar English

proficiency tests would not only be beneficial to the academic circles across the

Taiwan Strait, but may render the need for possible mutual accreditation of testresults as well. The researcher hopes that the results of this paper will provide anew starting point and constructive implications for a possible exchange of

experience in large-scale English language testing in Taiwan, as part of our momentum of the search for a solution to quality ELT assessment. Hopefully,the curriculum designers of college ELT programmes could also gainconsiderable insight from this study of test constructs for quality college ELTteaching.

THEORETICAL FRAMEWORK

Construct validation is one of the most important considerations for test

designers to consider. Weir (2004) pointed out, “Test validation is the processof generating evidence to support the well-foundedness of inferencesconcerning trait from test scores, i.e., essentially, testing should be concernedwith evidence-based validity.” It is significant to assess the construct validationof the CET and GEPT (and other) local tests used in the Mainland and Taiwan.

大標題：12p字。

齊左，粗體全大寫標題 (Level 1)

左邊及上邊空 2.3公分

白

2 .

1 .

下邊空 2公分



However, until recently there was no research focused on the test validation

between the CET and GEPT tests in general, and little was known about their features concerning construct validity in particular. Therefore, a priori

hypothesis in this study involves an assumption that certain test rationaleapplied to the test development will have an important bearing on the testvalidity in general and its consequential validity in particular. In the light of thetrait of construct validity, the evidence will be the focal point of this research.

The theoretical framework of this paper would be linked with a central issue ashow construct validation can be reflected in the CET test in Mainland Chinaand the GEPT test (or other grassroots tests) in Taiwan.

To begin with, it is necessary to take a look at the meaning of validity

and validation. “Validity” in language testing has traditionally been understood

as to discover whether a test “measures accurately what it is intended tomeasure” (Hughes, 2003, p.26). While “validation” can be modestlyunderstood as a whole process investigating and verifying evidence to providesufficient justification for designing and conducting a language testing systemin a certain way. In other words, a sound language testing system should beaccountable for why a test should be used in that way. As for the nature of this

kind of process, it is a never-ending process (Fulcher and Davidson, 2007). Inhis elucidation of methods of test validation, Xi (2008, p.177) described:

“Validity is a theoretical notion that defines the scope and natureof validation work, whereas validation is the process of

developing and evaluating evidence for a proposed scoreinterpretation and use. The way validity is conceptualizeddetermines the scope and the nature of validity investigations andhence the methods to gather evidence. Validation frameworksspecify the process used to prioritize, integrate, and evaluateevidence collected using various methods.”

Xi’s explanation provides a comprehensive understanding of the relationship between validity and validation. Furthermore, it is important to understandwhat construct validation means for the purpose of this research. It is believed

that construct validation involves an analysis of the qualities that a test isintended to measure, thus providing a basis for the rationale of a test (Weir 2004; Bachman and Palmer, 1996). According to Cronbach (1984, p.149),construct validity is also a fluid and creative process. Therefore, for the

purpose of this research, the researcher would adopt the following conception

in this paper: construct validity of a language test refers to an indication of howrepresentative it is of an underlying theory of language learning; and construct validation involves an investigation of the qualities that a test measures, thus

providing a basis for the rationale of a test (Davies, Brown, Elder, Hill, Lumley,

and McNamara, 1999). As Bachman (2004, p.15) wrote, “A construct, then, isan attribute that has been defined in a specific way for the purpose of a

particular measurement situation.” Still in other words, construct validity isconcerned with the question: Is the study actually investigating what it issupposed to be investigating? (Nunan, 1992) In simple terms, “construct”



means the idea used to support a test designer’s decision why s/he should

design or construct a test in that certain way. This seems simple, but the whole procedure of verifying construct validation could be extremely challenging,

and there are various approaches to appropriate explanation of constructvalidation from different perspectives (Fulcher, 2007, pp. 183-191).

In commenting on Cronbach and Meehl’s 1955 work, Fulcher (2007, p.10)also pointed out that the “central to understanding score meaning lies the

question of what evidence can be presented to support a particular scoreinterpretation.” This requires that test designers should be accountable for thetest they created. Nevertheless, as Alderson (1995, p.183) pointed out,“construct validity is the most difficult concept to explain.” In language testing,

it is believed that construct validity refers to the totality of evidence about

whether a particular operationalization of a theoretical framework adequatelyrepresents what is intended by theoretical account of the construct beingmeasured. It is not a simple measurement of one item such as internalreliability coefficients. Bachman and Palmer (1996) proposed the dominantnotion of test usefulness which contains six qualities: validity, reliability,authenticity, interactiveness, and impact, as well as practicality. So, as for the

conception of construct validation, Xi (2008) has pointed out, the constructs of language tests have become increasingly more complex and may go beyondwhat has traditionally been defined.

Therefore, regarding the nature of the CET and GEPT, it would be a

huge project to conduct a thorough study of the construct validation of alarge-scale test such as the national CET test in Mainland China or GEPT inTaiwan. Any one of the following issues such as correlational evidence,content validity, factor analysis, test usefulness, expert evaluation, themultitrait-multimethod, etc. would be a huge daunting project. Acomprehensive study from all possible aspects of the construct validity is

certainly beyond the scope of this paper. Hence, due to the limited space, theresearcher would just conduct a modest study that is focused on someimportant aspects for this study, as a starting point for any further research.

Viewed broadly, the researcher holds that a good language test should bewell supported by the rationale behind the test, and a good language test can

bring about positive backwash effects on language teaching. The researcher holds that construct validation should be studied so as to see to what extent thatthe rationale for a high-stakes test can be effectively reflected in the test

purposes, contents, and results with positive backwash effects in terms of social mobility. This study involves a supposition that positive backwash

effects on national English test programmes can be enhanced only whenconstruct validation can be supported with a needed rationale. Regarding the

influence of test washback, construct validity of the above mentionedhigh-stakes tests (the CET and GEPT) in the Chinese context can be viewed at

two levels: national and grass-roots levels. At national level, construct validitycan be viewed from its symbolic value for the state interest. In other words, thesymbolic value of construct validation is related to the strategic needs of



national college English education, i.e. the primary stakeholder’s needs, and to

how it can provide positive support for certain English test policies to becontinued at macro level. Meanwhile, at grass-roots level, construct validity is

more related to its functional value for end-users (teachers and students), whichis the concern of how universities can have a reliable and valid evaluationsystem for the purpose of quality English education at college level. Therefore,the fundamental question raised in this paper is to what extent construct validation can be reflected and quantified as significant in college English

graduation tests in Mainland China and Taiwan.

RESEARCH METHOD

For the purpose of this research, both documentary and empirical data of the CET-4 test in Mainland China and GEPT High-Intermediate test (and arepresentative local sample test—the SCUEPT test) in Taiwan are used in this

paper. The reason why this paper focuses on GEPT (High-Intermediate) is based on the following fact. The GEPT test has different versions for different

levels: such as the Elementary, Intermediate, High-Intermediate, Advanced,and Superior, and each set is a complete English proficiency test. The designer

of GEPT — LTTC (The Language Training and Testing Center) states that anexaminee who passes this level of GEPT (High-Intermediate) has a generally

effective command of English. His/her English ability is roughly equivalent tothat of a university graduate in Taiwan whose major was not English (LTTC,

2003). Therefore, as this research is related to the test for non-English major college students, only the GEPT Test at High-Intermediate level is relevant,and can be compared with its counterpart of the CET-4 which has the same test

purpose and function. GEPT tests at other levels are, therefore, irrelevant to

this research.It is important to realize that there is a kind of difference in studying

construct validation between psychological testing and language proficiency

testing. Generally, in psychological testing, as it is difficult to tell what factorscan be really considered as valid causal factors in the development of mentaltests and analysis of data collected from these tests, researchers need to find

what factors really cause a change of behavior among all possible factors, for example, by using LCA—latent change analysis (Windle, 2000). Broadlyspeaking, the techniques used by psychometricians to find out the influence of causal factors are relatively more complicated than those used for proficiency

language testing. But in language proficiency testing, experienced testdesigners and teachers are usually not in a “darkroom” that they need to findwhat can be considered as language skills. It is clear that these test designers do

not have to verify whether the four language skills should be the testcomponents in a high-stakes language proficiency test. Therefore, most English

proficiency tests, such as TOEFL, IELTS, CET, and GEPT, would contain thefour language skills—listening, reading, writing, and speaking. “Language

testing,” as Fulcher (2007) pointed out, “is about doing. It is about creating





tests.” In other words, at macro level, language test designers do not need to

find and prove that the four skills rather than others should be assessed in aforeign language proficiency test. The research approach to its construct

validation is also different from that of the more complicated psychometrics.Therefore, as far as the research method used in this paper is concerned, theresearcher would not conduct a factor analysis so as to prove that the four language skills are important causal factors in one’s language proficiency.

Instead, construct validation is to be studied through various methods so as tohave a comprehensive understanding of this issue from different perspectives.In other words the method used for studying construct validation of language

proficiency testing would be different from that for construct validation of

psychological tests.

Although there is no single best way to study construct validity, theresearcher would specifically look into some aspects that are commonlyregarded important by testing professionals. These aspects included theanalysis of test specifications, internal reliability, internal correlations, factor analysis, inter-subtest correlation matrix, the content validity, etc. as mentioned

previously (Alderson et al., 1995; Bachman, 2004; Fulcher and Davidson, 2007;

Xi, 2008).Generally, an effective approach to the study of construct validity of a

language proficiency test is to look at the correlation among different testcomponents. Regarding coefficients of correlation among different test

components in a test, researchers should be aware that it is not desirable for two test components in a language proficiency test to have a very highcoefficients of correlation such as over 0.85. In this respect, Alderson et al.(1995:184) pointed out,

“one way of assessing the construct validity of a test is to correlatethe different test components with each other. Since the reason for

having different test components is that they all measure somethingdifferent and therefore contribute to the total language ability thattest designers want to test, we should expect these correlations

fairly low— possibly in the order of +.3— +.5. If two components

correlate very high with each other, say +.9, we might wonder whether the two subtest are indeed testing different traits or skills,or whether they are testing essentially the same thing.”

However, as far as the interpretation of coefficients of correlation isconcerned, there is not an absolute value. Normally, it is suggested that acorrelation between +.3—+.7 can be considered acceptable (Yang and Weir,

2000, p.61). For the purpose of this paper, the researcher would consider any

coefficients of correlation among different test components smaller than +.3 asrather low. This is because when they are on the small side, such as +.2, thecontent validity of a test would be very questionable. Of course, on the other end of the continuum, a rather high coefficient should also be rejected since



there would be a question if the two test components are testing essentially the

same thing, as Alderson pointed out (1995: 184).As the CET-4 test is said to be a kind of criterion-related

norm-referenced test (Jin, 2005; Yang and Weir, 1998), the researcher firstlooked into the criterion that the CET is related to, and to study how thecriterion is reflected in the test specifications. It is believed that constructvalidation could be viewed by studying the test specifications of the CET (or

GEPT) for a start. As Alderson (1995) stated, it is generally regarded as thecorrect method by analyzing test specifications as a starting point in studyingthe construct validity of a test. Then, step by step, the researcher analyzed someempirical data of the CET, GEPT, and SCUEPT tests. These aspects included

the analysis of internal reliability, internal correlations, factor analysis,

inter-subtest correlation matrix, the content validity, and group difference. In particular, much of the first-hand empirical data of SCUEPT test results of college students in Taiwan was collected from a random selection of over 2000non-English major 2

nd-year students (from Soochow University in 2007 and

2008). The empirical results were also discussed in detail. Finally, theresearcher discussed the effects of the benchmark exams in terms of both

symbolic and functional values used in Mainland China and Taiwan. Theresearcher holds that a probe of these aspects is intended for sensible answersto the research question of this research. The discussion of relevantdocumentary and empirical data will be described from the next section.

GRADUATION ENGLISH TESTS AT UNIVERSITIES IN

MAINLAND CHINA

The educational authorities in Mainland China have been promoting anunprecedented English language testing system at universities with millions of test candidates each year. From the very beginning, the British Council has

been providing support to the CET Committee, and CALS of Applied

Linguistics Research Centre at University of Reading in Britain has beenresponsible for the research on validation of the CET. The standardizednational CET test (College English Test), which was introduced by the NationalCollege English Testing Committee (NCETC) on behalf of China’s Ministry of

Education in 1987 and revised in 2005, has been such a high-stakes collegeEnglish proficiency test that undergraduate university students in China arerequired to take before graduation (Han et al., 2004). The number of CETcandidates is on the increase every year. In the 1995 academic year, 583,135

students in China took the CET, with a passing rate of 66% (Yang and Weir,1998); and 9.58 million students took the test in the 2005 academic year (Jin,2005). In reality, the CET has become such a high-stakes benchmark test that

most universities would demand students to pass the CET so as to obtain their bachelor’s degree. Although China’s Ministry of Education altered its test policy in 2005 by stating that the CET is not to be linked with students’graduation, college students still consider the CET test crucial because the CET





test certificate is an important criterion for many employers to consider at a job

interview (Cheng 2008). In other words, viewed from functional value, it is anirreversible trend for millions of Chinese college students to take the CET test.

Although there are negative voices against the CET (Han et al., 2004), it isgenerally believed that such a mandatory standardized English language testingsystem has brought about a cumulative positive effect on the quality teachingof college English education in Mainland China (Jin, 2005; Yang and Weir,

1998). Therefore, as being linked with both educational and social status, theCET test has become high-stakes for 20 years since its first launch in 1987.

Construct Validation Reflected in the CET-4 Test Specifications

(Mainland China)

How much the test specifications of the CET-4 (College English Test)

can reflect the intended requirements of China’s national teaching syllabus isconsiderably relevant to the degree of how construct validation can be fully

represented in terms of its symbolic value. The CET is a national standardizedtest designed according to China’s National College English Teaching Syllabus

for Non-English Majors 1999 (revised in 2007). This national syllabusstipulates specific quantitative requirements for college students to achieve in

terms of their English language proficiency, and skills of reading and listeningare of paramount importance (http://edu.people.com.cn/GB/8216/43375/5995154.html

).The CET tests have two basic versions, CET-4, and CET-6. The CET-6 is

for students who have passed the CET-4, and have taken elective Englishcourse of Band 5-6. The CET Spoken English Test (CET-SET) is administeredonly to a very small number of students who want to take by themselves on thecondition that these students have passed the CET-4 with a score of 80 or above out of a full score of 100, or the CET-6 with a score of 75 or above.However, only the CET-4, which is the focus of this paper, is considered as the

benchmark test that virtually all undergraduate students need to pass, and the

CET-4 test is administered twice a year, in January and June. According to CETCommittee (2006), there are four main components in the CET-4 test:

Listening Comprehension (35% － dialogues and long talks), Reading

Comprehension (35%), Cloze (15%－one cloze and sentence translation), and

Writing (15%－one short essay).

As for test specifications of the CET-4, the in-house report states that theguiding principle is to reflect the requirements of the national syllabus.According to the research report by Yang and Weir (1998), the CET-4 test

specifications have generally met the requirements of the national syllabus. Asthe test is to help to implement this national teaching syllabus, the CET testdesigners had paid attention to the following aspects so as to construct atheoretical framework for the CET test:

1) The relationship between knowledge and ability: This means,

conceptually, language is a tool for communication. The ultimate aim of

次標題:10p字。

置中粗體大小寫標題 (Level 2)



EFT is to ensure that students can use English to communicate. Therefore,

the CET should test more language skills rather than language

knowledge.

2) The relationship between fluency and accuracy: The designers of theCET have set specific speed requirements of reading, listening and

writing (i.e. 50 wpm and 129 wpm are set for reading and listening in the

CET-4 test).

3) The relationship between sentence understanding and discourse

comprehension: As communication is based on discourse comprehension,

the CET should not only take into consideration of sentence structures,

but also the ability to understand discourse.

4) The relationship between receptive ability and productive ability: This

means the CET specifications require that both passive and active skillsare to be examined.

According to the research report by Yang and Weir (1998), the CET-4 test is

designed according to the above four major considerations which constitute the

basis for its construct validation at a macro level. The symbolic value of the

construct validation of the CET test is therefore can be indicated by the degree

of how the CET test can be accepted by both the educational authorities and

university English teachers. According to an official survey by China’s CET

Committee (2006), the CET has successfully achieved the aims of its test

specifications; and the construct validation based on the theoretical framework

can be well represented in each delivery of the CET test. Specifically, the

statistics also provide the following implications:

The internal reliability of objective items in the CET test reaches 0.9 or

above every time when the CET test is conducted, indicating high

reliability.

A series of studies of questionnaires on the CET has indicated that 92%

of college teachers in China agree that the CET test can effectively

reflect students’ actual English proficiency level, indicating a highvalidity in terms of expert judgment.

As the CET is a criterion-related norm-referenced test, the passing score

set in the CET correlates with the teachers’ assessment of the test

candidates’ passing score with a correlation coefficient of 0.82. In

addition, the CET test scores correlate with the order of class assessment

results given by the teachers with a correlation coefficient of 0.7, which

is very good because it is difficult to achieve such a high coefficient in

large-scale standardized tests.

Over 86% of college teachers agree that the contents of the CET areappropriately designed and each part has a proper weighting.

The CET has a complete testing system, including item bank

management, test formation and organization, administration, statistical

analysis of test results, test fairness, and practicality.



Therefore, upon conclusion of this part, it is clear that the construct

validity of the CET is mainly associated with the state interest at national level.In other words, the test specifications of the CET reflect governmental

initiatives for centralization and standardization of language testing at anational level, with a centralized definition of ability construct. Furthermore,empirically, China’s national educational authorities have gained solidstatistical support for its CET policy to be continued nationwide. Next, the

graduation benchmark testing system in Taiwan will be discussed.

GRADUATION ENGLISH TESTS AT UNIVERSITIES IN

TAIWAN

In Taiwan, there is not a mandatory island-wide English proficiency testset by Taiwan’s Ministry of Education for undergraduate students to take for

graduation. Unlike its counterpart in the Mainland, the Chinese educationalauthority in Taiwan has been carrying out an American style of autonomousand decentralized language assessment within colleges and universities. Theeducational authority in Taiwan has transferred power to lower levels at eachindividual university, which gives more freedom to universities to decide whatkind of English proficiency is needed for their undergraduate students

according to each university’s own principles; and many universities in Taiwan

have recently announced that they will carry out their own benchmark Englishtesting system for college students to take. In other words, a kind of thresholdEnglish test is about to be carried out in the near future across the university

campuses in Taiwan. In addition, undergraduate students can also take other English proficiency tests as a proof of their English proficiency before theyfinish the 4-year university education, such as the GEPT test (General EnglishProficiency Test, a criterion-referenced test set by a non-governmentalorganization in Taiwan), or TOEIC, IELTS, and TOEFL. Nevertheless, thelocal GEPT test is the most popular English proficiency test for college

students to take although the passing rate for college students is around 32%(LTTC, 2007).Meanwhile, among the English proficiency tests designed by individual

universities at grass-roots level in Taiwan, different universities have their own

testing systems and criteria. In contrast to the common practice of usingachievement test of the last term of the university programme, the SCUEPTtest (Soochow University English Proficiency Test) appears to be at theforefront of the campaign for a standardized English proficiency threshold test

for undergraduate students to pass for graduation. Many other universities arealso trying to design their own proficiency tests now. By and large, it is clear that a variety of benchmark English tests will soon become high-stakes tests for

thousands of college students in Taiwan to take; as such a test certificate wouldhelp college graduates to have better opportunities in the job market, too.





Construct Validation in the GEPT/SCUEPT Test Specifications (Taiwan)

Now let us take a look at the benchmark graduation English test for

college students in Taiwan. There are more than 150 officially accrediteduniversities in Taiwan. However, there lacks a cohesive paradigm of collegeEnglish assessment at the tertiary level. There exists no specified requirementfrom the educational authorities in Taiwan to demand all college graduates to

take an English proficiency test before they graduate. As mentioned earlier, theeducational authority in Taiwan has transferred power to lower levels at eachindividual university, which gives more freedom to individual universities todecide what graduation threshold test should be. According to the researcher’s

investigation, only some universities demand their students to take GEPT or any other public tests as part of the requirements for graduation. According to a2007-year report on GEPT, only 22% of GEPT testees took the GEPT in order to give their test scores to their universities for reference. As for thespecifications, the GEPT is not designed to test just college students’ English

proficiency, but to test the English proficiency of the general public, which is

very different from that of the CET. In addition, there exists little research of the possibility of a large-scale mandatory testing system designed especiallyfor college English education in Taiwan. (Note: There is a new test calledCollege Student English Proficiency Test—CSEPT designed by LTTC in

Taiwan, but it is still at its experimental stage and virtually no analytical

statistics have been released so far.) Notwithstanding this, the freedom from government control regarding

using a standardized and centralized assessment within universities in Taiwan

reflects autonomy in defining the construct of ability, whose rationale couldhave a different framework for different socioeconomic purposes. However,

the negative side of such autonomy in defining construct ability can also causevarious problems for local universities to solve. Practically, the generalscenario of the evaluation of English programmes among universities inTaiwan is that test scores may be inconsistent and incompatible. In other words,

the macro-relationship between college students’ English competence and theapplied evaluation methods in Taiwan is not clear, which means differentuniversities adopt their own methods in evaluating students’ English

proficiency, and such methods may be inconsistent each year and differ in

different department, even differ from one individual teacher to another. Thus,it is far from the desirable situation that testing results can be consideredmutually comparable among colleges in Taiwan in both theory and practice.This is because current evaluation criteria for college graduates’ English

proficiency level in Taiwan are not based on an island-wide or nationally

agreed standard, such as that of the CET-4 used in Mainland China. Therefore,the evaluation results conducted by different colleges and universities in

Taiwan are difficult to interpret in terms of statistical analysis at national level.In addition, at present, graduation examinations across colleges and

universities are mostly of progress tests or achievement tests, and the contentsof such achievement tests could be widely different from one university to

次標題:10p字。




another on the basis that different teaching materials are used. In a sense, such

assessment practice of English language programs can hardly provide reliableand valid evaluation of college graduates’ English language proficiency when

compared with each other. For example, let us look at the achievement testscores of the same English course at two campuses of the same university.

Table 1. Pearson Correlation of Freshmen English Scores at Two Campuses

Taipei Campus Kaohsiung Campus

Taipei Campus 1.000 0.034

Kaohsiung Campus 0.034 1.000

Note. N=900, Year 2005.

Table 1 reflects the fact that test scores are not comparable even within

the same university. No significant correlation can be found between the testscores at its two campuses (Kaohsiung and Taipei) of the same university, witha Pearson correlation of 0.034 ( p<0.01). This may suggest that as there are noestablished English test syllabus and test specifications for universities inTaiwan to follow, different tests are used for evaluation (Gong, 2004). In fact,each university has to take its own approach to the assessment of its English

programmes at different levels. Hence, in the light of different teaching

materials, the tests used for graduation examinations, if required, areunsurprisingly related to different teaching materials. The test results areaccordingly not comparable due to the fact that there are different test contents;

and there are hardly any test specifications written for such wide-ranging tests.Thus, as far as construct validity is concerned, it shows that the

understanding and interpretation of language ability could vary at differentuniversities. With an autonomous and decentralized language testing system,each individual college or university may decide their own testing criteriaaccording to their own needed rationale. Therefore, at national level, the

symbolic value of construct validation of both GEPT and other individualcollege tests in Taiwan appears to be limited when compared with that in theMainland where the construct of the CET test is closely related to the nationalEnglish teaching syllabus. But for universities at grass-roots level, their testresults appear to be inconsistent and incompatible. Thus, the English language

testing is said satisfactory and successful only in terms of the interpretation of the needed test construct, or “ability” by each university itself.

EMPERICAL ANAYLSIS OF THE GEPT, CET AND

SCUEPT

The writer has discussed how test validation of the CET and GEPT testscan be viewed from the aspects of test specifications in the previous Sections

IV and V. Now, let us take a look at how construct validation can be viewed





from the aspect of internal reliability, internal correlation coefficients, and

factor analysis of these tests, while construct validity linked with contentvalidity and expert evaluation will be discussed in next sections.

Test designers in both Mainland China and Taiwan paid much attentionto the issue of validity in their tests at different levels. Let us look at somemore empirical data so as to have a better view of the CET-4 test and GEPT (or SCUEPT). First, reliability results of the above three tests are reported as

follows: the reliability (Cronbach's alpha) of the CET is 0.9 (Yand and Weir,1998; Yang, 2009), the GEPT is said to be 0.85 (LTTC 2003, p.23), and 0.86for the SCUEPT (Gong). But the CET has kept its reliability as 0.9 for nearly20 years, which is remarkably high.

Internal Correlation Coefficients of the GEPT Test in TaiwanAs mentioned previously in Section III — Research Method, theconsideration of internal correlation coefficients is important and useful for usto evaluate construct validation of these tests. The following Tables 2-6 showthe internal correlation coefficients of the GEPT test in Taiwan.

Table 2. Internal Correlation Coefficients of the GEPT ( Year 2000)

ComponentsTest A (N=375) Test B (N=360)

AL1 AL2 AL3 BL1 BL2 BL3

LI 1 1

L2 0.777 1 0.822 1

L3 0.779 0.792 1 0.788 0.839 1

R1 0.591 0.629 0.620 0.636 0.651 0.648

R2 0.590 0.624 0.605 0.663 0.684 0.713

R3 0.598 0.648 0.680 0.695 0.730 0.770

Note: AL1= Test A Listening Part 1; AL2= Test A Listening Part 2; AL3= TestA Listening Part 3; BL1= Test B Listening Part 1; BL2= Test B Listening Part2; BL3= Test B Listening Part 3. The abbreviations for test components aremade by the writer of this paper.

** p≤0.01 (High-Intermediate Report, LTTC, 2000, p. L-12).

The above Table 2 shows that all the internal correlation coefficients of these two GEPT A and B tests are quite high in the same component — Listening, especially in Test B; and fairly good between the two components of

Listening and Reading, with lowest as 0.590. But a few of the correlationcoefficients of Test B are somewhat too high as it is beyond the range of +.3—+.7. In the same year of 2000, another version of GEPT was delivered bythe LTTC and the following Table 3 can reveal the relevant internal correlation

coefficients of this GEPT test.

次標題:10p字。




Table 3. Internal Correlation Coefficients of the GEPT ( N=375, LTTC 2000 )

Sub-test Reading Part A Reading Part B Reading Part C

Reading Part A 1.000

Reading Part B 0.681 1.000

Reading Part C 0.686 0.722 1.000

Listening Part A 0.591 0.590 0.598

Listening Part B 0.629 0.624 0.648

Listening Part C 0.620 0.605 0.680

Note. ** p≤0.01 (High-Intermediate Report, LTTC, 2000, p. R-11).

According to the LTTC GEPT reports, there seems to be a sign of decreasing

trend of the internal correlation coefficients of the GEPTS (high-Intermediate),which can be revealed from the following Table 4 and Table 5.

Table 4. Internal Correlation Coefficients of the GEPT (LTTC 2003 )

Listening Reading Writing (Speaking)

Listening 1.00

Reading 0.56 1.00

Writing 0.50 0.70 1.00

Speaking 0.61 0.38 0.49 1.00

Note. ** p≤0.01 (High-Intermediate Report, LTTC, 2003, p.31).

Table 5. Internal Correlation Coefficients of the GEPT (LTTC 2007)

Listening Reading Writing (Speaking)Listening 1.00

Reading 0.37 1.00

Writing 0.19 0.32 1.00

Speaking 0.39 0.23 0.38 1.00

Note. ** p≤0.01 (High-Intermediate Report, LTTC, 2007, p.2).

Another phenomenon reflected in the above Tables 2—5 is that theinternal correlation coefficients of the GEPT (High-Intermediate) are not veryconsistent, which can be clearly observed in Table 5. Researchers need to pay

more attention to such an inconsistent phenomenon. Looking at Tables 4 and5, we can see the inter-subtest correlations of the early GEPT in 2003 and 2007



generally appear to be on the small side of a commonly used range, i.e.

+.3—+.7 when compared with those of the CET whose figures are mostlyaround 0.5 ( p<0.01). This might indicate that each inter-subtest in GEPT is

quite disintegrative regarding language communication skills. Probably, further efforts are needed so as to achieve less divergent construct validity in the test,which needs to keep a fair balance between convergent and divergent validity

by using test items of better discrimination and difficulty index.

The internal correlation coefficients of the GEPT (2000 test) appear to bemore convergent than that of 2007 test. It is revealed in the 2000 GEPT Reportthat high coefficients between Listening and Reading in GEPT may be caused

by the fact that the test format and content are similar, i.e. both are of

paragraph comprehension (LTTC, 2000). Nevertheless, in general, the GEPT

2000 and 2003 Reports provided quite satisfactory figures in terms of internalcorrelation coefficients.

Internal Correlation Coefficients of the CET Test in China On the other hand, let us now take a look at the CET-4 test in the

Mainland. According to China’s National CET Committee, the CET-4 test is a

standardized test designed with the cooperation of the British Council (Yangand Weir, 1998). Research on test validation of the CET is mainly theresponsibility of the British team. Specifically, the CALS of AppliedLinguistics at the University of Reading in Britain is in charge.

As Cheng (2008) pointed out, much research has been conducted aboutthe validity of a widely-used test format—multiple-choice questions inobjective tests, the main feature of the CET test. A fair number of empiricalstudies have been conducted in China and published in Chinese academic

journals. Among these, there are many case studies concerning the validity of the CET test. For example, according to Zhou (2004) who conducted a

comparability study of CET, he found that a Pearson correlation coefficient of 0.712 ( p<0.01) between two CET tests.

Regarding internal correlation coefficients of each part in the CET-4 test,

Yang and Weir (1998) provided valuable research results. The following Table6 shows that the internal correlation coefficients of each part in the CET-4 are

between 0.3-0.7 ( p<0.01).According to a number of studies (Yang and Weir 1998; Yang and Jin

2000; Jin 2005; and Cheng 2008), it is worthwhile to notice that the CET-4 has

kept a stable trend of internal correlation coefficients, and there are no obvioustrends of ups and downs of these coefficients during the past 20 years.

Therefore, compared with Table 6, the internal correlation coefficients inGEPT are less desirable, especially the coefficient between Listening and

Writing in GEPT 2007 (Table 5) is on the small side in comparison with that of the CET shown in Table 6, although writing could be an important factor to

pull down the figures of the GEPT in 2007 as LTTC explains (LTTC, 2007).Such a phenomenon shown in Table 2 (LTTC 2000) and Table 5 (LTTC 2007)might be a factor for GEPT test designers to consider how to improve the

次標題:10p字。




the test time, format and skills used in doing Cloze, they differ considerably

from the other test components of the test. Based on the analysis of the test papers, the writer found that many students did not have time to finish the last

test component, i.e. Cloze, during the specified test time. This could be themain causal factor in the low correlations with all the other test components,indicating most college students need more training in speed reading.

Table 7. (Pearson) Inter-subtest Correlations Matrix of SCUEPT Test 5/2009 TotalScore

LS LL SC FR CR Cloze

TotalScore

1

LS .792** 1

LL .655**

.542**

1

SC .691**

.367**

.273**

1

FR .607**

.392**

.325**

.393**

1

CR .662**

.350**

.268**

.379**

.319**

1

Cloze .463**

.226**

.200**

.214**

.174**

.223**

1

Note. N=1660. LS=listening short conversations, LL=listening long talks,

SC=sentence completion, FR=fast reading, CR=careful reading.** p≤0.01 (2-tailed).

Meanwhile, the inter-subtest correlations of all the other test componentsappear to be quite acceptable according to the criterion set by Alderson (1995,

p.184) who wrote “we should expect these correlations fairly low— possibly

in the order of +.3— +.5.” Thus, on the whole, the internal correlation

coefficients of the SCUEPT test at Soochow University in Taiwan prove to besound because (1) the correlation coefficients of each subtest in SCUEPT with

its total score appear to be very desirable, and (2) inter-subtest correlations of six out of seven subtests are all generally within the acceptable range from

+.3—+.5. The strong correlations between the subtests and the total score

may provide some grounds for the construct validation of the SCUEPT, but theweak part of coefficients may indicate that test components need improving inthe test battery of the SCUEPT. Therefore, the internal correlation coefficientsof the SCUEPT test at Soochow University, which represents a typical

grass-roots universities in Taiwan, can be considered quite supportive of testconstruct validity for a locally designed English proficiency test for college

graduation.Regarding concurrent validity, the writer of this paper also administereda reading test of the SCUEPT and GEPT to 120 students within two weeks.Then, a paired-samples t test was conducted so as to see if the SCUEPT differs

greatly from the GEPT. Table 8 shows that there is not a statistically significantdifference between SCUEPT and GEPT (value p > .05 at +.074 sig.), which is



a very clear index for us to tentatively describe these two tests are similar.

Table 8. Paired-Samples T Test of SCUEPT and GEPT (Reading Test)

95% Confidence Int. of Difference

Mean SD SEM Lower Upper t df Sig.

Pair 1

SCUEPTGEPT

1.048 4.576 .576 -.104 2.199 1.818 62 .074

Note. p > .05

Next, the Pearson correlation between the two tests was calculated andillustrated in Table 9. There was a statistically significant positive correlation

between the two tests, r = 0.72 ( p=.000). Statistically, this means they are

Table 9. Correlations between SCUEPT and GEPT ( N=120)

SC GEPT

SC Pearson Correlation 1 .720**

GEPT Pearson Correlation .720**

1

** ≤0.01 (2-tailed).

positively related at high level. In other words, the writer cautiously holdsthat the usefulness of SCUEPT can be partly supported by its high correlation

with the GEPT. This is practically very useful for the students. Hence, thewriter maintains that construct validation of the SCUEPT viewed from the testusefulness is also supported by the evidence of its high correlation with theGEPT High Intermediate.

SCUEPT Viewed from Differential-Group Experiment Empirically, another popular method to study construct validity is the

differential-group experiment, and the writer conducted a study by using thismethod only with the SCUEPT in this part. This is because the writer of this

paper holds that both the CET and GEPT are relatively well-designed

large-scale tests and there is already much research literature on the results based on the differential-group experiment, showing a positive support to theconstruct validity in the CET and GEPT (Yang & Weir, 1998; Yang & Jin, 2000;LTTC, 2007). Therefore, their results will not be especially presented due to

the limited space here.The intension of this method is to detect bias in the test for or against

groups of students defined by the biodata characteristics (Brown, 2005, p.227).This means a researcher compares the performance of two groups on a test: one

group that obviously has the construct (the daytime English majors in thisstudy) and another group has little or less construct (nighttime English majorsand non-English majors in this study). If the first group scored high on the test

次標題:10p字。




and the other group(s) scored low, this would be an argument for the construct

validity of the test, that is, those have the construct score higher on the test thanthose who do not have the construct (Alderson, et al 1995, p.185). In

Bachman’s view,“…differences in group performance can be stated in terms of group

means, as follows:

X U1 > X U2 > X U into > X UprepIf these differences were observed, this would be evidence in support of

the claim that scores from this test would be useful for predicting future performance in a relevant language use domain, namely language use tasks inan academic setting” (Bachman 2004, pp.290-292).( Note: U1, U2, U into and Uprep in the above expression refer to different levels of university

students in the study conducted by Brown et al (2002), which was further adopted by Bachman,

2004, pp.290-292).

Therefore, the researcher of this paper followed this method to comparethe performance of three groups of second-year students on the same SCUEPTEnglish proficiency test. The first group was daytime English majors, whowere considered to have high ability; the second group was night-time English

majors, and the third group was non-English majors whose English ability was presumed to be the lowest.

According to the empirical analysis presented in Table 10, it shows

clearly that the mean score of the first group scored highest, the second groupscored relatively low, and the thirds group of non-English majors scored lowestof all. Table 10 shows the result of the descriptive analysis of the test

Table 10. Test Scores of the SCUEPT Among Three Groups of Students (2009)

Group Mean Median Mode SD

Daytime English Majors 63 67.69 67.40 58.20 7.73ight-time English Majors 57 59.46 60.00 58.8 8.67on-English Majors 224 53.44 54.00 61.00 10.95

scores of the SCUEPT test administered in May 2009. Data consisted of 344students at Soochow University, with 63 Daytime English Majors, 57

Nighttime English Majors, and 224 Non-English Majors. The mean scores for

the above three groups are X group1=67.69, X group2=59.46, and

X group3=53.44 respectively. The medians for the three groups are 67.4, 60,and 54. Other variables also have a similar tendency.

As Table 10 shows, it is quite clear that there is a difference in test meanscores among the three groups, and even between the two groups of English

majors. Therefore, based on these statistical results, the researcher has a strongargument for the construct validity of the test scores of the SCUEPT test. That

is to say, the test differentiated between students who have much of theSCUEPT proficiency construct (daytime English majors), those who have less(nighttime English majors), and those who have very little of the proficiencyconstruct (non-English majors). When taking other evidence into consideration,



such as content validity, inter-subtest correlation matrix mentioned in the

previous sections, the writer would form a convincing argument that theSCUEPT scores have also reflected the construct that the SCUEPT was

designed to measure.So far this section has studied the test construct validation from the aspect

of internal correlation coefficients and other aspects of the three tests. Thewriter can cautiously conclude this section by stating that the notion of

construct validity has not only been considerably represented in the CET,GEPT, and SCUEPT tests, but also played an important role in thedevelopment of the test specifications and construction although there is stillmuch to be done.

Next, it is worthwhile to take a look at how construct validation can be

reflected from the aspect of factor analysis of the CET and GEPT.

Factor Analysis of the CET and GEPT Test Components

As mentioned in Sections II and III, it is generally accepted that testconstruct validation can also be probed by factor analysis. Regarding theresults of factor analysis of the CET test, China’s National College English Test

Committee released a number of its study results. Let us take a look at the 1998statistics in Table 11.

Table 11. CET-4 Factor Analysis (using Principle Components Analysis)

Statistical Factor Analysis

Eigenvalue% of totalvariance

cumulative variance%

1 3.00246 60.0 60.0

2 .68722 13.7 73.83 .55023 11.0 84.84 .41964 8.4 93.25 .34045 6.8 100.2

Table 12. Factor 1 matrix from principle components analysis

Factor Matrix Factor 1

LC .76497

RC .80278

CL .78927

VS .85199

WR .65115

Note. LC=listening comprehension; RC=reading comprehension;VS=vocabulary & structure; CL=cloze test; WR=Writing. From: Yang and Weir,1998, p.227.

According to Tables 11 and 12, it shows that the CET Eigenvalue of factor 1 is much greater than 1 and its contribution reaches 60%. Among thefive factors of test components, factor 1 accounts for most of the variable

contributions; and the range is between 0.651 and 0.852. Therefore, factor 1

次標題:10p字。

置中粗體大小寫標題(Level 2)



can be considered as “general English proficiency”. In other words, the

construct design of the CET-4 can strongly reflect the general English proficiency level of college students.

Next, Table 13 and Table 14 show the results of factor analysis of GEPT(High Intermediate) in its 2003 Report, which is the only report that publiclyreleased the results of factors analysis of the GEPT test so far.

Table 13. Factor Analysis of GEPT (High Intermediate) (Stage 1 Test)

Test ComponentsFactor

1Factor

2Factor

3

GEPT(HighIntermediate)Stage 1 Test:ListeningReading

ListeningComprehension

Q & A 0.28 0.72 0.37

Short

Dialogues0.33 0.75 0.29

ShortTalks

0.27 0.78 -0.04

ReadingComprehension

Voc &Structure

0.52 0.34 0.32

Cloze Test 0,63 0.32 0.28

Reading 0.70 0.36 0.25

Expl. Var Prp. Totl

5.05 4.19 3.23

0.24 0.20 0.15

Note. From GEPT 2003 Report, p.14.

Table 14. Factor Analysis of GEPT (High Intermediate) (Stage 2 Test)

Test Components Factor1 Factor2

GEPT

(High

Intermediate)Stage 2 Test:Listening

ReadingWritingSpeaking

ListeningComprehension

Q & A 0.11 0.75

ShortDialogues

0.11 0.71

Short Talks 0.26 0.80

ReadingComprehension

Voc &

Structure 0.63 0.32Cloze Test 0.63 0.28

Reading 0.50 0.53

Writing

Chinese toEnglish

0.63 0.38

GuidedWriting

0.71 0.32

Speaking 0.26 0.67

Expl. Var

Prp. Totl

4.76 4.50

0.30 0.28 Note. From GEPT 2003 Report, p.32.

Table 13 and Table 14 were provided in the GEPT 2003 Report, but veryunfortunately, the original eigenvalue and percentage of total variance of



factors were not clearly provided by the GEPT report. Nevertheless, the report

mentioned that there are three factors whose eigenvalues are greater than 1, and59% of variance is caused by the three factors. Factor 1 covers most of the

variable contributions. The GEPT report (LTTC, 2003) wrote: “Readingaccounts for more of the variance in Factor 1, and Listening accounts for moreof the variance in Factor 2. Possibly, Factor 1 is related to Reading, and Factor 2 is related to Listening, and Factor 3 is related to Writing” (p.14).

Similarly in Table 14, according to the same GEPT report, there are twofactors whose eigenvalues are greater than 1, and 58% of variance is caused bythe two factors. The report also suggested that: “Possibly, Factor 1 is related toReading and Writing, and possibly Factor 2 is related to Listening” (p.32).

In short, when comparing the factor matrix in Tables 12, 13 and 14, the

writer found that the figures in CET is greater than those in the GEPT. Usually,according to the Kaiser criterion, it is suggested that researchers can retain onlyfactors with eigenvalues greater than 1. In essence this means unless a factor extracts at least as much as the equivalent of one original variable, we drop it.As Mousavi (2002) mentioned, if the correlations among the variables in thecorrelation matrix are close to zero, then no factors will emerge from the factor

analysis. On the other hand, the higher the correlations among the variables,the more likely it will be that one or more factors will result from the analysis(p.245). Accordingly, this might tentatively suggest that the results of factor analysis of the CET reflected higher correlations among the test components

than those of the GEPT. But in general, the construct validation of both CETand GEPT seemed well-grounded in their own explanation. In other words, theresults of factor analysis may provide more support for both the CET andGEPT tests in terms of their construct validation. Finally, as for the SCUEPT,there is no available report concerning its factor analysis. Because of the datalimitation, the writer admits that he was unfortunately not able to conduct a

factor analysis this time by himself. Therefore, factor analysis could beconducted to make some improvements on the studies of the SCUEPT in thefuture.

VALIDATION VIEWED FROM CONTENT VALIDITY

Construct validation should also be viewed from the perspective of content validity. One good way to study content validity is to gather the

judgment of experts. Alderson et al (1995) points out: “Typically, contentvalidation involves ‘experts’ making judgements in some systematic way. Acommon way is for them to analyze the content of a test and to compare it witha statement of what the content ought to be” (p.173). The following Tables 15,

16, and 17 show the relevant results of the CET-4 in Mainland China and theGEPT and SCUEPT in Taiwan.

Table 15 clearly shows only 8% of teachers have negative views on theCET-4, but the majority believes the CET-4 is creditable, indicating highapproval of content validity in terms of judgment of experts.





On the other hand, no judgment of experts has been officially reported

regarding the GEPT in Taiwan. However, according to the researcher’s surveyamong 17 English teachers who made their judgment of the content validity of

Table 15. Content Validity of the CET-4 Test Based on Teachers’ Evaluation

General Comments English Teachers(N=144)

Students(N=2490)

1. Useful/reflecting students’ ability 68% 41.6%

2. Useful/good for jobs 24% 26.5%

3. Useless/not reflecting ability 4% 14.6%

4. Useless/students unwilling to take 4% 11.4%

Note. From Yang and Weir, 1998, pp.174-175.

the GEPT. The question is: To what extent does the GEPT (high intermediate)

test properly reflect the claimed English proficiency that non-English major students have? Out of a scale from 1 to 10, the bigger the number is, the lessthe test suffers from poor construct representation. The writer collected thefeedback of GEPT shown in Table 16.

Table 16. Teachers’ Feedback of Construct Relevance of GEPTTeacher 1 2 3 4 5 6 7 8 9

+ 7 9 6 7 8 7 8 7 6

－

Teacher 10 11 12 13 14 15 16 17

+ 6 7 7 6 8 6 7 8

－

Note. N=17 selected from universities in greater Taipei and Kaohsiung areas in2006.

This “expert” feedback of GEPT shows that the agreement percentagereaches only 71% (120/170), which means on average GEPT has not got veryhigh marks in terms of judgment of experts. In this respect, the content validityfor the GEPT can only be rated as moderately significant, which is fairly good.

As for the SCUEPT, the researcher interviewed and collectedquestionnaires in 2007 from 20 college English teachers to have their judgment

of the content validity of the SCUEPT. The question to these 20 teachers is: Towhat extent does the SCUEPT test not suffer from construct under-representation or construct irrelevant variance?Out of a scale from 1 to10, the bigger the number is, the less the test suffers from constructunder-representation. The feedback is shown in Table 17.



Table 17. Teachers’ Feedback of Construct Under-representation of SCUEPT

Teacher 1 2 3 4 5 6 7 8 9 10

+ 8 7 8 9 9 8 6 9 7 9

－

Teacher 11 12 13 14 15 16 17 18 19 20

+ 8 8 7 9 9 8 9 7 9 8

－

Note. N=20 selected from universities in greater Taipei area in 2007.

This “expert” feedback shows that the agreement percentage reaches 81%

(162/200) i.e. the SCUEPT test content does not suffer from constructunder-representation. Therefore, relatively, the content validity for theSCUEPT can be rated as significantly strong. However, further research is still

needed in the future study because the selected respondents of the 17 and 20teachers may not be adequately representative. Regarding the demographic

data, larger samples are badly needed for better results.

DISCUSION

Test Washback The educational authorities and educational institutions at different levels

in both Mainland and Taiwan have put testing in a position to affect the hugenumber of stake-holders involved. This influence of testing in China, especiallyin relation to nationwide College English Test (CET) has a close relationship

with her national College English Teaching Syllabus. The test constructvalidation clearly needs to be in accordance with the national syllabus first. Asfor the CET, test washback is to be first linked with the needs of primarystakeholders, and then those of less important ones. Universities consider CET

results as a reliable index of their quality ELT programmes. Meanwhile, theCET test results are directly linked with its testees’ social mobility in terms of

job hunting. The washback effect of the CET is considerably influential on thestakeholders at each level (Jin & Wu, 2010). But the stakes of the GEPT (or other tests) in Taiwan would differ greatly from that in China, and thewashback effect of the GEPT is weak as the GEPT is not the only tool adopted

for assessing college students’ English competence. In fact, many universitieshave created their own criteria according to their different framework of construct validity.

Specifically, the symbolic value of the CET test in China can be

considered dramatically enormous for college ELT programmes in modernChinese education. The implementation of the CET in Mainland China hascaused millions of college students to study English, and it is no exaggerationto say that China has the largest English-learning population in the world now.

Universities at grass-roots level in China have utilized the test functional value



次標題:10p字。




to the extreme, and, to some extent, the CET has caused a high tension between

college ELT teaching and students’ functional needs (Han et al., 2004). Thewashback effect has played a critical role in tertiary ELT curricula across all

universities in China. According to studies of the CET, Cheng (2008) wroteabout both positive and negative washback:

“Most of the CET stakeholders think highly of the test, especially its

design, administration, marking, and the new measures adopted inrecent years. They believe that positive washback of the test is muchgreater than the negative washback, and the negative washback is

primarily due to the misuse of the test. However, some CETstakeholders are dissatisfied with the over use of multiple-choice (MC)format in the test, the lack of direct score reports to the teachers, the

incomplete evaluation of students’ English proficiency without acompulsory spoken English test, and the use of the test as the solemeans in evaluating the quality of college English teaching andlearning” (p.30).

The issue of the CET washback is complicated, with many factorsdetermining the outcome of college ELT programmes. But most collegeteachers hold that such a standardized English language testing system has

brought about a cumulative positive effect on the quality teaching of college

English education in China (Jin, 2005; Yang and Weir, 1998, Cheng, 2008).Clearly, tertiary ELT programmes at grass-roots level in China have to makeadjustment accordingly under the influence of the CET washback.

In Taiwan, the GEPT has become an important English test whosenational symbolic value has made all stakeholders cherish its educational and

social significance. Compared with the CET in China, the GEPT is not ahigh-stakes test to most college students in Taiwan as there is no suchobligation that all college students have to pass the GEPT. However, the GEPThas successfully influenced all test stakeholders, especially the primary ones in

tertiary ELT education (LTTC, 2007, 2003). This can be clearly reflected in thefact that almost all universities in Taiwan have made GEPT as the first choicefor their students to take, either as mandatory or optional (LTTC, 2007).College English teachers have made more effort to teach their students.Specifically, some universities have clearly stated that all students have to passthe GEPT or the other (such as the SCUEPT) to obtain a bachelor’s degree.

Accordingly, universities would encourage students to take the GEPT or thetest deigned by universities at grass-roots. It can hardly be denied that the most

powerful test washback effects such as the GEPT, are not just reflected in theimprovement of each university’s ELT syllabus design, but also reflected in its

social value and function. Officially, the Ministry of Education in Taiwandemanded a few years ago that at least 50% of college students should pass theGEPT test by 2007. In turn, most universities at grass-roots level started tocarry out various awarding plans to encourage students to take the GEPT test

by providing financial support. Partially, this shows the functional value of the



GEPT is high as the test usefulness is widely accepted. As Bachman (2004)

pointed out, “An overriding consideration in designing, developing and usinglanguage tests is that of test usefulness” (p.5). This reflects that test

usefulness should be related to many important areas that test designers need toknow, and test construct validity also needs to be related to test usefulness. Wecan hardly imagine a test can be considered socially valid if it turns out to belittle useful by any means.

Test Symbolic and Functional Values This paper introduced and discussed the issues and concerns of the

construct validation of the CET in Mainland China and the GEPT, and

SCUEPT (as a representative English proficiency test designed by a local

university) in Taiwan from different aspects.At macro level, for the Chinese educational authorities in the Mainland,the CET test serves the state interest and the needs of the national higher education. The CET reflects governmental initiatives for centralization andstandardization of language testing at a national level, with a centralizeddefinition of ability construct validation. The CET appears to be such a

mandatory standardized English language testing system that all collegestudents in the Mainland need to pass this high-stakes benchmark test for graduation. Therefore, the CET has virtually mobilized all Chinese collegestudents to study English hard. Although there are criticisms of the CET (Han

et al., 2004), the major positive symbolic and functional values of backwasheffects of the CET test is that all universities in China have realized theimportance of college English education at national level, and have takenvarious actions to promote college English actively, which has brought about acumulative positive effect on the quality teaching of college English educationnationwide (Jin, 2005). In this sense, the construct validation of this

high-stakes national test is linked with the needs in terms of the national ELTcurriculum. The symbolic value of construct validation of the CET in China isrelated to the needed rationale not only educationally but also socially and

economically because the CET has been popular in the job market across theMainland (Jin, 2005; Cheng, 2008).

However, as for the GEPT test, the educational authorities in Taiwanhave adopted a completely different approach to the assessment of collegeEnglish education. Universities in Taiwan have much more freedom to decide

what kind of English proficiency is needed for their undergraduate studentsaccording to each university’s own understanding and interpretation of test

construct abilities. The autonomous and decentralized language assessmentwithin colleges and universities in Taiwan provides autonomy in defining the

test constructs of ability according to their own local needs. Therefore, theGEPT is not high-stakes, and the symbolic value of construct validation of the

GEPT appears to be limited when compared with that in the Mainland wherethe construct of the CET test is closely related to the national English teachingsyllabus.

次標題:10p字。




In view of symbolic value of construct validation, the researcher believes

that the CET has comparatively got more credit because a set of dedicated testspecifications has been designed, which is directly in line with the purpose of

the CET test in the Mainland. By contrast, the GEPT’s test specifications arenot specially designed to evaluate college graduates’ English competency, notto mention graduation threshold. Therefore, strictly speaking, the purpose of the GEPT and its backwash effects on college English education do not

completely accord with the needs of college English programs in Taiwan.There is no comparison between CET and GEPT in this sense. On the creditside, SCUEPT is designed for graduation threshold, but the symbolic value of its construct validity is limited as it is just for one individual university in

Taiwan.

At grass-roots level, the functional value of construct validation of boththe CET and GEPT (or SCUEPTS) can be viewed with their specific neededrationale. Generally, the CET, GEPT, or SCUEPT all have solid empirical datato support its own rationale regarding internal reliability, correlationcoefficients, content validity, etc., which are all satisfactorily acceptable atmicro level. However, in the light of different tests used for graduation

examinations in Taiwan, if required, the test results are unsurprisingly notcomparable due to the fact that there are different test contents; and there arehardly any test specifications written for such wide-ranging tests. So, thefunctional value of construct validity is also limited to each individual

university in Taiwan.

CONCLUSION

In the final section of this paper, the writer concludes with findings and provides implications. With both documentary and empirical data, the writer has examined the construct validation in the CET and GEPT (and SCUEPT)tests from various aspects and provided justification for how construct

validation can be quantified and reflected in the CET and GEPT which are usedas graduation English tests in Mainland China and Taiwan.

Evidence has shown that the construct validation behind the collegegraduation English proficiency tests such as the CET and GEPT (and SCUEPT)in Mainland China and Taiwan has been effectively and moderately representedin the test content. Such a claim is satisfactorily supported by the analysis of

documentary and empirical data according to the major research approaches.Empirical analysis by means of the analysis of test specifications, internalreliability, internal correlations, factor analysis, (Pearson) inter-subtestcorrelation matrix, the content validity, etc. has also provided strong support

for the claim, and answered the main research question. The findings haveshown that construct validation can be strongly reflected and quantified as

significant in college English graduation tests in Mainland China and Taiwan. In this sense, both the CET and GEPT tests have fairly sound rationale to beselected as a kind of qualified English proficiency test for universities at





grass-roots to use. But in a narrow sense, the CET test in the Mainland appears

to have exerted a more powerful washback influence on her national collegeEnglish programmes. The researcher suggests that universities in Taiwan

should consider using a unified testing system as a kind of benchmark graduation English proficiency test although such a unified testing system maynot be the only tool that should be adopted for assessing college students’English competence so as to avoid college ELT programmes as test-driven. In

short, this research has provided an initiative for the study of the constructvalidation behind the college graduation English proficiency tests such as theCET and GEPT (and SCUEPT) in Mainland China and Taiwan. In a sense, thisresearch has considerably filled up a blank in this field as a starting point

although much remains to be done.

As Bachman (2004) says, “It is never sufficient for the purpose of supporting a validation argument” (p.279), and there is still much more to bedone in the field of language testing in general and test validity in particular.The researcher firmly believes that the issue of construct validation can bestudied more thoroughly from different perspectives. It is hoped that the resultsof this study can provide a clear indication of to what degree that construct

validation can be reflected in the CET and GEPT tests so as to help testdesigners to improve their language testing design. Meanwhile, the researcher also hopes that the results of this paper can provide a new starting point for a

possible exchange of experience in large-scale English language testing, as part

of our momentum of the search for a solution to quality college ELT testing.



REFERENCES

Alderson, J. C., Clapham, C., and Wall, D. (1995). Language Test Construction and Evaluation .

Cambridge: CUP.Bachman, Lyle F. (2004). Statistical Analysis for Language Assessment. Cambridge: Cambridge

University Press.

Bachman, L.F. and Palmer, A.S. (1996) Language testing in practice. Oxford: OUP.

Brown, J. D. (2005). Testing in Language Program: A Comprehensive Guide to English Language

Assessment. New York: McGraw-Hill Education (Asia). Cheng Liying (2008). The key to success: English language testing in China. Language Testing

25(1): 15-37.

Cronbach, L. J. (1984). Essentials of Psychological Testing (4th ed.). New York: Harper and Row.Davis, A., Brown, A., Elder, C., Hill, K., Lumley, T., and McNamara, T. (1999) Studies in

Language Testing 7: Dictionary of language testing . Cambridge: CUP.Fulcher, G. and Davidson, F. (2007). Language Testing and Assessment. London: Routledge.

Gong, Byron (2004). “A Need for a Unified Assessment of College English LanguagePrograms—Some Theoretical and Practical Considerations for Quality ELT in Taiwan” Shih

Chien Management Commentary, Issue 1.

Han, B. and Dai, M. and Yang, L. (2004): Problems with College English Test as emerged from a

survey. Foreign Languages and Their Teaching 179(2): 17-23.

Hughes, A. (2003). Testing for Language Teachers (2nd

Ed). Cambridge: Cambridge University

Press.

Jin, Yan and Jiang Wu. (2010). A Preliminary Study of the Internet-Based CET-4. Paper presented

at the 12th

Academic Forum on English Language Testing in Asia, AFELTA 2010 Conference.

Taipei, Taiwan.

Jin, Y. (2005). The National College English Test of China. In Hamp-Lyons, L. (Chair), the big

tests: Intentions and evidence. Symposium presented at International Association of AppliedLinguistics (AILA) 2005 Conference in Madison, WI.

LTTC (2007, 2003, 2000). A Statistical Report on the Scores of a GEPT Test. Taipei, LTTC.

MOE. 2005. Higher Education Newsletter —Vol (166). Taipei: Ministry of Education.

Mousavi, Seyyed Abbas (2002). An Encyclopedic Dictionary of Language Testing. 3rd ed. Taipei:

Tung Hua Book Co. Ltd..

National College English Testing Committee, PRC (2006) College English Test Sample Papers.Shanghai, China: Shanghai Foreign Language Education Press.

Tsai, Y. and C. Tsou (2009). A standardized English Language Proficiency test as the graduation

benchmark: student perspectives on its application in higher education. In Jo Anne Baird(Ed), Assessment in Education: Principles Policy & Practice 16, no. 3: 319-330.

Weir, C. (2004). Language Testing and Validation: An Evidence-based Approach. Basingstoke:

Palgrave.Windle, M. (2000). A Latent Growth Curve Model of Delinquent Activity Among Adolescents.

Applied Developmental Science , 4(4), 193-207.Xi, Xiaoming. (2008). Methods of Validation. In Elana Shohamy and Nancy H. Hornberger (Eds.),

Encyclopedia of Language and Education (pp. 177-196). New York: Springer.

Yang, H. and Jin, Y. (2000) Score interpretation of CET. Proceedings at the Third International

Conference on English Language Testing in Asia. Hong Kong: Hong Kong Examinations

Authority, 32–40.

Yang, H. (2009). The Sociological Aspects of Language Testing . Paper presented at the 2009

LTTC International Conference on English Language Teaching and Testing. Taipei, Taiwan.

Yang, H. and Weir, C. (1998) Validation study of the National College English Test , third edition.

Shanghai, China: Shanghai Foreign Language Education Press.





大陸與臺灣大學生英語能力畢業考試的結構

效度化比較研究中文摘要

本文針對大陸與臺灣的大學生英語能力畢業考試做出結構效度化的比較研究。結構效度在研發語言測驗當中具有重要的指導意義。在何種程度上，一項考試可以有效地對預定的測驗內容進行測量，乃是兩岸大學英語畢業門檻試題研

發人員必須考慮的首要問題。本文以文獻與量化資料為基準，探索結構效度理念在兩岸主流的大學英語畢業門檻考試中的體現。

本文對考試結構效度的效化體現進行了廣泛討論並作了量化探討，其量化研究方式在佐證結構效度方面頗具實用參考意義。研究發現,首先大陸的“大學英語四級考試”與臺灣的“全民英檢（中高級）”各自作爲兩岸主流的大學英語畢業門檻考試，其結構效化均具有良好以及適當的體現；

其次，語言測驗的結構效度理念應當從實際效用方面加以探索，這一點對全面理解語言測驗對教學產生的回撥效應，尤其是大規模測驗的回撥效應尤爲重要；再者，在量化分析方面，本文通過對大陸的四級考試、臺灣的全民英檢（及「東吳大學英語檢定測驗」）的研究，在如内部信度、組内效度相關矩陣、内容效度、組間落差試驗等方面都得到了結構效度理念的正面論證; 最後作者認爲大規模英語畢業門檻考試的社會實用意義也需要納入結構效度的考量之中。

本研究還顯示，相比之下，大陸的四級考試更加顯得具有效度的集中體現，因此其考試回撥效應強大，對其大學英語教學具有積極的正面促進作用；而臺灣的全民英檢（或其他各大學自行研發之試題）的效化反應比較不穩定，各校自行研發之試題分數可比性差。但從積極面來看，各校卻可依據其自身之需求來體現對結構效度的規範。廣義來講，對結構效度正確地理解和實施是大型考試對大學英語教學給與正面回撥效應的保證。因此，本文對從事語言測驗的研究者

和教師也具有良好的參考價值。關鍵詞：結構效度化、大學英語畢業門檻考試、回撥效應、

高危測驗。

sell sample article

Documents