the construct equivalence of a measure of inductive ... · web viewinductive reasoning in zambia,...

Inductive Reasoning 1

RUNNING HEAD: Cross-Cultural Equivalence of an Inductive Reasoning Test

Inductive Reasoning in Zambia, Turkey, and The Netherlands:

Establishing Cross-Cultural Equivalence

Fons J. R. van de Vijver

Tilburg University

The Netherlands

Mailing address:

Fons J. R. van de Vijver

Department of Psychology

Tilburg University

PO Box 90153

5000 LE Tilburg

The Netherlands

Phone: +31 13 466 2528

Fax: +31 13 466 2370

E-mail: [email protected]

Acknowledgment. The help of Cigdem Kagitcibasi and Robert Serpell in making the

data collection possible in Turkey and Zambia is gratefully acknowledged.


Abstract

Tasks of inductive reasoning and its component processes were administered to

704 Zambian, 877 Turkish, and 632 Dutch pupils from the highest two grades of

primary and the lowest two grades of secondary school. All items were constructed

using item-generating rules. Three types of equivalence were examined: structural

equivalence (Does an instrument measure the same psychological concept in each

country?), measurement unit equivalence (Do the scales have the same metric in

each country?), and full score equivalence (full comparability of scores across

countries). Structural and measurement unit equivalence were examined in two

ways. First, a MIMIC (multiple indicators, multiple causes) structural equation model

was fitted, with tasks for component processes as input and inductive reasoning

tasks as output. Second, using a linear logistic model, the relationship between item

difficulties and the difficulties of their constituent item-generating rules was

examined in each country. Both analyses of equivalence provided strong evidence

for structural equivalence, but only partial evidence for measurement unit

equivalence; full score equivalence was not supported.


Equivalence of a Measure of Inductive Reasoning

in Zambia, Turkey, and The Netherlands

Inductive reasoning has been a topic of considerable interest to cross-cultural

researchers, mainly because of its strong relationship with general intelligence

(Carroll, 1993; Gustafsson, 1984; Jensen, 1980). Many cultural populations have

been studied using common tasks of inductive reasoning such as number series

extrapolations (e.g., How should the following series be continued: 1, 4, 9, 16,...?),

figure series extrapolations such as Raven’s Progressive Matrices, analogical

reasoning (e.g., Complete the following: day : night :: white : ...?), and exclusion

tasks (e.g., Mark the odd one out: (a) 21, (b) 14, (c) 28, (d) 63, (e) 32). Studies of

inductive reasoning among nonwestern populations were reviewed by Irvine (1969,

1979; Irvine & Berry, 1988). He concluded that the structure found among western

participants with exploratory factor-analytic techniques is usually replicated. More

recent comparative studies, often based on comparisons of ethnic groups in the

U.S.A., have confirmed this conclusion (e.g., Fan, Willson, & Reynolds, 1995; Geary

& Whitworth, 1988; Hakstian & Vandenberg, 1979; Hennessy & Merrifield, 1976;

Naglieri & Jensen, 1987; Ree & Carretta, 1995; Reschly, 1978; Sandoval, 1982;

Sung & Dawis, 1981; Taylor & Ziegler, 1987; Valencia & Rankin, 1986; Valencia,

Rankin, & Oakland, 1997). Major differences in structure (for instance as reported by

Claassen & Cudeck, 1985) are exceptional. Inductive reasoning provides a strong

case for what Waitz, a nineteenth century philosopher, called “the psychic unity of

mankind” (Jahoda & Krewer, 1997), according to which the basic structure and

operations of the cognitive system are universal while manifestations of these


structures may vary across cultures, depending on what is relevant in a particular

cultural context.

The validity of cross-cultural comparisons can be jeopardized by bias;

examples of bias sources are country differences in stimulus familiarity (Serpell,

1979) and item translations (Ellis, 1990; Ellis, Becker, & Kimmel, 1993). Bias refers

to the presence of score differences that do not reflect differences in the target

construct. Much research has been reported on fair test use; the question is

addressed there whether a test predicts an external criterion such as job success

equally well in different ethnic, age or gender groups (e.g., Hunter, Schmidt, &

Hunter, 1979). The present study does not study bias in test use but bias in test

meaning; in other words, no reference is made here to social bias, unfairness, and

differential predictive validity. The present study focuses on the question whether

the same score but obtained in different cultural groups has the same meaning

across these groups. Such scores are unbiased. Two types of approaches have

been developed to deal with bias in cognitive tests. The first type, known under

various labels such as culture-free, culture-fair, and culture-reduced testing (Jensen,

1980), attempts to eliminate or minimize the differential influence of cultural factors,

like education, by adapting instrument features that may induce unwanted score

differences across countries. Raven's Matrices Tests are often considered to

exemplify this approach (e.g., Jensen, 1980). Despite the obvious importance of

good test design, the approach has come under critical scrutiny; it has been argued

that culture and test performance are so inextricably linked that a culture-free test

does not exist (Frijda & Jahoda, 1966; Greenfield, 1997).


Second, various statistical procedures have been proposed to examine the

appropriateness of psychological instruments in different ethnic groups. Examples

are exploratory factor analysis followed by target rotations and the computation of

factorial agreement between ethnic groups (Barrett, Petrides, Eysenck, & Eysenck,

1998; McCrae & Costa, 1997), simultaneous components analysis (Zuckerman,

Kuhlman, Thornquist, & Kiers, 1991), item bias statistics (Holland & Wainer, 1993),

and structural equation modeling (Little, 1997). It is remarkable that a priori and a

posteriori approaches (test adaptations and statistical techniques, respectively) have

almost never been combined, despite their common aim, mutual relevance, and

complementarity.

The present paper attempts to integrate a priori and a posteriori approaches

and takes equivalence as a starting point. Equivalence refers to the similarity of

psychological meaning across cultural groups (i.e., the absence of bias). Three

hierarchical types of equivalence can be envisaged (Van de Vijver & Leung, 1997a,

b). At the lowest level the issue of similarity of a psychological construct, as

measured by a test in different cultures, is addressed. An instrument shows

structural (also called functional) equivalence if it measures the same construct in

each cultural population studied. There is no claim that scores or measurement units

are comparable across cultures. In fact, instruments may be different across

cultures; structural equivalence is supported if it can be shown that in each culture

the same underlying construct (e.g., inductive reasoning) has been measured. The

intermediate level refers to measurement unit equivalence, defined by equal scale

units and unequal scale origins across cultural groups (e.g., the temperature scales

in degrees of Celsius and Kelvin). In practical terms, this type of equivalence is


found when the same instrument has been administered in different groups but

scores are not directly comparable across groups because of the presence of

moderating variables with a bearing on group mean scores, such as intergroup

differences in stimulus familiarity. Structural equation modeling is suitable to address

measurement unit equivalence because it allows for a comparison of score metrics

across cultural groups. The third and highest level is called full score equivalence

and refers to identity of both scale units and origins. Only in the latter case, scores

can be compared both within and across cultures using techniques like t tests and

analyses of (co)variance.

Full score equivalence assumes the complete absence of bias in the

measurement. Score differences between and within cultures are entirely due to

inductive reasoning. There are no fully adequate statistical tests of full score

equivalence, but some go a long way. The first is indirect and involves the use of

additional variables to (dis)confirm a particular interpretation of cross-cultural score

differences (Poortinga & Van de Vijver, 1987). Suppose that Raven’s Standard

Progressive Matrices Test is administered to adults in the U.S.A. and to illiterate

Bushmen. It may well be that the test provides a good picture of inductive reasoning

in both cultures. However, it is likely that differences between the countries are

influenced by educational differences between the groups. Score differences within

and across groups have a different meaning in this case. A measure of test-

wiseness or previous test exposure, administered to all participants, can be used to

(dis)confirm that cross-cultural score differences are due to bias. Full score

equivalence is then not demonstrated but assumed, and corollaries are tested.


Other tests of full score equivalence that have been proposed, compare the

patterning of cross-cultural score differences across items or subtests, often within

the framework of structural equation modeling. An example is multilevel covariance

structure analysis (Muthén, 1991, 1994) that compares the factor structure of pooled

within-country data to between-country data. Such an analysis assumes a sizable

number of cultural groups involved. Another example involves the modeling of latent

means in a structural model (e.g., Little, 1997). A frequently employed approach,

often based on item response theory, which is applicable when a small number of

cultures have been studied, is the examination of differential item functioning or item

bias (e.g., Holland & Wainer, 1993; Van der Linden & Hambleton, 1997). As long as

the sources of bias (such as education) affect all items in a more or less uniform

way, no statistical techniques will indicate that between-group differences are of a

different nature than within-group differences. Only if bias affects some items, the

proposed techniques can identify it. In sum, the establishment of full score

equivalence is an intricate issue. In many empirical studies dealing with mental

tests, this form of equivalence is merely assumed. As a consequence, statements

about the size of cross-cultural score differences often have an unknown validity.

Sternberg and Kaufman’s (1998) observation that we know that there are population

differences in human abilities, but that their nature is elusive, is very pertinent.

In line with current thinking in validity theory (Embretson, 1983; Messick,

1988), the present study combines test design and statistical analyses to deal with

bias (and equivalence). A distinction is made between internal and external

procedures to establish equivalence, depending on whether the procedure is based


on information derived from the scrutinized test itself (internal) or from additional

tests (external).

The present study examines the structural, measurement unit, and full score

equivalence of a measure of inductive reasoning in three, culturally widely divergent

populations (Zambia, Turkey, and the Netherlands). Structural and measurement

unit equivalence are studied using both an internal and external procedure. The

internal procedure to examine equivalence is based on item-generating rules that

underlie the instruments. In the external procedure, equivalence is scrutinized by

comparing the contribution of skill components to inductive reasoning across

countries. Three components are presumably relevant in the types of inductive

reasoning tasks studied here (Sternberg, 1977). The first is classification: treating

stimuli as exemplars of higher order concepts (e.g., the set CDEF as four

consecutive letters in the alphabet, as an instance of a group with one vowel, as a

group with three consonants, etcetera). Individuals are more successful in inductive

reasoning tasks when they can generate more of these classifications. Therefore, in

addition to classification, the skill to generate underlying rules on the basis of a

stimulus set was also tested. Finally, each generated rule has to be tested (e.g., Do

other groups also have four consecutive letters?). The latter skill, labeled rule

testing, was also assessed.


Method

Participants

An important consideration in the choice of countries was the presumed

strong influence of schooling on test performance (Van de Vijver, 1997); the

expenditure per head on education, a proxy for school quality, is strongly influenced

by national affluence. Countries with considerable differences in school systems and

educational expenditures per child were chosen. Furthermore, inclusion of at least

three different cultural groups decreases the number of alternative hypotheses to

explain cross-cultural differences (Campbell & Naroll, 1972). Zambia, Turkey, and

the Netherlands show considerable differences in educational systems and GDP

(per capita); the GDP figures per capita for 1995 were US$ 382, 2,814, and 25,635

for the three countries, respectively. School life expectancy of the three countries is

7.3, 9.7, and 15.5 year (United Nations, 1999). The choice of Zambia was also

made because of its lingua franca in school; English is the school language in

Zambia which was convenient for developing and administering tasks.

In each country pupils of four subsequent grades were involved. In the

Netherlands these were the last two grades of primary school (Grade 5 and 6) and

the first two grades of secondary school. The same procedure was applied in

Zambia, where primary school has seven grades. In a pilot study it was found that

the tasks could not be adequately administered to pupils from Grade 5 because

most of these children have still an insufficient knowledge of English, that is the first

language of few Zambians. Children start attending primary school in Turkey and

the Netherlands at the age of six, while schooling starts one year later in Zambia; as

a consequence, the Zambian pupils were on average two years older. The Zambian


sample comprised of more than 20 cultural groups (the three largest being Tonga,

21%; Bemba, 13%; and Nyanja, 11%); the Turkish groups was 99% Turkish, while in

the Dutch group 93% were Dutch, 2% Moroccan, and 2% Turkish.

Primary schooling in Turkey has five grades; pupils from the fifth grade of

primary school and the first three grades of secondary school were involved.

Secondary education is markedly different in the three countries. In Zambia a

nation-wide examination (with tests for reasoning and school achievement) at the

end of the last grade of primary school, Grade 7, is utilized to select pupils for

secondary school. After the seventh Grade less than 20% pupils continue their

education in either public or private secondary schools. Admittance to public schools

is conditional on the score at the Grade 7 Examination. Cutoff scores vary per

region and depend on the number of places available in secondary schools. In

urban areas there are some private schools; admittance to these schools usually

does not depend on examination results, but is mainly dependent on availability of

places as well as the ability and willingness of parents to pay school fees.

Participants both from public and private schools were included in our study. The

tremendous dropout at the end of Grade VII has undoubtedly adversely affects the

generalizability of the data to the Zambian population at large and it also jeopardized

the comparability of the age cohorts, both within Zambia and across the three

countries. In Turkey and the Netherlands secondary schooling is more or less

intellectually streamed. An attempt was made to retain the intellectual heterogeneity

of the primary school group at secondary school level by selecting various types of

schools. The intellectual heterogeneity of the samples is clearly larger in Turkey and


the Netherlands than Zambia; yet, none of the samples may be fully representative

for the age groups of their respective countries.

Insert Table 1 about here

Sample sizes are presented in Table 1; of the participants recruited 56%

came form urban and 44% from rural schools; 46% was female, 54% was male.

Instruments

The battery consisted of eight tasks, four with figures and four with letters as

stimuli. Each of these two stimulus modes had the same composition: a task of

inductive reasoning and three tasks of skill components that are assumed to

constitute important aspects of inductive reasoning. The first is rule classification,

called encoding in Sternberg’s (1977) model of analogical reasoning. The second is

rule generating, a combination of inference and mapping. The third is rule testing, a

combination of comparing and justification.

All tasks are based on item-generating rules, schematically presented in

Appendix A. All figure tasks are based on the following three item-generating rules:

(a) The same number of figure elements is added to subsequent figures in a

period (periods consist of either circles or squares, but never of both. A

period defines the number of figures that belong together. Examples of

items of all tasks, in which the item-generating rules are illustrated, can be

found in Appendix B).

(b) The same number of elements is subtracted from subsequent figures in a

period.


(c) The same number of elements is, alternatingly, added to and subtracted

from subsequent figures in a period.

The three item-generating rules are an example of a facet, a generic term for

all item features that are systematically varied across items. Two more facets

applied to all figure tasks. First, the number of figures in a period varies from two to

four. Second, the number of elements that are added to or subtracted from

successive elements of a period varied from one to three. Whenever possible, all

facet levels were crossed. However, for some combinations of facet levels no item

could be generated. For example, as each figure can have (in addition to a circle or

a square that are present in all items) only five elements (namely a hat, arrow, dot,

line, or bow), it is impossible to construct an item with two or three elements added

to each of four figures in a period.

Inductive Reasoning Figures is a task of 30 items. Each item has five rows of

12 figures, the first eight are identical. One of the rows has been composed

according to a rule while in the other rows the rule has not been applied

consistently. The pupil has to mark the correct row.

Besides the common facets, two additional facets were used to generate the

items of Inductive Reasoning Figures. First, the figure elements added or subtracted

are either the same or different across periods. In the example of Appendix B there

is a constant variation because in each period a dot is added first, followed by a

dash and a hat. Second, periods do or do not repeat one another, meaning that the

first figures of each period are identical (except for a possible swap of circle and

square).


The 36 items of Rule Classification Figures consist of eight figures. Below

these figures the three item-generating rules were printed. In addition, the

alternative "None of the rules applies" has been added. The pupil had to indicate

which of the four alternatives applies to the eight figures above.

In addition to the common facets, the task has three additional facets. The

first two are the same as in Inductive Reasoning Figures. The third refers to the

presence or absence of periodicity cues. These cues refer to the presence of both

circles and squares in an item (as illustrated in the first item of Appendix B) or the

presence of either squares or circles (if all circles of the example would be changed

into squares, no periodicity cues would be present).

Whereas in Inductive Reasoning Figures the number of different elements of

a figure could be either one, two, or three, Rule Classification Figures has another

level of this facet, referring to a variable number of elements. For example, in the

first period one element is added to subsequent figures and in the second period

two elements.

Each of the 36 items of Rule Generating Figures consists of a set of six

figures under which three lines with the numbers 1 to 6 are printed. In each item

one, two, or three triplets (i.e., groups of three figures) have been composed

according to one of the item-generating rules. Any of the six figures of an item can

be part of one, two, or three triplets. Pupils were asked to indicate all triplets that

constitute valid periods of figures. No information about the number of valid triplets

in each particular item was given. The total number of triplets was 63. In the data

analysis these were treated as separate, dichotomously scored items.


Two facets, in addition to the common ones, were included. First, periodicity

cues are either present or absent; the facet has the same meaning as in Rule

Classification Figures. Second, the number of valid triplets is one, two, or three.

A verbal specification is given at the top of each item of Rule Testing Figures.

In this specification three characteristics of the item are given, namely the

periodicity, the item-generating rule, and the number of elements varied between

subsequent figures of a period (e.g., "There are 4 figures in a group. 1 thing is

subtracted from figures which come after each other in a group"). Below this

specification four rows of eight figures have been drawn. One of the rows of eight

figures has been composed completely according to the specification. In some items

none of the four rows has been composed according to the specification. In this

case a fifth response alternative, "None of the rows has been composed according

to the specification" applies. This facet is labeled “None/one of the rules applies”.

The pupil has to mark the correct answer.

The facets and facet levels of Rule Testing Figures and Rule Classification

Figures were identical. In addition, the facet “Rows (do not) repeat each other” is

included in the former task. In some items the rows are fairly similar to each other

(except for minor variations that were essential for the solution), while in other items

each row has a completely different set of eight figures.

The letter tasks were based on five item-generating rules:

(a) Each group of letters has the same number of vowels. The vowels used in

the task are A, E, I, O, and U. As the status of the letter Y can easily

create confusion in English and Dutch where it can be both a consonant


and a vowel, the letter was never used in connection to the first item-

generating rule;

(b) Each group of letters has an equal number of identical letters that are the

same across groups (e.g., BBBB BBBB);

(c) Each group of letters has an equal number of identical letters that are not

the same across groups (e.g., GGGG LLLL);

(d) Each group of letters has a number of letters that appear the same (i.e., 1,

2, 3, or 4) number of positions after each other in the alphabet. The letters

A and B have a difference of one position, the letters A and C a difference

of two positions, etcetera;

(e) Each group of letters has a number of letters that appear the same (i.e., 1,

2, 3, or 4) number of positions before each other in the alphabet.

A second facet refers to the number of letters to which the rule applies. The number

could vary from 1 to 6. All items of the letter tasks are based on a combination of the

two facets described (i.e., item rule and number of letters). Like in the figure tasks,

not all combinations of the facets are possible; for example, applications of the

fourth and fifth rule assume an item rule that is based on at least two letters in a

group.

Inductive Reasoning Letters bears resemblance to the Letter Sets Test in the

ETS-Kit of Factor-Referenced Tests (Ekstrom, French, & Harman, 1976). Each of

the 45 items consists of five groups of six letters. Four out of these five groups are

based on the same combination of the two facets (e.g., they all have two vowels).

The pupil has to mark the odd one out.


Each of the 36 items of Rule Classification Letters consists of three groups of

six letters. Most items have been constructed according to some combination of the

two facets, while in some items the combination has not been used consistently. The

five item-generating rules are printed under the groups of letters. The pupil has to

indicate which item-generating rule underlies the item. If the rule has not been

applied consistently, the sixth response alternative, "None of the rules applies," is

the correct answer. Like in Rule Classification Figures, the facet “Non/one of the

rules applies” is included.

Each of the 30 items of Rule Generating Letters consists of a set of 6 groups

of letters (each made up of 1 to 9 letters). In each item up to five triplets of (subsets

of) the groups of letters have been composed according to a combination of the two

facets. The pupil is asked to indicate as many triplets as possible. Again, the pupils

were not informed about the exact number of triplets in each item. The total number

of triplets is 90, which are treated as separate items in the data analysis.

Like in Rule Generating figures, a facet about the number of valid triplets

(ranging here from 1 to 5) applies to the items, in addition to the common facets.

Rule Testing Letters consists of 36 items. Each item starts with a verbal

specification. In the specification two characteristics of the item are given, namely

the item-generating rule and the number of letters pertaining to the rule (e.g., "In

each group of letters there are 3 vowels"). Below this specification four rows of three

groups of six letters are printed. In most items one of the rows figures has been

composed according to the specification. The pupil has to find this row. In some

items none of the four rows has been composed according to the specification. In

this case a fifth response alternative, "None of the rows has been composed


according to the specification above the item," applies. The pupil has to indicate

which of the five alternatives applies.

The task had three facets, besides the common ones. Like in Rule

Classification Letters, there is a facet indicating whether or not one of the

alternatives follows the rule. Also, like in Rule Testing Figures, there is a facet

indicating whether or not rows repeat each other.

The Turkish and Roman alphabet are not entirely identical. The (Roman)

letters Q, W, and X were not used here since they are uncommon in Turkish. The

presence of specifically Turkish letters, such as Ç, Ö, and Ü, necessitated the

introduction of small changes in the stimulus material (e.g., the sequence ABCD in

the Zambian and Dutch stimulus materials was changed into ABCÇ in Turkish).

Administration

The tasks were administered without time limit to all pupils of a class;

however, in the rural areas in Zambia the number of desks available was often

insufficient for all pupils to work simultaneously as each pupil had to have his or her

own test booklet and answer sheet. The experimenter then selected randomly a

number of participants.

The tasks were administered by local testers. The number of testers in

Zambia, Turkey, and the Netherlands were two, three, and two (the author being

one of them), respectively. Five were psychologists and three were experienced

psychological assistants. All testers followed a one-day training in the administration

of the tasks.


In Zambia English was used in the administration. A supplementary sheet in

Nyanja, the main language of the Lusaka region, was included in the test booklet

that explained the item-generating rules. Turkish was the testing language in the

Turkish group Turkish and Dutch in the Dutch group.

The administration of all tasks to all pupils would presumably have taken

three school days. In order to avoid the loss of motivation and test fatigue, two

experimental conditions were introduced: the figure and the letter condition. The two

tasks of inductive reasoning were always administered; in the figure condition rule

classification, generating, and rule testing tasks with figures were also included,

while the three additional letter tasks were administered in the letter condition;

sample sizes for each condition are given in Table 1. So, all pupils received five

tasks: two tasks of inductive reasoning and three tasks of skill components (either

the three figure or the three letter skill component tasks). The administration of the

five tasks took place on two consecutive school days. The order of administration of

the tasks was random, with the constraint that the two tasks of inductive reasoning

were given on one day (either the first or the second testing day) and the remaining

on the other one.

The description of all eight instruments started with a one-page description of

the task, which was read aloud by the experimenter to the pupils; item-generating

rules of the stimulus mode were specified. This instruction was included in the

pupils’ test booklets. Examples were then presented of each of the item-generating

rules; explicit reference was made to which rule applied. Finally, the pupils were

asked to answer a number of exercises that again, covered all item-generating

rules. After this instruction, the pupils were asked to answer the actual items. In


each figure task the serial position of each figure was printed on top of the item in

order to minimize the computational load of the task. The alphabet was printed at

the top of each page of the letter tasks, with the vowels underlined. It was indicated

to the pupils that they were allowed to look back at the instructions and examples

(e.g., to consult the item-generating rules). Experience showed that this was

infrequently done, probably because all tasks of a single stimulus mode utilized the

same rules.

Results

The section begins with a description of preliminary analyses, followed by the

main analyses. Per analysis, the hypothesis, statistical procedure and findings are

reported.

Preliminary Analyses

The internal consistencies of the instruments (Cronbach’s alpha) were

computed per culture and grade. Inductive Reasoning Figures showed an average

of .86 (range: .79-.93), Rule Classification Figures .83 (.71-.90), Rule Generating

Figures .89 (.84-.95), Rule Testing Figures .85 (.81-.89), Inductive Reasoning

Letters .79 (.69-.88), Rule Classification Letters .83 (.73-.90), Rule Generating

Letters .93 (.90-.95), and Rule Testing Letters .78 (.63-.85). Overall, the internal

consistencies yielded adequate values. Country differences were examined in a

procedure described by Hakstian and Whalen (1976). Data of all grades were

combined. The M statistic, that follows a chi square distribution with two degrees of

freedom, was significant for Inductive Reasoning Figures (M = 64.92, p < .001), Rule

Classification Figures (M = 10.57, p < .01), Rule Generating Figures (M = 34.06, p


< .001), Rule Classification Letters (M = 11.57, p < .01), and Rule Testing Letters (M

= 12.40, p < .01). The Dutch group tended to have lower internal consistencies (a

possible explanation is given later).

Insert Table 2 and 3 about here

The average proportions of correctly solved items per country, grade, and

task are given in Table 2. Differences in average scores were tested in a

multivariate analysis of variance with country (3 levels; Zambia, Turkey, and the

Netherlands), grade (4 levels; 5 through 8), and gender (2 levels). Separate

analyses were carried out for the letter and figure mode. It may be noted that the

present analysis is presented merely for exploratory purposes to give insight in the

relative contribution of each factor to the overall score variation; conclusions about

country differences in inductive reasoning or its components are premature until full

score equivalence of scores across countries has been shown. Table 3 gives the

estimated effect sizes (proportion of variance accounted for). The results were

essentially similar for the two modes. Country was highly significant (p < .001) in all

tasks, usually explaining more than 10%. Zambian pupils tended to show the lowest

scores and Dutch pupils the highest scores. Grade differences were as expected; as

can be confirmed in Table 2, scores increased with grade. The effect sizes were

substantial, usually larger than 10%, and highly significant for all tasks (p < .001).

Gender differences were small; significant differences were found for Rule Testing

Figures and Inductive Thinking Letters (girls scored higher on both tasks), but

gender differences did not explain more than 1% on any task. The country by grade


interaction was significant in all analyses, explaining between 1 and 5%. As can be

seen in Table 2, score increases with grade tended to be smaller in the Netherlands

than in the other two countries. Country differences in scores were large in all

grades but tended to be become smaller with age. These results are in line with a

meta-analysis (Van de Vijver, 1997) in which in the age range examined here, there

was no increase of cross-cultural score differences with age (contrary to what would

be predicted from Jensen’s, 1977, cumulative deficit hypothesis). Other interactions

were usually smaller and often not significant.

Structural Equivalence in Internal Procedure

Hypothesis. The first hypothesis addresses equivalence in internal

procedures by examining the decomposition of the item difficulties. The hypothesis

states that facet levels provide an adequate decomposition of the item difficulties of

each task in each country (Hypothesis 1a). See Table 4 for an overview of the

hypotheses and their tests.

Statistical procedure. Structural equivalence of all tests is examined using the linear

logistic model (LLM) (Fischer, 1974, 1995). It is an extension of the Rasch model,

which is one of the frequently employed models in item response theory. The Rasch

model holds that the probability that a subject k (k = 1, …, K) responds correctly to

item i is given by

exp(k - i)/[1+ exp(k - i)], (1)

in which k represents the person’s ability and i the item difficulty. An item is


represented by only one parameter, namely its difficulty (unlike some other models

in item response theory in which each item also has a discrimination parameter,

sometimes in addition to a pseudo-guessing parameter). A sufficient statistic for

estimating a person’s ability is the total number of correctly solved items on the task.

Analogously, the number of correct responses at an item provides a sufficient

statistic for estimating the item difficulty. For our present purposes the main interest

is in item parameters.

The LLM imposes a constraint on the item parameter by specifying that the

item difficulty is the sum of an intercept (that is irrelevant here) and a sum of

underlying facet level difficulties, j:

i = + qij j (2)

The second step aims at estimating the facet level difficulties (). Suppose that the

item is “BBBBNM BBBBKJ BBBBHJ BBFTHG BBBBHN”. In terms of the facets, the

item can be classified as involving (a) four letters (facet: number of letters); (b) equal

letters within and across groups of letters (facet: item rule). The above model

equation (2) specifies that the item parameter will be the sum of an intercept, two

facet level difficulties (namely the difficulty parameter of items dealing with four

letters and the difficulty parameter of items dealing with equal letters within and

across groups of letters), and a residual component.

The matrix Q (with elements qij) defines the independent variable; the matrix

has m rows (the number of items of the task) and n columns (the number of

independent facet levels of the task). Entries of the Q matrix are zero or one


depending on whether the facet level is absent or present in the item (interactions of

facets were not examined). In order to guarantee uniqueness of the parameter

estimates in the LLM, linear dependencies in the design matrix were removed by

leaving the first level of each facet out of the design matrix. This (arbitrary) choice

implied that the first level of each facet has a difficulty level of zero and that the size

and significance of other facet levels should be interpreted relative to this “anchor.”

The sufficient statistic for estimating the basic parameters is the number of

correct answers at the items that make up the facet level. As a consequence, there

will be a perfect rank order between this number of correct answers and j. Various

procedures have been developed to estimate the basic parameters. In the present

study conditional maximum likelihood estimation was used (details of this

computationally rather involved procedure are given by Fischer, 1974, 1995). An

important property of the LLM is the sample independence of its parameters;

estimates of the item difficulty and the basic parameters are not influenced by the

overall ability level of the pupils. This property is attractive here because it allows for

their estimation, even when average scores of cultural groups differ.

The LLM is a two-step procedure; the first consists of a Rasch analysis.

Estimates of item () and person () parameters of equation (1) are obtained. In the

second step the parameters of equation (2) are estimated. The item parameters are

used in the evaluation of the fit of the model. The fit of an LLM can be evaluated in

various ways. First, a likelihood ratio test can be computed, comparing the likelihood

of the (unrestricted) Rasch model to the (restricted) LLM. The statistic is of limited

value here. The ratio is affected by guessing (Van de Vijver, 1986). Because all

tasks employed a multiple-choice format, it is unrealistic to assume that a Rasch


model would hold. The usage of an LLM may seem questionable here because of

the occurrence of guessing (pupils were instructed to answer all items). However,

Van de Vijver (1986) has shown that guessing gives rise to a reduction of the

variance of the estimated person and item parameters but correlations of both

estimated parameters with their true values are hardly affected. A useful heuristic to

evaluate the degree of fit of the LLM is provided by the correlation between the

Rasch parameters of the first step of the analysis and the by means of the design

matrix reconstructed item parameters of the second step. It amounts to correlating

item parameters of the first step () (the “unfaceted item difficulties”) with the item

parameters of the second step, using i* = qij j (the “faceted item difficulties”). The

latter vector gives the item parameters estimated on the basis of the estimated facet

level difficulties. Higher correlations point to a better approximation of item level

difficulties by facet level difficulties and hence, to a better modelability of inductive

reasoning.

Every task has its own design matrix, consisting both of facets that were

common to all tasks of a mode (e.g., the item-generating rules) and task-specific

facets (e.g., the number of correct answers in the rule generating tasks). The

analyses were carried out per country and grade. The item difficulties (with different

values per country and grade) and the Q matrix (invariant across grades and

countries for a specific task) were input to the analyses. This procedure was

repeated for each task, making a total of 8 (tasks) x 4 (grades) x 3 countries = 96

analyses.

The LLM is applied here as one of two tests of structural equivalence. This

type of equivalence addresses the relationship between measurement outcomes


and the underlying construct. The facets of the tasks are assumed to influence the

difficulty of the items. For example, it can be expected that rules in items of the letter

tasks are easier when they involve more letters. The analysis of structural

equivalence examines whether the facets exert an influence on item difficulty in

each culture. In more operational terms, structural equivalence is supported if the

correlation of each analysis is significantly larger than zero. A significant correlation

points to a contribution of the facet levels to the item difficulty: It indicates that the

facet levels contribute to the prediction of item difficulties.


Hypothesis test. As can be seen in Table 5, the correlations between the

unfaceted Rasch item parameters (of equation 1) and the faceted item parameters

(of equation 2) were high for all tasks in each grade in each country. These high

correlations provide powerful evidence that the same facets influence item difficulty

in each country. It can be concluded that Hypothesis 1a, according to which the item

difficulty decomposition would be adequate in each country was strongly supported.


The second question involves the patterning of the correlations of Table 5.

This question was addressed in an analysis of variance, with country (3 levels:

Zambia, Turkey, and the Netherlands), stimulus mode (2 levels: figure and letters

tests), and type of skill (4 levels: inductive reasoning and each of the three skill


component tasks) as independent variables; the correlation was the dependent

variable. The four grades were treated as independent replications. As can be seen

in Table 6, all main effects and first order interactions were significant. A significantly

lower correlation (and hence a poorer fit of the data to the model) was found for

figure tasks than for letter tasks (averages of .87 and .91, respectively), F(2, 72) =

24.85, p < .001. The effect was considerable, explaining 22% of the total score

variation. About the same percentage was explained by skill components, F(3, 72) =

37.94, p < .001. The lowest correlation (of .87) was obtained for rule classifications

and rule generating (.87), followed by inductive reasoning (.89) while rule testing

showed the highest value (.93). The high values of the latter may be due to a

combination of a large number of items (long tests) and the presence of both very

easy and difficult facet levels in the rule testing tasks in each country; such facet

levels increase the dispersion and will give rise to high correlations. Country

differences explained about 10% of the score variation; the correlations of the

Turkish and Zambian groups were very close to each other (.91 and .90,

respectively), while the value for the Dutch group was .87. A closer inspection of the

data revealed that the largest differences between the countries were found for

tasks with relatively high proportions of correctly solved items. Therefore, the

difference in fit may be due to ceiling effects in the Dutch group, which by definition

reduce the correlation. The most important interaction, explaining 16% of the total

score variation, was observed between country and stimulus mode, F(2, 72) =

19.92, p < .001. Whereas the correlations did not differ more than .03 for both

stimulus modes in Zambia and Turkey, the difference in the Dutch sample was .09.


The interaction of stimulus mode and skill component was also significant,

F(3, 72) = 9.61, p < .001. Correlations of rule generating and rule testing were on

average .03 larger than in the figure mode than in the letter mode, while a much

larger value of .08 was observed for rule classification. The interaction of country

and skill was significant though less important (explaining only 5%). The score

differences of the cultures were relatively small for rule generating and rule testing

and much larger for inductive reasoning and rule classification, mainly due to the

relatively low scores of the Dutch. Again, ceiling effects may have induced the effect

(not necessarily the largest attainable score because some facet levels remained

beyond the reach of many pupils even the highest scorers). In sum, the analysis of

the correlations revealed high values for all tasks in the three countries. The

observed country differences were presumably more due to ceiling effects than to

country differences in modelability of inductive reasoning and its components.

Ceiling effects may also explain the lower internal consistencies in the Dutch data,

discussed before.

Insert Figure 1 and 2 about here

The estimated facet level difficulties (the estimated values of , cf. equation

2) are of all tests have been drawn in Figure 1 (figure tests) and 2 (letters tests).

Higher values of refer to more difficult facet levels. The most striking finding of

both Figures is the proximity of the three country curves; this points to the cross-

cultural similarity in pattern of difficult and easy facet levels, which yields further

evidence for the structural equivalence of the instruments in the present samples.


Furthermore, most facet levels behaved as expected. As for the figure tasks, the

third item-generating rule (about alternating additions and subtractions) was

invariably the most difficult. Items were more difficult when they dealt with shorter

periods, when a variable number of elements were added or subtracted in

subsequent figures of a period, when periodicity cues were absent, and when

periods did not repeat each other. The number of valid triplets (only present in the

rule-generating task) showed large variation. Pupils found it relatively easy to

retrieve one correct solution, but relatively difficult to find all solutions when the item

contained two or three valid triplets.

The difficulty patterning of the letter tasks also followed expectation. Dealing

with equal letters was easier than dealing with positions in the alphabet. Items about

equal letters within and across groups (e.g., BBBBBB BBBBBB) were easier than

items about letters that were equal within and unequal across groups (e.g., BBBBBB

GGGGGG). Items were easier when the underlying rule involved more letters (which

facilitates recognition). Items about positions in the alphabet (the last two item-

generating rules of the letter mode) were easier when they involved smaller jumps

(e.g., ABCD was easier to recognize as a group of letters in which the position of

letters in the alphabet is important than ACEG, that was easier to recognize than

ADGJ). Like in the generating task of the figure mode, a strong effect of the number

of valid triplets was found. Finding all solutions turned out to be difficult and valid

triplets were often overlooked.


Measurement Unit Equivalence in Internal Procedure

Hypothesis. For each task the same facet level difficulties apply in each

country (Hypothesis 1b; cf. Table 4).

Statistical procedure. The LLM parameters can also be used to test

measurement unit equivalence. This type of equivalence goes beyond structural

equivalence by assuming that the tasks as applied in the three countries have the

same measurement units (but not necessarily the same scale origins). If the

estimated parameters of equation 2 are invariant across countries except for

random fluctuations, there is strong evidence for the invariance of the measurement

units of the test scales. This invariance would imply that the estimated facet level

difficulties in a particular country could be replaced by the difficulty of the same facet

in another country without affecting the fit of the model. For these analyses the data

for the grades in a country were combined because of the primary interest in country

differences.

Hypothesis test. Standard errors of the estimated facet level difficulties

ranged from 0.05 to 0.10. As can be derived from Figure 1 and 2, in each task there

are facet levels that differ significantly across countries. It can be safely concluded

that scores did not show complete measurement unit equivalence.

Yet, it is also clear from these Figures that some facet levels are not

significantly different across countries. So, the question arises to what extent facet

levels are identical across countries. The question was addressed using intraclass

correlations, measuring the absolute agreement of the estimated facet level

difficulties in the three countries. The absolute agreement of the estimated basic

parameters of a single task across countries was evaluated; per task the intraclass


correlation of the country by facet level matrix was computed. The letter tasks

showed consistently higher values than the figure tasks. The average agreement

coefficient was .91 for the figure tasks and .96 for the letter tasks (all intraclass

correlations were significantly above zero, p < .001). The high within-task agreement

points to an overall strong agreement of facet levels across countries. The estimated

facet level difficulties come close to being interchangeable across countries (despite

the significant differences of some facet levels).

A recurrent theme in the analysis is the better modelability of the letter tasks

as compared to the figure tasks, due to wider range of facet level difficulties in the

letter than in the figure mode. The range differences may be a consequence of the

choices made in the test design stage. One of the problems of many existing figure

tests is their often implicit definition of permitted stimulus transformations (e.g.,

rotating and flipping). This lack of clarity, presumably an important source of cross-

cultural score differences, was avoided in the present study by spelling out all

permitted transformations in the test instructions. Apparently, the price to be paid for

providing the pupils with this information is a small variation in facet level difficulties.

Structural Equivalence in External Procedure

Hypothesis. The skill components contribute to inductive reasoning in each

country (Hypothesis 2a; cf. Table 4).

Insert Figure 3 about here

Statistical procedure. External procedures to establish equivalence scrutinize


the relationships between inductive reasoning and its componential skills. A specific

type of structural equation model was used, namely a MIMIC model (Multiple

Indicators MultIple Causes; see Van Haaften & Van de Vijver, 1996, for another

cross-cultural application). A MIMIC is a model that links input and output through

one latent variable (see Figure 3). The core of the model is the latent variable,

labeled inductive reasoning. This variable, , is measured by the two tasks of

inductive reasoning (the output variables). The input to the inductive reasoning

factor comes from the skill components; the components are said to influence the

latent factor and this influence is reflected in the two tasks of inductive reasoning. In

sum, the MIMIC model states that inductive reasoning is measured by two tasks

(IRF and IRL) and is influenced by three components (classification, rule generating,

and rule testing). The model equations are as follows:

y1 = 1 + 1; (3)

y2 = 2 + 2,

in which y1 and y2 denote observed scores on the two tasks of inductive thinking, 1

and 2 the factor loadings, and 1 and 2 error components. The latent variable, , is

linked to the skill components in a linear regression function:

= 1 x1 + 2 x2 +3 x3 + , (4)

where the gammas are the regression coefficients, the x-variables refer to the skill

components, and is the error component. In order to make the estimates

identifiable, the factor loading of IRF, 1, was fixed at one.

An attractive feature of structural equation modeling is its allowance for

multigroup analysis. This means that the adequacy of the above model equations for

the data can be evaluated for all 12 data sets (4 grades x 3 countries) at once. The


fit statistics yield an overall assessment, covering all data sets.

The theoretical model underlying the study stipulates that the three skill

components constitute essential elements of inductive reasoning. In terms of the

MIMIC analysis, this means that structural equivalence would be supported by a

good fit of a model with three input and two output variables as described. Nested

models were analyzed. In the first step all parameters were held fixed across data

sets, while in subsequent steps similarity constraints were lifted in the following

order (cf. Table 7): the error variance (unreliability) of the tasks of inductive

reasoning, the intercorrelations of the tasks of skill component, the error variance of

the latent variable, the regression coefficients, and the factor loadings. The order

was chosen in such a way that invariance of relationships involving the latent

variable (i.e., regression coefficients and factor loadings) was retained as long as

possible. More precisely, structural equivalence would be supported when the

MIMIC model with the fewest equality constraints across countries shows a good fit

and all MIMIC parameters differ from zero (hypothesis 2a). It would mean that the

tasks of inductive reasoning constitute a single factor that is influenced by the same

skill component in each analysis (the possibility that there is a good fit but that some

regression coefficients or factor loadings are negative is not further considered here

because no covariances were negative).


Hypothesis test. The relationship of the skill components and inductive

reasoning tasks was examined in a MIMIC model (see Table 7; more details are


given in Appendix C). Nested models were fitted to the data of both stimulus modes.

The choice of a MIMIC model was mainly based on the relatively large change of all

fit statistics when constraints were imposed on the phi matrices (the covariance

matrices of the component skills; see the figure of Appendix B); therefore, the model

with equal factor loadings, regression coefficients, and error variances was chosen.

Although the letter tasks showed a better fit than the figure tasks, the choice of a

model was less straightforward. A MIMIC model with a similar pattern of free and

fixed parameters in both stimulus modes was chosen, mainly because of parsimony

(see footnote to Table 7 for a more elaborate explanation).

The standardized solution of the two models is given in Figure 3. As

hypothesized, all loadings and regression coefficients were positive and significant

(p < .01). It can be concluded that inductive reasoning with figure and letter stimuli

involves the same components in each country. This supports structural

equivalence, as predicted in hypothesis 2a. The regression coefficients of the figure

component tasks were unequal to each other: rule classification was least important,

followed by rule generating, while rule testing showed the largest contribution to

inductive reasoning. The letter mode did not show this patterning; the regression

coefficients of the component tasks of the letter mode were rather similar to one

another.

Measurement Unit Equivalence in External Procedure

Hypothesis. The skill components contribute in the same way to inductive

reasoning in each country (Hypothesis 2b; cf. Table 4).


Statistical procedure. Measurement unit equivalence can be scrutinized by

introducing and testing equality constraints in the MIMIC model. This type of

equivalence would be supported when a single MIMIC model with identical

parameter values holds in all countries. It may be noted that this test is stricter than

the ones proposed in the literature. Whereas the latter tend to analyze all tasks in a

single exploratory or confirmatory factor analysis, more specific relationships

between the tasks are considered here.

Hypothesis test. The psychologically most salient elements of the MIMIC, the

factor loadings, regression coefficients, and the explained variance of the latent

variable, were found to be invariant across countries. However, measurement unit

equivalence also requires the other parameter matrices to be invariant. In the figure

mode the model with equality constraints for all matrices showed a rather poor fit,

with an NNFI of .88, a GFI of .96, and an RMSEA of .045. An inspection of the delta

chi square values indicated that in particular the introduction of equality of

covariances of the skill components () reduced the fit significantly. The letter tasks

showed a similar picture; the most restrictive model revealed values of .89 for the

NNFI, .82 for the GFI, and .041 for the RMSEA, which can be interpreted as a rather

poor fit. Again, equality of the matrices led to a significant reduction of the fit. Like

in our internal procedure to examine measurement unit equivalence, we found some

but inconclusive evidence for the measurement unit equivalence of the task scores

across countries; hypothesis 2b had to be rejected.


Full Score Equivalence

Hypothesis. Both tasks of inductive reasoning show full score equivalence

(Hypothesis 3; cf. Table 4).

Statistical procedure. Full score equivalence can be examined in an item bias

analysis. A logistic regression model was applied to analyze item bias (Rogers &

Swaminathan, 1993). Advantages of the model are the possibility to include more

than two groups and to examine both uniform and nonuniform bias (Mellenbergh,

1982). The combined samples of the three countries are used to determine cutoff

scores that split up the sample in three score level groups (low, medium, and high)

of about the same size. In the logistic regression procedure, culture (dummy coded),

score level, and their interaction are the independent variables, while the item

response is the dependent variable. A significant main effect of culture points to

uniform bias: individuals from at least one country show an unexpectedly low or high

score across all score levels on the item as compared to individuals with the same

test score from other cultures. A significant interaction points to nonuniform bias: the

systematic difference of the scores depends here on the score level; for example,

country differences in scores among low scorers are not found among high scorers.

Alpha was set at a (low) level of .001 in the item bias analyses in order to prevent

inflation of Type I errors, due to multiple testing (although, obviously, the power of

the procedure is adversely affected by this choice).

Insert Figure 4 about here


Hypothesis test. In the introduction two approaches were mentioned to

examine full score equivalence that are based on structural equation modeling:

multilevel covariance structure analysis and the modeling of latent means. The

former could not be used due to the small number of countries involved, while the

latter was precluded because of the incomplete support of measurement unit

equivalence. This lack of support indeed prohibits any analysis of full score

equivalence. Yet, because the bias analysis yielded interesting results, it is reported

here for exploratory purposes. Of the 30 items of the IRF, 15 items were biased (13

items uniform, 11 items non-uniform), mainly involving the Dutch—Zambian

comparison. The occurrence of bias was related to the difficulty of the items; both

the easiest and most difficult items showed the most bias. The correlation between

the presence of bias (0 = absent, 1 = present) and the deviance of the item score

from the mean (i.e., average item score - overall average) was .64 (p < .001). The

correlation suggests a methodological artifact, such as floor and ceiling effects. This

was confirmed by an inspection of the contingency tables underlying the logistic

regression analyses. Figure 4 depicts empirical item characteristic curves of two

items that showed both uniform and nonuniform bias. The upper panel shows a

relatively easy item (with an overall mean of .79) and the lower panel a relatively

difficult item (mean of .33). The bias for the easy item is induced by country

differences at the lowest score level that are not reproduced at the higher levels.

Analogously, the scores for the difficult item remain close to the guessing level

(of .20) in the two lowest score levels, while there is more score differentiation in the

highest scoring group. The score patterns of Figure 4 were found for several items.

It appears that ceiling and floor effects led to item bias in the IRF.


Three items were found to be biased in the IRL (one uniform and two

nonuniform). The Zambian pupils showed relatively high scores on these items.

Because the items were few and involved different facet levels, the reasons for the

bias were not understood, a fairly common finding in item bias research (cf. Van de

Vijver & Leung, 1997b). Floor and ceiling effects did not occur, which points to an

important difference between the two tasks of inductive reasoning; whereas at the

IRF pupils tended to answer items either with a very low of a very high level of

accuracy, pupil scores at the IRL varied more gradually. Similarly, in the IRF there

were no facet levels that were either too difficult or too easy for most of the sample,

but both types of facet levels were present in the IRL.

Discussion

The equivalence of two tasks of inductive reasoning was examined in a

cross-cultural study involving 632 Dutch, 877 Turkish, and 704 Zambian pupils from

the highest two grades from primary and the lowest two grades from secondary

school. Two stimulus modes were examined: letters and figures. In each mode tasks

for inductive reasoning and for each of its components, classification, generation,

and testing, were administered. The structural, measurement unit, and full score

equivalence of the instruments in these countries were studied. A MIMIC model was

fitted to the data, linking skill components to inductive reasoning through a latent

variable, labeled inductive reasoning (external procedure). A linear logistic model

was utilized to examine to what extent in each country item difficulties could be

adequately decomposed into the underlying rules that were used to generate the

items (internal procedure). In keeping with past research, structural equivalence was

strongly supported; yet, measurement unit equivalence was not fully supported. It is


interesting to note that two different statistical models, item response theory (LLM)

and structural equation modeling (MIMIC), looking at different aspects of the data

(facet level difficulties in the LLM and covariances of tasks with componential skills)

yielded the same conclusion about measurement unit equivalence.

Critics might argue that the emphasis on equivalence of the present study is a

misnomer and detracts the attention form the real cross-cultural differences

observed with these carefully constructed instruments. In this line of reasoning the

results would show massive differences in inductive reasoning across countries, with

Zambian pupils having the lowest skill level, Turkish pupils having an intermediate

position, and Dutch pupils having the highest level. The validity of this conclusion is

underscored by the LLM analyses in which it was found that most facet level

difficulties are identical and interchangeable across the three countries while a small

number is country dependent. Even if score comparisons are restricted to the facet

levels with the same difficulties, at least some of the score differences of the

countries are likely to remain. In this line of reasoning the current study has

demonstrated the presence of at least some but presumably large differences in

inductive reasoning, with Western subjects showing the highest skill levels. In my

view the interpretation is based on a simplistic and untenable view on country score

differences. These differences are not just a matter of differences in inductive

reasoning skills. It may well be that differences of country scores on the tasks are

partly or entirely due to additional factors. Various educational factors may play a

role here, as is often the case in comparisons of highly dissimilar cultural groups. In

a meta-analysis Van de Vijver (1997) has found that educational expenditure is a

significant predictor of country differences in mental test performance. Does quality


of schooling have an influence on inductive reasoning? I concur with Cole (1996),

who after reviewing the available cross-cultural evidence, concluded that schooling

does not have a formative influence on higher-order forms of thinking but tends to

broaden the domains in which these skills can be successfully applied. Schooling

facilitates the usage of skills by their training and by exposure to psychological and

educational tests (cf. Rogoff, 1981; Serpell, 1993). The educational differences of

the populations of the current study are massive. For example, attending

kindergarten is more common in Turkey and the Netherlands than in Zambia, and

schools in Zambia have a fraction of the learning material that schools in Turkey and

the Netherlands have at their disposal. The interpretation of the country differences

observed in the present study as reflecting real differences is based on an

underestimation of the impact of various context-related (educational) factors and an

overestimation of ability of the tasks employed here to measure inductive reasoning

in all countries. Tasks that capitalize less on schooling and are more derived from

everyday experiences may show a different patterning of country differences.

The present results replicate the findings of many studies on structural

equivalence; strong support was found that the instruments measure inductive

reasoning in the three countries. The present results make it very unlikely that there

are major cross-cultural differences in strategies and processes involved in inductive

reasoning in the populations studied. These results extend findings of numerous

factor analytic studies in showing that skill components contribute in a largely

identical way to inductive thinking and item difficulty is governed by complexity rules

that are largely identical across cultures.


The results also show that comparisons of scores obtained in different

countries are not allowed, despite the careful item construction process. This

negative finding on the numerical comparability may be due to the large cultural

distance of the countries involved here. However, it also points to the need to

address measurement unit and full score equivalence in cross-cultural research.

Cross-cultural comparisons of data that have not been scrutinized for equivalence,

abound in the literature. In fact, it is rather difficult to find examples of data in which

the equivalence was examined in an appropriate way. Scores are often numerically

compared across cultures (assuming full score equivalence) when only structural

equivalence has been demonstrated; examples can be found in the cross-cultural

comparisons of the Eysenck personality scales (e.g., Barrett et al., 1998). It is

difficult to defend the practice to compare scores across cultures when equivalence

has not been tested or when only structural equivalence has been observed. The

present study underscores the need to study equivalence of data before comparing

test scores. A more prudent treatment of cross-cultural score differences is badly

needed. We have firmly established the commonality of basic cognitive functions in

several cultural and ethnic groups (Waitz’s “psychic unity”), but we still have to come

to grips with the question of how to design cognitive tests that allow for numerical

score comparisons across a wide cultural range.

A final issue concerns the external validity of the present findings: To what

populations can the present results be generalized? The three countries involved in

the study have a highly different status on affluence. Given the strong findings on

structural equivalence, it is realistic to assume that inductive reasoning is a universal

with largely identical components in schooled populations, at least as of the end of


primary school. Future studies should address the question of whether

measurement unit equivalence would be fully supported when the cultural distance

between the countries is smaller.


References

Barrett, P. T., Petrides, K. V., Eysenck, S. B. G., & Eysenck, H. J. (1998). The

Eysenck Personality Questionnaire: An examination of the factorial similarity of P, E,

N, and L across 34 countries. Personality and Individual Differences, 25, 805-819.

Campbell, D. T., & Naroll, R. (1972). The mutual methodological relevance of

anthropology and psychology. In F. L. K. Hsu (Ed.), Psychological anthropology.

Cambridge, MA: Schenkman.

Carroll, J. B. (1993). Human cognitive abilities. A survey of factor-analytic

studies. Cambridge: Cambridge University Press.

Claassen, N. C., & Cudeck, R. (1985). Die faktorstruktuur van die Nuwe Suid-

Afrikaanse Groeptoets (NSAG) by verskillende bevolkingsgroepe [The factor

structure of the New South African Group Test (NSAGT) in various population

groups.]. South-African Journal of Psychology, 15, 1-10.

Cole, M. (1996). Cultural psychology: A once and future discipline.

Cambridge, MA: Harvard University Press.

Ekstrom, R. B., French, J. W., & Harman, H. H. (1976). Kit of factor-

referenced tests. Princeton, NJ: Educational Testing Service.

Ellis, B. B. (1990). Assessing intelligence cross-nationally: A case for

differential item functioning detection. Intelligence, 14, 61-78.

Ellis, B. B., Becker, P., & Kimmel, H. D. (1993). An item response theory

evaluation of an English version of the Trier Personality Inventory (TPI). Journal of

Cross-Cultural Psychology, 24, 133-148.

Embretson, S. E. (1983). Construct validity: Construct representation versus

nomothetic span. Psychological Bulletin, 93, 179-197.


Fan, X., Willson, V. L., & Reynolds, C. R. (1995). Assessing the similarity of

the factor structure of the K-ABC for African-American and White children. Journal of

Psychoeducational Assessment, 13, 120-131.

Fischer, G. H. (1974). Einführung in die Theorie psychologischer Tests

[Introduction to the theory of psychological tests]. Bern: Huber.

Fischer, G. H. (1995). The linear logistic test model. In G. H. Fischer & I. W.

Molenaar (Eds.), Rasch models. Foundations, recent developments and

applications. New York: Springer.

Frijda, N., & Jahoda, G. (1966). On the scope and methods of cross-cultural

research. International Journal of Psychology, 1, 109-127.

Geary, D. C., & Whitworth, R. H. (1988). Is the factor structure of the WISC-R

different for Anglo- and Mexican-American children? Journal of Psychoeducational

Assessment, 6, 253-260.

Greenfield, P. M. (1997). You can't take it with you: Why ability assessments

don't cross cultures. American Psychologist, 52, 1115-1124.

Gustafsson, J-E. (1984). A unifying model for the structure of intellectual

abilities. Intelligence, 8, 179-203.

Hakstian, A. R., & Vandenberg, S. G. (1979). The cross-cultural

generalizability of a higher-order cognitive structure model. Intelligence, 3, 73-103.

Hakstian, A. R., & Whalen, T. E. (1976). A k-sample significance test for

independent alpha coefficients. Psychometrika, 41, 219-231.

Hennessy, J. J., & Merrifield, P. R. (1976). A comparison of the factor

structures of mental abilities in four ethnic groups. Journal of Educational

Psychology, 68, 754-759.


Holland, P. W., & Wainer, H. (Eds.) (1993). Differential item functioning.

Hillsdale, NJ: Erlbaum.

Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differential validity of

employment tests by race: A comprehensive review and analysis. Psychological

Bulletin, 86, 721-735.

Irvine, S. H. (1969). Factor analysis of African abilities and attainments:

Constructs across cultures. Psychological Bulletin, 71, 20-32.

Irvine, S. H. (1979). The place of factor analysis in cross-cultural methodology

and its contribution to cognitive theory. In L. Eckensberger, W. Lonner, & Y. H.

Poortinga (Eds.), Cross-cultural contributions to psychology. Lisse, the Netherlands:

Swets & Zeitlinger.

Irvine, S. H., & Berry, J. W. (1988). The abilities of mankind: A revaluation. In

S. H. Irvine & J. W. Berry (Eds.), Human abilities in cultural context. Cambridge:

Cambridge University Press.

Jahoda, G., & Krewer, B. (1997). History of cross-cultural and cultural

psychology. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.), Handbook of cross-

cultural psychology (2nd ed., vol. 1). Chicago: Allyn & Bacon.

Jensen, A. R. (1977). Cumulative deficit in intelligence of Blacks in the rural

South. Developmental Psychology, 13, 184-191.

Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.

Little, T. D. (1997). Mean and covariance structures (MACS) analyses of

cross-cultural data: Practical and theoretical issues. Multivariate Behavioral

Research, 32, 53-76.


McCrae, R. R., & Costa, P. T., (1997). Personality trait structure as a human

universal. American Psychologist, 52, 509-516.

Mellenbergh, G. J. (1982). Contingency table models for assessing item bias.

Journal of Educational Statistics, 7, 105-118.

Messick, S. (1988). Validity. In R. L. Linn (Ed.), Educational measurement

(3rd ed). Hillsdale, NJ: Erlbaum.

Muthén, B. O. (1991). Multilevel factor analysis of class and student

achievement components. Journal of Educational Measurement, 28, 338-354.

Muthén, B. O. (1994). Multilevel covariance structure analysis. Sociological

Methods & Research, 22, 376-398.

Naglieri, J. A., & Jensen, A. R. (1987). Comparison of Black-White differences

on the WISC-R and the K-ABC: Spearman's hypothesis. Intelligence, 11, 21-43.

Poortinga, Y. H., & Van de Vijver, F. J. R. (1987). Explaining cross-cultural

differences: Bias analysis and beyond. Journal of Cross-Cultural Psychology, 18,

259-282.

Ree, M. J., & Carretta, T. R. (1995). Group differences in aptitude factor

structure on the ASVAB. Educational and Psychological Measurement, 55, 268-277.

Reschly, D. (1978). WISC-R factor structures among Anglos, Blacks,

Chicanos, and Native-American Papagos. Journal of Consulting and Clinical


Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression

and Mantel-Haenszel procedures for detecting differential item functioning. Applied

Psychological Measurement, 17, 105-116.


Rogoff, B. (1981). Schooling and the development of cognitive skills. In H. C.

Triandis & A. Heron (Eds.), Handbook of cross-cultural psychology: Volume 4,

Developmental psychology. Boston: Allyn & Bacon.

Sandoval, J. (1982). The WISC-R factorial validity for minority groups and

Spearman's hypothesis. Journal of School Psychology, 20, 198-204.

Serpell, R. (1979). How specific are perceptual skills? British Journal of


Serpell, R. (1993). The significance of schooling. Life journeys in an African

society. Cambridge: Cambridge University Press.

Sternberg, R. J. (1977). Intelligence, information processing, and analogical

reasoning: The componential analysis of human abilities. New York: Wiley.

Sternberg, R. J., & Kaufman, J. C. (1998). Human abilities. Annual Review of


Sung, Y. H., & Dawis, R. V. (1981). Level and factor structure differences in

selected abilities across race and sex groups. Journal of Applied Psychology, 66,

613-624.

Taylor, R. L., & Ziegler, E. W. (1987). Comparison of the first principal factor

on the WISC-R across ethnic groups. Educational and Psychological Measurement,

47, 691-694.

Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs,

No. 1.

United Nations (1999). Indicators on education [On-line]. Available Internet:

www.un.org/depts/unsd/social/education.htm.


Valencia, R. R., & Rankin, R. J. (1986). Factor analysis of the K-ABC for

groups of Anglo and Mexican American children. Journal of Educational

Measurement, 23, 209-219.

Valencia, R. R., Rankin, R. J., & Oakland, T. (1997). WISC-R factor structures

among White, Mexican American, and African American children: A research note.

Psychology in the Schools, 34, 11-16.

Van de Vijver, F. J. R. (1986). The robustness of Rasch estimates. Applied

Psychological Measurement, 10, 45-57.

Van de Vijver, F. J. R. (1997). Meta-analysis of cross-cultural comparisons of

cognitive test performance. Journal of Cross-Cultural Psychology, 28, 678-709.

Van de Vijver, F. J. R., & Leung, K. (1997a). Methods and data analysis of

comparative research. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.),

Handbook of cross-cultural psychology, 2nd Ed., Vol. 1. Chicago: Allyn & Bacon.

Van de Vijver, F. J. R., & Leung, K. (1997b). Methods and data analysis for

cross-cultural research. Newbury Park, CA: Sage.

Van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item

response theory. New York: Springer.

Van Haaften, E. H., & Van de Vijver, F. J. R. (1996). Psychological

consequences of environmental degradation. Journal of Health Psychology, 1, 411-

429.

Willemsen, M. E., & Van de Vijver, F. J. R. (under review). Context effects in

logical reasoning in the Netherlands and Zambia.


Zuckerman, M., Kuhlman, D. M., Thornquist, M., & Kiers, H. A. L. (1991). Five

(or three) robust questionnaire scale factors of personality without culture.

Personality and Individual Differences, 12, 929-941.


Table 1

Sample Size per Culture, Grade, and Experimental Condition

Gradea

Country Test condition 5 6 7 8 Total

Zambia Figure 80 79 94 123 376

Letter 81 81 87 79 328

Turkey Figure 127 97 95 102 421

Letter 139 107 110 100 456

Netherlands Figure 117 74 51 77 319

Letter 83 91 77 62 313

Total 627 529 514 543 2213

aIn Zambia the grades are 6, 7, 8, and 9, respectively.


Table 2

Average Proportion of Correctly Solved Items per Task, Grade, and Culture

Task

Country Grade IRF RCF RGF RTF IRL RCL RGL RTL

Zambia 6 .40 .44 .53 .39 .49 .40 .37 .41

7 .55 .53 .55 .43 .56 .60 .47 .56

8 .56 .64 .55 .53 .58 .61 .44 .58

9 .62 .68 .64 .54 .61 .54 .48 .58

Turkey 5 .47 .51 .44 .42 .50 .54 .39 .49

6 .48 .57 .56 .46 .52 .53 .42 .47

7 .66 .73 .64 .58 .64 .70 .56 .65

8 .65 .75 .69 .63 .64 .71 .52 .60

Netherlands 5 .67 .80 .65 .64 .60 .63 .51 .58

6 .74 .73 .72 .67 .64 .68 .57 .66

7 .70 .74 .70 .66 .68 .72 .63 .67

8 .78 .84 .76 .77 .70 .74 .60 .69

IRF: Inductive Reasoning Figures; RCF: Rule Classification Figures; RGF: Rule

Generating Figures; RTF: Rule Testing Figures; IRL: Inductive Reasoning Letters;

RCL: Rule Classification Letters; RGL: Rule Generating Letters; RTL: Rule Testing

Letters


Table 3Effect Sizes of Multivariate Analyses of Variance of the Psychological Tests per Test Mode

Skill component

Independent

variable

Multi-

variatea

Inductive

reasoning

Rule

classification

Rule

generating

Rule

testing

(a) Figure mode

Country (C) .135*** .132*** .183*** .139*** .200***

Grade (G) .073*** .108*** .149*** .132*** .125***

Sex (S) .011* .001 .002 .001 .010**

C G .035*** .013* .063*** .041*** .014*

C S .012** .016*** .017*** .014*** .014***

G S .009** .004 .006 .004 .011**

C G S .009* .010 .007 .008 .004

(b) Letter mode

Country .102*** .078*** .113*** .130*** .113***

Grade .061*** .122*** .114*** .088*** .125***

Sex .014** .010** .002 .000 .001

C G .030*** .035*** .051*** .028*** .051***

C S .014*** .014** .002 .013** .016***

G S .005 .007 .000 .003 .001

C G S .014*** .017** .012 .018** .009

Note. Significance levels of the effect sizes refer to the probability level of the

corresponding F ratio of the independent variable(s).

aWilks’ lambda. *p < .05. **p < .01. ***p < .001.


Table 4

Overview of the Hypothesis Tests and the Statistical Models Used Statistical aspects

Conditions for equivalenceProcedure to establish equivalence

Question examinedStatistical model used

Structural equivalence

Measurement unit equivalence

Full score equivalence

Internal Focus on tests of inductive reasoning

Are facet level difficulties and item difficulties related?

linear logistic model

correlations significant in each country (hypothesis 1a)

correlations significant and identical across countries (hypothesis 1b)

Focus on tests of inductive reasoning

Is there item bias? Logistic regression

Absence of item bias (hypothesis 3)

External Focus on relationship of skill components and inductive reasoning

Are tests of skill components and inductive reasoning related?

structural equation modeling

MIMIC parameters significant in each country (hypothesis 2a)

MIMIC parameters significant and identical across countries (hypothesis 2b)

MIMIC = Multiple Indicators MultIple Causes


Table 5

Accuracy of the Design Matrices per Task and per Country: Means (and

Standard Deviations) of Correlation

Stimulus mode

Figures Letters

Skill Zam Tur Net Zam Tur Net

Inductive reasoning .90 (.03) .90 (.02) .81 (.03) .90 (.02) .92 (.01) .90 (.02)

Rule classification .84 (.04) .87 (.01) .76 (.02) .92 (.01) .93 (.02) .88 (.03)

Rule generating .88 (.01) .86 (.03) .81 (.02) .87 (.01) .89 (.02) .92 (.02)

Rule testing .95 (.01) .93 (.01) .91 (.03) .94 (.01) .95 (.01) .94 (.01)

Net = Netherlands. Tur = Turkey. Zam = Zambia.


Table 6

Analysis of Variance of Correlations with Country, Stimulus Mode, and Skill as

Independent Variables

Source df F Variance explained

Country (C) 2 24.85*** .10

Stimulus mode (S) 1 79.18*** .22

Skill (Sk) 3 37.94*** .21

C S 2 19.92*** .16

C Sk 6 3.70** .05

S Sk 3 9.61*** .10

C S Sk 6 1.95 .03

Within-cell error 72 (.0006) .14

*p < .05. **p < .01. ***p < .001.


Table 7Fit Indices for Nested Multiple Indicators Multiple Causes Models of Figure and Letter Tasks

Contribution to per country(percentage)

Invariant matrices (df) Zam Tur Net NNFI GFI RMSEA 2 (df)(a) Figure mode

y 533.97*** (167) 21 32 47 .88 .96 .045y 437.28*** (145) 19 35 47 .89 .96 .043 96.69*** (22)y 180.52*** (79) 20 27 53 .93 .98 .034 256.76*** (66)y 134.01*** (46) 20 19 61 .91 .98 .042 46.51 (33)y 98.72*** (35) 20 18 62 .90 .99 .041 35.29*** (11)

(b) Letter modey 473.33*** (167) 33 33 34 .89 .82 .041y 364.63*** (145) 28 30 41 .91 .87 .037 108.70*** (22)y 180.60*** (79) 35 26 39 .92 .90 .034 184.03*** (66)y 82.07** (46) 56 25 19 .95 .94 .027 98.53*** (33)y 61.61** (35) 54 24 23 .96 .94 .027 20.46* (11)Note. The choice of a MIMIC model for the figure tests was mainly based on the relatively large change of all fit statistics when constraints were imposed on the phi matrices; therefore, the model with equal factor loadings, regression coefficients, and error variances was chosen. The same model of free, fixed and constrained parameters also showed an adequate fit for the letter tests. Releasing constraints on the regression coefficients revealed a significant increase of fit. The question was addressed as to whether the decrease of the statistic was due to systematic country differences in the regression coefficients. An inspection of the regression coefficients per country did not show a clear patterning of country differences. The same question was also addressed by two more analyses; in the first the regression coefficients were allowed to vary across countries but not across grades while in the second analysis variation was allowed across grades but not across countries. It was found that equality of regression coefficients across the four grades of a country yielded a poorer fit than equality across the three countries per grade (first analysis: (73, N = 1094) = 168.40, p < .001; GFI = .91, NNFI = .92, RMSEA = .035; second analysis: (70, N = 1094) =


131.54, p < .001; GFI = .90, NNFI = .95, RMSEA = .029). The two analyses confirmed that a choice of a model of equal regression coefficients of the letter mode across countries does not lead to the elimination of relevant country differences.Net = the Netherlands; Tur = Turkey; Zam = Zambia; NNFI = Nonnormed Fit Index; GFI = Goodness of Fit Index; RMSEA = Root Mean Square Error of Approximation; 2 = decrease of 2 value.*p < .05. **p < .01. ***p < .001.


Figure Captions

Figure 1. Estimated facet level difficulties per test and country of the figure mode

Note. The first level of each facet (see Appendix A), arbitrarily set to zero is not presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item rule: 2 ; R3: Item rule: 3; P3: Number of figures per period: 3; P4: Number of figures per period: 4; D2: Number of different elements of subsequent figures: 2 ; D3: Number of different elements of subsequent figures: 3; DV: Number of different elements of subsequent figures: variable; V: Variation across periods: variable; C: Periodicity cues: absent; PR: Periods repeat each other: no; V2: Number of valid triplets: 2; V3: Number of valid triplets: 3; F: One of the alternatives follows the rule: no; NR: Rows repeat each other: no.

Figure 2. Estimated facet level difficulties per test and country of the letter mode

Note. The first level of each facet (see Appendix A), arbitrarily set to zero, is not presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item rule: 2; R3: Item rule: 3; R4: Item rule: 4; R5: Item rule: 5; L2: Number of letters: 2; L3: Number of letters: 3; L4: Number of letters: 4; L5: Number of letters: 5; L6: Number of letters: 6; LV: Number of letters: variable; D2: Difference in positions in alphabet: 2; D3: Difference in positions in alphabet: 3; D4: Difference in positions in alphabet: 4; V2: Number of valid triplets: 2; V3: Number of valid triplets: 3; V4: Number of valid triplets: 4; V5: Number of valid triplets: 5; F: One of the alternatives follows the rule: no; NR: Rows repeat each other: no.

Figure 3. Multiple Indicators Multiple Causes model (standardized solution).

Figure 4. Examples of biased items: (a) easy item; (b) difficult item


(a) Inductive Reasoning Figures

-1

-0.5

0

0.5

1

1.5

2

R2 R3 P3 P4 D2 D3 V PRFacet level

Diff

icul

ty

(b) Rule Classification Figures

-1-0.5

00.5

11.5

22.5

R2 R3 P3 P4 D2 D3 DV V C PR FFacet level

Diff

icul

ty

(c) Rule Generating Figures

-0.50

0.51

1.52

2.53

3.5

R2 R3 D2 D3 V2 V3Facet level

Diff

icul

ty

(d) Rule Testing Figures

-1

-0.5

0

0.5

1

1.5

2

R2 R3 P3 P4 D2 D3 DV V C F NRFacet level

Diff

icul

ty

Zam Tur Net


(a) Inductive Reasoning Letters

-3.5

-2.5

-1.5

-0.5

0.5

1.5

R2 R3 R4 R5 L2 L3 L4 L5 L6 D2 D3 D4Facet level

Diff

icul

ty

(b) Rule Classification Letters

-2.5

-1.5

-0.5

0.5

1.5

2.5

R2 R3 R4 R5 L2 L3 L4 L5 L6 D2 D3 D4 FFacet level

Diff

icul

ty

(c) Rule Generating Letters

-3.5-2.5-1.5-0.50.51.52.5

R2 R3 R4 R5 L2 L3 L4 D2 D3 D4 V2 V3 V4 V5Facet level

Diff

icul

ty

(d) Rule Testing Letters

-2

-1

0

1

2

3

R2 R3 R4 R5 L3 L4 L5 L6 LV D2 D3 D4 F NRFacet level

Diff

icul

ty

Zam Tur Net


Ruleclassificationfigures

Rulegeneratingfigures

Ruletestingfigures

Inductivereasoning

Inductivereasoning figures

Inductivereasoning letters

.24

.39

.46

.73

.67

Ruleclassificationletters

Rulegeneratingletters

Ruletestingletters

Inductivereasoning

Inductivereasoning figures

Inductivereasoning letters

.37

.34

.34

.63

.74

Figure mode

Letter mode

.15

.23


0.4

0.5

0.6

0.7

0.8

0.9

1

Low Medium High

Score level

Average score

ZambiaTurkeyNetherlands


Appendix A: Test Facets

The following table provides a description of the facets of the examples of the figure tests:

TestFaceta Level IRF RCF RGFb RTFItem rule 1 * *

2 *3 *

Number of figures per period

2 -

3 * -4 * - *

Number of different elements of subsequent figures

1 * * *

23 *

Variable - -Variation across periods constant * * - *

Variable -Periodicity cues Present - -

Absent - * -Periods repeat each other

Yes * - *

No * -Number of valid triplets 1 - - -Number of valid triplets 2 - - * -

3 - - -One of the alternatives follows the rule

Yes - * - *

No - -Rows repeat each other Yes - - - *

No - - -Note. An asterisk indicates that the corresponding facet level applies to the item; a dash indicates that the facet is not present in the test. aSee the text for an explanation of the facets. bThe description refers to the first correct answer (1-3-5). IRF: Inductive Reasoning Figures; RCF: Rule Classification Figures; RGF: Rule Generating Figures; RTF: Rule Testing Figures.


The following Table provides a description of the facets of the examples of the letter tests:

TestFaceta Level IRL RCL RGLb RTLItem rule 1 * *

23 *45 *

Number of letters 123 * *4 *56 *

variableDifference in positions in alphabet

1 *

234

Number of valid triplets 1 - - -2 - - -3 - - * -4 - - -5 - - -

One of the alternatives follows the rule

yes - * - *

no - -Rows repeat each other yes - - - *

no - - -Note. An asterisk indicates that the corresponding facet level applies to the item; a dash indicates that the facet is not present in the test. aSee the text for an explanation of the facets. bThe description refers to the first correct answer (2-4-6). IRL: Inductive Reasoning Letters; RCL: Rule Classification Letters; RGL: Rule Generating Letters; RTL: Rule Testing Letters.


1

2

3

4

5

1 2 3 4 5 6 7 8 9 10 11 12

(a) Inductive Reasoning Figures: Subject is asked to indicate which row consistently follows one of the item generating rules.

Appendix B: Examples of test items

(Correct answer: 3)

1

2

3

4

5

1 2 3 4 5 6 7 8 9 10 11 12

(a) Inductive Reasoning Figures: Subject is asked to indicate which row consistently follows one of the item generating rules.

Appendix B: Examples of test items

(Correct answer: 3)


1 2 3 4 5 6 7 8

1. One or more things are added to figures which come after each otherin a group.

2. One or more things are subtracted from figures which come after each other in a group.

3. In turn, one or more things are added to figures which come after each other in a group and then, the same number of things is subtracted.

4. None of the rules applies.

(b) Rule Classification Figures: Subject is asked to indicate which rule applies to the eight figures.

(Correct answer: 3)

1 2 3 4 5 6 7 8

1. One or more things are added to figures which come after each otherin a group.

2. One or more things are subtracted from figures which come after each other in a group.

3. In turn, one or more things are added to figures which come after each other in a group and then, the same number of things is subtracted.

4. None of the rules applies.

(b) Rule Classification Figures: Subject is asked to indicate which rule applies to the eight figures.

(Correct answer: 3)


1 - 2 - 3 - 4 - 5 - 6

1 - 2 - 3 - 4 - 5 - 6

1 - 2 - 3 - 4 - 5 - 6

(c) Rule Generating Figures: Subject is asked to find one or more groups of three figures that follow one of the item generating rules.

(Correct answers: 3-4-6 and 2-3-4)

1 - 2 - 3 - 4 - 5 - 6

1 - 2 - 3 - 4 - 5 - 6

1 - 2 - 3 - 4 - 5 - 6

(c) Rule Generating Figures: Subject is asked to find one or more groups of three figures that follow one of the item generating rules.

(Correct answers: 3-4-6 and 2-3-4)


(d) Rule Testing Figures: Subject is asked to indicate which row of figures follows the rule at the top of the item.

1

2

3

4

5

1 2 3 4 5 6 7 8

The rule is:There are 4 figures in a group. 1 thing is ADDED tofigures which come after each other in a group.

None of these

(Correct answer: 4)

(d) Rule Testing Figures: Subject is asked to indicate which row of figures follows the rule at the top of the item.

1

2

3

4

5

1 2 3 4 5 6 7 8

The rule is:There are 4 figures in a group. 1 thing is ADDED tofigures which come after each other in a group.

None of these

(Correct answer: 4)


(e ) In d u c t iv e R e a s o n in g L e t te rs : S u b je c t is a s k e d to in d ic a te w h ic h g ro u p o f le t te rs d o e s n o t fo l lo w th e ru le o f th e o th e r fo u r .

1 2 3 4 5M L K J I H G F E D C B U T S R Q P O N M L K H X W V U T S

(C o r re c t a n s w e r : 4 )

( f ) R u le C la s s if ic a t io n L e t te r s : S u b je c t is a s k e d to in d ic a te w h ic h ru le a p p lie s to th e th re e g ro u p s o f le t te rs .

S R R R T Z V V V W X Z K K K C D F

1 . E a c h g ro u p o f le t te rs h a s th e s a m e n u m b e r o f v o w e ls .2 . E a c h g ro u p o f le t te rs h a s a n e q u a l n u m b e r o f id e n t ic a l le t te rs

a n d th e s e le t te rs a re th e s a m e a c ro s s g ro u p s .3 . E a c h g ro u p o f le t te rs h a s a n e q u a l n u m b e r o f id e n t ic a l le t te rs

a n d th e s e le t te rs a re n o t th e s a m e a c ro s s g ro u p s .4 . E a c h g ro u p o f le t te rs h a s a n u m b e r o f le t te rs w h ic h a p p e a r th e

s a m e n u m b e r o f p o s it io n s a f te r e a c h o th e r in th e a lp h a b e t.5 . E a c h g ro u p o f le t te rs h a s a n u m b e r o f le t te rs w h ic h a p p e a r th e

s a m e n u m b e r o f p o s it io n s b e fo re e a c h o th e r in th e a lp h a b e t.6 . N o n e o f th e ru le s a p p lie s


(e ) In d u c t iv e R e a s o n in g L e t te rs : S u b je c t is a s k e d to in d ic a te w h ic h g ro u p o f le t te rs d o e s n o t fo l lo w th e ru le o f th e o th e r fo u r .

1 2 3 4 5M L K J I H G F E D C B U T S R Q P O N M L K H X W V U T S


( f ) R u le C la s s if ic a t io n L e t te r s : S u b je c t is a s k e d to in d ic a te w h ic h ru le a p p lie s to th e th re e g ro u p s o f le t te rs .

S R R R T Z V V V W X Z K K K C D F

1 . E a c h g ro u p o f le t te rs h a s th e s a m e n u m b e r o f v o w e ls .2 . E a c h g ro u p o f le t te rs h a s a n e q u a l n u m b e r o f id e n t ic a l le t te rs

a n d th e s e le t te rs a re th e s a m e a c ro s s g ro u p s .3 . E a c h g ro u p o f le t te rs h a s a n e q u a l n u m b e r o f id e n t ic a l le t te rs

a n d th e s e le t te rs a re n o t th e s a m e a c ro s s g ro u p s .4 . E a c h g ro u p o f le t te rs h a s a n u m b e r o f le t te rs w h ic h a p p e a r th e

s a m e n u m b e r o f p o s it io n s a f te r e a c h o th e r in th e a lp h a b e t.5 . E a c h g ro u p o f le t te rs h a s a n u m b e r o f le t te rs w h ic h a p p e a r th e

s a m e n u m b e r o f p o s it io n s b e fo re e a c h o th e r in th e a lp h a b e t.6 . N o n e o f th e ru le s a p p lie s



( g ) R u l e G e n e r a t i n g L e t t e r s : S u b j e c t i s a s k e d t o f i n d o n e o r m o r e g r o u p s o f t h r e e b o x e s o f l e t t e r s t h a t f o l l o w o n e o f t h e i t e m g e n e r a t i n g r u l e s .F G H L L O A I L L V B C I D O U P Q R L L E A

1 2 3 4 5 61 2 3 4 5 61 2 3 4 5 61 2 3 4 5 61 2 3 4 5 6

( C o r r e c t a n s w e r s : 2 - 4 - 6 , 1 - 2 - 6 , a n d 1 - 4 - 5 )

( h ) R u l e T e s t i n g L e t t e r s : S u b j e c t i s a s k e d t o i n d i c a t e w h i c h r o w o f f i g u r e s f o l l o w s t h e r u l e a t t h e t o p o f t h e i t e m .

T h e r u l e i s :I n e a c h b o x t h e r e a r e f o u r v o w e l s

1 A O U V W I S R Z E I O V G A O U I2 A O U V W I S A R E I O V G A O U I3 B O U V W I S A R E O I V G A O U Q4 A O U V W J S A R D D O V G A P U Q5 N o n e o f t h e s e

( C o r r e c t a n s w e r : 2 )

( g ) R u l e G e n e r a t i n g L e t t e r s : S u b j e c t i s a s k e d t o f i n d o n e o r m o r e g r o u p s o f t h r e e b o x e s o f l e t t e r s t h a t f o l l o w o n e o f t h e i t e m g e n e r a t i n g r u l e s .F G H L L O A I L L V B C I D O U P Q R L L E A

1 2 3 4 5 61 2 3 4 5 61 2 3 4 5 61 2 3 4 5 61 2 3 4 5 6

( C o r r e c t a n s w e r s : 2 - 4 - 6 , 1 - 2 - 6 , a n d 1 - 4 - 5 )

( h ) R u l e T e s t i n g L e t t e r s : S u b j e c t i s a s k e d t o i n d i c a t e w h i c h r o w o f f i g u r e s f o l l o w s t h e r u l e a t t h e t o p o f t h e i t e m .

T h e r u l e i s :I n e a c h b o x t h e r e a r e f o u r v o w e l s

1 A O U V W I S R Z E I O V G A O U I2 A O U V W I S A R E I O V G A O U I3 B O U V W I S A R E O I V G A O U Q4 A O U V W J S A R D D O V G A P U Q5 N o n e o f t h e s e

( C o r r e c t a n s w e r : 2 )


Appendix C:Parameter Estimates of the MIMIC Model per Mode and Cultural Group

A more detailed description of the MIMIC analyses is given here. In order to simplify the presentation and reduce the number of figures to be presented, the covariance matrices of the four grades were pooled per country prior to the analyses (as a consequence, the numbers in this Appendix and in Table 7 are not directly comparable). The table presents an overview of the estimated parameters (top) and fit (bottom). Going from the left to the right in the table, equality constraints are increased, starting with the “core parameters” of the model, the factor loadings (y), followed by the regression coefficients (), the error variance of the latent construct, labeled Inductive Reasoning (), the covariances of the predictors (), and the error variance of the tasks of inductive reasoning (). Cells with three different numbers represent the parameter estimates for the Dutch, Turkish, and Zambian group, respectively (e.g., the values 1.09, .79, and 0.69 were the factor loadings in these groups of the IRL task in the solution without any equality constraints across cultural groups); cells with one number contain parameter estimates that were set to be identical across countries; cells with an arrow and the word “Same” contain values equal to its left neighboring cell. All parameter estimates are significant (p < .05).

RuleClassificationFig/Let

RuleGeneratingFig/Let

RuleTestingFig/Let

InductiveReasoning (IR)

InductiveReasoning Figures (IRF)

InductiveReasoning Letters (IRL)

Schematic diagram of MIMIC models:


Invariant parameters across countriesNo equality constraints y y y y y

Parameter (a) Figure mode2 1.09 0.79 0.69 0.83 0.82 0.83 0.83 0.821 0.11 0.19 0.21 0.12 0.19 0.20 0.17 0.17 0.17 0.172 0.18 0.15 0.19 0.20 0.14 0.18 0.17 0.17 0.17 0.183 0.31 0.34 0.34 0.37 0.33 0.31 0.34 0.33 0.33 0.3311 27.28 39.25 45.69 Same Same Same 37.99 37.9921 29.48 36.82 35.11 Same Same Same 34.14 34.1422 111.76 102.06 104.22 Same Same Same 105.56 105.5631 15.35 23.72 26.31 Same Same Same 22.20 22.2032 31.79 35.93 33.89 Same Same Same 34.06 34.0633 30.02 37.66 45.42 Same Same Same 38.08 38.08 0.44 4.48 5.86 0.30 4.28 4.73 0.43 4.39 4.83 3.10 3.10 3.411 17.39 18.05 23.25 17.61 18.27 24.53 17.42 18.25 24.35 15.27 19.25 25.80 15.27 19.25 25.80 20.082 18.73 15.93 19.69 19.49 15.81 19.34 19.69 15.82 19.42 18.44 16.41 20.29 18.44 16.41 20.29 18.12Proportion of variance accounted fora IR 0.97 0.79 0.79 0.98 0.79 0.80 0.97 0.80 0.79 0.86 0.84 0.84IRF 0.43 0.54 0.54 0.48 0.53 0.49 0.47 0.55 0.49 0.54 0.51 0.45 0.57 0.51 0.44 0.51IRL 0.46 0.45 0.40 0.37 0.47 0.45 0.34 0.48 0.45 0.40 0.46 0.42 0.43 0.45 0.40 0.43Fit indices(df) 13.25

(2)8.32

(2)3.58

(2)38.35 (8) 43.42 (14) 49.67 (16) 101.92 (28) 123.82 (32)

prob. .001 .016 .167 .000 .000 .000 .000 .000(df) 12.20 (2) 5.07 (6) 6.25 (2) 52.25 (12) 21.90 (4)prob. .002 .535 .044 .000 .000NNFI 0.91 0.96 0.99 0.95 0.97 0.97 0.96 0.96GFI 0.98 0.99 1.00 0.99 0.99 0.99 0.97 0.96RMSEA 0.13 0.09 0.05 0.10 0.07 0.08 0.08 0.09


No equality constraints y y y y y

Parameter (b) Letter mode2 1.49 1.23 0.76 1.21 1.18 1.19 1.19 1.121 0.11 0.29 0.20 0.13 0.30 0.15 0.23 0.22 0.22 0.222 0 11 0.07 0.11 0.12 0.07 0.09 0.09 0.09 0.09 0.103 0.31 0.17 0.31 0.36 0.18 0.21 0.23 0.23 0.23 0.2411 30.24 37.06 38.53 Same Same Same 35.55 35.5521 38.14 49.81 49.59 Same Same Same 46.41 46.4122 161.85 199.72 196.20 Same Same Same 187.85 187.8531 15.25 17.54 18.99 Same Same Same 17.32 17.3232 29.74 39.36 46.45 Same Same Same 38.77 38.7733 19.14 25.90 32.63 Same Same Same 25.97 25.97 2.38 1.78 7.62 2.85 1.81 4.41 3.27 1.98 4.54 2.69 2.69 3.291 21.74 14.95 30.96 21.49 14.91 35.30 21.19 14.85 34.43 21.51 14.42 35.76 21.51 14.42 35.76 22.282 11.03 15.49 21.21 12.22 15.55 19.36 12.78 15.66 20.33 13.34 14.95 22.53 13.34 14.95 22.53 16.45


Proportion of variance accounted fora IR 0.77 0.85 0.66 0.79 0.85 0.64 0.72 0.84 0.71 0.81 0.79 0.76IRF 0.32 0.44 0.42 0.39 0.44 0.26 0.35 0.45 0.31 0.34 0.47 0.28 0.37 0.47 0.26 0.38IRL 0.68 0.53 0.38 0.62 0.53 0.48 0.56 0.52 0.52 0.53 0.55 0.46 0.57 0.54 0.44 0.51Fit indices(df) 1.32

(2)2.44

(2)2.32

(2)24.42 (8) 54.32 (14) 57.35 (16) 104.51 (28) 187.75 (32)

prob. .517 .295 .313 .002 .000 .000 .000 .000(df) 18.34 (2) 29.9 (6) 3.03 (2) 47.16 (12) 83.24 (4)prob. .000 .000 .220 .000 .000NNFI 1.01 1.00 1.00 0.97 0.96 0.96 0.96 0.93GFI 1.00 1.00 1.00 0.98 0.98 0.97 0.96 0.92RMSEA 0.00 0.02 0.02 0.07 0.09 0.08 0.08 0.12Note. Values in cells refer to nonstandardized solution; 1 is fixed at a value of one. aThe last three rows refer to proportions of variance accounted for in the latent variable and the two inductive reasoning tasks, respectively.IR: Inductive Reasoning (latent construct). IRF: Inductive Reasoning Figures. IRL: Inductive Reasoning Letters. NNFI = Nonnormed Fit Index; GFI = Goodness of Fit Index; RMSEA = Root Mean Square Error of Approximation.

the construct equivalence of a measure of inductive ... · web viewinductive reasoning in zambia,...

Documents