the construct equivalence of a measure of inductive ... · web viewinductive reasoning in zambia,...
TRANSCRIPT
Inductive Reasoning 1
RUNNING HEAD: Cross-Cultural Equivalence of an Inductive Reasoning Test
Inductive Reasoning in Zambia, Turkey, and The Netherlands:
Establishing Cross-Cultural Equivalence
Fons J. R. van de Vijver
Tilburg University
The Netherlands
Mailing address:
Fons J. R. van de Vijver
Department of Psychology
Tilburg University
PO Box 90153
5000 LE Tilburg
The Netherlands
Phone: +31 13 466 2528
Fax: +31 13 466 2370
E-mail: [email protected]
Acknowledgment. The help of Cigdem Kagitcibasi and Robert Serpell in making the
data collection possible in Turkey and Zambia is gratefully acknowledged.
Inductive Reasoning 2
Abstract
Tasks of inductive reasoning and its component processes were administered to
704 Zambian, 877 Turkish, and 632 Dutch pupils from the highest two grades of
primary and the lowest two grades of secondary school. All items were constructed
using item-generating rules. Three types of equivalence were examined: structural
equivalence (Does an instrument measure the same psychological concept in each
country?), measurement unit equivalence (Do the scales have the same metric in
each country?), and full score equivalence (full comparability of scores across
countries). Structural and measurement unit equivalence were examined in two
ways. First, a MIMIC (multiple indicators, multiple causes) structural equation model
was fitted, with tasks for component processes as input and inductive reasoning
tasks as output. Second, using a linear logistic model, the relationship between item
difficulties and the difficulties of their constituent item-generating rules was
examined in each country. Both analyses of equivalence provided strong evidence
for structural equivalence, but only partial evidence for measurement unit
equivalence; full score equivalence was not supported.
Inductive Reasoning 3
Equivalence of a Measure of Inductive Reasoning
in Zambia, Turkey, and The Netherlands
Inductive reasoning has been a topic of considerable interest to cross-cultural
researchers, mainly because of its strong relationship with general intelligence
(Carroll, 1993; Gustafsson, 1984; Jensen, 1980). Many cultural populations have
been studied using common tasks of inductive reasoning such as number series
extrapolations (e.g., How should the following series be continued: 1, 4, 9, 16,...?),
figure series extrapolations such as Raven’s Progressive Matrices, analogical
reasoning (e.g., Complete the following: day : night :: white : ...?), and exclusion
tasks (e.g., Mark the odd one out: (a) 21, (b) 14, (c) 28, (d) 63, (e) 32). Studies of
inductive reasoning among nonwestern populations were reviewed by Irvine (1969,
1979; Irvine & Berry, 1988). He concluded that the structure found among western
participants with exploratory factor-analytic techniques is usually replicated. More
recent comparative studies, often based on comparisons of ethnic groups in the
U.S.A., have confirmed this conclusion (e.g., Fan, Willson, & Reynolds, 1995; Geary
& Whitworth, 1988; Hakstian & Vandenberg, 1979; Hennessy & Merrifield, 1976;
Naglieri & Jensen, 1987; Ree & Carretta, 1995; Reschly, 1978; Sandoval, 1982;
Sung & Dawis, 1981; Taylor & Ziegler, 1987; Valencia & Rankin, 1986; Valencia,
Rankin, & Oakland, 1997). Major differences in structure (for instance as reported by
Claassen & Cudeck, 1985) are exceptional. Inductive reasoning provides a strong
case for what Waitz, a nineteenth century philosopher, called “the psychic unity of
mankind” (Jahoda & Krewer, 1997), according to which the basic structure and
operations of the cognitive system are universal while manifestations of these
Inductive Reasoning 4
structures may vary across cultures, depending on what is relevant in a particular
cultural context.
The validity of cross-cultural comparisons can be jeopardized by bias;
examples of bias sources are country differences in stimulus familiarity (Serpell,
1979) and item translations (Ellis, 1990; Ellis, Becker, & Kimmel, 1993). Bias refers
to the presence of score differences that do not reflect differences in the target
construct. Much research has been reported on fair test use; the question is
addressed there whether a test predicts an external criterion such as job success
equally well in different ethnic, age or gender groups (e.g., Hunter, Schmidt, &
Hunter, 1979). The present study does not study bias in test use but bias in test
meaning; in other words, no reference is made here to social bias, unfairness, and
differential predictive validity. The present study focuses on the question whether
the same score but obtained in different cultural groups has the same meaning
across these groups. Such scores are unbiased. Two types of approaches have
been developed to deal with bias in cognitive tests. The first type, known under
various labels such as culture-free, culture-fair, and culture-reduced testing (Jensen,
1980), attempts to eliminate or minimize the differential influence of cultural factors,
like education, by adapting instrument features that may induce unwanted score
differences across countries. Raven's Matrices Tests are often considered to
exemplify this approach (e.g., Jensen, 1980). Despite the obvious importance of
good test design, the approach has come under critical scrutiny; it has been argued
that culture and test performance are so inextricably linked that a culture-free test
does not exist (Frijda & Jahoda, 1966; Greenfield, 1997).
Inductive Reasoning 5
Second, various statistical procedures have been proposed to examine the
appropriateness of psychological instruments in different ethnic groups. Examples
are exploratory factor analysis followed by target rotations and the computation of
factorial agreement between ethnic groups (Barrett, Petrides, Eysenck, & Eysenck,
1998; McCrae & Costa, 1997), simultaneous components analysis (Zuckerman,
Kuhlman, Thornquist, & Kiers, 1991), item bias statistics (Holland & Wainer, 1993),
and structural equation modeling (Little, 1997). It is remarkable that a priori and a
posteriori approaches (test adaptations and statistical techniques, respectively) have
almost never been combined, despite their common aim, mutual relevance, and
complementarity.
The present paper attempts to integrate a priori and a posteriori approaches
and takes equivalence as a starting point. Equivalence refers to the similarity of
psychological meaning across cultural groups (i.e., the absence of bias). Three
hierarchical types of equivalence can be envisaged (Van de Vijver & Leung, 1997a,
b). At the lowest level the issue of similarity of a psychological construct, as
measured by a test in different cultures, is addressed. An instrument shows
structural (also called functional) equivalence if it measures the same construct in
each cultural population studied. There is no claim that scores or measurement units
are comparable across cultures. In fact, instruments may be different across
cultures; structural equivalence is supported if it can be shown that in each culture
the same underlying construct (e.g., inductive reasoning) has been measured. The
intermediate level refers to measurement unit equivalence, defined by equal scale
units and unequal scale origins across cultural groups (e.g., the temperature scales
in degrees of Celsius and Kelvin). In practical terms, this type of equivalence is
Inductive Reasoning 6
found when the same instrument has been administered in different groups but
scores are not directly comparable across groups because of the presence of
moderating variables with a bearing on group mean scores, such as intergroup
differences in stimulus familiarity. Structural equation modeling is suitable to address
measurement unit equivalence because it allows for a comparison of score metrics
across cultural groups. The third and highest level is called full score equivalence
and refers to identity of both scale units and origins. Only in the latter case, scores
can be compared both within and across cultures using techniques like t tests and
analyses of (co)variance.
Full score equivalence assumes the complete absence of bias in the
measurement. Score differences between and within cultures are entirely due to
inductive reasoning. There are no fully adequate statistical tests of full score
equivalence, but some go a long way. The first is indirect and involves the use of
additional variables to (dis)confirm a particular interpretation of cross-cultural score
differences (Poortinga & Van de Vijver, 1987). Suppose that Raven’s Standard
Progressive Matrices Test is administered to adults in the U.S.A. and to illiterate
Bushmen. It may well be that the test provides a good picture of inductive reasoning
in both cultures. However, it is likely that differences between the countries are
influenced by educational differences between the groups. Score differences within
and across groups have a different meaning in this case. A measure of test-
wiseness or previous test exposure, administered to all participants, can be used to
(dis)confirm that cross-cultural score differences are due to bias. Full score
equivalence is then not demonstrated but assumed, and corollaries are tested.
Inductive Reasoning 7
Other tests of full score equivalence that have been proposed, compare the
patterning of cross-cultural score differences across items or subtests, often within
the framework of structural equation modeling. An example is multilevel covariance
structure analysis (Muthén, 1991, 1994) that compares the factor structure of pooled
within-country data to between-country data. Such an analysis assumes a sizable
number of cultural groups involved. Another example involves the modeling of latent
means in a structural model (e.g., Little, 1997). A frequently employed approach,
often based on item response theory, which is applicable when a small number of
cultures have been studied, is the examination of differential item functioning or item
bias (e.g., Holland & Wainer, 1993; Van der Linden & Hambleton, 1997). As long as
the sources of bias (such as education) affect all items in a more or less uniform
way, no statistical techniques will indicate that between-group differences are of a
different nature than within-group differences. Only if bias affects some items, the
proposed techniques can identify it. In sum, the establishment of full score
equivalence is an intricate issue. In many empirical studies dealing with mental
tests, this form of equivalence is merely assumed. As a consequence, statements
about the size of cross-cultural score differences often have an unknown validity.
Sternberg and Kaufman’s (1998) observation that we know that there are population
differences in human abilities, but that their nature is elusive, is very pertinent.
In line with current thinking in validity theory (Embretson, 1983; Messick,
1988), the present study combines test design and statistical analyses to deal with
bias (and equivalence). A distinction is made between internal and external
procedures to establish equivalence, depending on whether the procedure is based
Inductive Reasoning 8
on information derived from the scrutinized test itself (internal) or from additional
tests (external).
The present study examines the structural, measurement unit, and full score
equivalence of a measure of inductive reasoning in three, culturally widely divergent
populations (Zambia, Turkey, and the Netherlands). Structural and measurement
unit equivalence are studied using both an internal and external procedure. The
internal procedure to examine equivalence is based on item-generating rules that
underlie the instruments. In the external procedure, equivalence is scrutinized by
comparing the contribution of skill components to inductive reasoning across
countries. Three components are presumably relevant in the types of inductive
reasoning tasks studied here (Sternberg, 1977). The first is classification: treating
stimuli as exemplars of higher order concepts (e.g., the set CDEF as four
consecutive letters in the alphabet, as an instance of a group with one vowel, as a
group with three consonants, etcetera). Individuals are more successful in inductive
reasoning tasks when they can generate more of these classifications. Therefore, in
addition to classification, the skill to generate underlying rules on the basis of a
stimulus set was also tested. Finally, each generated rule has to be tested (e.g., Do
other groups also have four consecutive letters?). The latter skill, labeled rule
testing, was also assessed.
Inductive Reasoning 9
Method
Participants
An important consideration in the choice of countries was the presumed
strong influence of schooling on test performance (Van de Vijver, 1997); the
expenditure per head on education, a proxy for school quality, is strongly influenced
by national affluence. Countries with considerable differences in school systems and
educational expenditures per child were chosen. Furthermore, inclusion of at least
three different cultural groups decreases the number of alternative hypotheses to
explain cross-cultural differences (Campbell & Naroll, 1972). Zambia, Turkey, and
the Netherlands show considerable differences in educational systems and GDP
(per capita); the GDP figures per capita for 1995 were US$ 382, 2,814, and 25,635
for the three countries, respectively. School life expectancy of the three countries is
7.3, 9.7, and 15.5 year (United Nations, 1999). The choice of Zambia was also
made because of its lingua franca in school; English is the school language in
Zambia which was convenient for developing and administering tasks.
In each country pupils of four subsequent grades were involved. In the
Netherlands these were the last two grades of primary school (Grade 5 and 6) and
the first two grades of secondary school. The same procedure was applied in
Zambia, where primary school has seven grades. In a pilot study it was found that
the tasks could not be adequately administered to pupils from Grade 5 because
most of these children have still an insufficient knowledge of English, that is the first
language of few Zambians. Children start attending primary school in Turkey and
the Netherlands at the age of six, while schooling starts one year later in Zambia; as
a consequence, the Zambian pupils were on average two years older. The Zambian
Inductive Reasoning 10
sample comprised of more than 20 cultural groups (the three largest being Tonga,
21%; Bemba, 13%; and Nyanja, 11%); the Turkish groups was 99% Turkish, while in
the Dutch group 93% were Dutch, 2% Moroccan, and 2% Turkish.
Primary schooling in Turkey has five grades; pupils from the fifth grade of
primary school and the first three grades of secondary school were involved.
Secondary education is markedly different in the three countries. In Zambia a
nation-wide examination (with tests for reasoning and school achievement) at the
end of the last grade of primary school, Grade 7, is utilized to select pupils for
secondary school. After the seventh Grade less than 20% pupils continue their
education in either public or private secondary schools. Admittance to public schools
is conditional on the score at the Grade 7 Examination. Cutoff scores vary per
region and depend on the number of places available in secondary schools. In
urban areas there are some private schools; admittance to these schools usually
does not depend on examination results, but is mainly dependent on availability of
places as well as the ability and willingness of parents to pay school fees.
Participants both from public and private schools were included in our study. The
tremendous dropout at the end of Grade VII has undoubtedly adversely affects the
generalizability of the data to the Zambian population at large and it also jeopardized
the comparability of the age cohorts, both within Zambia and across the three
countries. In Turkey and the Netherlands secondary schooling is more or less
intellectually streamed. An attempt was made to retain the intellectual heterogeneity
of the primary school group at secondary school level by selecting various types of
schools. The intellectual heterogeneity of the samples is clearly larger in Turkey and
Inductive Reasoning 11
the Netherlands than Zambia; yet, none of the samples may be fully representative
for the age groups of their respective countries.
Insert Table 1 about here
Sample sizes are presented in Table 1; of the participants recruited 56%
came form urban and 44% from rural schools; 46% was female, 54% was male.
Instruments
The battery consisted of eight tasks, four with figures and four with letters as
stimuli. Each of these two stimulus modes had the same composition: a task of
inductive reasoning and three tasks of skill components that are assumed to
constitute important aspects of inductive reasoning. The first is rule classification,
called encoding in Sternberg’s (1977) model of analogical reasoning. The second is
rule generating, a combination of inference and mapping. The third is rule testing, a
combination of comparing and justification.
All tasks are based on item-generating rules, schematically presented in
Appendix A. All figure tasks are based on the following three item-generating rules:
(a) The same number of figure elements is added to subsequent figures in a
period (periods consist of either circles or squares, but never of both. A
period defines the number of figures that belong together. Examples of
items of all tasks, in which the item-generating rules are illustrated, can be
found in Appendix B).
(b) The same number of elements is subtracted from subsequent figures in a
period.
Inductive Reasoning 12
(c) The same number of elements is, alternatingly, added to and subtracted
from subsequent figures in a period.
The three item-generating rules are an example of a facet, a generic term for
all item features that are systematically varied across items. Two more facets
applied to all figure tasks. First, the number of figures in a period varies from two to
four. Second, the number of elements that are added to or subtracted from
successive elements of a period varied from one to three. Whenever possible, all
facet levels were crossed. However, for some combinations of facet levels no item
could be generated. For example, as each figure can have (in addition to a circle or
a square that are present in all items) only five elements (namely a hat, arrow, dot,
line, or bow), it is impossible to construct an item with two or three elements added
to each of four figures in a period.
Inductive Reasoning Figures is a task of 30 items. Each item has five rows of
12 figures, the first eight are identical. One of the rows has been composed
according to a rule while in the other rows the rule has not been applied
consistently. The pupil has to mark the correct row.
Besides the common facets, two additional facets were used to generate the
items of Inductive Reasoning Figures. First, the figure elements added or subtracted
are either the same or different across periods. In the example of Appendix B there
is a constant variation because in each period a dot is added first, followed by a
dash and a hat. Second, periods do or do not repeat one another, meaning that the
first figures of each period are identical (except for a possible swap of circle and
square).
Inductive Reasoning 13
The 36 items of Rule Classification Figures consist of eight figures. Below
these figures the three item-generating rules were printed. In addition, the
alternative "None of the rules applies" has been added. The pupil had to indicate
which of the four alternatives applies to the eight figures above.
In addition to the common facets, the task has three additional facets. The
first two are the same as in Inductive Reasoning Figures. The third refers to the
presence or absence of periodicity cues. These cues refer to the presence of both
circles and squares in an item (as illustrated in the first item of Appendix B) or the
presence of either squares or circles (if all circles of the example would be changed
into squares, no periodicity cues would be present).
Whereas in Inductive Reasoning Figures the number of different elements of
a figure could be either one, two, or three, Rule Classification Figures has another
level of this facet, referring to a variable number of elements. For example, in the
first period one element is added to subsequent figures and in the second period
two elements.
Each of the 36 items of Rule Generating Figures consists of a set of six
figures under which three lines with the numbers 1 to 6 are printed. In each item
one, two, or three triplets (i.e., groups of three figures) have been composed
according to one of the item-generating rules. Any of the six figures of an item can
be part of one, two, or three triplets. Pupils were asked to indicate all triplets that
constitute valid periods of figures. No information about the number of valid triplets
in each particular item was given. The total number of triplets was 63. In the data
analysis these were treated as separate, dichotomously scored items.
Inductive Reasoning 14
Two facets, in addition to the common ones, were included. First, periodicity
cues are either present or absent; the facet has the same meaning as in Rule
Classification Figures. Second, the number of valid triplets is one, two, or three.
A verbal specification is given at the top of each item of Rule Testing Figures.
In this specification three characteristics of the item are given, namely the
periodicity, the item-generating rule, and the number of elements varied between
subsequent figures of a period (e.g., "There are 4 figures in a group. 1 thing is
subtracted from figures which come after each other in a group"). Below this
specification four rows of eight figures have been drawn. One of the rows of eight
figures has been composed completely according to the specification. In some items
none of the four rows has been composed according to the specification. In this
case a fifth response alternative, "None of the rows has been composed according
to the specification" applies. This facet is labeled “None/one of the rules applies”.
The pupil has to mark the correct answer.
The facets and facet levels of Rule Testing Figures and Rule Classification
Figures were identical. In addition, the facet “Rows (do not) repeat each other” is
included in the former task. In some items the rows are fairly similar to each other
(except for minor variations that were essential for the solution), while in other items
each row has a completely different set of eight figures.
The letter tasks were based on five item-generating rules:
(a) Each group of letters has the same number of vowels. The vowels used in
the task are A, E, I, O, and U. As the status of the letter Y can easily
create confusion in English and Dutch where it can be both a consonant
Inductive Reasoning 15
and a vowel, the letter was never used in connection to the first item-
generating rule;
(b) Each group of letters has an equal number of identical letters that are the
same across groups (e.g., BBBB BBBB);
(c) Each group of letters has an equal number of identical letters that are not
the same across groups (e.g., GGGG LLLL);
(d) Each group of letters has a number of letters that appear the same (i.e., 1,
2, 3, or 4) number of positions after each other in the alphabet. The letters
A and B have a difference of one position, the letters A and C a difference
of two positions, etcetera;
(e) Each group of letters has a number of letters that appear the same (i.e., 1,
2, 3, or 4) number of positions before each other in the alphabet.
A second facet refers to the number of letters to which the rule applies. The number
could vary from 1 to 6. All items of the letter tasks are based on a combination of the
two facets described (i.e., item rule and number of letters). Like in the figure tasks,
not all combinations of the facets are possible; for example, applications of the
fourth and fifth rule assume an item rule that is based on at least two letters in a
group.
Inductive Reasoning Letters bears resemblance to the Letter Sets Test in the
ETS-Kit of Factor-Referenced Tests (Ekstrom, French, & Harman, 1976). Each of
the 45 items consists of five groups of six letters. Four out of these five groups are
based on the same combination of the two facets (e.g., they all have two vowels).
The pupil has to mark the odd one out.
Inductive Reasoning 16
Each of the 36 items of Rule Classification Letters consists of three groups of
six letters. Most items have been constructed according to some combination of the
two facets, while in some items the combination has not been used consistently. The
five item-generating rules are printed under the groups of letters. The pupil has to
indicate which item-generating rule underlies the item. If the rule has not been
applied consistently, the sixth response alternative, "None of the rules applies," is
the correct answer. Like in Rule Classification Figures, the facet “Non/one of the
rules applies” is included.
Each of the 30 items of Rule Generating Letters consists of a set of 6 groups
of letters (each made up of 1 to 9 letters). In each item up to five triplets of (subsets
of) the groups of letters have been composed according to a combination of the two
facets. The pupil is asked to indicate as many triplets as possible. Again, the pupils
were not informed about the exact number of triplets in each item. The total number
of triplets is 90, which are treated as separate items in the data analysis.
Like in Rule Generating figures, a facet about the number of valid triplets
(ranging here from 1 to 5) applies to the items, in addition to the common facets.
Rule Testing Letters consists of 36 items. Each item starts with a verbal
specification. In the specification two characteristics of the item are given, namely
the item-generating rule and the number of letters pertaining to the rule (e.g., "In
each group of letters there are 3 vowels"). Below this specification four rows of three
groups of six letters are printed. In most items one of the rows figures has been
composed according to the specification. The pupil has to find this row. In some
items none of the four rows has been composed according to the specification. In
this case a fifth response alternative, "None of the rows has been composed
Inductive Reasoning 17
according to the specification above the item," applies. The pupil has to indicate
which of the five alternatives applies.
The task had three facets, besides the common ones. Like in Rule
Classification Letters, there is a facet indicating whether or not one of the
alternatives follows the rule. Also, like in Rule Testing Figures, there is a facet
indicating whether or not rows repeat each other.
The Turkish and Roman alphabet are not entirely identical. The (Roman)
letters Q, W, and X were not used here since they are uncommon in Turkish. The
presence of specifically Turkish letters, such as Ç, Ö, and Ü, necessitated the
introduction of small changes in the stimulus material (e.g., the sequence ABCD in
the Zambian and Dutch stimulus materials was changed into ABCÇ in Turkish).
Administration
The tasks were administered without time limit to all pupils of a class;
however, in the rural areas in Zambia the number of desks available was often
insufficient for all pupils to work simultaneously as each pupil had to have his or her
own test booklet and answer sheet. The experimenter then selected randomly a
number of participants.
The tasks were administered by local testers. The number of testers in
Zambia, Turkey, and the Netherlands were two, three, and two (the author being
one of them), respectively. Five were psychologists and three were experienced
psychological assistants. All testers followed a one-day training in the administration
of the tasks.
Inductive Reasoning 18
In Zambia English was used in the administration. A supplementary sheet in
Nyanja, the main language of the Lusaka region, was included in the test booklet
that explained the item-generating rules. Turkish was the testing language in the
Turkish group Turkish and Dutch in the Dutch group.
The administration of all tasks to all pupils would presumably have taken
three school days. In order to avoid the loss of motivation and test fatigue, two
experimental conditions were introduced: the figure and the letter condition. The two
tasks of inductive reasoning were always administered; in the figure condition rule
classification, generating, and rule testing tasks with figures were also included,
while the three additional letter tasks were administered in the letter condition;
sample sizes for each condition are given in Table 1. So, all pupils received five
tasks: two tasks of inductive reasoning and three tasks of skill components (either
the three figure or the three letter skill component tasks). The administration of the
five tasks took place on two consecutive school days. The order of administration of
the tasks was random, with the constraint that the two tasks of inductive reasoning
were given on one day (either the first or the second testing day) and the remaining
on the other one.
The description of all eight instruments started with a one-page description of
the task, which was read aloud by the experimenter to the pupils; item-generating
rules of the stimulus mode were specified. This instruction was included in the
pupils’ test booklets. Examples were then presented of each of the item-generating
rules; explicit reference was made to which rule applied. Finally, the pupils were
asked to answer a number of exercises that again, covered all item-generating
rules. After this instruction, the pupils were asked to answer the actual items. In
Inductive Reasoning 19
each figure task the serial position of each figure was printed on top of the item in
order to minimize the computational load of the task. The alphabet was printed at
the top of each page of the letter tasks, with the vowels underlined. It was indicated
to the pupils that they were allowed to look back at the instructions and examples
(e.g., to consult the item-generating rules). Experience showed that this was
infrequently done, probably because all tasks of a single stimulus mode utilized the
same rules.
Results
The section begins with a description of preliminary analyses, followed by the
main analyses. Per analysis, the hypothesis, statistical procedure and findings are
reported.
Preliminary Analyses
The internal consistencies of the instruments (Cronbach’s alpha) were
computed per culture and grade. Inductive Reasoning Figures showed an average
of .86 (range: .79-.93), Rule Classification Figures .83 (.71-.90), Rule Generating
Figures .89 (.84-.95), Rule Testing Figures .85 (.81-.89), Inductive Reasoning
Letters .79 (.69-.88), Rule Classification Letters .83 (.73-.90), Rule Generating
Letters .93 (.90-.95), and Rule Testing Letters .78 (.63-.85). Overall, the internal
consistencies yielded adequate values. Country differences were examined in a
procedure described by Hakstian and Whalen (1976). Data of all grades were
combined. The M statistic, that follows a chi square distribution with two degrees of
freedom, was significant for Inductive Reasoning Figures (M = 64.92, p < .001), Rule
Classification Figures (M = 10.57, p < .01), Rule Generating Figures (M = 34.06, p
Inductive Reasoning 20
< .001), Rule Classification Letters (M = 11.57, p < .01), and Rule Testing Letters (M
= 12.40, p < .01). The Dutch group tended to have lower internal consistencies (a
possible explanation is given later).
Insert Table 2 and 3 about here
The average proportions of correctly solved items per country, grade, and
task are given in Table 2. Differences in average scores were tested in a
multivariate analysis of variance with country (3 levels; Zambia, Turkey, and the
Netherlands), grade (4 levels; 5 through 8), and gender (2 levels). Separate
analyses were carried out for the letter and figure mode. It may be noted that the
present analysis is presented merely for exploratory purposes to give insight in the
relative contribution of each factor to the overall score variation; conclusions about
country differences in inductive reasoning or its components are premature until full
score equivalence of scores across countries has been shown. Table 3 gives the
estimated effect sizes (proportion of variance accounted for). The results were
essentially similar for the two modes. Country was highly significant (p < .001) in all
tasks, usually explaining more than 10%. Zambian pupils tended to show the lowest
scores and Dutch pupils the highest scores. Grade differences were as expected; as
can be confirmed in Table 2, scores increased with grade. The effect sizes were
substantial, usually larger than 10%, and highly significant for all tasks (p < .001).
Gender differences were small; significant differences were found for Rule Testing
Figures and Inductive Thinking Letters (girls scored higher on both tasks), but
gender differences did not explain more than 1% on any task. The country by grade
Inductive Reasoning 21
interaction was significant in all analyses, explaining between 1 and 5%. As can be
seen in Table 2, score increases with grade tended to be smaller in the Netherlands
than in the other two countries. Country differences in scores were large in all
grades but tended to be become smaller with age. These results are in line with a
meta-analysis (Van de Vijver, 1997) in which in the age range examined here, there
was no increase of cross-cultural score differences with age (contrary to what would
be predicted from Jensen’s, 1977, cumulative deficit hypothesis). Other interactions
were usually smaller and often not significant.
Structural Equivalence in Internal Procedure
Hypothesis. The first hypothesis addresses equivalence in internal
procedures by examining the decomposition of the item difficulties. The hypothesis
states that facet levels provide an adequate decomposition of the item difficulties of
each task in each country (Hypothesis 1a). See Table 4 for an overview of the
hypotheses and their tests.
Statistical procedure. Structural equivalence of all tests is examined using the linear
logistic model (LLM) (Fischer, 1974, 1995). It is an extension of the Rasch model,
which is one of the frequently employed models in item response theory. The Rasch
model holds that the probability that a subject k (k = 1, …, K) responds correctly to
item i is given by
exp(k - i)/[1+ exp(k - i)], (1)
in which k represents the person’s ability and i the item difficulty. An item is
Inductive Reasoning 22
represented by only one parameter, namely its difficulty (unlike some other models
in item response theory in which each item also has a discrimination parameter,
sometimes in addition to a pseudo-guessing parameter). A sufficient statistic for
estimating a person’s ability is the total number of correctly solved items on the task.
Analogously, the number of correct responses at an item provides a sufficient
statistic for estimating the item difficulty. For our present purposes the main interest
is in item parameters.
The LLM imposes a constraint on the item parameter by specifying that the
item difficulty is the sum of an intercept (that is irrelevant here) and a sum of
underlying facet level difficulties, j:
i = + qij j (2)
The second step aims at estimating the facet level difficulties (). Suppose that the
item is “BBBBNM BBBBKJ BBBBHJ BBFTHG BBBBHN”. In terms of the facets, the
item can be classified as involving (a) four letters (facet: number of letters); (b) equal
letters within and across groups of letters (facet: item rule). The above model
equation (2) specifies that the item parameter will be the sum of an intercept, two
facet level difficulties (namely the difficulty parameter of items dealing with four
letters and the difficulty parameter of items dealing with equal letters within and
across groups of letters), and a residual component.
The matrix Q (with elements qij) defines the independent variable; the matrix
has m rows (the number of items of the task) and n columns (the number of
independent facet levels of the task). Entries of the Q matrix are zero or one
Inductive Reasoning 23
depending on whether the facet level is absent or present in the item (interactions of
facets were not examined). In order to guarantee uniqueness of the parameter
estimates in the LLM, linear dependencies in the design matrix were removed by
leaving the first level of each facet out of the design matrix. This (arbitrary) choice
implied that the first level of each facet has a difficulty level of zero and that the size
and significance of other facet levels should be interpreted relative to this “anchor.”
The sufficient statistic for estimating the basic parameters is the number of
correct answers at the items that make up the facet level. As a consequence, there
will be a perfect rank order between this number of correct answers and j. Various
procedures have been developed to estimate the basic parameters. In the present
study conditional maximum likelihood estimation was used (details of this
computationally rather involved procedure are given by Fischer, 1974, 1995). An
important property of the LLM is the sample independence of its parameters;
estimates of the item difficulty and the basic parameters are not influenced by the
overall ability level of the pupils. This property is attractive here because it allows for
their estimation, even when average scores of cultural groups differ.
The LLM is a two-step procedure; the first consists of a Rasch analysis.
Estimates of item () and person () parameters of equation (1) are obtained. In the
second step the parameters of equation (2) are estimated. The item parameters are
used in the evaluation of the fit of the model. The fit of an LLM can be evaluated in
various ways. First, a likelihood ratio test can be computed, comparing the likelihood
of the (unrestricted) Rasch model to the (restricted) LLM. The statistic is of limited
value here. The ratio is affected by guessing (Van de Vijver, 1986). Because all
tasks employed a multiple-choice format, it is unrealistic to assume that a Rasch
Inductive Reasoning 24
model would hold. The usage of an LLM may seem questionable here because of
the occurrence of guessing (pupils were instructed to answer all items). However,
Van de Vijver (1986) has shown that guessing gives rise to a reduction of the
variance of the estimated person and item parameters but correlations of both
estimated parameters with their true values are hardly affected. A useful heuristic to
evaluate the degree of fit of the LLM is provided by the correlation between the
Rasch parameters of the first step of the analysis and the by means of the design
matrix reconstructed item parameters of the second step. It amounts to correlating
item parameters of the first step () (the “unfaceted item difficulties”) with the item
parameters of the second step, using i* = qij j (the “faceted item difficulties”). The
latter vector gives the item parameters estimated on the basis of the estimated facet
level difficulties. Higher correlations point to a better approximation of item level
difficulties by facet level difficulties and hence, to a better modelability of inductive
reasoning.
Every task has its own design matrix, consisting both of facets that were
common to all tasks of a mode (e.g., the item-generating rules) and task-specific
facets (e.g., the number of correct answers in the rule generating tasks). The
analyses were carried out per country and grade. The item difficulties (with different
values per country and grade) and the Q matrix (invariant across grades and
countries for a specific task) were input to the analyses. This procedure was
repeated for each task, making a total of 8 (tasks) x 4 (grades) x 3 countries = 96
analyses.
The LLM is applied here as one of two tests of structural equivalence. This
type of equivalence addresses the relationship between measurement outcomes
Inductive Reasoning 25
and the underlying construct. The facets of the tasks are assumed to influence the
difficulty of the items. For example, it can be expected that rules in items of the letter
tasks are easier when they involve more letters. The analysis of structural
equivalence examines whether the facets exert an influence on item difficulty in
each culture. In more operational terms, structural equivalence is supported if the
correlation of each analysis is significantly larger than zero. A significant correlation
points to a contribution of the facet levels to the item difficulty: It indicates that the
facet levels contribute to the prediction of item difficulties.
Insert Table 5 about here
Hypothesis test. As can be seen in Table 5, the correlations between the
unfaceted Rasch item parameters (of equation 1) and the faceted item parameters
(of equation 2) were high for all tasks in each grade in each country. These high
correlations provide powerful evidence that the same facets influence item difficulty
in each country. It can be concluded that Hypothesis 1a, according to which the item
difficulty decomposition would be adequate in each country was strongly supported.
Insert Table 6 about here
The second question involves the patterning of the correlations of Table 5.
This question was addressed in an analysis of variance, with country (3 levels:
Zambia, Turkey, and the Netherlands), stimulus mode (2 levels: figure and letters
tests), and type of skill (4 levels: inductive reasoning and each of the three skill
Inductive Reasoning 26
component tasks) as independent variables; the correlation was the dependent
variable. The four grades were treated as independent replications. As can be seen
in Table 6, all main effects and first order interactions were significant. A significantly
lower correlation (and hence a poorer fit of the data to the model) was found for
figure tasks than for letter tasks (averages of .87 and .91, respectively), F(2, 72) =
24.85, p < .001. The effect was considerable, explaining 22% of the total score
variation. About the same percentage was explained by skill components, F(3, 72) =
37.94, p < .001. The lowest correlation (of .87) was obtained for rule classifications
and rule generating (.87), followed by inductive reasoning (.89) while rule testing
showed the highest value (.93). The high values of the latter may be due to a
combination of a large number of items (long tests) and the presence of both very
easy and difficult facet levels in the rule testing tasks in each country; such facet
levels increase the dispersion and will give rise to high correlations. Country
differences explained about 10% of the score variation; the correlations of the
Turkish and Zambian groups were very close to each other (.91 and .90,
respectively), while the value for the Dutch group was .87. A closer inspection of the
data revealed that the largest differences between the countries were found for
tasks with relatively high proportions of correctly solved items. Therefore, the
difference in fit may be due to ceiling effects in the Dutch group, which by definition
reduce the correlation. The most important interaction, explaining 16% of the total
score variation, was observed between country and stimulus mode, F(2, 72) =
19.92, p < .001. Whereas the correlations did not differ more than .03 for both
stimulus modes in Zambia and Turkey, the difference in the Dutch sample was .09.
Inductive Reasoning 27
The interaction of stimulus mode and skill component was also significant,
F(3, 72) = 9.61, p < .001. Correlations of rule generating and rule testing were on
average .03 larger than in the figure mode than in the letter mode, while a much
larger value of .08 was observed for rule classification. The interaction of country
and skill was significant though less important (explaining only 5%). The score
differences of the cultures were relatively small for rule generating and rule testing
and much larger for inductive reasoning and rule classification, mainly due to the
relatively low scores of the Dutch. Again, ceiling effects may have induced the effect
(not necessarily the largest attainable score because some facet levels remained
beyond the reach of many pupils even the highest scorers). In sum, the analysis of
the correlations revealed high values for all tasks in the three countries. The
observed country differences were presumably more due to ceiling effects than to
country differences in modelability of inductive reasoning and its components.
Ceiling effects may also explain the lower internal consistencies in the Dutch data,
discussed before.
Insert Figure 1 and 2 about here
The estimated facet level difficulties (the estimated values of , cf. equation
2) are of all tests have been drawn in Figure 1 (figure tests) and 2 (letters tests).
Higher values of refer to more difficult facet levels. The most striking finding of
both Figures is the proximity of the three country curves; this points to the cross-
cultural similarity in pattern of difficult and easy facet levels, which yields further
evidence for the structural equivalence of the instruments in the present samples.
Inductive Reasoning 28
Furthermore, most facet levels behaved as expected. As for the figure tasks, the
third item-generating rule (about alternating additions and subtractions) was
invariably the most difficult. Items were more difficult when they dealt with shorter
periods, when a variable number of elements were added or subtracted in
subsequent figures of a period, when periodicity cues were absent, and when
periods did not repeat each other. The number of valid triplets (only present in the
rule-generating task) showed large variation. Pupils found it relatively easy to
retrieve one correct solution, but relatively difficult to find all solutions when the item
contained two or three valid triplets.
The difficulty patterning of the letter tasks also followed expectation. Dealing
with equal letters was easier than dealing with positions in the alphabet. Items about
equal letters within and across groups (e.g., BBBBBB BBBBBB) were easier than
items about letters that were equal within and unequal across groups (e.g., BBBBBB
GGGGGG). Items were easier when the underlying rule involved more letters (which
facilitates recognition). Items about positions in the alphabet (the last two item-
generating rules of the letter mode) were easier when they involved smaller jumps
(e.g., ABCD was easier to recognize as a group of letters in which the position of
letters in the alphabet is important than ACEG, that was easier to recognize than
ADGJ). Like in the generating task of the figure mode, a strong effect of the number
of valid triplets was found. Finding all solutions turned out to be difficult and valid
triplets were often overlooked.
Inductive Reasoning 29
Measurement Unit Equivalence in Internal Procedure
Hypothesis. For each task the same facet level difficulties apply in each
country (Hypothesis 1b; cf. Table 4).
Statistical procedure. The LLM parameters can also be used to test
measurement unit equivalence. This type of equivalence goes beyond structural
equivalence by assuming that the tasks as applied in the three countries have the
same measurement units (but not necessarily the same scale origins). If the
estimated parameters of equation 2 are invariant across countries except for
random fluctuations, there is strong evidence for the invariance of the measurement
units of the test scales. This invariance would imply that the estimated facet level
difficulties in a particular country could be replaced by the difficulty of the same facet
in another country without affecting the fit of the model. For these analyses the data
for the grades in a country were combined because of the primary interest in country
differences.
Hypothesis test. Standard errors of the estimated facet level difficulties
ranged from 0.05 to 0.10. As can be derived from Figure 1 and 2, in each task there
are facet levels that differ significantly across countries. It can be safely concluded
that scores did not show complete measurement unit equivalence.
Yet, it is also clear from these Figures that some facet levels are not
significantly different across countries. So, the question arises to what extent facet
levels are identical across countries. The question was addressed using intraclass
correlations, measuring the absolute agreement of the estimated facet level
difficulties in the three countries. The absolute agreement of the estimated basic
parameters of a single task across countries was evaluated; per task the intraclass
Inductive Reasoning 30
correlation of the country by facet level matrix was computed. The letter tasks
showed consistently higher values than the figure tasks. The average agreement
coefficient was .91 for the figure tasks and .96 for the letter tasks (all intraclass
correlations were significantly above zero, p < .001). The high within-task agreement
points to an overall strong agreement of facet levels across countries. The estimated
facet level difficulties come close to being interchangeable across countries (despite
the significant differences of some facet levels).
A recurrent theme in the analysis is the better modelability of the letter tasks
as compared to the figure tasks, due to wider range of facet level difficulties in the
letter than in the figure mode. The range differences may be a consequence of the
choices made in the test design stage. One of the problems of many existing figure
tests is their often implicit definition of permitted stimulus transformations (e.g.,
rotating and flipping). This lack of clarity, presumably an important source of cross-
cultural score differences, was avoided in the present study by spelling out all
permitted transformations in the test instructions. Apparently, the price to be paid for
providing the pupils with this information is a small variation in facet level difficulties.
Structural Equivalence in External Procedure
Hypothesis. The skill components contribute to inductive reasoning in each
country (Hypothesis 2a; cf. Table 4).
Insert Figure 3 about here
Statistical procedure. External procedures to establish equivalence scrutinize
Inductive Reasoning 31
the relationships between inductive reasoning and its componential skills. A specific
type of structural equation model was used, namely a MIMIC model (Multiple
Indicators MultIple Causes; see Van Haaften & Van de Vijver, 1996, for another
cross-cultural application). A MIMIC is a model that links input and output through
one latent variable (see Figure 3). The core of the model is the latent variable,
labeled inductive reasoning. This variable, , is measured by the two tasks of
inductive reasoning (the output variables). The input to the inductive reasoning
factor comes from the skill components; the components are said to influence the
latent factor and this influence is reflected in the two tasks of inductive reasoning. In
sum, the MIMIC model states that inductive reasoning is measured by two tasks
(IRF and IRL) and is influenced by three components (classification, rule generating,
and rule testing). The model equations are as follows:
y1 = 1 + 1; (3)
y2 = 2 + 2,
in which y1 and y2 denote observed scores on the two tasks of inductive thinking, 1
and 2 the factor loadings, and 1 and 2 error components. The latent variable, , is
linked to the skill components in a linear regression function:
= 1 x1 + 2 x2 +3 x3 + , (4)
where the gammas are the regression coefficients, the x-variables refer to the skill
components, and is the error component. In order to make the estimates
identifiable, the factor loading of IRF, 1, was fixed at one.
An attractive feature of structural equation modeling is its allowance for
multigroup analysis. This means that the adequacy of the above model equations for
the data can be evaluated for all 12 data sets (4 grades x 3 countries) at once. The
Inductive Reasoning 32
fit statistics yield an overall assessment, covering all data sets.
The theoretical model underlying the study stipulates that the three skill
components constitute essential elements of inductive reasoning. In terms of the
MIMIC analysis, this means that structural equivalence would be supported by a
good fit of a model with three input and two output variables as described. Nested
models were analyzed. In the first step all parameters were held fixed across data
sets, while in subsequent steps similarity constraints were lifted in the following
order (cf. Table 7): the error variance (unreliability) of the tasks of inductive
reasoning, the intercorrelations of the tasks of skill component, the error variance of
the latent variable, the regression coefficients, and the factor loadings. The order
was chosen in such a way that invariance of relationships involving the latent
variable (i.e., regression coefficients and factor loadings) was retained as long as
possible. More precisely, structural equivalence would be supported when the
MIMIC model with the fewest equality constraints across countries shows a good fit
and all MIMIC parameters differ from zero (hypothesis 2a). It would mean that the
tasks of inductive reasoning constitute a single factor that is influenced by the same
skill component in each analysis (the possibility that there is a good fit but that some
regression coefficients or factor loadings are negative is not further considered here
because no covariances were negative).
Insert Table 7 about here
Hypothesis test. The relationship of the skill components and inductive
reasoning tasks was examined in a MIMIC model (see Table 7; more details are
Inductive Reasoning 33
given in Appendix C). Nested models were fitted to the data of both stimulus modes.
The choice of a MIMIC model was mainly based on the relatively large change of all
fit statistics when constraints were imposed on the phi matrices (the covariance
matrices of the component skills; see the figure of Appendix B); therefore, the model
with equal factor loadings, regression coefficients, and error variances was chosen.
Although the letter tasks showed a better fit than the figure tasks, the choice of a
model was less straightforward. A MIMIC model with a similar pattern of free and
fixed parameters in both stimulus modes was chosen, mainly because of parsimony
(see footnote to Table 7 for a more elaborate explanation).
The standardized solution of the two models is given in Figure 3. As
hypothesized, all loadings and regression coefficients were positive and significant
(p < .01). It can be concluded that inductive reasoning with figure and letter stimuli
involves the same components in each country. This supports structural
equivalence, as predicted in hypothesis 2a. The regression coefficients of the figure
component tasks were unequal to each other: rule classification was least important,
followed by rule generating, while rule testing showed the largest contribution to
inductive reasoning. The letter mode did not show this patterning; the regression
coefficients of the component tasks of the letter mode were rather similar to one
another.
Measurement Unit Equivalence in External Procedure
Hypothesis. The skill components contribute in the same way to inductive
reasoning in each country (Hypothesis 2b; cf. Table 4).
Inductive Reasoning 34
Statistical procedure. Measurement unit equivalence can be scrutinized by
introducing and testing equality constraints in the MIMIC model. This type of
equivalence would be supported when a single MIMIC model with identical
parameter values holds in all countries. It may be noted that this test is stricter than
the ones proposed in the literature. Whereas the latter tend to analyze all tasks in a
single exploratory or confirmatory factor analysis, more specific relationships
between the tasks are considered here.
Hypothesis test. The psychologically most salient elements of the MIMIC, the
factor loadings, regression coefficients, and the explained variance of the latent
variable, were found to be invariant across countries. However, measurement unit
equivalence also requires the other parameter matrices to be invariant. In the figure
mode the model with equality constraints for all matrices showed a rather poor fit,
with an NNFI of .88, a GFI of .96, and an RMSEA of .045. An inspection of the delta
chi square values indicated that in particular the introduction of equality of
covariances of the skill components () reduced the fit significantly. The letter tasks
showed a similar picture; the most restrictive model revealed values of .89 for the
NNFI, .82 for the GFI, and .041 for the RMSEA, which can be interpreted as a rather
poor fit. Again, equality of the matrices led to a significant reduction of the fit. Like
in our internal procedure to examine measurement unit equivalence, we found some
but inconclusive evidence for the measurement unit equivalence of the task scores
across countries; hypothesis 2b had to be rejected.
Inductive Reasoning 35
Full Score Equivalence
Hypothesis. Both tasks of inductive reasoning show full score equivalence
(Hypothesis 3; cf. Table 4).
Statistical procedure. Full score equivalence can be examined in an item bias
analysis. A logistic regression model was applied to analyze item bias (Rogers &
Swaminathan, 1993). Advantages of the model are the possibility to include more
than two groups and to examine both uniform and nonuniform bias (Mellenbergh,
1982). The combined samples of the three countries are used to determine cutoff
scores that split up the sample in three score level groups (low, medium, and high)
of about the same size. In the logistic regression procedure, culture (dummy coded),
score level, and their interaction are the independent variables, while the item
response is the dependent variable. A significant main effect of culture points to
uniform bias: individuals from at least one country show an unexpectedly low or high
score across all score levels on the item as compared to individuals with the same
test score from other cultures. A significant interaction points to nonuniform bias: the
systematic difference of the scores depends here on the score level; for example,
country differences in scores among low scorers are not found among high scorers.
Alpha was set at a (low) level of .001 in the item bias analyses in order to prevent
inflation of Type I errors, due to multiple testing (although, obviously, the power of
the procedure is adversely affected by this choice).
Insert Figure 4 about here
Inductive Reasoning 36
Hypothesis test. In the introduction two approaches were mentioned to
examine full score equivalence that are based on structural equation modeling:
multilevel covariance structure analysis and the modeling of latent means. The
former could not be used due to the small number of countries involved, while the
latter was precluded because of the incomplete support of measurement unit
equivalence. This lack of support indeed prohibits any analysis of full score
equivalence. Yet, because the bias analysis yielded interesting results, it is reported
here for exploratory purposes. Of the 30 items of the IRF, 15 items were biased (13
items uniform, 11 items non-uniform), mainly involving the Dutch—Zambian
comparison. The occurrence of bias was related to the difficulty of the items; both
the easiest and most difficult items showed the most bias. The correlation between
the presence of bias (0 = absent, 1 = present) and the deviance of the item score
from the mean (i.e., average item score - overall average) was .64 (p < .001). The
correlation suggests a methodological artifact, such as floor and ceiling effects. This
was confirmed by an inspection of the contingency tables underlying the logistic
regression analyses. Figure 4 depicts empirical item characteristic curves of two
items that showed both uniform and nonuniform bias. The upper panel shows a
relatively easy item (with an overall mean of .79) and the lower panel a relatively
difficult item (mean of .33). The bias for the easy item is induced by country
differences at the lowest score level that are not reproduced at the higher levels.
Analogously, the scores for the difficult item remain close to the guessing level
(of .20) in the two lowest score levels, while there is more score differentiation in the
highest scoring group. The score patterns of Figure 4 were found for several items.
It appears that ceiling and floor effects led to item bias in the IRF.
Inductive Reasoning 37
Three items were found to be biased in the IRL (one uniform and two
nonuniform). The Zambian pupils showed relatively high scores on these items.
Because the items were few and involved different facet levels, the reasons for the
bias were not understood, a fairly common finding in item bias research (cf. Van de
Vijver & Leung, 1997b). Floor and ceiling effects did not occur, which points to an
important difference between the two tasks of inductive reasoning; whereas at the
IRF pupils tended to answer items either with a very low of a very high level of
accuracy, pupil scores at the IRL varied more gradually. Similarly, in the IRF there
were no facet levels that were either too difficult or too easy for most of the sample,
but both types of facet levels were present in the IRL.
Discussion
The equivalence of two tasks of inductive reasoning was examined in a
cross-cultural study involving 632 Dutch, 877 Turkish, and 704 Zambian pupils from
the highest two grades from primary and the lowest two grades from secondary
school. Two stimulus modes were examined: letters and figures. In each mode tasks
for inductive reasoning and for each of its components, classification, generation,
and testing, were administered. The structural, measurement unit, and full score
equivalence of the instruments in these countries were studied. A MIMIC model was
fitted to the data, linking skill components to inductive reasoning through a latent
variable, labeled inductive reasoning (external procedure). A linear logistic model
was utilized to examine to what extent in each country item difficulties could be
adequately decomposed into the underlying rules that were used to generate the
items (internal procedure). In keeping with past research, structural equivalence was
strongly supported; yet, measurement unit equivalence was not fully supported. It is
Inductive Reasoning 38
interesting to note that two different statistical models, item response theory (LLM)
and structural equation modeling (MIMIC), looking at different aspects of the data
(facet level difficulties in the LLM and covariances of tasks with componential skills)
yielded the same conclusion about measurement unit equivalence.
Critics might argue that the emphasis on equivalence of the present study is a
misnomer and detracts the attention form the real cross-cultural differences
observed with these carefully constructed instruments. In this line of reasoning the
results would show massive differences in inductive reasoning across countries, with
Zambian pupils having the lowest skill level, Turkish pupils having an intermediate
position, and Dutch pupils having the highest level. The validity of this conclusion is
underscored by the LLM analyses in which it was found that most facet level
difficulties are identical and interchangeable across the three countries while a small
number is country dependent. Even if score comparisons are restricted to the facet
levels with the same difficulties, at least some of the score differences of the
countries are likely to remain. In this line of reasoning the current study has
demonstrated the presence of at least some but presumably large differences in
inductive reasoning, with Western subjects showing the highest skill levels. In my
view the interpretation is based on a simplistic and untenable view on country score
differences. These differences are not just a matter of differences in inductive
reasoning skills. It may well be that differences of country scores on the tasks are
partly or entirely due to additional factors. Various educational factors may play a
role here, as is often the case in comparisons of highly dissimilar cultural groups. In
a meta-analysis Van de Vijver (1997) has found that educational expenditure is a
significant predictor of country differences in mental test performance. Does quality
Inductive Reasoning 39
of schooling have an influence on inductive reasoning? I concur with Cole (1996),
who after reviewing the available cross-cultural evidence, concluded that schooling
does not have a formative influence on higher-order forms of thinking but tends to
broaden the domains in which these skills can be successfully applied. Schooling
facilitates the usage of skills by their training and by exposure to psychological and
educational tests (cf. Rogoff, 1981; Serpell, 1993). The educational differences of
the populations of the current study are massive. For example, attending
kindergarten is more common in Turkey and the Netherlands than in Zambia, and
schools in Zambia have a fraction of the learning material that schools in Turkey and
the Netherlands have at their disposal. The interpretation of the country differences
observed in the present study as reflecting real differences is based on an
underestimation of the impact of various context-related (educational) factors and an
overestimation of ability of the tasks employed here to measure inductive reasoning
in all countries. Tasks that capitalize less on schooling and are more derived from
everyday experiences may show a different patterning of country differences.
The present results replicate the findings of many studies on structural
equivalence; strong support was found that the instruments measure inductive
reasoning in the three countries. The present results make it very unlikely that there
are major cross-cultural differences in strategies and processes involved in inductive
reasoning in the populations studied. These results extend findings of numerous
factor analytic studies in showing that skill components contribute in a largely
identical way to inductive thinking and item difficulty is governed by complexity rules
that are largely identical across cultures.
Inductive Reasoning 40
The results also show that comparisons of scores obtained in different
countries are not allowed, despite the careful item construction process. This
negative finding on the numerical comparability may be due to the large cultural
distance of the countries involved here. However, it also points to the need to
address measurement unit and full score equivalence in cross-cultural research.
Cross-cultural comparisons of data that have not been scrutinized for equivalence,
abound in the literature. In fact, it is rather difficult to find examples of data in which
the equivalence was examined in an appropriate way. Scores are often numerically
compared across cultures (assuming full score equivalence) when only structural
equivalence has been demonstrated; examples can be found in the cross-cultural
comparisons of the Eysenck personality scales (e.g., Barrett et al., 1998). It is
difficult to defend the practice to compare scores across cultures when equivalence
has not been tested or when only structural equivalence has been observed. The
present study underscores the need to study equivalence of data before comparing
test scores. A more prudent treatment of cross-cultural score differences is badly
needed. We have firmly established the commonality of basic cognitive functions in
several cultural and ethnic groups (Waitz’s “psychic unity”), but we still have to come
to grips with the question of how to design cognitive tests that allow for numerical
score comparisons across a wide cultural range.
A final issue concerns the external validity of the present findings: To what
populations can the present results be generalized? The three countries involved in
the study have a highly different status on affluence. Given the strong findings on
structural equivalence, it is realistic to assume that inductive reasoning is a universal
with largely identical components in schooled populations, at least as of the end of
Inductive Reasoning 41
primary school. Future studies should address the question of whether
measurement unit equivalence would be fully supported when the cultural distance
between the countries is smaller.
Inductive Reasoning 42
References
Barrett, P. T., Petrides, K. V., Eysenck, S. B. G., & Eysenck, H. J. (1998). The
Eysenck Personality Questionnaire: An examination of the factorial similarity of P, E,
N, and L across 34 countries. Personality and Individual Differences, 25, 805-819.
Campbell, D. T., & Naroll, R. (1972). The mutual methodological relevance of
anthropology and psychology. In F. L. K. Hsu (Ed.), Psychological anthropology.
Cambridge, MA: Schenkman.
Carroll, J. B. (1993). Human cognitive abilities. A survey of factor-analytic
studies. Cambridge: Cambridge University Press.
Claassen, N. C., & Cudeck, R. (1985). Die faktorstruktuur van die Nuwe Suid-
Afrikaanse Groeptoets (NSAG) by verskillende bevolkingsgroepe [The factor
structure of the New South African Group Test (NSAGT) in various population
groups.]. South-African Journal of Psychology, 15, 1-10.
Cole, M. (1996). Cultural psychology: A once and future discipline.
Cambridge, MA: Harvard University Press.
Ekstrom, R. B., French, J. W., & Harman, H. H. (1976). Kit of factor-
referenced tests. Princeton, NJ: Educational Testing Service.
Ellis, B. B. (1990). Assessing intelligence cross-nationally: A case for
differential item functioning detection. Intelligence, 14, 61-78.
Ellis, B. B., Becker, P., & Kimmel, H. D. (1993). An item response theory
evaluation of an English version of the Trier Personality Inventory (TPI). Journal of
Cross-Cultural Psychology, 24, 133-148.
Embretson, S. E. (1983). Construct validity: Construct representation versus
nomothetic span. Psychological Bulletin, 93, 179-197.
Inductive Reasoning 43
Fan, X., Willson, V. L., & Reynolds, C. R. (1995). Assessing the similarity of
the factor structure of the K-ABC for African-American and White children. Journal of
Psychoeducational Assessment, 13, 120-131.
Fischer, G. H. (1974). Einführung in die Theorie psychologischer Tests
[Introduction to the theory of psychological tests]. Bern: Huber.
Fischer, G. H. (1995). The linear logistic test model. In G. H. Fischer & I. W.
Molenaar (Eds.), Rasch models. Foundations, recent developments and
applications. New York: Springer.
Frijda, N., & Jahoda, G. (1966). On the scope and methods of cross-cultural
research. International Journal of Psychology, 1, 109-127.
Geary, D. C., & Whitworth, R. H. (1988). Is the factor structure of the WISC-R
different for Anglo- and Mexican-American children? Journal of Psychoeducational
Assessment, 6, 253-260.
Greenfield, P. M. (1997). You can't take it with you: Why ability assessments
don't cross cultures. American Psychologist, 52, 1115-1124.
Gustafsson, J-E. (1984). A unifying model for the structure of intellectual
abilities. Intelligence, 8, 179-203.
Hakstian, A. R., & Vandenberg, S. G. (1979). The cross-cultural
generalizability of a higher-order cognitive structure model. Intelligence, 3, 73-103.
Hakstian, A. R., & Whalen, T. E. (1976). A k-sample significance test for
independent alpha coefficients. Psychometrika, 41, 219-231.
Hennessy, J. J., & Merrifield, P. R. (1976). A comparison of the factor
structures of mental abilities in four ethnic groups. Journal of Educational
Psychology, 68, 754-759.
Inductive Reasoning 44
Holland, P. W., & Wainer, H. (Eds.) (1993). Differential item functioning.
Hillsdale, NJ: Erlbaum.
Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differential validity of
employment tests by race: A comprehensive review and analysis. Psychological
Bulletin, 86, 721-735.
Irvine, S. H. (1969). Factor analysis of African abilities and attainments:
Constructs across cultures. Psychological Bulletin, 71, 20-32.
Irvine, S. H. (1979). The place of factor analysis in cross-cultural methodology
and its contribution to cognitive theory. In L. Eckensberger, W. Lonner, & Y. H.
Poortinga (Eds.), Cross-cultural contributions to psychology. Lisse, the Netherlands:
Swets & Zeitlinger.
Irvine, S. H., & Berry, J. W. (1988). The abilities of mankind: A revaluation. In
S. H. Irvine & J. W. Berry (Eds.), Human abilities in cultural context. Cambridge:
Cambridge University Press.
Jahoda, G., & Krewer, B. (1997). History of cross-cultural and cultural
psychology. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.), Handbook of cross-
cultural psychology (2nd ed., vol. 1). Chicago: Allyn & Bacon.
Jensen, A. R. (1977). Cumulative deficit in intelligence of Blacks in the rural
South. Developmental Psychology, 13, 184-191.
Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.
Little, T. D. (1997). Mean and covariance structures (MACS) analyses of
cross-cultural data: Practical and theoretical issues. Multivariate Behavioral
Research, 32, 53-76.
Inductive Reasoning 45
McCrae, R. R., & Costa, P. T., (1997). Personality trait structure as a human
universal. American Psychologist, 52, 509-516.
Mellenbergh, G. J. (1982). Contingency table models for assessing item bias.
Journal of Educational Statistics, 7, 105-118.
Messick, S. (1988). Validity. In R. L. Linn (Ed.), Educational measurement
(3rd ed). Hillsdale, NJ: Erlbaum.
Muthén, B. O. (1991). Multilevel factor analysis of class and student
achievement components. Journal of Educational Measurement, 28, 338-354.
Muthén, B. O. (1994). Multilevel covariance structure analysis. Sociological
Methods & Research, 22, 376-398.
Naglieri, J. A., & Jensen, A. R. (1987). Comparison of Black-White differences
on the WISC-R and the K-ABC: Spearman's hypothesis. Intelligence, 11, 21-43.
Poortinga, Y. H., & Van de Vijver, F. J. R. (1987). Explaining cross-cultural
differences: Bias analysis and beyond. Journal of Cross-Cultural Psychology, 18,
259-282.
Ree, M. J., & Carretta, T. R. (1995). Group differences in aptitude factor
structure on the ASVAB. Educational and Psychological Measurement, 55, 268-277.
Reschly, D. (1978). WISC-R factor structures among Anglos, Blacks,
Chicanos, and Native-American Papagos. Journal of Consulting and Clinical
Psychology, 46, 417-422.
Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression
and Mantel-Haenszel procedures for detecting differential item functioning. Applied
Psychological Measurement, 17, 105-116.
Inductive Reasoning 46
Rogoff, B. (1981). Schooling and the development of cognitive skills. In H. C.
Triandis & A. Heron (Eds.), Handbook of cross-cultural psychology: Volume 4,
Developmental psychology. Boston: Allyn & Bacon.
Sandoval, J. (1982). The WISC-R factorial validity for minority groups and
Spearman's hypothesis. Journal of School Psychology, 20, 198-204.
Serpell, R. (1979). How specific are perceptual skills? British Journal of
Psychology, 70, 365-380.
Serpell, R. (1993). The significance of schooling. Life journeys in an African
society. Cambridge: Cambridge University Press.
Sternberg, R. J. (1977). Intelligence, information processing, and analogical
reasoning: The componential analysis of human abilities. New York: Wiley.
Sternberg, R. J., & Kaufman, J. C. (1998). Human abilities. Annual Review of
Psychology, 49, 479-502.
Sung, Y. H., & Dawis, R. V. (1981). Level and factor structure differences in
selected abilities across race and sex groups. Journal of Applied Psychology, 66,
613-624.
Taylor, R. L., & Ziegler, E. W. (1987). Comparison of the first principal factor
on the WISC-R across ethnic groups. Educational and Psychological Measurement,
47, 691-694.
Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs,
No. 1.
United Nations (1999). Indicators on education [On-line]. Available Internet:
www.un.org/depts/unsd/social/education.htm.
Inductive Reasoning 47
Valencia, R. R., & Rankin, R. J. (1986). Factor analysis of the K-ABC for
groups of Anglo and Mexican American children. Journal of Educational
Measurement, 23, 209-219.
Valencia, R. R., Rankin, R. J., & Oakland, T. (1997). WISC-R factor structures
among White, Mexican American, and African American children: A research note.
Psychology in the Schools, 34, 11-16.
Van de Vijver, F. J. R. (1986). The robustness of Rasch estimates. Applied
Psychological Measurement, 10, 45-57.
Van de Vijver, F. J. R. (1997). Meta-analysis of cross-cultural comparisons of
cognitive test performance. Journal of Cross-Cultural Psychology, 28, 678-709.
Van de Vijver, F. J. R., & Leung, K. (1997a). Methods and data analysis of
comparative research. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.),
Handbook of cross-cultural psychology, 2nd Ed., Vol. 1. Chicago: Allyn & Bacon.
Van de Vijver, F. J. R., & Leung, K. (1997b). Methods and data analysis for
cross-cultural research. Newbury Park, CA: Sage.
Van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item
response theory. New York: Springer.
Van Haaften, E. H., & Van de Vijver, F. J. R. (1996). Psychological
consequences of environmental degradation. Journal of Health Psychology, 1, 411-
429.
Willemsen, M. E., & Van de Vijver, F. J. R. (under review). Context effects in
logical reasoning in the Netherlands and Zambia.
Inductive Reasoning 48
Zuckerman, M., Kuhlman, D. M., Thornquist, M., & Kiers, H. A. L. (1991). Five
(or three) robust questionnaire scale factors of personality without culture.
Personality and Individual Differences, 12, 929-941.
Inductive Reasoning 49
Table 1
Sample Size per Culture, Grade, and Experimental Condition
Gradea
Country Test condition 5 6 7 8 Total
Zambia Figure 80 79 94 123 376
Letter 81 81 87 79 328
Turkey Figure 127 97 95 102 421
Letter 139 107 110 100 456
Netherlands Figure 117 74 51 77 319
Letter 83 91 77 62 313
Total 627 529 514 543 2213
aIn Zambia the grades are 6, 7, 8, and 9, respectively.
Inductive Reasoning 50
Table 2
Average Proportion of Correctly Solved Items per Task, Grade, and Culture
Task
Country Grade IRF RCF RGF RTF IRL RCL RGL RTL
Zambia 6 .40 .44 .53 .39 .49 .40 .37 .41
7 .55 .53 .55 .43 .56 .60 .47 .56
8 .56 .64 .55 .53 .58 .61 .44 .58
9 .62 .68 .64 .54 .61 .54 .48 .58
Turkey 5 .47 .51 .44 .42 .50 .54 .39 .49
6 .48 .57 .56 .46 .52 .53 .42 .47
7 .66 .73 .64 .58 .64 .70 .56 .65
8 .65 .75 .69 .63 .64 .71 .52 .60
Netherlands 5 .67 .80 .65 .64 .60 .63 .51 .58
6 .74 .73 .72 .67 .64 .68 .57 .66
7 .70 .74 .70 .66 .68 .72 .63 .67
8 .78 .84 .76 .77 .70 .74 .60 .69
IRF: Inductive Reasoning Figures; RCF: Rule Classification Figures; RGF: Rule
Generating Figures; RTF: Rule Testing Figures; IRL: Inductive Reasoning Letters;
RCL: Rule Classification Letters; RGL: Rule Generating Letters; RTL: Rule Testing
Letters
Inductive Reasoning 51
Table 3Effect Sizes of Multivariate Analyses of Variance of the Psychological Tests per Test Mode
Skill component
Independent
variable
Multi-
variatea
Inductive
reasoning
Rule
classification
Rule
generating
Rule
testing
(a) Figure mode
Country (C) .135*** .132*** .183*** .139*** .200***
Grade (G) .073*** .108*** .149*** .132*** .125***
Sex (S) .011* .001 .002 .001 .010**
C G .035*** .013* .063*** .041*** .014*
C S .012** .016*** .017*** .014*** .014***
G S .009** .004 .006 .004 .011**
C G S .009* .010 .007 .008 .004
(b) Letter mode
Country .102*** .078*** .113*** .130*** .113***
Grade .061*** .122*** .114*** .088*** .125***
Sex .014** .010** .002 .000 .001
C G .030*** .035*** .051*** .028*** .051***
C S .014*** .014** .002 .013** .016***
G S .005 .007 .000 .003 .001
C G S .014*** .017** .012 .018** .009
Note. Significance levels of the effect sizes refer to the probability level of the
corresponding F ratio of the independent variable(s).
aWilks’ lambda. *p < .05. **p < .01. ***p < .001.
Inductive Reasoning 52
Table 4
Overview of the Hypothesis Tests and the Statistical Models Used Statistical aspects
Conditions for equivalenceProcedure to establish equivalence
Question examinedStatistical model used
Structural equivalence
Measurement unit equivalence
Full score equivalence
Internal Focus on tests of inductive reasoning
Are facet level difficulties and item difficulties related?
linear logistic model
correlations significant in each country (hypothesis 1a)
correlations significant and identical across countries (hypothesis 1b)
Focus on tests of inductive reasoning
Is there item bias? Logistic regression
Absence of item bias (hypothesis 3)
External Focus on relationship of skill components and inductive reasoning
Are tests of skill components and inductive reasoning related?
structural equation modeling
MIMIC parameters significant in each country (hypothesis 2a)
MIMIC parameters significant and identical across countries (hypothesis 2b)
MIMIC = Multiple Indicators MultIple Causes
Inductive Reasoning 53
Table 5
Accuracy of the Design Matrices per Task and per Country: Means (and
Standard Deviations) of Correlation
Stimulus mode
Figures Letters
Skill Zam Tur Net Zam Tur Net
Inductive reasoning .90 (.03) .90 (.02) .81 (.03) .90 (.02) .92 (.01) .90 (.02)
Rule classification .84 (.04) .87 (.01) .76 (.02) .92 (.01) .93 (.02) .88 (.03)
Rule generating .88 (.01) .86 (.03) .81 (.02) .87 (.01) .89 (.02) .92 (.02)
Rule testing .95 (.01) .93 (.01) .91 (.03) .94 (.01) .95 (.01) .94 (.01)
Net = Netherlands. Tur = Turkey. Zam = Zambia.
Inductive Reasoning 54
Table 6
Analysis of Variance of Correlations with Country, Stimulus Mode, and Skill as
Independent Variables
Source df F Variance explained
Country (C) 2 24.85*** .10
Stimulus mode (S) 1 79.18*** .22
Skill (Sk) 3 37.94*** .21
C S 2 19.92*** .16
C Sk 6 3.70** .05
S Sk 3 9.61*** .10
C S Sk 6 1.95 .03
Within-cell error 72 (.0006) .14
*p < .05. **p < .01. ***p < .001.
Inductive Reasoning 55
Table 7Fit Indices for Nested Multiple Indicators Multiple Causes Models of Figure and Letter Tasks
Contribution to per country(percentage)
Invariant matrices (df) Zam Tur Net NNFI GFI RMSEA 2 (df)(a) Figure mode
y 533.97*** (167) 21 32 47 .88 .96 .045y 437.28*** (145) 19 35 47 .89 .96 .043 96.69*** (22)y 180.52*** (79) 20 27 53 .93 .98 .034 256.76*** (66)y 134.01*** (46) 20 19 61 .91 .98 .042 46.51 (33)y 98.72*** (35) 20 18 62 .90 .99 .041 35.29*** (11)
(b) Letter modey 473.33*** (167) 33 33 34 .89 .82 .041y 364.63*** (145) 28 30 41 .91 .87 .037 108.70*** (22)y 180.60*** (79) 35 26 39 .92 .90 .034 184.03*** (66)y 82.07** (46) 56 25 19 .95 .94 .027 98.53*** (33)y 61.61** (35) 54 24 23 .96 .94 .027 20.46* (11)Note. The choice of a MIMIC model for the figure tests was mainly based on the relatively large change of all fit statistics when constraints were imposed on the phi matrices; therefore, the model with equal factor loadings, regression coefficients, and error variances was chosen. The same model of free, fixed and constrained parameters also showed an adequate fit for the letter tests. Releasing constraints on the regression coefficients revealed a significant increase of fit. The question was addressed as to whether the decrease of the statistic was due to systematic country differences in the regression coefficients. An inspection of the regression coefficients per country did not show a clear patterning of country differences. The same question was also addressed by two more analyses; in the first the regression coefficients were allowed to vary across countries but not across grades while in the second analysis variation was allowed across grades but not across countries. It was found that equality of regression coefficients across the four grades of a country yielded a poorer fit than equality across the three countries per grade (first analysis: (73, N = 1094) = 168.40, p < .001; GFI = .91, NNFI = .92, RMSEA = .035; second analysis: (70, N = 1094) =
Inductive Reasoning 56
131.54, p < .001; GFI = .90, NNFI = .95, RMSEA = .029). The two analyses confirmed that a choice of a model of equal regression coefficients of the letter mode across countries does not lead to the elimination of relevant country differences.Net = the Netherlands; Tur = Turkey; Zam = Zambia; NNFI = Nonnormed Fit Index; GFI = Goodness of Fit Index; RMSEA = Root Mean Square Error of Approximation; 2 = decrease of 2 value.*p < .05. **p < .01. ***p < .001.
Inductive Reasoning 57
Figure Captions
Figure 1. Estimated facet level difficulties per test and country of the figure mode
Note. The first level of each facet (see Appendix A), arbitrarily set to zero is not presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item rule: 2 ; R3: Item rule: 3; P3: Number of figures per period: 3; P4: Number of figures per period: 4; D2: Number of different elements of subsequent figures: 2 ; D3: Number of different elements of subsequent figures: 3; DV: Number of different elements of subsequent figures: variable; V: Variation across periods: variable; C: Periodicity cues: absent; PR: Periods repeat each other: no; V2: Number of valid triplets: 2; V3: Number of valid triplets: 3; F: One of the alternatives follows the rule: no; NR: Rows repeat each other: no.
Figure 2. Estimated facet level difficulties per test and country of the letter mode
Note. The first level of each facet (see Appendix A), arbitrarily set to zero, is not presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item rule: 2; R3: Item rule: 3; R4: Item rule: 4; R5: Item rule: 5; L2: Number of letters: 2; L3: Number of letters: 3; L4: Number of letters: 4; L5: Number of letters: 5; L6: Number of letters: 6; LV: Number of letters: variable; D2: Difference in positions in alphabet: 2; D3: Difference in positions in alphabet: 3; D4: Difference in positions in alphabet: 4; V2: Number of valid triplets: 2; V3: Number of valid triplets: 3; V4: Number of valid triplets: 4; V5: Number of valid triplets: 5; F: One of the alternatives follows the rule: no; NR: Rows repeat each other: no.
Figure 3. Multiple Indicators Multiple Causes model (standardized solution).
Figure 4. Examples of biased items: (a) easy item; (b) difficult item
Inductive Reasoning 58
(a) Inductive Reasoning Figures
-1
-0.5
0
0.5
1
1.5
2
R2 R3 P3 P4 D2 D3 V PRFacet level
Diff
icul
ty
(b) Rule Classification Figures
-1-0.5
00.5
11.5
22.5
R2 R3 P3 P4 D2 D3 DV V C PR FFacet level
Diff
icul
ty
(c) Rule Generating Figures
-0.50
0.51
1.52
2.53
3.5
R2 R3 D2 D3 V2 V3Facet level
Diff
icul
ty
(d) Rule Testing Figures
-1
-0.5
0
0.5
1
1.5
2
R2 R3 P3 P4 D2 D3 DV V C F NRFacet level
Diff
icul
ty
Zam Tur Net
Inductive Reasoning 59
(a) Inductive Reasoning Letters
-3.5
-2.5
-1.5
-0.5
0.5
1.5
R2 R3 R4 R5 L2 L3 L4 L5 L6 D2 D3 D4Facet level
Diff
icul
ty
(b) Rule Classification Letters
-2.5
-1.5
-0.5
0.5
1.5
2.5
R2 R3 R4 R5 L2 L3 L4 L5 L6 D2 D3 D4 FFacet level
Diff
icul
ty
(c) Rule Generating Letters
-3.5-2.5-1.5-0.50.51.52.5
R2 R3 R4 R5 L2 L3 L4 D2 D3 D4 V2 V3 V4 V5Facet level
Diff
icul
ty
(d) Rule Testing Letters
-2
-1
0
1
2
3
R2 R3 R4 R5 L3 L4 L5 L6 LV D2 D3 D4 F NRFacet level
Diff
icul
ty
Zam Tur Net
Inductive Reasoning 60
Ruleclassificationfigures
Rulegeneratingfigures
Ruletestingfigures
Inductivereasoning
Inductivereasoning figures
Inductivereasoning letters
.24
.39
.46
.73
.67
Ruleclassificationletters
Rulegeneratingletters
Ruletestingletters
Inductivereasoning
Inductivereasoning figures
Inductivereasoning letters
.37
.34
.34
.63
.74
Figure mode
Letter mode
.15
.23
Inductive Reasoning 61
0.4
0.5
0.6
0.7
0.8
0.9
1
Low Medium High
Score level
Average score
ZambiaTurkeyNetherlands
Inductive Reasoning 62
Appendix A: Test Facets
The following table provides a description of the facets of the examples of the figure tests:
TestFaceta Level IRF RCF RGFb RTFItem rule 1 * *
2 *3 *
Number of figures per period
2 -
3 * -4 * - *
Number of different elements of subsequent figures
1 * * *
23 *
Variable - -Variation across periods constant * * - *
Variable -Periodicity cues Present - -
Absent - * -Periods repeat each other
Yes * - *
No * -Number of valid triplets 1 - - -Number of valid triplets 2 - - * -
3 - - -One of the alternatives follows the rule
Yes - * - *
No - -Rows repeat each other Yes - - - *
No - - -Note. An asterisk indicates that the corresponding facet level applies to the item; a dash indicates that the facet is not present in the test. aSee the text for an explanation of the facets. bThe description refers to the first correct answer (1-3-5). IRF: Inductive Reasoning Figures; RCF: Rule Classification Figures; RGF: Rule Generating Figures; RTF: Rule Testing Figures.
Inductive Reasoning 63
The following Table provides a description of the facets of the examples of the letter tests:
TestFaceta Level IRL RCL RGLb RTLItem rule 1 * *
23 *45 *
Number of letters 123 * *4 *56 *
variableDifference in positions in alphabet
1 *
234
Number of valid triplets 1 - - -2 - - -3 - - * -4 - - -5 - - -
One of the alternatives follows the rule
yes - * - *
no - -Rows repeat each other yes - - - *
no - - -Note. An asterisk indicates that the corresponding facet level applies to the item; a dash indicates that the facet is not present in the test. aSee the text for an explanation of the facets. bThe description refers to the first correct answer (2-4-6). IRL: Inductive Reasoning Letters; RCL: Rule Classification Letters; RGL: Rule Generating Letters; RTL: Rule Testing Letters.
Inductive Reasoning 64
1
2
3
4
5
1 2 3 4 5 6 7 8 9 10 11 12
(a) Inductive Reasoning Figures: Subject is asked to indicate which row consistently follows one of the item generating rules.
Appendix B: Examples of test items
(Correct answer: 3)
1
2
3
4
5
1 2 3 4 5 6 7 8 9 10 11 12
(a) Inductive Reasoning Figures: Subject is asked to indicate which row consistently follows one of the item generating rules.
Appendix B: Examples of test items
(Correct answer: 3)
Inductive Reasoning 65
1 2 3 4 5 6 7 8
1. One or more things are added to figures which come after each otherin a group.
2. One or more things are subtracted from figures which come after each other in a group.
3. In turn, one or more things are added to figures which come after each other in a group and then, the same number of things is subtracted.
4. None of the rules applies.
(b) Rule Classification Figures: Subject is asked to indicate which rule applies to the eight figures.
(Correct answer: 3)
1 2 3 4 5 6 7 8
1. One or more things are added to figures which come after each otherin a group.
2. One or more things are subtracted from figures which come after each other in a group.
3. In turn, one or more things are added to figures which come after each other in a group and then, the same number of things is subtracted.
4. None of the rules applies.
(b) Rule Classification Figures: Subject is asked to indicate which rule applies to the eight figures.
(Correct answer: 3)
Inductive Reasoning 66
1 - 2 - 3 - 4 - 5 - 6
1 - 2 - 3 - 4 - 5 - 6
1 - 2 - 3 - 4 - 5 - 6
(c) Rule Generating Figures: Subject is asked to find one or more groups of three figures that follow one of the item generating rules.
(Correct answers: 3-4-6 and 2-3-4)
1 - 2 - 3 - 4 - 5 - 6
1 - 2 - 3 - 4 - 5 - 6
1 - 2 - 3 - 4 - 5 - 6
(c) Rule Generating Figures: Subject is asked to find one or more groups of three figures that follow one of the item generating rules.
(Correct answers: 3-4-6 and 2-3-4)
Inductive Reasoning 67
(d) Rule Testing Figures: Subject is asked to indicate which row of figures follows the rule at the top of the item.
1
2
3
4
5
1 2 3 4 5 6 7 8
The rule is:There are 4 figures in a group. 1 thing is ADDED tofigures which come after each other in a group.
None of these
(Correct answer: 4)
(d) Rule Testing Figures: Subject is asked to indicate which row of figures follows the rule at the top of the item.
1
2
3
4
5
1 2 3 4 5 6 7 8
The rule is:There are 4 figures in a group. 1 thing is ADDED tofigures which come after each other in a group.
None of these
(Correct answer: 4)
Inductive Reasoning 68
(e ) In d u c t iv e R e a s o n in g L e t te rs : S u b je c t is a s k e d to in d ic a te w h ic h g ro u p o f le t te rs d o e s n o t fo l lo w th e ru le o f th e o th e r fo u r .
1 2 3 4 5M L K J I H G F E D C B U T S R Q P O N M L K H X W V U T S
(C o r re c t a n s w e r : 4 )
( f ) R u le C la s s if ic a t io n L e t te r s : S u b je c t is a s k e d to in d ic a te w h ic h ru le a p p lie s to th e th re e g ro u p s o f le t te rs .
S R R R T Z V V V W X Z K K K C D F
1 . E a c h g ro u p o f le t te rs h a s th e s a m e n u m b e r o f v o w e ls .2 . E a c h g ro u p o f le t te rs h a s a n e q u a l n u m b e r o f id e n t ic a l le t te rs
a n d th e s e le t te rs a re th e s a m e a c ro s s g ro u p s .3 . E a c h g ro u p o f le t te rs h a s a n e q u a l n u m b e r o f id e n t ic a l le t te rs
a n d th e s e le t te rs a re n o t th e s a m e a c ro s s g ro u p s .4 . E a c h g ro u p o f le t te rs h a s a n u m b e r o f le t te rs w h ic h a p p e a r th e
s a m e n u m b e r o f p o s it io n s a f te r e a c h o th e r in th e a lp h a b e t.5 . E a c h g ro u p o f le t te rs h a s a n u m b e r o f le t te rs w h ic h a p p e a r th e
s a m e n u m b e r o f p o s it io n s b e fo re e a c h o th e r in th e a lp h a b e t.6 . N o n e o f th e ru le s a p p lie s
(C o r re c t a n s w e r : 3 )
(e ) In d u c t iv e R e a s o n in g L e t te rs : S u b je c t is a s k e d to in d ic a te w h ic h g ro u p o f le t te rs d o e s n o t fo l lo w th e ru le o f th e o th e r fo u r .
1 2 3 4 5M L K J I H G F E D C B U T S R Q P O N M L K H X W V U T S
(C o r re c t a n s w e r : 4 )
( f ) R u le C la s s if ic a t io n L e t te r s : S u b je c t is a s k e d to in d ic a te w h ic h ru le a p p lie s to th e th re e g ro u p s o f le t te rs .
S R R R T Z V V V W X Z K K K C D F
1 . E a c h g ro u p o f le t te rs h a s th e s a m e n u m b e r o f v o w e ls .2 . E a c h g ro u p o f le t te rs h a s a n e q u a l n u m b e r o f id e n t ic a l le t te rs
a n d th e s e le t te rs a re th e s a m e a c ro s s g ro u p s .3 . E a c h g ro u p o f le t te rs h a s a n e q u a l n u m b e r o f id e n t ic a l le t te rs
a n d th e s e le t te rs a re n o t th e s a m e a c ro s s g ro u p s .4 . E a c h g ro u p o f le t te rs h a s a n u m b e r o f le t te rs w h ic h a p p e a r th e
s a m e n u m b e r o f p o s it io n s a f te r e a c h o th e r in th e a lp h a b e t.5 . E a c h g ro u p o f le t te rs h a s a n u m b e r o f le t te rs w h ic h a p p e a r th e
s a m e n u m b e r o f p o s it io n s b e fo re e a c h o th e r in th e a lp h a b e t.6 . N o n e o f th e ru le s a p p lie s
(C o r re c t a n s w e r : 3 )
Inductive Reasoning 69
( g ) R u l e G e n e r a t i n g L e t t e r s : S u b j e c t i s a s k e d t o f i n d o n e o r m o r e g r o u p s o f t h r e e b o x e s o f l e t t e r s t h a t f o l l o w o n e o f t h e i t e m g e n e r a t i n g r u l e s .F G H L L O A I L L V B C I D O U P Q R L L E A
1 2 3 4 5 61 2 3 4 5 61 2 3 4 5 61 2 3 4 5 61 2 3 4 5 6
( C o r r e c t a n s w e r s : 2 - 4 - 6 , 1 - 2 - 6 , a n d 1 - 4 - 5 )
( h ) R u l e T e s t i n g L e t t e r s : S u b j e c t i s a s k e d t o i n d i c a t e w h i c h r o w o f f i g u r e s f o l l o w s t h e r u l e a t t h e t o p o f t h e i t e m .
T h e r u l e i s :I n e a c h b o x t h e r e a r e f o u r v o w e l s
1 A O U V W I S R Z E I O V G A O U I2 A O U V W I S A R E I O V G A O U I3 B O U V W I S A R E O I V G A O U Q4 A O U V W J S A R D D O V G A P U Q5 N o n e o f t h e s e
( C o r r e c t a n s w e r : 2 )
( g ) R u l e G e n e r a t i n g L e t t e r s : S u b j e c t i s a s k e d t o f i n d o n e o r m o r e g r o u p s o f t h r e e b o x e s o f l e t t e r s t h a t f o l l o w o n e o f t h e i t e m g e n e r a t i n g r u l e s .F G H L L O A I L L V B C I D O U P Q R L L E A
1 2 3 4 5 61 2 3 4 5 61 2 3 4 5 61 2 3 4 5 61 2 3 4 5 6
( C o r r e c t a n s w e r s : 2 - 4 - 6 , 1 - 2 - 6 , a n d 1 - 4 - 5 )
( h ) R u l e T e s t i n g L e t t e r s : S u b j e c t i s a s k e d t o i n d i c a t e w h i c h r o w o f f i g u r e s f o l l o w s t h e r u l e a t t h e t o p o f t h e i t e m .
T h e r u l e i s :I n e a c h b o x t h e r e a r e f o u r v o w e l s
1 A O U V W I S R Z E I O V G A O U I2 A O U V W I S A R E I O V G A O U I3 B O U V W I S A R E O I V G A O U Q4 A O U V W J S A R D D O V G A P U Q5 N o n e o f t h e s e
( C o r r e c t a n s w e r : 2 )
Inductive Reasoning 71
Appendix C:Parameter Estimates of the MIMIC Model per Mode and Cultural Group
A more detailed description of the MIMIC analyses is given here. In order to simplify the presentation and reduce the number of figures to be presented, the covariance matrices of the four grades were pooled per country prior to the analyses (as a consequence, the numbers in this Appendix and in Table 7 are not directly comparable). The table presents an overview of the estimated parameters (top) and fit (bottom). Going from the left to the right in the table, equality constraints are increased, starting with the “core parameters” of the model, the factor loadings (y), followed by the regression coefficients (), the error variance of the latent construct, labeled Inductive Reasoning (), the covariances of the predictors (), and the error variance of the tasks of inductive reasoning (). Cells with three different numbers represent the parameter estimates for the Dutch, Turkish, and Zambian group, respectively (e.g., the values 1.09, .79, and 0.69 were the factor loadings in these groups of the IRL task in the solution without any equality constraints across cultural groups); cells with one number contain parameter estimates that were set to be identical across countries; cells with an arrow and the word “Same” contain values equal to its left neighboring cell. All parameter estimates are significant (p < .05).
RuleClassificationFig/Let
RuleGeneratingFig/Let
RuleTestingFig/Let
InductiveReasoning (IR)
InductiveReasoning Figures (IRF)
InductiveReasoning Letters (IRL)
Schematic diagram of MIMIC models:
Inductive Reasoning 72
Invariant parameters across countriesNo equality constraints y y y y y
Parameter (a) Figure mode2 1.09 0.79 0.69 0.83 0.82 0.83 0.83 0.821 0.11 0.19 0.21 0.12 0.19 0.20 0.17 0.17 0.17 0.172 0.18 0.15 0.19 0.20 0.14 0.18 0.17 0.17 0.17 0.183 0.31 0.34 0.34 0.37 0.33 0.31 0.34 0.33 0.33 0.3311 27.28 39.25 45.69 Same Same Same 37.99 37.9921 29.48 36.82 35.11 Same Same Same 34.14 34.1422 111.76 102.06 104.22 Same Same Same 105.56 105.5631 15.35 23.72 26.31 Same Same Same 22.20 22.2032 31.79 35.93 33.89 Same Same Same 34.06 34.0633 30.02 37.66 45.42 Same Same Same 38.08 38.08 0.44 4.48 5.86 0.30 4.28 4.73 0.43 4.39 4.83 3.10 3.10 3.411 17.39 18.05 23.25 17.61 18.27 24.53 17.42 18.25 24.35 15.27 19.25 25.80 15.27 19.25 25.80 20.082 18.73 15.93 19.69 19.49 15.81 19.34 19.69 15.82 19.42 18.44 16.41 20.29 18.44 16.41 20.29 18.12Proportion of variance accounted fora IR 0.97 0.79 0.79 0.98 0.79 0.80 0.97 0.80 0.79 0.86 0.84 0.84IRF 0.43 0.54 0.54 0.48 0.53 0.49 0.47 0.55 0.49 0.54 0.51 0.45 0.57 0.51 0.44 0.51IRL 0.46 0.45 0.40 0.37 0.47 0.45 0.34 0.48 0.45 0.40 0.46 0.42 0.43 0.45 0.40 0.43Fit indices(df) 13.25
(2)8.32
(2)3.58
(2)38.35 (8) 43.42 (14) 49.67 (16) 101.92 (28) 123.82 (32)
prob. .001 .016 .167 .000 .000 .000 .000 .000(df) 12.20 (2) 5.07 (6) 6.25 (2) 52.25 (12) 21.90 (4)prob. .002 .535 .044 .000 .000NNFI 0.91 0.96 0.99 0.95 0.97 0.97 0.96 0.96GFI 0.98 0.99 1.00 0.99 0.99 0.99 0.97 0.96RMSEA 0.13 0.09 0.05 0.10 0.07 0.08 0.08 0.09
Inductive Reasoning 73
No equality constraints y y y y y
Parameter (b) Letter mode2 1.49 1.23 0.76 1.21 1.18 1.19 1.19 1.121 0.11 0.29 0.20 0.13 0.30 0.15 0.23 0.22 0.22 0.222 0 11 0.07 0.11 0.12 0.07 0.09 0.09 0.09 0.09 0.103 0.31 0.17 0.31 0.36 0.18 0.21 0.23 0.23 0.23 0.2411 30.24 37.06 38.53 Same Same Same 35.55 35.5521 38.14 49.81 49.59 Same Same Same 46.41 46.4122 161.85 199.72 196.20 Same Same Same 187.85 187.8531 15.25 17.54 18.99 Same Same Same 17.32 17.3232 29.74 39.36 46.45 Same Same Same 38.77 38.7733 19.14 25.90 32.63 Same Same Same 25.97 25.97 2.38 1.78 7.62 2.85 1.81 4.41 3.27 1.98 4.54 2.69 2.69 3.291 21.74 14.95 30.96 21.49 14.91 35.30 21.19 14.85 34.43 21.51 14.42 35.76 21.51 14.42 35.76 22.282 11.03 15.49 21.21 12.22 15.55 19.36 12.78 15.66 20.33 13.34 14.95 22.53 13.34 14.95 22.53 16.45
Inductive Reasoning 74
Proportion of variance accounted fora IR 0.77 0.85 0.66 0.79 0.85 0.64 0.72 0.84 0.71 0.81 0.79 0.76IRF 0.32 0.44 0.42 0.39 0.44 0.26 0.35 0.45 0.31 0.34 0.47 0.28 0.37 0.47 0.26 0.38IRL 0.68 0.53 0.38 0.62 0.53 0.48 0.56 0.52 0.52 0.53 0.55 0.46 0.57 0.54 0.44 0.51Fit indices(df) 1.32
(2)2.44
(2)2.32
(2)24.42 (8) 54.32 (14) 57.35 (16) 104.51 (28) 187.75 (32)
prob. .517 .295 .313 .002 .000 .000 .000 .000(df) 18.34 (2) 29.9 (6) 3.03 (2) 47.16 (12) 83.24 (4)prob. .000 .000 .220 .000 .000NNFI 1.01 1.00 1.00 0.97 0.96 0.96 0.96 0.93GFI 1.00 1.00 1.00 0.98 0.98 0.97 0.96 0.92RMSEA 0.00 0.02 0.02 0.07 0.09 0.08 0.08 0.12Note. Values in cells refer to nonstandardized solution; 1 is fixed at a value of one. aThe last three rows refer to proportions of variance accounted for in the latent variable and the two inductive reasoning tasks, respectively.IR: Inductive Reasoning (latent construct). IRF: Inductive Reasoning Figures. IRL: Inductive Reasoning Letters. NNFI = Nonnormed Fit Index; GFI = Goodness of Fit Index; RMSEA = Root Mean Square Error of Approximation.