journal of applied measurement, 1 2000 moral … · moral and evaluative reasoning across the...

26
JOURNAL OF APPLIED MEASUREMENT, 1 (4), 346-37 1 Copyrighto 2000 Moral and Evaluative Reasoning Across the Life-span Theo Linda Dawson University of California at Berkeley In a longitudinal/cross sectional study of moral and evaluative reasoning, Armon interviewed 23 females and 19 males, ages ranging from 5 at the first test time (1977) to 86 at the 4th (1989) test-time. Rasch analysis of Armon's data demonstratedthat Armon's and Kohlberg's measures tap a single underlying dimension of reasoning; that individual stages across five items measure the same levels of reasoning, and that development on all items progresses at about the same rate. Participants found it easier to apply already available reasoning structures to new areas than to reason at a new stage, implying that stage transition is step-like. Requests for reprints should be sent to Theo L. Dawson, Graduate School of Education, Tolman Hall, University of California at Berkeley, Berkeley, CA 94720, e-mail: [email protected].

Upload: phungliem

Post on 08-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

JOURNAL OF APPLIED MEASUREMENT, 1 (4), 346-37 1

Copyrighto 2000

Moral and Evaluative Reasoning Across the Life-span

Theo Linda Dawson University of California at Berkeley

In a longitudinal/cross sectional study of moral and evaluative reasoning, Armon interviewed 23 females and 19 males, ages ranging from 5 at the first test time (1977) to 86 at the 4th (1989) test-time. Rasch analysis of Armon's data demonstrated that Armon's and Kohlberg's measures tap a single underlying dimension of reasoning; that individual stages across five items measure the same levels of reasoning, and that development on all items progresses at about the same rate. Participants found it easier to apply already available reasoning structures to new areas than to reason at a new stage, implying that stage transition is step-like.

Requests for reprints should be sent to Theo L. Dawson, Graduate School of Education, Tolman Hall, University of California at Berkeley, Berkeley, CA 94720, e-mail: [email protected].

This paper is a reexamination of Armon's (1984a, 1984b, 1993,1995) 13- year lifespan study of moral reasoning (the Moral Judgement Interview [MJI], Colby and Kohlberg, 1987b) and evaluative reasoning about the good (the Good Life Interview, Armon, l984b). In an earlier paper, Armon and Dawson (1997) performed a developmental analysis of the moral rea- soning results from this study using multiple regression analysis. We re- ported a curvilinear developmental trend with significant positive development in all but the oldest group of participants (50- 86 years), and no significant decline in stage scores before old age. This analysis also revealed a moderate linear relationship between moral development and educational attainment, though education was found to be neither neces- sary nor sufficient for developmental advance. Questions pertaining to the nature of stage change were not addressed due to the limitations of the regression approach. The present paper reports the results of a Rasch analy- sis of Arrnon's data and discusses these with specific reference the char- acteristics of stages of development.

Rasch Modeling and Structural-Developmental Research

Rasch models are well known among psychometricians, who use them extensively in current efforts to develop test instruments that are more sensitive to the variety and range of abilities expressed in students' work. In fact, a Rasch model can be a particularly useful bootstrapping instrument1, since it gives researchers clear and accurate feedback about the performance of items and scoring methods. During recent years, there has been an increasing interest in its value as a tool for the empirical investigation of structural-developmental theory. For example, Demetriou, Efklides, Papadaki, Papantoniou, and Economou (1993) employed Rasch analysis to examine four areas of causal-experimental reasoning, com- prising combinatorial, hypothesis handling, experimentation, and model construction abilities. Further, Bond (1991, 1992a, 19920; Bond and Bunting, 1995) and Noelting, Rousseau, and Coudd (1994) have used the partial credit model to devise tests of some of Piaget's metatheoretical constructs, while Noelting, Coudd, and Rousseau (1995) have demon- strated that the model can be used to differentiate between domains of reasoning. Muller, Reene, and Overton (1994), Muller, Winn, and Overton (1995), and Muller, Sokol and Overton (1999) have used the model to examine the structure of performances on recursive thinking, deductive reasoning, and class reasoning tasks. Finally, Dawson (1998, 2000, sub- mitted) has employed Rasch models to model stages of evaluative reason-

ing about education and to examine the structure of Kohlberg's Standard Issue Scoring System (Colby and Kohlberg, 1987b).

Rasch models are intended for data that are (1) hierarchically or- dered and (2) unidimensional. Rasch models create an interval scale, called a logit scale, on which estimated values for both persons and items can be arranged if the data fit the model. The range of the logit scale and the size of the spaces between item and individual estimates is determined from patterns of performance.

Hierarchy: Rasch analysis is probabilistic. Rasch uses a log trans- formation to convert ordered response categories onto an interval scale. This confers several advantages. First, unlike correlational or factorial analytical methods, Rasch models simultaneously place each individual on a measurement scale relative to other individuals based on performance on items, and places items on the same scale based on the performance of persons on each item. This feature is of central interest to developmental researchers, because developmental theories have built in ordinality- certain behaviors are expected to occur before other behaviors-and in most developmental theories, this ordinality is based on the idea that there is some kind of increasing difficulty of behaviors or complexity of their underlying structures. Rasch analysis helps to determine whether hypoth- esized sequences of development are demonstrated empirically in data by simultaneously (1) scaling items by their difficulty, based on the relative performance of persons on individual items, and (2) ordering persons on this scale based on their performances across items. This means that both group trends and individual growth can be examined.

Unidimensionality: As already mentioned, Rasch models are unidi- mensional (Elliott, 1982; Harnbleton, et al., 1991; Masters, 1988). This means that they are designed to measure performance on a single dimen- sion of ability. By providing goodness of fit values for each item, Rasch analysis allows us to assess whether the items work together well enough to support the claim that they assess the same dimension of performance (Elliot, 1982; Hautamaki, 1989). For example, using Rasch analytical methods, Demetriou, et al. (1993) have shown that reasoning on four dis- tinct types of causal-experimental tasks represents a single underlying construct. The unidimensionality of Rasch models is not only useful for demonstrating that test items measure a single dimension of ability, but can be used to detect when items do not measure the same processes. Noelting, Coudt5, and Rousseau (1995) have used this Rasch measure-

ment requirement to highlight differences between the task demands of spatial and logicomathematical tests.

Prior to the availability of Rasch methods, the unidimensionality of developmental tests was difficult to determine. In spite of this problem, some attempts had been made. For example, Commons and his colleagues (Commons, et al., 1989; Commons, et al., 19841, using a correlational approach, reported that in the social realm, developmental sequences based on Piagetian structural criteria appear to tap a common underlying di- mension of reasoning, while those that are based on other criteria do not. However, it is argued that correlations are not an adequate indicator of unidimensionality because unidimensionality is not one of the assump- tions of correlational methods (Elliott, 1982; Hambleton, et al., 1991; Masters, 1988).

The Structure d'ensemble: The unidimensionality of Rasch mod- els, combined with their ability to differentiate between different levels of performance, also allow the researcher to determine whether different items designed to measure the same levels of performance actually do so. At the same time, these features can help to clarify whether or not indi- viduals apply the same level of reasoning to different tasks both within and between domains of reason.

One of the more controversial concepts of structural developmental theory is Piaget's and Inhelder's (Piaget and Inhelder 1969) structure d'ensemble. One of the implications of this concept of a "structure of the whole" is that at any given stage, reasoning would reach a state of equi- librium within the individual such that the same global structures or orga- nizing principles would be applied across a wide range of problems within or between domains. These periods of equilibrium would be character- ized by a uniformity of actions and their justifications, along with a pause in the developmental process. This way of thinking about stages led to a step-like conceptualization of global stage change with which Piaget and Inhelder were dissatisfied (Tomlinson-Keasy, 1982; Vuyk, 1981). Part of Piaget's dissatisfaction with this way of representing global stage change was that it might not be exclusive to change at the global level. He posited that qualitative shifts, in which existing structures are reorganized into a new structure with qualitatively different features, might be seen in smaller developmental steps as well as in global stage shifts. A second reason for this dissatisfaction was his concern that stage issues, rather than the more dynamic aspects of his theory, would come to dominate inquiry in struc-

tural development. As a consequence of these and other concerns, in their later work, Piaget and Inhelder came to regard the concept of structure d'ensemble as an "unfortunate expression, which should not be taken lit- erally" (Vuyk, 1981, p. 193), preferring to represent development as a 'spiral of knowing' (Ginsburg and Opper, 1988) in which the concept of stage was of less importance than the dynamic features of the develop- mental process.

Despite Piaget's altered perspective, stage remains an important psy- chological construct for many developmentalists (Berkowitz and Keller, 1994; Bidell and Fischer, 1992; Commons and Hallinan, 1990; Fischer, 1980; Fischer, Hand and Russel, 1984; Walker and Taylor, 1991). How- ever, attempts to examine developmental stages have yielded contradic- tory results. Some researchers report findings that support the concept of structure d'ensemble such as relative structural uniformity in reasoning across tasks (Commons, et al., 1989; Commons and Grotzer, 1990; Edelstein, Keller and Wahlen, 1984) while others report results that are inconsistent with this postulate, such as a lack of structural uniformity (see, for example, Shantz, 1975; and Turiel, 1983a). There appear to be at least two possible reasons for these divergent findings. The first of these derives from differences in the way in which structure is conceptualized. Commons and his colleagues (Commons, et al., 1984; 1989) suggest that one can legitimately compare the level of an individual's reasoning across tasks only when scoring systems are based upon the same structural crite- ria. If one accepts this notion, it follows that Piaget's concept of structure d'ensemble should be tested only with measures that are based on his concept of structure.

Unfortunately, there are several interpretations of Piaget's concept of structure d'ensemble. Flavell (1971) outlines three possible concep- tions of stage transition, each consistent with a different interpretation. In the first model, change from one to another is abrupt and all encompass- ing, with rapid application of a new structure to existing knowledge and a long period of equilibrium. This conception is generally viewed as a cari- cature of Piaget's thinking (Bidell and Fischer, 1992; Vuyk, 198 1). In the second case, the structures of a new stage are applied gradually, but rea- soning with the structures of the next stage does not take place until the individual's reasoning has consolidated at the modal stage. Several re- searchers have reported results that support this model (Berkowitz and Keller, 1994; Commons, et al., 1989; Commons and Grotzer, 1990;

Edelstein, Keller and Wahlen, 1984; Flavell, 1982; Walker and Taylor, 1991). Finally, in Flavell's third model of stage transition, development within a stage can continue long after some of the structures of the next stage, or even the stage after that, are in use.

When researchers employing Piaget's concept of structure examine stages within a single domain, they generally find evidence that supports the second model (see, for example, Armon, 1984; Commons, et al., 1989; Commons and Grotzer, 1990; Dawson, 1998; Flavell, 1982; Kuhn, Langer, Kohlberg, and Haan, 1977; Rosenberg, 1988; Selman, 197 1; Walker, 1980; Walker and Richards, 1979; see also the discussion by Bidell and Fischer, 1992). However, when researchers examine development across domains, or use measures that do not employ Piaget's conception of structure, they more commonly find evidence that supports the third model (Commons, et al., 1989; Demetriou, et al., 1993; King, Kitchener, Wood, and Davison, 1989; Turiel and Davidson, 1986).

There are a number of possible reasons for this pattern of results. First, even when stage sequences in different domains are based on simi- lar conceptions of structure, comparing stage achievements across do- mains is not a simple task. In a study in which they compare reasoning across the domains of intellectual, moral, and ego development, King, Kitchener, Wood, and Davison (1989) point out that no method exists for determining just how stages in different domains should be related to one another. This is at least partially because most structural developmental scoring systems, at least in the area of social development, are the domain specific products of systems of "spiral bootstrapping," similar to those employed by Kohlberg and his colleagues (described by Colby and Kohlberg, 1987a). In these systems, theoretical criteria are used to estab- lish the categories of reason encompassed within a given domain, and interview procedures are developed to explore reasoning in the domain delineated by these categories. Data are then collected and submitted to analysis, in which specific structural and/or content-based criteria are employed to order responses on a developmental continuum. This con- tinuum is then tested for sequentiality by gathering and scoring longitudi- nal data. Feedback from these longitudinal data often contributes to the refinement of stage scoring criteria and/or the reexamination of theoreti- cal postulates. Ultimately, this process produces a coding system that can identify developmental stages with a high level of reliability within a spe- cific domain, but it does not directly contribute to an understanding of

how stages in one sequence should be related to stages in another se- quence. Further, Commons and Richards (1996) argue that a lack of evi- dence for homogeneity in development across domains reflects, in part, the strong influence of environmental and social factors on development. Given a theory of development that posits interacting biological, environ- mental, and social influences (Piaget and Inhelder, 1969), varying rates of development in different domains would arguably be easier to account for than uniformity across domains.

Questions about (1) whether development is step-like or gradual, (2) the relationship between development in different domains, and (3) the relationships between stages assessed with different measures, are difficult to address with conventional statistical methods (see Fischer and Bidell, 1998, for an overview of research focused on these issues). Rasch analysis can help address these questions. First, it can be employed to determine whether development occurs in a step-like or smooth manner, because the log transformation converts ordinal stage scores into diffi- culty estimates on a continuous metric (Hambleton, Clauser, Mazor, and Jones, 1993; Masters, 1982; Masters, 1994; Smith, 1986). Gaps between groups of items and/or groups of persons may appear on this scale of difficulty, or logit scale (Bond, 1992b, 1995; Noelting, et al., 1994). Wil- son (1989), Draney (1996), and Dawson (1998; submitted) have argued that these gaps may reflect periods of relative consolidation that occur prior to stage advance. Second, these models can be employed in com- parisons of developmental processes across knowledge domains.~ata from multiple domains can be modeled simultaneously, making it possible to examine similarities and differences in patterns of development across domains. Finally, simultaneous analysis of data from two scoring systems can also be used to assess the degree to which different stage scoring systems assess the same underlying dimension of performance.

In summary, many of the questions and issues raised in connection with developmental stage theory can be explored with the assistance of Rasch models. In the present analysis, the following questions are ad- dressed: (1) Do developmental sequences that examine reasoning about the good life, good friendship, good work, good person, and morality tap a single underlying dimension of reasoning? (2) Do patterns of develop- ment across these sequences appear gradual or step-like? (3) Is there a tendency for development to progress more rapidly in some areas than in others?

Method

The data employed in this analysis are from Armon's (1984a; Armon and Dawson, 1987) study of moral reasoning and evaluative reasoning about the good. Armon interviewed a total of 42 individuals ranging in age from 5 years, at the first time of testing in 1977, to 86 years, at the final time of testing in 1989. Interviews also took place in 1981 and 1985.

To assess moral reasoning, form A (three moral dilemmas: Heinz, Judge, Joe) of the MJI (Colby and Kohlberg, 1987b) was used with all participants except three who were familiar with it. For these participants, Forms B (three dilemmas: Dr. Jefferson, Judge, Judy) or C (three dilem- mas: Korean, Valjean, Karl) were used. The same form was used with each participant at each time. Correlations between forms, as reported in the Standard Issue Scoring Manual (Colby and Kohlberg, 1987), are .82 between forms A and C, and .84 between forms B and C. To assess any differential effects of form in the present instance, Rasch analyses were conducted both with and without the cases who were administered forms B and C. Because no systematic differences were found between the re- sults from these analyses, forms B and C are treated here as equivalent to form A and data from respondents who received these forms are included in the following analysis.

Armon's (1984b) Good Life interview was also administered at each test-time. The Good Life interview includes questions about (1) the good life, (2) good work, (3) good friendship, and (4) the good person. The interviews are open-ended. For example, the good life segment of the interview opens with the question, "What is the good life?' Responses are probed with questions like, "Why is that good?'until the interviewer, based upon explicit criteria, concludes that the participant has given an adequate description of his or her reasoning on each theme. Complete interview procedures for evaluative reasoning about the good are described in detail in Arrnon's (1984b) Good Life Scoring Manual.

Both Arrnon and Kohlberg describe structural developmental se- quences that meet Piagetian criteria for structured wholeness, sequentiality, qualitative differences between stages, and hierarchical integration (Armon 1984b, Colby and Kohlberg, 1987b). Moreover, stages in Armon's and Kohlberg's sequences reflect similar organizations, such that each stage in Armon's sequence meets structural criteria similar to those of the stage with the same number in Kohlberg's sequence.

Altogether, 18 males and 25 females participated in the study. Aver- age educational attainment for adult males at Time 1 was 12.94 (SD 6.80) years; for females it was 13.04 (SD 5.26) years. This group difference was not found to be statistically significant. Four adults had completed doctorates, nine had completed some post-undergraduate work, six had completed college, and nine had completed some college. Average educa- tional attainments for both children and adults rose significantly over the course of the study (see Armon and Dawson, 1987). On the whole, the sample was well educated, affluent, and predominantly white. More de- tailed information about the demographics of the sample can be obtained from earlier reports (Armon, 19840; Armon and Dawson, 1987).

Procedure

Interviews were individually administered to each participant, tape recorded, and transcribed. Of the participants who were interviewed on two or more occasions, 43 were interviewed at Time 1 (1977), 42 were interviewed at Time 2 (1981), 42 were interviewed at Time 3 (1985), and 41 were interviewed at Time 4 (1989-90). Participants who completed fewer than 4 interviews were not significantly different on any of the demographic variables included in this study from those who completed all 4 interviews.

Scoring and Analysis

The MJIs were scored by Armon, who had obtained a .95 reliability rating with the first author of the Standard Issue Scoring Manual (Colby and Kohlberg, 1987b). To check for rater bias, half of the interviews from Times 1 and 2, and one quarter of the interviews from Times 3 and 4 were scored by independent raters (trained undergraduate students). Different raters were used at each subsequent test time. At each test-time, the inde- pendent raters and the first author provided identical scores at least 80% of the time and never provided scores that were more than 1/2 stage apart. Inter rater, test-re-test, and alternate form reliability estimates for the MJI indicate an overall error in the range of 113 stage (Colby and Kohlberg, 1987a).

Evaluative reasoning questionnaires were scored by Armon and two trained undergraduate students. Agreement rates among raters were com- parable to those for the moral reasoning interviews. (See Armon, 19840; Armon and Dawson, 1987) for more details on the scoring of evaluative reasoning material.)

Both Colby and Kohlberg's and Armon's scoring systems produce scores that can be reported as Global Stage Scores (GSS) ranging from 1 to 5 in 112 stage increments. (See Colby and Kohlberg, 1987a, and Armon, 1984b, for detailed scoring information.) For each of the five themes, participants were interviewed at length, providing several examples of their reasoning. Then, all of the examples for each theme were scored individually and the scores averaged, using a standard weighting system (described in Armon, 1984a), resulting in a single score for each partici- pant on each theme. Half-stage scores on all five items can best be inter- preted as mixed scores, since virtually all participants who were assigned half-stage scores demonstrated a mix of adjacent stage scores on their judgments.

Analyses in this report are conducted on the pooled longitudinal and cross-sectional data, meaning that all observations are treated as indepen- dent, raising the n from 43 (the total number of participants) to 168 (the total number of interviews across four test times) This practice has two major advantages. First, power is increased when data collected at each test time are treated as coming from independent observations. This is permissible when a study design involves more than two test-times sepa- rated by relatively long intervals (Rogosa and Willett, 1985; Willett, 1989). Second, the longitudinal information incorporated into the analysis in this way adds valuable information about growth trends. Specifically, by in- cluding longitudinal data for individuals, it can be shown that age differ- , .

ences reflect actual trends at the individual level rather than artifacts of statistical averaging.

Results

Version 2.1 of Quest (Adams and Khoo, 1993) was used to analyze the data. Five item scores were included in the analysis, one each for reasoning about the (1) good life, (2) good friendship, (3) good work, (4) the good person, and (5) moral judgment. The partial credit model, rather than the ratio scale model was employed because this paper addresses questions about the distance between levels of the items.

The logit scale developed from the analysis spans 18 logits, from -9 to +9, with 0 (set by default) at the mean of the item estimates. Item difficulty estimates are distributed along the logit scale from -8.17 to +7.17. Item fit statistics are within acceptable parameters (infit t c 2.00), supporting the claim that they measure a single underlying dimension (Table 1). Standard

Table 1

Item Estimates (Thresholds) N = 168 ITEM NAME SCOREMAXSCR THRESHOLDIS INFT OUTFT INFT OUTFT

1 2 3 4 5 6 7 MNSQ MNSQ t t

1 Life 5 4 6 1036 1-8.17 -4.22 -2.50 -.41 1.95 4.86 7.17 1.08 1.11 .7 .7

I 1 1.26 .78 .63 .50 .42 .59

2 Work 416 888 .79 I

I I -3.69 -2.77 -.42 1.75 4.63 6.89 .79 .80 -1.8 -.8 .75 .69 .50 .46 .57 .75 1

5Moral 1639 11761-8.14 -5.90 -3.97 -.60 1.98 4.78 6.541.77 .78-2.1 -1.6 I 1.23 .80 .69 .45 .44 .56 .67

Mean ! ! 0.00 .96 .98 -.4 -.l

errors for the item estimates range from .42 to 1.26'logits (M = .70, SD = .22), while errors for person estimates range from .59 to 1.90 logits (M = .78, SD = 24). Reliability of the person estimates is .93, with a mean of .85 logits and an adjusted standard deviation of 3.24 logits (see Appendix 1 for complete person statistics).

Figure 1 displays the results of the analysis as an item person map. The long vertical black band toward the right of the figure shows the logit scale, a scale of relative item and person measures. To the left of this band are the estimates for individuals at each of the four times of testing. These estimates are in separate columns, one for each testing occasion, 1977,1981, 1985, and 1989. Individual participants are represented by case numbers, so that the developmental trajectory of each individual can be traced. As an illustration, trajectories for case 1, a woman aged 30 years at the onset of the study; case 11, a boy age 11 at the onset of the study; and case 18, a woman who entered the study at age 58 are represented with dotted lines. Seven case numbers are outlined with boxes, to indicate that those perfor- mance patterns underfit the Rasch requirements at that time of testing.

Quest does not provide estimates for cases with perfect zero scores and perfect high scores, because they provide insufficient data for estima- tion. Values for the cases with perfect scores were estimated as follows:

Estimate for perfect scores = E + E-,-ESÃ ( E - E ) (1)

E~32-Es31 In this equation, E represents the person estimate for a given score. This procedure located individuals with perfect scores at 9.49.The same pro-

Figure

Moral and Evaluative Reasoning: Item & Person Estimates with Means

Time 1M = .64, sd 4.57: Time 2M = .71, sd = 3.97; Time 3M = 1.49, sd = 3.40; Time 4M = 2.54, sd = 3.35

F (3,168) = 5.05, pe.01

cess was employed to calculate estimates for cases with zero scores, though, in this case, the rate of change from a score of 1 to a score of 2 was not available because there were no individuals in the sample who received this score, so 112 the rate of change from step 1 to 3 was used to impute an estimate for a score of 2. This procedure located individuals with zero scores (all items scored stage 1 at an individual test time) at -9.76 logits.

To the right of the vertical band are the item estimates for the good life (L), good work (W), good friendship (F), the good person (P), and moral judgment (M), respectively. Large horizontal grey bands indicate the approximate range for each level of reasoning from full-stage 2 to full-stage 5 across the five items. Note that the distance between these bands is similar from stage to s t a g e 4 5 logits. The half-stage items tend to cluster below or overlap these bands, which are intended only to visually demarcate the groups of item estimates and do not include stan- dard errors for the items.

While there is no agreed upon method for determining whether groups of items are separated to the extent that they can be said to represent distinct levels of difficulty, a conservative approach is to calculate stan- dard error bands around each of the items to determine whether the error bands for item estimates representing one stage of development overlap those of an adjacent stage. Given an average standard error of .70 (2 stan- dard errors = 1.4), the error bands around full-stage estimates and the half-stage estimates at the adjacent higher levels overlap, indicating that the gaps are not statistically significant. However, the error bands around the estimates for adjacent full-stages do not overlap, indicating that they are statistically significant. These and other patterns in the data are dis- cussed further below.

Finally, four narrow, horizontal, black bands show the means of the person estimates at each of the test times. Mean development in this group over the course of the study spanned 4 logits. An ANOVA, comparing person estimate means at the four test-times indicates that mean stage advance is significant, F(3,168) = 5.05, p < .01.?

Table 2 shows the actual scores, infit mean squares, and t values for individuals whose performances do not fit the model ( t > k2.00). Interest- ingly, the first 12 of these individuals had scores across items that were very similar-they overfit the model. The negative infit statistics indicate that there is less variation in the difference between expected and ob- served scores than predicted. Note that all items for each of these indi- viduals are scored at the same level. These cases are not problematic theoretically, because the narrow range of their scores is consistent with the hypothesis that individuals tend to think in structurally similar ways across these themes.

The large positive infit statistics for the following 7 participants in- dicate relatively inconsistent performances. There is an unexpectedly large

Table 2

Person Estimates with Infit Ts Over +2.00 Case Time Count Total Est Infit MSQ Infit t L W F P M

difference between the expected and observed scores of these individu- als-the performances underfit the model. Given that the 7 cases with underfitting response patterns represent only about 4% of the total sample (n = 168), it seems reasonable to allow that these patterns reflect mea- surement error (p < .05) (Smith, 1986).

Full-stage and half-stage scores across items cluster within relatively narrow bands of the logit scale, indicating that individual full-stage and half-stage scores across all five items appear to measure the same level of reasoning. Moreover, there are no patterns in the distribution of the stages that would suggest that development on any one item progresses more rapidly than on any of the other items.

The instance, at stage 2.5, in which moral judgment appears to be markedly easier than reasoning on any of the good life themes, might be due to differences in the interview formats of Armon's and Kohlberg's

measures. Specifically, the dilemma format of Kohlberg's approach might be better at eliciting reasoning competence at the lower stages than are the more open-ended evaluative reasoning questionnaires. Though it seems this hypothesis has never been tested, it is consonant with Piaget's description of cognitive development (Piaget and Inhelder, 1969), in that Kohlberg's dilemmas present more concrete situations for young participants to con- sider, while Armon's questions are more open-ended and abstract.

The proximity of the estimates for full-stages and the half-stages immediately preceding them could indicate a failure to adequately differ- entiate between half-stages and full-stages, particularly since the half- stage scores are primarily composed of a mixture of adjacent full-stage scores, but this explanation does not account for the tendency of half- stage thresholds to cluster nearer to the adjacent higher stage band than the adjacent lower stage band, a pattern we also observed in an analysis of a similar set of data (Dawson, in press). This last pattern suggests that once a higher stage of reasoning has appeared in one area, it is relatively easy to extend it into other areas, at least when compared to the difficulty of moving from reasoning at one full-stage to any amount of reasoning at the next stage. Such a pattern is consistent with the view that cognitive development involves qualitative change rather than cumulative change.

The pattern seen here is also consistent with the concept of consoli- dation embodied in Piaget's notion of structure d'ensemble. The tight clustering of difficulty estimates for each stage across each of the five themes shows that stage of reasoning tends to be relatively stable across content areas. In fact, 52 respondents were scored at a single stage on at least 4 out of 5 themes (the 5th score was at the same stage or an adjacent half-stage), and another 52 respondents had scores across the 5 themes that were within half a stage. Of the remaining cases, 33 had scores that were within one full stage, 11 had scores that spanned more than one full stage, and 20 had data that were so incomplete that it was not possible to classify them with this scheme. The overwhelming majority of respon- dents (70% of 148 classifiable cases) received scores across themes that were within half a stage of one another, while 22% had scores that were within one stage of one another and only 7% had scores that were more than one stage apart. These patterns of performance are reflected in the gaps between item estimates. However, as reported above, the gaps be- tween groups of full-stage estimates and groups of adjacent, higher, half- stage estimates are not statistically significant, though the gaps between adjacent full-stage estimates are statistically significant. Consequently,

while the model suggests support for qualitative, stage-like change, with periods of consolidation before advance, the standard errors for the item estimates are too large to provide unequivocal support for this notion.

Finally, the intervals between bands of full-stage item estimates in Figure 1 are roughly the same from stage to stage-between 4 and 5 logits. The regularity of these intervals suggests that the difficulty of moving from one stage to the next is about the same. This pattern might be an artifact of the research design, but it might also reflect similarities in the task demands of stage transitions.

Discussion

Armon's longitudinal dataset is a rich source of information about development across the lifespan. The present analysis focuses on only a small subset of the questions that can be explored with these data, in par- ticular, those related to structural-developmental propositions about stage and structure. In the past, the investigation of these propositions, using conventional statistical methods, has led to equivocal and often contra- dictory conclusions. However, the Rasch model's assumptions of unidi- mensionality and hierarchy are compatible with those of structural-developmental theory, and thus it provides a particularly suit- able means for addressing propositions related to stage and structure.

This paper examined the development of moral reasoning and evalu- ative reasoning about the good. It was found that similar processes or structures are applied to reasoning in both domains, and that individual stages of items in these domains measure the same levels of reasoning performance. Further, a hierarchy of difficulty was clearly differentiated for all items, and trends in the hierarchical distribution of stages suggest that consolidation occurs in a pattern that is compatible with Piaget's con- cept of structure d'ensemble, though it is important to keep in mind that the domains of reasoning examined here are closely related and, conse- quently, these results do not address the claim that similar patterns exist across a wide range of domains.

This analysis provides information that may be useful in refining Armon's and Kohlberg's research methodologies. Armon's open-ended interview format might result in underestimation of the ability of younger reasoners, as indicated by how much easier (about 1.5 logits) moral rea- soning at stage 2.5 appears to be relative to evaluative reasoning at the same stage. On the other hand, it is also possible that the structural crite-

362 DAWSON

ria for Armon's stage 2.5 are not the same as those for Kohlberg's stage 2.5. Methodological questions are also raised by the participants whose scores do not fit the parameters of the Rasch model, all of whom are in the stage 4.5 range of the logit scale. In fact two patterns of misfit occur

, predominantly within in this range-overfit and underfit. It would be worthwhile to take another look at data collection and scoring at this level, to see if methodological flaws contributed to this pattern.

Partly because their empirical consequences have been difficult to assess, structural developmental concepts like Piaget's structure d'ensemble no longer interest many developmental researchers. This analy- sis demonstrates a method of testing such concepts using Rasch model- ing, and in doing so provides support for further investigation of these developmental constructs.

The scope of this paper has been deliberately confined to an exami- nation of cross-sectional structural-developmental trends. Neither indi- vidual development nor environmental effects have been scrutinized. However, both can readily be examined with a Rasch modeling approach. First, individual developmental trajectories can be traced by referring to the case numbers in Figure 1. By comparing these trajectories, individu- als who stand out can be readily identified. For example, participant num- ber 18, a woman who entered the study in 1977 at age 58, gains about 315 of a stage over the 12 years of the study. The data from her interviews may hold valuable clues about development in late adulthood, particu- larly when compared to data from cases in which development is not ob- served. Finally, the person and item estimates from this Rasch analysis are subject to the usual statistical procedures and can be employed to help answer questions about relationships between development and other fac- tors. For example, the estimates from this analysis can be employed to construct Hierarchical Linear Models examining the impact of education, age, sex, and other variables on the trajectory of individual development. Such an analysis is currently being undertaken on moral judgment data from 80 respondents tested from 3-4 times at 4 year intervals.

Acknowledgement

I would like to thank Cheryl Armon for permission to use her rich lifespan longitudinal database. In addition, I thank Trevor Bond, who in- troduced me to Rasch analysis and gave many hours to this project. Thanks also to Michael Commons and to Mark Wilson for their assistance.

Footnotes

' In the sense that it provides feedback that can contribute to both theory and instrument construction as part of a relational or discursive research process as described in Overton (1998).

We are currently employing Hierarchical Linear Modeling (HLM) to examine stage advance in this sample.

References Adams, R. J., and Khoo, S. T. (1993). Quest: The interactive test analysis sys-

tem. Victoria, Australia: Australian Council for Educational Research Ltd.

Armon, C. (1984a). Ideals of the good life. In M. Commons, F. Richards, and C. Armon (Eds.), Beyond formal operations, Vol. 1. Late adolescent and adult cognitive development. New York: Praeger.

Armon, C. (1984b). Ideals of the good life: Evaluative reasoning in children and adults. Unpublished Doctoral dissertation, Harvard University, Boston.

Arrnon, C. (1993). Developmental conceptions of good work: A longitudinal study. In J. Demick and P. M. Miller (Eds.), Development in the workplace (pp. 21- 37). Hillsdale, NJ: Lawrence Erlbaum.

Arrnon, C. (1995). Moral judgment and self-reported moral events in adulthood. Journal of Adult Development, 2,49-62.

Armon, C., and Dawson, T. (1997). Developmental trajectories in moral reason- ing across the lifespan. Journal of Moral Education, 26,433-453.

Berkowitz, M. W., and Keller, M. (1994). Transitional processes in social cogni- tive development: A longitudinal study. International Journal of Behavioral Development, 17,447-467.

Bidell, T. R., and Fischer, K. W. (1992). Beyond the stage debate: Action, structure, and variability in Piagetian theory and research. In R. J. Stemberg and C. A. Berg (Eds.), Intellectual Development, (pp. 100-160). New York: Cambridge University Press.

Bond, T. G. (1991, November). Assessing developmental levels in children's think- ing: Matching measurement model to cognitive theory. Unpublished paper presented at the annual Conference for the Australian Association for Research in Education, Gold Coast.

Bond, T. G. (1992a, June). Bridging the Kantian gulf: Piaget's rationalism, psychology's empiricism, and item response theory. Unpublished paper pre- sented at the annual symposium of the Jean Piaget Society, Montreal.

Bond, T. G. (1992b, June). An empirical validation of Piaget 3 logico-mathemati- cal model for formal operational thinking: Matching measurement model to cognitive theory. Unpublished paper presented at the annual symposium of the Jean Piaget Society, Montreal.

Bond, T., G. (1995). Piaget and measurement 11: Empirical validation of the Piagetian model. Archives de Psychologie, 63, 155-185.

Bond, T. G. and Bunting, E. (1995). Piaget and measurement 111: Reassessing the mkthode clinique. Archives de Psychologie, 63.

Colby, A., and Kohlberg, L. (1987a). The measurement of moraljudgment, Vol. 1. Theo- retical foundations and research validation. New Yok Cambridge University Press.

Colby, A., and Kohlberg, L. (1987b). The measurement of moral judgment, Vol. 2. Standard issue scoring manual. New York: Cambridge University Press.

Commons, M. L., Armon, C., Richards, F. A., Schrader, D. E., Farrell, E. W., Tappan, M. B., and Bauer, N. F. (1989). A multidomain study of adult devel- opment. In J. Sinnot, F. Richards, C. Armon, and M. Commons (Eds.), Adult development, Vol. 1. Comparisons and applications of developitental models (pp. 33-56). New York: Praeger.

Commons, M. L., and Grotzer, T. A. (1990). The relationship between Piagetian and Kohlbergian stage: An examination of the "necessary but not sufficient relationship". In M. L. Commons, C. Armon, L. Kohlberg, F. A. Richards, T. A. Grotzer, and J. D. Sinnott (Eds.), Adult development, Vol. 2. Models and methods in the study of adolescent and adult thought (pp. 205-231). New York: Praeger Publishers.

Commons, M. L., and Hallinan, P. W. (1990). Intelligent pattern recognition: Hierarchical organization of concepts. In M. L. Commons, R. J. Herrnstein, S. M. Kosslyn, and D. B. Mumford (Eds.), Computational and clinical approaches to pattern recognition and concept formation. Quantitative analyses of be- havior, Vol. 9 (pp. 127- 153). Hillsdale: Lawrence Erlbaum.

Commons, M. L., and Richards, F. A. (1996). Behavior analytic approach to a dialectics of stage performance and stage change. Unpublished manuscript.

Commons, M. L., Richards, F. A., with Ruf, F. J., Armstrong-Roche, M., and Bretzius, S. (1984). A general model of stage theory. In M. Commons, F. A. Richards, and C. Armon (Eds.), Beyond Formal Operations (pp. 120-140). New York: Praeger.

Dawson, T. L. (1998). "A good education is ..." A life-span investigation of de- velopmental and conceptual features of evaluative reasoning about educa- tion. Unpublished doctoral dissertation, University of California at Berkeley, Berkeley, CA.

Dawson, T. L. (2000, June). A stage is a stage is a stage. Part 2: A direct com- parison of two scoring systems. Paper presented at the annual meeting of the Jean Piaget Society, Montreal, Quebec.

Dawson, T. L. (in press). New tools, new insights: Kohlberg's moral reasoning stages revisited. International Journal of Behavioral Development.

Demetriou, A., Efklides, A., Papadaki, M., Papantoniou, G., and Economou, A. (1993). Structure and development of causal-experimental thought: From early adolescence to youth. Developmental Psychology, 29,480-497.

Draney, K. L. (1996). The polytomous Saltus model: A mixture model approach to the diagnosis of developmental differences. Unpublished doctoral disserta- tion, University of California at Berkeley, Berkeley, CA.

Edelstein, W., Keller, M., and Wahlen, K. (1984). Structure and content in social cogni- tion: Conceptual and empirical analyses. Child Development, 55,15 14-1526.

Elliott, C. (1982). The measurement characteristics of developmental tests. In S. Modgil and C. Modgil (Eds.), Jean Piaget: Consensus and controversy, (pp. 241-255). London: Holt, Rinehart, and Winston.

Fischer, K. (1980). A theory of cognitive development: The control and con- struction of hierarchies of skills. Psychological Review, 87,477-53 1.

Fischer, K. W., Hand, H. H., and Russel, S. (1984). The development of abstrac- tions in adolescence and adulthood. In M. L. Commons, F. A. Richards, and C. Annon (Eds.), Beyond formal operations. Late adolescent and adult cog- nitive development, (pp. 43-73). New York: Praeger.

Fischer, K. W., and Bidell, T. R. (1998). Dynamic development of psychological structures in action and thought. In W. Damon and R. M. Lemer (Eds.), Hand- book of Child Psychology: Theoretical models of human development, (51h

ed., pp. 467-561). New York: John Wiley and Sons.

Flavell, J. H. (1971). Stage-related properties of cognitive development. Cogni- tive Psychology, 2,421 -453.

Flavell, J. H. (1982). On cognitive development. Child Development, 53, 1-10.

Ginsburg, H. I?, and Opper, S. (1988). Piaget's theory of intellectual develop- ment (3rd ed), Englewood Cliffs, NJ: Prentice-Hall.

Hambleton, R. K., Clauser, B. E., Mazor, K. M., and Jones, R. W. (1993). Ad- vances in the detection of differentially functioning test items. European Journal of Psychological Assessment, 9, 1-1 8.

Hambleton, R. K., Swarninathan, H., and Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage Publications, Inc.

Hautamaki, J. (1989). The application of a Rasch model on Piagetian measures of stages of thinking. In P. Adley (Ed.), Adolescent Development and School Science, (pp. 342-349). London: Falmer.

King, P. M., Kitchener, K. S., Wood, P. K., and Davison, M. L. (1989). Relation- ships across developmental domains: A longitudinal study of intellectual, moral, and ego development. In M. L. Commons, J. D. Sinnot, F. A. Richards, and C. Annon (Eds.), Adult development. Vol. 1. Comparisons and applications of developmental models, (pp. 57-71). New York: Praeger.

Kohlberg, L. (1969). Stage and sequence: The cognitive-developmental approach to socialization. In D. G o s h (Ed.), Handbook of socialization theory and research, (pp. 347-480). Chicago: Rand McNally.

Kohlberg, L., and Armon, C. (1984). Three types of stage models used in the study of adult development. In M. L. Commons, F. A. Richards, and C. Armon (Eds.), Beyond formal operations: Late adolescent and adult cognitive devel- opment, (pp. 383-394). New York: Praeger.

Kohlberg, L., Levine, C., and Hewer, A. (1983). Moral stages: A cun-ent formulation and a response to critics. Contributions to Human Development, 10,104-166.

Kuhn, D., Langer, J., Kohlberg, L., and Haan, N. S. (1977). The development of formal operations in logical and moral judgment. Genetic Psychology Mono- graphs, 95,97-188.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.

Masters, G. N. (1988). Measurement models for ordered response categories. In R. Langeheine and J. Rost (Eds.), Latent trait and latent class models (pp. 11- 29). New York: Plenum Press.

Masters, G. N. (1994). Partial credit model. In T. Hush and T. N. Postlethwaite (Eds.), The international encyclopedia of education (pp. 4302-4307). Lon- don: Pergammon.

Muller, U., Reene, K., and Overton, W. F. (1994, June). Rasch scaling of a de- ductive reasoning task. Paper presented at the Annual Symposium of the Jean Piaget Society, Philadelphia, PA.

Muller, U., Sokol, B., and Overton, W. 0. (1999). Developmental sequences in class reasoning and proportional reasoning. Journal of Experimental Child Psychology, 74 (69- 106).

Muller, U., Winn, M., and Overton, W. (1995, June). Rasch analysis of two re- cursive thinking tasks. Paper presented at the Annual Symposium of the Jean Piaget Society, Berkeley, CA.

Noelting, G., Coudd, G., and Rousseau, J. P. (1995, June). Rasch analysis ap- plied to multi-domain tasks. Paper presented at the Twenty-Fifth Annual Sym- posium of the Jean Piaget Society, Berkeley, CA.

Noelting, G., Rousseau, J. P., and Coud6, G. (1994, June). Rasch analysis ap- plied to Piagetian-type problems. Paper presented at the Annual Symposium of the Jean Piaget Society, Chicago, IL.

Overton, W. (1998). Developmental psychology: Philosophy, concepts, and methodol- ogy. In (Eds.) W. Darnon and R. M. Lemer, Handbook of childpsychology: Theoreti- c d models of human development. New York. John Wiley and Sons

Piaget, J. and Inhelder, B. (1969). The psychology of the child. New York: Basic Books.

Rosenberg, S. (1988). Reason, ideology, and politics. Princeton NJ: Princeton University Press.

Rogosa, D. R., and Willett, J. B. (1985). Understanding correlates of change by modeling individual differences in growth. Psychometrika, 50,203-228.

Selman, R. L. (1971). The relation of role taking to the development of moral judgment in children. Child Development, 42,79-91.

Shantz, C. U. (1975). The development of social cognition. In E. M. Hetherington (Ed.), Review of child development research, (Vol. 5, ). Chicago: University of Chicago Press.

Smith, R. M. (1986). Person fit in the Rasch model. Educational and Psycho- logical Measurement, 46,359-372.

Tomlinson-Keasy, C. (1982). Structures, functions, and stages: A trio of unre- solved issues in formal operations. In S. Modgil and C. Modgil (Eds.), Jean Piaget: Consensus and controversy, (pp. 13 1 - 153). New York: Praeger.

Turiel, E. (1983a). The development of social knowledge: Morality and conven- tion. Cambridge, England: Cambridge University Press.

Turiel, E., and Davidson, P. (1986). Heterogeneity, inconsistency, and asynchrony in the development of cognitive structures. In I. Levin (Ed.), Stage and stmc- ture: Reopening the debate (pp. 106-143). Norwood, NJ: Ablex.

Vuyk, R. (1981). Critique and overview of Piaget's genetic epistemology, 1965- 1980. (Vol. 1). New York: Academic Press.

Walker, L. J. (1980). Cognitive and perspective-taking prerequisites for the de- velopment of moral reasoning. Child Development, 57, 13 1- 139.

Walker, L. J. (1982). The sequentiality of Kohlberg's stages of moral develop- ment. Child Development, 53, 1330-1336.

Walker, L. J., and Taylor, J. H. (1991). Stage transitions in moral reasoning: A longitudinal study of developmental processes. Developmental Psychology, 27,330-337.

Walker, L. J., and Richards, B. S. (1979). Stimulating transitions in moral reason- ing as a function of stage of cognitive development. Developmental Psychol- ogy, 15,95-103.

Willett, J. B. (1989). Some results on reliability for the longitudinal measurement of change: Implications for the design of studies of individual growth. Educa- tional and Psychological Measurement, 49,587-602.

Wilson, M. (1989). Saltus: A psychometric model of discontinuity in cognitive development. Psychological Bulletin, 105,276-289.

Appendix

Case Estimates in Estimate Order N = 168

--

Case Time Score Maxscr Estimate Error Infit Outfit Inft Outft Mnsq Mnsq t t

(Appendix continued on next page.)

STAGE TRANSITION IN MORAL AND EVALUATIVE REASONING 369

(Appendix continued from previous page.)

Case Time Score Maxscr Estimate Error Infit Outfit Inft Outft Mnsa Mnsa t t

.95 .67 .87 8 -.04 .10

(Appendix continued on next page.)

370 DAWSON

(Appendix continued from previous page.)

Case Time Score Maxscr Estimate Error Infit Outfit Inn Outft Mnsq Mnsq t t

(Appendix continued on next page.)

STAGE TRANSITION IN MORAL AND EVALUATIVE REASONING 37 1

(Appendix continued from previous page.)

Case Time Score Maxscr Estimate Error Infit Outfit Inft Outft Mnsq Mnsq t t

Zeroscores

Mean .85 .85 .87 -.33 -.I8 SD 3.37 .95 1.04 1.26 1.09