evaluating the effect of intelligent tutoring of …

The Pennsylvania State University

The Graduate School

College of Education

EVALUATING THE EFFECT OF INTELLIGENT TUTORING

OF STRUCTURE STRATEGY ON STUDENTS’ READING ABILITY

USING A COGNITIVE DIAGNOSTIC MODELING APPROACH

A Thesis in

Educational Psychology

by

Xiaoli Jiang

© 2017 Xiaoli Jiang

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Master of Science

August 2017

The thesis of Xiaoli Jiang was reviewed and approved* by the following:

Pui-Wa Lei

Professor of Education

Thesis Advisor

Bonnie J. F. Meyer

Professor of Education

Peggy Van Meter

Associate Professor of Education

Professor in Charge, Educational Psychology Program

*Signatures are on file in the Graduate School.

iii

ABSTRACT

Reading comprehension ability is generally acknowledged to be crucial for

learners of all ages. Previous studies have consistently shown that the intelligent tutoring

of structure strategy (ITSS) is a promising approach to improving average reading

comprehension at group level. However, it is yet unclear as to whether ITSS instruction

is equally effective in promoting different reading subskills for all individuals, and

whether its effect differs between different grade levels or female and male learners.

The current study addressed those questions by applying cognitive diagnostic

modeling (CDM) technique, which is designed to provide formative diagnostic

information through a fine-grained reporting of learners’ subskill or attribute mastery

profile, through a retrofitting approach. The G-DINA model was used to examine which

of the four subskills (literal, inferential, critical, and affective subskills) as represented in

the Gray Silent Reading Test (GSRT) were significantly improved through ITSS

instruction as compared to a business-as-usual control group.

Our findings showed that all four reading subskills were significantly enhanced,

and that there was no sufficient evidence obtained to claim that the effect of ITSS

instruction differed in improving the four subskills. However, the growth rate of the four

subskills was different over time from pretest to posttest, with inferential subskill having

the greatest growth rate. In addition, the data collected supported the other two

hypotheses that the effect of ITSS instruction in promoting reading subskills did not

differ between Grade 4 and Grade 5 students or between female and male students.

iv

The results of the study are expected to shed light on both instruction and learning

of reading in that teachers can remediate the students’ weakness in reading subskills

through adapting instructional design and providing customized reading intervention

programs to aid student learning. The limitations of the present study and future research

directions were also discussed.

v

Table of Contents

LIST OF FIGURES ........................................................................................................... vi

LIST OF TABLES ............................................................................................................ vii

ACKNOWLEDGEMENT ............................................................................................... viii

Chapter 1 Introduction ........................................................................................................ 1

1.1 Statement of the Problem ...................................................................................................... 2

1.2 Significance of the Study ...................................................................................................... 5

Chapter 2 Literature Review ............................................................................................... 7

2.1 Overview of Cognitive Diagnostic Assessment .................................................................... 7

2.2 Cognitive Diagnostic Modeling Approach ......................................................................... 11

2.2.1 Cognitive Diagnostic Models ...................................................................................... 15

2.2.2 Application of the G-DINA Model in Reading Tests .................................................. 17

2.3 Reading Comprehension Subskills ..................................................................................... 18

2.3.1 Reading Comprehension as a Construct of Multidimensional Subskills ..................... 19

2.3.2 Specifying Reading Subskills and Subskill-Item Relationship.................................... 22

2.4 Research Questions and Study Hypotheses ........................................................................ 25

Chapter 3 Research Methodology ..................................................................................... 27

3.1 Identifying the GSRT Reading Subskills ............................................................................ 27

3.1.1 Instrumentation and Measure of Reading Comprehension .......................................... 28

3.1.2 Reading Subskill Identification ................................................................................... 29

3.1.3 Coding Procedures and Q-matrix Construction ........................................................... 30

3.2 Investigating Examinees’ Performance on the GSRT ......................................................... 32

3.2.1 Participants and Data Collection .................................................................................. 32

3.2.2 Model Fit Evaluation ................................................................................................... 33

3.3 Data Analysis ...................................................................................................................... 37

Chapter 4 Results .............................................................................................................. 40

4.1 On Group Level .................................................................................................................. 40

4.2 On Individual Subskill Level .............................................................................................. 42

Chapter 5 Discussion, Implication, Limitations, and Future Research............................. 50

5.1 Discussion of the Overall Findings ..................................................................................... 50

5.2 Implications for Reading Instruction and CDA of Reading ................................................ 52

5.3 Limitation and Future Research .......................................................................................... 53

References ......................................................................................................................... 55

Appendix A. Q-matrix for Form A of the GSRT.............................................................. 67

Appendix B. Q-matrix for Form B of the GSRT .............................................................. 69

vi

LIST OF FIGURES

Figure 4.1 Estimated probability of each subskill mastery profile for pretest and posttest

........................................................................................................................................... 40

Figure 4.2 Difference of probability of mastery of each subskill at pretest and posttest by

intervention condition ....................................................................................................... 43

vii

LIST OF TABLES

Table 2.1 An Example of Q-matrix .................................................................................. 11

Table 2.2 An Example of Latent Classes .......................................................................... 13

Table 3.1 An Analysis of Sample Reading Passages in the GSRT (Form A) .................. 29

Table 3.2 Subskills Specified by the GSRT Developers .................................................. 30

Table 3.3 Q-matrix for Form B of the GSRT (single subskill) ......................................... 31

Table 3.4 Q-matrix for Form B of the GSRT (multiple subskills) ................................... 32

Table 3.5 Absolute Fit Statistics for Pretest...................................................................... 34

Table 3.6 Relative Fit Statistics for Pretest ....................................................................... 36

Table 3.7 Absolute Fit Statistics for Posttest .................................................................... 36

Table 3.8 Relative Fit Statistics for Posttest ..................................................................... 37

Table 4.1 Subskill Mastery Probability in Pretest and Posttest ........................................ 42

Table 4.2 Repeated Measures Analysis of Variance for Within-Subjects and Between-

Subjects Effects ................................................................................................................. 45

Table 4.3 Mean Difference t-test ...................................................................................... 47

viii

ACKNOWLEDGEMENT

I have spent a lot of time working on this thesis, yet the study would not have

been possible without the contribution and help of many people. I am feeling grateful to

everyone who has helped me in completing the thesis. I am particularly thankful to my

advisor, Dr. Pui-Wa Lei, for her superb mentorship, guidance, encouragement, and

patience. Dr. Lei has a great vision and strong passion for research in educational

measurement, which has served as a great inspiration during my study in the Educational

Psychology program. It is she who has introduced me to the field of cognitive diagnostic

assessment, in which I am developing increasing interest.

I would also like to express my deep gratitude to Dr. Bonnie Meyer. As an expert

in reading research, she has provided me with valuable insights for my study. Dr. Meyer

is always encouraging and ready to help, which plays a very important role along my

academic journey.

I am also grateful to Janet Dillon, Robert Fogarty, Cory Kite and other colleagues

at Outreach Analytics and Reporting (OAR) in Penn State Outreach and Online

Education for their help. Janet is a wonderful supervisor, who is always supportive and

cares for both my research and my life. My GA work experience at OAR becomes

terrific because of her and the OAR team that she leads.

Special thanks must go to my peers Xiuyan Guo, Yao Xiong, and Xinyue Li in

Educational Psychology program. Their help is always appreciated.

My gratefulness to my family is beyond words. I am indebted to them all for their

continuous understanding, love, and support.

1

Chapter 1 Introduction

Reading comprehension ability, the ability to understand written text, is generally

acknowledged to be crucial for learners across all ages both inside and outside school

environment. Reading comprehension is fundamental and imperative for K-12 students

to have successful academic performance in school as they first learn to read before they

start reading to learn. On the other hand, reading comprehension is a challenge for

school children; U.S. school students have been under prepared for reading as evidenced

by state-level and national-level reading tests (NAEP 2007).

It is widely accepted that reading strategies are necessary for readers to maintain

appropriate comprehension. The utility of various strategies has been consistently proved

to be beneficial for readers to have better comprehension (Alfassi, 2004; Johnson-

Glenberg, 2000; Sung, Chang, & Huang, 2008). Therefore, reading instructors often

teach reading strategies in order to handle the challenges of learners’ reading obstacles

(McNamara, 2007). One of the most efficient strategies for which there is much research

and practice is training students to facilitate their reading comprehension by using the

structure strategy.

An influx of research about the structure strategy spans 40 years (Wijekumar,

Meyer, & Lei, 2012), which has indicated that the structure strategy can improve reading

comprehension achievement for both young (like 3rd graders) and adult readers (e.g.,

Meyer, Brandt, & Bluth, 1980; Meyer et al., 2002; Meyer et al., 2010; Wijekumar et al.,

2014). The structure strategy, designed by Dr. Bonnie Meyer in 1975, mainly focuses on

the organizational structure of a reading text and then follows this structure to determine

2

what is important (Meyer et al., 1980). Essentially, the structure strategy teaches readers

to find out the structure of an expository text (e.g., comparison, cause and effect, and

sequence) by identifying the key signaling words and to organize their reading

comprehension by using that structure (Meyer & Wijekumar, 2007).

Extensive research has been conducted with trained teachers training students on

how to use the structure strategy. The web-based Intelligent Tutoring System for the

Structure Strategy (ITSS), funded by the U.S. Department of Education, creates user

models by capturing tutoring interactions between human tutors and students and

converts them to computerized interactions. ITSS has demonstrated success in improving

reading comprehension for elementary and middle school students (e.g., Meyer et al.,

2010; Wijekumar et al., 2014).

1.1 Statement of the Problem

The structure strategy is a “well-tested method that helps readers to focus on the

text organization, helping them to organize their reading accordingly, and show

significant improvement in recall of expository text” (Project Description. Retrieved

September 13, 2016, from https://itss.psu.edu/itss/index.php/project-54). Specifically, the

structure strategy is a text-structure-based reading comprehension approach that requires

readers to: a) identify specific signaling words for the text structure (e.g., description,

comparison, sequence, problem-and-solution, and cause-and-effect), b) impose a top-

level structure on the reading text to create a main idea, and c) obtain proper

comprehension of the details in a text by combining both the signaling words and the

main idea (Wijekumar et al., 2014).

3

This suggests that the structure strategy has three major steps: a) to find out the

text structure by identifying the key signaling words such as “in contrast” for the

comparison structure (Meyer, 1975), b) to write down the main idea of the text that has a

particular text structure with the help of signaling words identified, and c) to recall the

text without referring to the text by using the text structure and main idea already

obtained. The three steps can be summarized to be focusing on identifying the main idea

(i.e., first two steps) and figuring out detailed explicit information (i.e., the third step)

depicted in a given passage through the use of a text structure. To obtain explicit

information from a text is literal-level comprehension whereas to figure out the main idea

involves both literal and inferential comprehension since the main idea of a passage can

be either explicitly or directly expressed (literal-level) or implied and therefore should be

inferred by readers (inferential-level).

Research studies have consistently shown that the structure strategy helps readers

of all ages improve their reading comprehension in that those who have received the

structure strategy instruction perform significantly better than their counterparts who

have not at a group level on assessments such as standardized reading comprehension

tests (e.g., Gray Silent Reading Test) or researcher-designed tests (e.g., Carrell, 1985;

Englert & Thomas, 1987; Meyer et al., 1980; Meyer et al., 2002; Meyer et al., 2010;

Meyer, Wijekumar, & Lin, 2011; Meyer & Poon, 2001; Meyer & Rice, 1989;

Wijekumar, Meyer, & Lei, 2012; Wijekumar et al., 2014; Williams et al., 2005).

However, it is yet unclear as to whether the structure strategy is equally effective in

promoting different reading subskills for all individuals.

4

The Gray Silent Reading Test (GSRT) mentioned above is a standardized reading

assessment that measures silent reading comprehension ability. The GSRT has two

equivalent forms (Form A and Form B) and each form is composed of 13

developmentally sequenced reading passages/stories with five multiple-choice questions

following each passage. There are four reading attributes or subskills underlying the

GSRT, which are specified by the test developers in the Gray Silent Reading Tests:

Examiner’s Manual (Manual hereafter) (Wiederholt & Blalock, 2000), namely, literal,

inferential, critical, and affective subskills.

According to the Manual (Wiederholt & Blalock, 2000), literal comprehension is

the subskill to figure out explicit information within the test, both general (such as the

directly expressed main idea) and specific (such as the explicitly stated supporting

details). Inferential comprehension is to infer meanings that go beyond the stated

information which includes inferring main ideas and inferring facts or to link text

information to other situations to make predictions like what is likely to happen based on

the current text. That is, inferential comprehension requires a mental process by which

we obtain information not explicitly stated and reach a conclusion based on specific

evidence or given information. Critical comprehension refers to readers’ ability to

analyze, evaluate, or make reasonable judgment about the text’s content. Therefore,

critical comprehension is also termed as evaluative comprehension in the literature of

reading research. According to Bloom’s Taxonomy (1956), to analyze and to evaluate

are higher-order thinking skills. Affective comprehension involves readers’ personal or

emotional response to the text. Bloom’s Taxonomy of educational objectives is divided

into three domains: cognitive, affective, and psychomotor. Bloom identified six levels

5

within the cognitive domain—knowledge, comprehension, application, analysis,

synthesis, and evaluation, which are also applied in reading comprehension (More

elaboration on this can be found in Chapter 2). Affective domain includes the manner in

which people deal with things emotionally and affective comprehension involves readers’

emotions and feelings to a text.

The present study aims at moving forward one step further from previous studies

by looking into the effect of structure strategy instruction on reading comprehension

subskills. The research questions addressed in the present study are: Is ITSS instruction

equally effective in improving the reading subskills as represented in the GSRT? Does

the effect of ITSS instruction vary by grade level (4th grade vs. 5th grade) or gender

(female vs. male)? Cognitive diagnostic modeling (CDM) assessment will be applied to

examine 4th and 5th grade students’ mastery of reading subskills based on their

performance on the GSRT before and after ITSS instruction and compare to their

business-as-usual control counterparts to address the research questions. An introduction

to CDM is given in the subsequent chapter.

1.2 Significance of the Study

McNamara (2007) stressed the importance of developing reading comprehension

skills in elementary school children, and further argued that development and instruction

in reading subskills are as important as fostering basic language skills for these students.

Therefore, it would be helpful to obtain more diagnostic information regarding students’

reading subskill mastery profiles by identifying the specific subskills that can be

significantly enhanced through ITSS instruction. Once the diagnostic information is

6

obtained, it will shed light on the design of reading skill instruction, training or

intervention, which in turn can benefit student learning. That is to say, this study will

inform the skill areas that need further strengthening within ITSS and it is hoped that

more customized and personalized intervention(s) can be developed and delivered to

individuals in need.

7

Chapter 2 Literature Review

This chapter is devoted to providing background information necessary for a

thorough understanding of the context of the present study, which encompasses the terms,

concepts, and previous research studies pertinent to cognitive diagnostic models and

cognitive diagnostic modeling approach, along with reading comprehension assessment

within the cognitive diagnostic assessment framework.

2.1 Overview of Cognitive Diagnostic Assessment

Knowing What Students Know (National Research Council, 2001) discussed some

measurement models that allow the merger of advances in cognitive science and

psychometric theories and facilitate inferences more relevant to learning. The

psychometric models include models known as cognitive diagnostic models (CDMs),

which can be used to understand the skills and cognitive processes involved in a task.

The approach that takes advantage of CDMs for designing and interpreting diagnostic

assessment is known as cognitive diagnostic assessment (CDA). The CDA approach

combines theories of cognition of interest with statistical models intended to make

inferences about examinees’ mastery profile of tested skills or their knowledge state of

the cognitive skills. The cognitive skills represent the processes or strategies that

examinees utilize in order to correctly solve tasks (Jang, 2008). Broader terms for

cognitive skills include knowledge and strategies that are used to successfully solve a

8

problem. Terms like skill, subskill, and attribute are often used interchangeably in the

literature.

CDA is an assessment approach that employs a cognitive diagnostic model to

identify items that measure specific skills, and then use this model to direct the

psychometric analyses of the examinees’ item response patterns for test score inferences

(Gierl, Cui, & Zhou, 2009). Designed to evaluate learners’ cognitive skills or attributes

specified in a cognitive model of test performance, CDA has a fundamental assumption

that every task or each item in a test can be described in terms of multiple cognitive

subskills or attributes that must be mastered by an individual in order to successfully

accomplish the task or answer each item correctly (Gierl et al., 2000).

In performing CDA analyses to language assessments, such as reading

comprehension, there are generally two approaches. First of all, a test can be designed by

developing diagnostic items that enable researchers to infer the skill processes underlying

the item responses (Jang, 2008; Lee & Sawaki, 2009). With the diagnostic items

developed, CDMs can be applied to further test the knowledge, subskills, or attributes

underlying each item. This approach, the most-needed area of future CDA research

(Jang, 2009b), however, is not commonly used because the development of cognitively

diagnostic tests is costly, time-consuming, and labor intensive. In contrast, the second

approach is presently much more common; that is, cognitive diagnostic modeling is

usually retrofitted to an existing non-diagnostic test although the development of CDA

has largely been motivated by the need for formative assessment techniques (Jang, 2005,

2008). Most applications of cognitive diagnostic modeling in the literature involve the

retrofitting of CDMs to assessments designed by using either classical test theory (CTT)

9

or item response theory (IRT), which is described as an addition of a new technology to

an older system (Gierl, Roberts, Alves, & Gotzmann, 2009). According to Liu, Huggins-

Manley, and Bulut (2017), retrofitting cognitive diagnostic models to existing non-

diagnostic assessments is “a plausible approach to obtain more actionable scores or

understand more about the constructs themselves” (p. 1).

Diagnostic information can be extracted from non-diagnostic tests, and the

resulting information can further provide insight into both teaching and learning (Lee &

Sawaki, 2009). Furthermore, there are currently many large-scale tests available

including various types of standardized reading comprehension tests like GRE and

TOEFL. A large amount of resources have been invested in developing them, so it would

be cost-effective if they can be used for other purposes such as extracting diagnostic

information about students’ mastery and non-mastery status of specified subskills (Chen

& de la Torre, 2014).

When a CDA analysis is conducted using retrofitting method, the following four

basic steps or procedures are typically followed. First, identify and define a set of

subskills required to perform successfully in a given test, which may or may not be

provided by the test developers. Second, construct or create a Q-matrix displaying the

relationship between test items and subskills specified. A Q-matrix is an item-by-subskill

incidence matrix that demonstrates which subskill or multiple subskills are needed for

successful performance for each item on the test (as shown in Table 2.1). In other words,

the Q-matrix specifies the set of latent traits, including knowledge, strategies, or

subskills, necessary for each item. Third, select a proper cognitive diagnostic model and

analyze the examinees’ test performance data by utilizing this model with the Q-matrix

10

already developed. Fourth, provide diagnostic feedback of the assessment results and

report the scores with diagnostic information (i.e., individual’s skill mastery

status/profile) to inform examinees, teachers, and parents of the examinees’ strengths and

weaknesses in a domain such as reading comprehension or mathematics (Lee & Sawaki,

2009).

George and his colleagues simplified the four steps, summarizing that only two

steps, a qualitative step and a quantitative step, are involved in a CDA analysis (George,

Robitzsch, Kiefer, Groß, & Ünlü, 2016). The first step is a qualitative one in which

researchers subdivide the tested ability into different abilities or subskills. Subsequently,

the researchers expose in a Q-matrix which skills are needed to master each of the items.

The second step is a quantitative one in which a cognitive diagnostic model is applied

and examinees are classified into dichotomous latent skill classes that predict their

mastery or non-mastery of the skills defined in the first step.

Q-matrix plays a key role in CDA analysis as it provides the specification of

subskill-item relationship. For a J*K Q-matrix, where J represents the number of items

and K the number of subskills, qjk denotes the element in row j and column k (“1” means

that item j assesses attribute k, and “0” otherwise). Table 2.1 illustrates what a typical Q-

matrix looks like. Suppose three subskills have been identified and defined. The

subskills measured by each item are indicated in this item-by-subskill binary matrix. For

example, Item 1 requires the first and third subskills for a correct response, and Item 3

only requires the first subskill.

11

Table 2.1 An Example of Q-matrix

Item Subskill 1 Subskill 2 Subskill 3

1 1 0 1

2 0 1 1

3 1 0 0

4 0 0 1

5 1 1 0

2.2 Cognitive Diagnostic Modeling Approach

Cognitive diagnostic models are a class of psychometric models developed

primarily for assessing students’ mastery and non-mastery status on a set of fine-grained

skills or attributes within a domain in order to provide a multivariate view of their

strengths and weaknesses (Templin & Bradshaw, 2013). Cognitive in this context means

CDMs are able to provide detailed information of students’ cognitive strengths and

weaknesses (Huebner, 2010), which can be used to adapt instructional designs and

determine the skills and knowledge that need to be further developed. This differs from

summative assessment which is generally evaluative rather than diagnostic. As the

detection of students’ strengths and weaknesses is gradually becoming an essential part of

educational assessment (Huff & Goodman, 2007), the CDM approach has received

increased research interest in recent years from researchers and educators in educational

and psychological fields. CDMs also have some alternative labels such as diagnostic

classification models (DCMs; Rupp, Templin, & Henson, 2010), latent response models

(Maris, 1995), and structured located latent class models (Xu & von Davier, 2006).

According to Rupp and Templin (2008), CDMs are “probabilistic, confirmatory

multidimensional latent-variable models with a simple or complex loading structure” (p.

12

226). CDMs are probabilistic because they predict probability of an observable

categorical response from unobservable latent categorical variables and provide

formative information in the form of attribute profiles. The unobserved latent variables

are usually termed and labeled in the literature as skills, subskills, attributes, or even

more broadly as abilities, strategies, or knowledge as introduced earlier. These terms are

often used interchangeably in the literature to refer to the unobserved latent variables or

latent traits (Ravand & Robitzsch, 2015). Attributes and subskills, commonly denoted as

α, will be used for most of the time throughout the paper for the purpose of clarity and

consistency. Examinees are assigned multidimensional attribute profiles, and further

diagnostically classified as either masters or non-masters of each subskill involved within

a domain. The examinee’s attribute profile is analogous to their subskill profile, latent

class, or knowledge state of subskills involved in a test. The four terms—attribute

profile, subskill (mastery) profile, latent classes, and knowledge state—are used

interchangeably throughout the present study.

Given k subskills, there are a total number of 2k attribute profiles or latent classes

(Gierl, Cui, & Zhou, 2009; Rupp, Templin, & Henson, 2010). Table 2.2 is an example of

eight latent classes when there are three subskills (k = 3) underlying performance on a

test. The examinees falling into Class 1 have mastered none of the three subskills,

whereas the examinees classified into Class 2 have only mastered the third subskill.

Examinees in either Class 1 (masters of none subskills) or Class 8 (masters of all three

subskills) are referred to as having a flat skill mastery profile (Ravand, Barati, &

Widhiarso, 2013).

13

Table 2.2 An Example of Latent Classes

Latent Class Number Representation of Latent Class

Class 1 [0,0,0]

Class 2 [0,0,1]

Class 3 [0,1,0]

Class 4 [0,1,1]

Class 5 [1,0,0]

Class 6 [1,0,1]

Class 7 [1,1,0]

Class 8 [1,1,1]

Cognitive diagnostic method using CDMs is now receiving widened attention due

to its advantages over traditional analysis of test results with CTT or unidimentional IRT

models. Both CTT and the most commonly used unidimensional IRT models are geared

towards measuring students’ overall performance by providing their unidimensional

ability values, typically with a summative scaled score and/or a percentile rank. In IRT

framework, for example, it is generally believed that a single latent trait denoted as q

underlies an individual’s performance on an item or in a test. Students with higher latent

traits have higher probability of giving correct response. In contrast, within CDM

framework, an individual’s performance on an item is assumed to be a function of

multiple, discrete latent traits generically referred to as attributes or subskills, which has

been explained earlier. Students’ successful performance on a test requires a series of

successful implementation of the attributes underlying the test.

A CDM framework can provide multidimensional results and examinees’

attribute profiles whereas neither CTT nor unidimensional IRT can provide diagnostic

information on examinees’ mastery or non-mastery state of tested subskills. Diagnostic

score reports based on these subskill profiles can then be provided to researchers,

14

educators, students, and their parents. Proper actions can be taken accordingly in view of

the examinees’ strengths and weaknesses in a domain in order to guide their

improvements in the ongoing teaching and learning context. This demonstrates that, as

Jang (2008) pointed out, the CDA approach is aimed at promoting assessment for

learning and learning process as opposed to assessment of learning in traditional methods.

CDA analysis using CDMs has gained its popularity among researchers and

educators only in recent years despite all its advantages. Ravand and Robitzsch (2015)

suggested that its underutilization is due to two reasons. First, CDMs are novel and more

complex compared to conventional IRT models, leading to the fact that many researchers

remain unfamiliar with CDMs, including how to use them. Second, model specifications

for CDMs are more complex and more subject to estimation problems because they are

highly parameterized models.

In the past decade, CDMs have been applied to tests in different domains such as

mathematics and reading comprehension tests using the retrofitting approach. A number

of research studies with regard to reading tests, second language reading tests in

particular, have been conducted using CDMs (e.g., Chen & Chen, 2015; Gao, 2007;

Hartz & Roussos, 2008; Jang, 2005; Jang, Dunlop, Wagner, Kim, & Gu, 2013; Li, 2011;

Li & Suen, 2013; Ma & Meng, 2014; Ravand, Barati, & Widhiarso, 2013; Sawaki, Kim,

& Gentile 2009).

Like most CDM applications in the literature, the GSRT test used in the present

study was not originally designed for cognitive diagnostic purpose. By applying the

retrofitting approach, it is hoped that diagnostic information extracted from this

15

standardized reading test can facilitate the understanding of the effectiveness of ITSS

instruction on specific reading subskills.

2.2.1 Cognitive Diagnostic Models

In the past decade, there has been a rapid progress in cognitive diagnostic

assessment practice. A wide array of CDMs have been developed, as evidenced by over

120 CDMs reviewed by Fu and Li (2007, cited in Lee & Sawaki, 2009). These CDMs

can be generally grouped into three families including non-compensatory CDMs,

compensatory CDMs, and general CDMs. The terms refer to how subskills are related to

modeling the probability of a correct response (Lissitz et al., 2014).

In a non-compensatory CDM framework, a correct response to an item can be

obtained only if all required subskills for this item are present. The deterministic input,

noisy “and” gate model (DINA; Haertel, 1989; Junker & Sijtsma, 2001), the Rule Space

Model (Tatsuoka, 1983), and the Reparameterized Unified Model (RUM; Hartz, 2002)

that is also known as Fusion Model (Hartz, 2002) all belong to this family. In contrast, in

compensatory CDMs, not all skills are required; mastery of only some of the subskills

required for a correct response can compensate for non-mastery of other subskills. The

diagnostic inputs, noisy “or” gate model (DINO; Templin & Henson, 2006) and the

Additive Cognitive Diagnostic Model (ACDM; de la Torre, 2011) both belong to this

second family. Finally, general CDMs allow for both types of relationships. General

diagnostic model (GDM; von Davier, 2008; Xu & von Davier, 2006) and generalized

DINA model (G-DINA; de la Torre, 2011) are examples of general CDMs.

16

Non-compensatory models are more popular than compensatory models in

performing cognitive diagnostic analyses because the former generate more fine-grained

diagnostic information than the latter can (Li, 2011). Application of CDMs is most

common with mathematics and reading comprehension. It is generally agreed that non-

compensatory models are appropriate for mathematics assessments, while compensatory

models are better for reading assessments since many researchers believe reading skills

are compensatory in nature (Li, Hunter, & Lei, 2015). A number of studies have been

conducted where CDMs are applied to large-scale reading assessments (e.g., Chen &

Chen, 2015; Jang, 2009b; Kim, A. Y., 2011; Kim, H. S., 2011; Li, 2011; Li & Suen,

2013; Ravand, Barati, & Widhiarso, 2013).

As a general cognitive diagnostic model, the G-DINA model is a generalization of

the DINA model (Junker & Sijtsma, 2001). The G-DINA model relaxes the highly

restricted DINA model assumption that anyone who fails to master all subskills has the

equal probability of success. Based on a Q-matrix of dimensions J (number of items) * K

(number of attributes), the probability formula for the G-DINA model is given by de la

Torres (2011) as follows:

where,

a lj

* is the binary latent attribute vector for latent class l

is the intercept for item j

is the main effect of mastering attribute for item j

P(a lj

*) = d j0 + d jk

k=1

K j*

å a lk + djkk '

k=1

K j*-1

åk '=k+1

K j*

å a lka lk '...+dj12...K j

* a lk

k=1

K j*

Õ

d j0

d jk a k

17

is the interaction effect due to attributes and for item j

In the formula above, represents the baseline probability of a correct response

when none of the required subskills is mastered by an examinee, which can be viewed as

the guessing parameter. With guessing, a correct response is still possible in this case.

When used, the G-DINA model is usually accepted for its simplicity of

computation and estimation in identifying the role an individual subskill plays for

someone in completing a task (Tatsuoka & Tatsuoka, 1982). Nevertheless, the G-DINA

model is much less frequently adopted in educational research as compared to other

CDMs like Fusion model and DINA model. The G-DINA model has drawn researchers’

attention only in recent years, especially after de la Torre’s (2011) illustration of

estimation and application of the G-DINA model using both real-world data and

simulated data.

2.2.2 Application of the G-DINA Model in Reading Tests

Li and her colleagues reviewed previous CDM analyses of reading tests,

summarizing each study in the literature on the test analyzed, the cognitive diagnostic

model(s) used, and software used (Li, Hunter, & Lei, 2015). A total of 15 studies have

been reviewed, and the summary of review was organized in a tabular form. Of the 15

studies reviewed, none of them used the G-DINA model; instead, Rule Space Model and

Fusion Model were usually used.

The feature of the G-DINA model aligns with that of reading comprehension in

that this model “is sensitive to the integrative nature of reading comprehension skills and

djkk ' a k a

k '

d 0

18

can detect the interactive relationship among them” (Chen & Chen, 2015, p. 4). This

suggests that the G-DINA model can in theory be applied to reading tests. In the limited

sources that can be tracked in the extant literature, it is found that the G-DINA model has

been successfully applied in reading tests. For example, Chen & Chen (2015) examined

the relationships among the five reading subskills (i.e., identifying explicit information,

generalizing main idea, interpretation and explanation, getting inference, and evaluation

and comment) specified and defined by six content experts by applying the G-DINA

model and analyzing the test response data of 1,029 British secondary school students’

performance on a reading test (i.e., Programme for International Student Assessment

English) with 20 items. The research demonstrated that the G-DINA model caters to the

characteristics of reading subskills and suggested that the model be applied to tests that

involve hierarchical skills.

In view of the research findings that the G-DINA model is well suited for

modeling reading comprehension subskills and the fact that very few educational studies

have employed the G-DINA model, it might be a meaningful endeavor to apply this

model to analyze students’ test performance data on the GSRT reading test to extract

more fine-grained diagnostic information about examinees at the attribute level for the

current study. Goodness of fit of different more parsimonious variants of the G-DINA

model will be further evaluated to determine the best fitting model.

2.3 Reading Comprehension Subskills

Reading is usually viewed as comprehension—“the process of simultaneously

extracting and constructing meaning through interaction and engagement with written

19

language” (RAND Reading Study Group, 2002, p. xiii). Reading ability, the ability to

comprehend meaning from the written text, is a key skill for learning. Reading

comprehension and reading ability are used interchangeably throughout the study. Given

the important role reading plays in our lives, there has long existed a debate about the

nature of reading comprehension. A myriad of theories have been proposed to describe

and explain reading comprehension, among which cognitive processing perspectives of

reading remain the most influential to date (Tracey & Morrow, 2006). Because reading is

an unobservable construct or mental process, educational psychologists in the late 1950s

started to extensively describe the underlying cognitive processes involved in reading

comprehension. Therefore, cognitive processing view began to appear and is still popular

today among researchers. Nevertheless, the nature of reading is still little known and

there is no agreed-upon view on the dimensionality of the reading construct.

2.3.1 Reading Comprehension as a Construct of Multidimensional Subskills

In general, there are two opposite views on the dimensionality of reading

comprehension. Many experts hold the view that reading comprehension is unitary and

holistic, and thus it cannot be further divided (e.g., Lunzer, Waite, & Dolan, 1979).

Others, on the contrary, argue that reading comprehension is multi-divisible or that

reading ability is composed of separable components or subskills (e.g., Carr & Levy,

1990; Davis, 1968; Munby, 1978).

Experts in the second group believe that reading comprehension consists of

separable components or constituents, in other words, multiple subskills are involved in

reading comprehension (e.g., Davis, 1944; Munby, 1978). However, they differ on the

20

number of reading subskills there exist. For example, Davis (1944) reviewed the relevant

literature and identified nine skills deemed most basic and important by authorities in the

domain of reading, such as ability to select the appropriate meaning for a word or phrase

in light of its particular contextual setting, ability to select the main thought of a passage,

ability to draw inferences from a passage about its contents, ability to draw inferences

about a writer, and so forth. In a subsequent factor analysis study, Davis (1968) found

eight skills in reading: recalling word meanings, obtaining word meaning based on the

context clues, finding out explicit information, integrating ideas of the text, drawing

inferences based on given information, figuring out the writer’s writing purpose, tone

etc., identifying the writer’s techniques, and identifying the text structure. Munby (1978)

contended that 19 skills are required for reading comprehension, such as skimming,

scanning, having basic reference skills, and so on. Scarborough (2001) suggested that

reading largely consisted of word recognition skill and language comprehension skill,

each of which further consists of three and five subskills, respectively. This review

shows that the identified number of skills that reading comprehension include varies,

which indicates the complex nature of reading ability. This intricate nature of reading

makes it difficult to find a suitable model for assessment of learners’ reading ability (Kim

A. H., 2011).

According to Anderson and Pearson (1984), drawing inferences, one of the

common reading skills, is an essential part of the reading comprehension process, even

among young children. There are researchers who argue that hierarchical relationships

exist among reading comprehension skills. Bloom’s Taxonomy (1956) consisted of six

categories in the cognitive domain used to express the level of expertise required to

21

achieve an educational goal: knowledge, comprehension, application, analysis, synthesis,

and evaluation. The six categories lie along a continuum from simple to complex and

from low level to higher level. This taxonomy is also used in reading comprehension by

some researchers holding the position that it would be comparatively more difficult for

readers to attain the higher level of reading skills such as synthesis and evaluation (e.g.,

Gray, 1960).

In contrast to this view, some researchers have found that lower-level reading

skills are actually not prerequisites for higher-level reading skills (e.g., Alderson, 1990;

Matthews, 1990), which is evidenced by examples that some examinees can perform

successfully on items requiring higher-level reading skills but fail on items requiring

lower-level skills. In addition, the construction-integration model (Kintsch & Kintsch,

2005) proposed that reading is a cognitive decoding process, making inferences and

integration in the reading process. This model also supports the multiple-component

view of reading comprehension.

In the present study, we adopted the divisibility view of reading comprehension

and contended that multiple reading comprehension subskills are required to complete

reading tasks. CDMs are well-suited for modeling multiple subskills in a confirmatory

fashion. That is, comprehension subskills required for an assessment are defined a priori

in CDM analyses.

22

2.3.2 Specifying Reading Subskills and Subskill-Item Relationship

As introduced in the beginning of the chapter on the procedures of carrying out a

cognitive diagnostic assessment using retrofitting approach, the first step is to specify and

define a set of tested subskills. Considering the importance of proper specification of a

set of tested subskills, a variety of sources are usually utilized, which include test

specifications, content domain theories, respondents’ think-aloud interviews, and a

literature review on relevant research (Leighton & Gierl, 2007; Leighton, Gierl, &

Hunka, 2004). When the subskills are not specified by the test developers, which is

common in retrofitting, it is especially important to take advantage of different sources to

identify the set of subskills. One or a combination of all the above-mentioned sources is

found in the literature of application of CDMs in an assessment.

In specifying subskills underlying a given assessment from existing items, some

considerations are important for researchers to keep in mind. One of the considerations is

that the total number of subskills, k, should be small (Liu, Huggins-Manley & Bulut,

2017). Since examinees are classified into 2k attribute profiles or latent classes, it would

be very difficult to estimate the attributes and interpret the results given a large number of

attributes (Liu, Huggins-Manley & Bulut, 2017; Xu & Zhang, 2016). Generally, the

retrofitting studies in the literature specify three to five attributes or subskills (Liu,

Huggins-Manley & Bulut, 2017) so that examinees are classified into 8 (k = 3), 16 (k = 4),

or 32 (k = 5) subskill profiles or latent classes.

Upon completing the specification of reading subskills, subskill-item relationships

should be examined and contained in a Q-matrix. There are generally three ways in

selecting coders for analyzing subskill-item relationships in order to create a Q-matrix.

23

First, the author(s) of a study can serve as coders in employing diagnostic models to

existing test (e.g., Buck et al., 1997; Gierl, Leighton, et al., 2009). It is in many cases due

to either funding issues or the fact that people not involved in the project cannot be

invited to serve as coders (Buck & Tatsuoka, 1998). Second, content experts are

recruited and included in the research to code the items (e.g., von Davier, 2008; Wang,

Gierl, & Leighton, 2006). It is generally believed that content experts are experienced in

teaching in a domain such as reading; they are familiar with the examinee population of a

given test and understand examinees’ subskill processes regarding problem solving.

Third, some researchers recruit graduate students to assist with the coding (e.g., Jang,

2005; Wang & Gierl, 2007; Wang & Gierl, 2011). Compared to content experts,

graduate students can be more easily recruited. Wang and Gierl (2007) illustrated the use

of attribute hierarchy method (AHM) by applying AHM to students’ response data from

the SAT Critical Reading Subtest. In specifying the attribute hierarchy, the researchers

recruited two graduate students in cognitive psychology, who had experience in

conducting verbal report study and were familiar with reading comprehension research.

In practice, the three ways outlined may have some overlap as in a situation where the

authors of a study are a cohort of graduate students.

The three ways introduced above have all been used in the literature for both

subskills specification and the development of Q-matrix, with no consensus so far about

which way is superior to the others. One might claim that content experts are

authoritative in a domain and should always be included in the coding process. It should

be noted, however, that experts’ ability is substantially higher than that of the examinees

so that experts may approach questions in considerably different ways than examinees

24

do. There is no evidence suggesting that subskills identified by content experts are

indeed used by examinees in solving problems (Leighton & Gierl, 2007). Furthermore,

the number of coders involved in a study also varies across studies, with the most

common number being two (e.g., Birenbaum & Tatsuoka, 1993; Wang & Gierl, 2011).

Unlike the common practice, Jang (2005) recruited five coders to review the entire test

items of the LanguEdge Reading Comprehension Test (Form 1) and to decide primary

subskills required by each item. Additionally, there are also studies in which a single

coder is used in the coding process (e.g., Buck & Tatsuoka, 1998; Leighton, Cui, & Cor,

2009). Therefore, there is no agree-upon way of identifying subskills nor number of

coders recruited to specify the subskill-item relationships and code the Q-matrix.

Researchers may employ any reasonable method and any number of coders depending on

their own situation.

In addition to the involvement of coders, a number of researchers also use

examinees’ think-aloud protocols or verbal reports as an important empirical source of

information in constructing Q-matrix (e.g., Buck, 1991, 1994; Jang, 2005; Li, 2011;

Wang & Gierl, 2007). Think-aloud protocol is a qualitative research method in which

items in a test are given to a sample of examinees in order to obtain their thinking

processes utilized in understanding, conceptualizing, reasoning, and responding to the

items (Leighton, 2004; Leighton & Gierl, 2011).

Researchers have adopted the methods outlined above in constructing their Q-

matrix. For example, Jang (2005) came up with a Q-matrix by using extensive

preliminary analyses of tasks and examinees’ performance, and qualitative analysis of

verbal protocols from both coders/raters and examinees in the NG TOEFL reading

25

comprehension test. Similarly, Li and Suen (2013) constructed their initial Q-matrix for

the MELAB reading assessment by employing various sources including reviewing

related literature, obtaining examinees’ think-aloud protocols, and inviting content

experts for their ratings. The initial Q-matrix was then validated using preliminary

empirical evidence.

2.4 Research Questions and Study Hypotheses

Using the CDM approach, the present research attempts to investigate the

following research questions:

(1) Is ITSS instruction equally effective in promoting students’ reading subskills

as represented in the GSRT (i.e., literal, inferential, critical, and affective

subskills)?

(2) Does the effect of ITSS instruction in improving reading subskills differ by

grade level (Grade 4 vs. Grade 5)?

(3) Does the effect of ITSS instruction in improving reading subskills differ by

gender (female vs. male)?

The set of subskills involved in the GSRT will be elaborated in the next chapter

on methodology of the study. Students’ performance on the GSRT in terms of presence

or absence of these subskills before and after ITSS instruction will be evaluated relative a

business-as-usual control condition to address the research questions.

As mentioned in the introduction to structure strategy in Chapter 1, two reading

subskills, namely, identifying main idea and extracting explicit information from text, are

26

primarily emphasized by the structure strategy (i.e., the ITSS instruction in the present

study), which include literal and inferential comprehension. In addition, since the

structure strategy has shown effective for users of different age groups in increasing their

reading comprehension, no grade or gender difference in reading achievement was

expected through the ITSS instruction. Consequently, the hypotheses of the present study

are as follows.

Hypothesis 1: ITSS instruction was not equally effective in promoting reading subskills

as represented in the GSRT; students’ literal and inferential subskills would be

significantly improved through ITSS instruction compared to their counterparts who did

not receive ITSS instruction controlling for their prior reading subskill levels. Students

who did or did not receive ITSS instruction were expected to perform similarly on the

other two subskills—critical and affective subskills.

Hypothesis 2: The effect of ITSS instruction in improving reading subskills did not differ

between 4th- and 5th-grade students.

Hypothesis 3: The effect of ITSS instruction in improving reading subskills did not differ

between female and male students.

27

Chapter 3 Research Methodology

Chapter 3 describes the methodology of the current study, which consists of two

parts. The first part mainly focuses on describing the subskills involved in the GSRT and

the construction of Q-matrix for both forms of the GSRT. The second part deals with the

investigation of examinees’ performance on the GSRT, of both pretest (i.e., Form B was

administered) and posttest (i.e., Form A was administered). Examinees’ test response

data in pretest and posttest were analyzed based on their treatment groups using the G-

DINA model in order to examine their strengths (presence/mastery of reading subskills as

represented in the GSRT) and weaknesses (absence/non-mastery of these reading

subskills). Results were used to address the research questions and the findings were

expected to provide diagnostic feedback to the ITSS developers and educators for future

instructional purposes.

3.1 Identifying the GSRT Reading Subskills

As stated previously, the first step in conducting a cognitive diagnostic

assessment using retrofitting method is to identify and define a set of subskills because

successful performance on a test requires a series of successful implementations of these

subskills. This is the fundamental assumption of CDA that each item in a test can be

described in terms of multiple cognitive subskills that must be mastered in order for an

individual to answer each item correctly. Without this first step, the diagnostic analysis is

impossible to be performed.

28

3.1.1 Instrumentation and Measure of Reading Comprehension

The reading assessment employed in the current study is the Gray Silent Reading

Test, a standardized reading test that helps test administrators or teachers quickly and

efficiently measure and assess an individual’s silent reading comprehension abilities.

Namely, the GSRT is used to test whether the individual has developed, or is developing

the ability to silent read with proper comprehension. The GSRT consists of two parallel

forms (Form A and Form B), with each containing 13 reading passages or stories that are

developmentally sequenced. That is to say, passages gradually become longer and more

difficult from the first one to the last. Following each passage, there are five multiple-

choice questions, making each form of the GSRT consist of 65 items.

The GSRT reading test can be either group or individually administered and each

form of the test yields raw scores, grade equivalents, age equivalents, percentiles and a

Silent Reading Quotient. The appropriate age range for the GSRT ranges from 7 to 25

years of age. The test has been used for a variety of subgroups including males, females,

European Americans, African Americans, Hispanic Americans, Asian Americans, Native

Americans, the Learning Disabled, and so forth.

Table 3.1 is an illustration of some selected passages from Form A of the GSRT,

suggesting that the reading passages in the GSRT differ in multiple ways including

content, content structure, length, and difficulty level. Form B of the GSRT shares the

same feature.

29

Table 3.1 An Analysis of Sample Reading Passages in the GSRT (Form A)

Reading Passage Topic/Theme Content Structure Length

Passage 1 A boy Narration 41 words

Passage 4 Girls at a party Narration 100 words

Passage 6 Sharks Description 132 words

Passage 10 Folk dances Description 157 words

Passage 12 Advertisements Argumentation 159 words

3.1.2 Reading Subskill Identification

Like most current application of cognitive diagnostic models, we retrofitted the

diagnostic model to the GSRT, a non-diagnostic assessment. When an assessment is

non-diagnostic but the use of the assessment is diagnostic, it is essential to first identify a

set of subskills involved in the assessment and subsequently analyze each item to figure

out the subskill-item relationship by indicating what subskills are necessary for each item

in order for examinees to provide a correct response, namely, the construction of the Q-

matrix.

As previously introduced, a variety of sources are usually utilized to specify the

set of skills involved in a task. The four subskills involved in the GSRT are specified in

the Manual (2000) and can thus be used directly; they are literal, inferential, critical, and

affective subskills. In addition, we also looked into the literature with regard to reading

subskills and all four subskills, particularly literal, inferential, critical subskills, are

commonly used reading attributes. Table 3.2 provides a summary of the definitions of

the four subskills.

30

Table 3.2 Subskills Specified by the GSRT Developers

Reading subskills Definition and description

Literal Forming a global understanding of a passage/text; locating and

recognizing information explicitly stated.

Inferential Inferring information not explicitly stated by using given

information; linking text information to make predictions.

Critical Analyzing, evaluating, or making reasonable judgment about the

text, forms of writing or author’s opinions.

Affective Personal or emotional response to the text.

3.1.3 Coding Procedures and Q-matrix Construction

For the current study, the author along with two other graduate students (2nd year

and 4th year in Educational Psychology program at Penn State) served as the coders.

Think-aloud protocol was not used since the data had already been collected long before

they were diagnostically analyzed, making it impractical to take a sample of examinees to

examine their multidimensional cognitive processes that they were engaged in when

taking the GSRT.

Coding Procedures. The author introduced to the other two coders both the

GSRT and the four subskills for the test. The coding process was done independently for

the first round. Three coders each worked on their own to determine what subskill(s) is

used for each item and create a Q-matrix on their own. Each coder came up with two Q-

matrices—one is only composed of single subskill used for each item and the other

multiple subskills. As we read through the items in the GSRT, it can be found that some

items only require one single subskill. For example, an item asks examinees about a

detail that one passage explicitly expresses, suggesting that it measures only literal

subskill. For other items, they may involve two or more subskills but one of them is

31

dominant. So the single subskill Q-matrix only depicts the one subskill required by each

item, which can be either the only subskill required or the subskill that is dominant. The

other Q-matrix, however, encompasses multiple subskills required. For the first round,

coding was performed by the three coders independently. Following that, the coding

results were compared and any disagreement was discussed before a consensus was

achieved. The inter-rater reliability (i.e., Pearson’s correlation) for independent coding

was 0.861 for the first set of Q-matrix and 0.804 for the second set. After all

disagreements were solved, all coders reached consensus (i.e., the inter-rater reliability

reached 1.00) that two sets of common Q-matrix were acceptable based on their own

understanding and standards.

Common Q-matrix. For illustration purpose, Table 3.3 displays a portion of the

first common Q-matrix for Form B of the GSRT with each item requiring only one

subskill. Table 3.4 is the other common Q-matrix with some items coded as requiring

multiple reading subskills. Both complete Q-matrices for each of the two forms of the

GSRT can be found in Appendix A and Appendix B.

Table 3.3 Q-matrix for Form B of the GSRT (single subskill)

Item Subskill

Literal Inferential Critical Affective

9 0 1 0 0

10 0 1 0 0

11 1 0 0 0

12 1 0 0 0

13 1 0 0 0

14 0 1 0 0

32

Table 3.4 Q-matrix for Form B of the GSRT (multiple subskills)

Item Subskill

Literal Inferential Critical Affective

9 0 1 0 0

10 0 1 1 0

11 1 0 0 0

12 1 0 0 0

13 1 0 0 0

14 0 1 1 0

3.2 Investigating Examinees’ Performance on the GSRT

3.2.1 Participants and Data Collection

The present study used a secondary dataset from a large randomized control trial

of the ITSS intervention. Participants of the study included both 4th and 5th graders in

the states of Michigan and Pennsylvania, United States, with a total number of 5,790

students. A check on the GSRT performance data found that there were missing values; a

total of 5,393 students participated in the pretest and 5,125 students participated in the

posttest. The ITSS intervention group had 3.47% (201 cases) missing on pretest and

7.13% (413 cases) missing on posttest, while the control group had 3.38% (196 cases)

missing on pretest and 4.35% (252 cases) missing on posttest. Little’s missing

completely at random (MCAR) test was conducted, which yielded a non-significant

result; χ2

(261) = 1740, p = 0.93. This suggested that the missing data could be viewed as

MCAR. Therefore, listwise deletion was employed and cases with missing values were

all dropped listwise from the analysis. The analysis sample size was 4,728 students.

33

The 4th grade students from 131 classrooms, and the 5th grade students from 128

classrooms, including both rural and suburban schools, participated in the study. The

schools signed a memorandum indicating their agreement to participate in the study. The

participating classrooms were randomly assigned to experimental conditions (ITSS

intervention vs. control) within schools, with participants being approximately balanced

across both conditions. For example, 65 classrooms of 5th grade were assigned to the

ITSS intervention condition and 63 classrooms to the control condition. The students in

intervention group used the ITSS system as a partial substitute for the standard language

arts curriculum for one class period (30-45 minutes) per week for six to seven months

whereas those in control group used only their regular language arts curriculum.

To prevent intervention diffusion, the teachers in charge of the intervention

classrooms were instructed not to share the ITSS software access passwords or relevant

reading materials with other teachers in the school. In addition, students in control

condition would get access to use ITSS for the next academic year. Form B of the GSRT

was administered to all participants before ITSS instruction. Form A of the GSRT was

administered about six months later when ITSS instruction for the intervention group was

completed under the same conditions as the pretest administration. Participants’ response

data of both pretest and posttest for both intervention and control groups were analyzed

for the current study.

3.2.2 Model Fit Evaluation

Like any other statistical models, the estimated parameters in CDMs will be

interpretable when the model fits the data. When a model is evaluated for its fit to the

34

data, it is an absolute fit evaluation. On the other hand, with the availability of various

CDMs, it is also important to compare a model with other rival ones and select the most

appropriate model. This way of model fit evaluation is referred to as relative fit

evaluation. Both absolute and relative fit indices can be obtained using the CDM

packages (Robitzsch, Kiefer, George, & Uenlue, 2016) in R.

To obtain the fit statistics, we firstly used the first Q-matrix (i.e., unidimensional

Q-matrix) in which each item was coded as only requiring one reading subskill for

correct response with four CDMs (G-DINA, DINA, ACDM, and DINO). This Q-matrix

resulted in equivalent models with the same fit statistics for all four CDMs, which

indicated that when only one subskill is required for each item, there is no need to

distinguish compensatory or non-compensatory CDMs. We then used the second Q-

matrix (i.e., multidimensional Q-matrix) in which some items require multiple subskills

with these four CDMs. The absolute and relative model fit indices for both pretest and

posttest that were obtained from this second round of analysis are presented in Table 3.5,

Table 3.6, Table 3.7, and Table 3.8, respectively.

Table 3.5 Absolute Fit Statistics for Pretest

G-DINA DINA ACDM DINO

maxX2 759.998 770.650 953.824 787.778

p_maxX2 0 0 0 0

MADcor 0.055 0.059 0.061 0.060

SRMSR 0.080 0.079 0.086 0.080

100* MADRESIDCOV

(MADRCOV)

0.882 0.887 0.977 0.903

MADQ3 0.035 0.057 0.038 0.052

MADaQ3 0.036 0.058 0.039 0.053

35

Table 3.5 summarizes the absolute model fit statistics for pretest of all

participants including MADcor, SRMSR, 100*MADRESIDVOC, MADQ3, and

MADaQ3. According to Robitzsch, Kiefer, George, and Uenlue (2016), for each of these

fit statistics, “it holds that smaller values (values near to zero) indicate better fit” (p. 136).

Measures including MADcor, SRMSR, and 100*MADRESIDVOC are in fact effect

sizes of absolute model fit which compare observed and predicted covariances of item

pairs (Ravand & Robitzsch, 2015). The smaller an effect size is, the better a model fits

the data. Therefore, the G-DINA model has better model fit than the DINA, ACDM, and

DINO models. The significance test of absolute model fit indicates that there was a

significant model misfit for all four models (p = 0). However, the maxX2 value for the

G-DINA model was the smallest (maxX2 = 759.998), suggesting that the other three

models had worse model fit. In summary, the absolute fit statistics showed that the G-

DINA model fitted our data the best and was superior to other models for our study.

36

Table 3.6 Relative Fit Statistics for Pretest


Loglike -153045.0 -157153.6 -154443.3 -156911.3

Deviance 306090.0 314307.2 308886.5 313822.5

Npars 189.0 145.0 165.0 145.0

Nobs 5393.0 5393.0 5393.0 5393.0

AIC 306468.0 314597.2 309216.5 314112.5

BIC 307714.0 315553.2 310304.4 315068.5

AIC3 306657.0 314742.2 309381.5 314257.5

AICc 306481.8 314605.3 309227.0 314120.6

CAIC 307903.0 315698.2 310469.4 315213.5

Table 3.6 focuses on the relative fit statistics for pretest of information criteria

such as AIC, BIC, AIC3, AICc (i.e., sample size adjusted AIC), and CAIC (consistent

AIC). For all the fit statistics, the G-DINA model performed the best among the four

models as it yielded the smallest value for each fit index.

The absolute and relative model fit indices for posttest are presented in Table 3.7

and Table 3.8, respectively. The G-DINA model performed the best amongst all models

for all fit statistics.

Table 3.7 Absolute Fit Statistics for Posttest


maxX2 757.101 812.882 989.811 802.082

p_maxX2 0 0 0 0

MADcor 0.050 0.058 0.054 0.057

SRMSR 0.075 0.080 0.078 0.078

100* MADRESIDCOV

(MADRCOV)

0.752 0.862 0.813 0.839

MADQ3 0.035 0.053 0.038 0.054

MADaQ3 0.036 0.053 0.037 0.054

37

Table 3.8 Relative Fit Statistics for Posttest


Loglike -147549.2 -151017.1 -148333.8 -151109.5

Deviance 295098.5 302034.2 296667.6 302219.1

Npars 201.0 145.0 173.0 145.0

Nobs 5124.0 5124.0 5124.0 5124.0

AIC 295500.5 302324.2 297009.6 302509.1

BIC 296815.4 303272.7 298128.2 303457.6

AIC3 295701.5 302469.2 297180.6 302654.1

AICc 295517.0 302332.7 297021.5 302517.6

CAIC 297016.4 303417.7 298299.2 303602.6

The absolute and relative fit statistics presented in the tables above all suggested

that the G-DINA model had a good model fit with the 4th- and 5th-grade students’ test

response data. Therefore, the G-DINA model with the multidimensional Q-matrix was

used to analyze students’ GSRT test performance data for further diagnostic inferences to

address the research questions.

3.3 Data Analysis

To test the three hypotheses in the study, the G-DINA model was fit to

participants’ item response data, separately by pretest and posttest, using the CDM

package in R. More details on the analysis procedure are provided as follows.

On the group level, the subskill mastery patterns of all participants in terms of

their 16 (k = 4) subskill mastery profiles and the estimated probability of mastery for both

pretest and posttest and intervention and control groups were generated. Examinees’ 16

subskill mastery profiles between pretest and posttest for both intervention and control

38

groups were then compared based on the estimated probability of each subskill mastery

profile, which could yield a global picture of students’ subskill mastery status. The

subskill mastery profiles classify students into different latent classes, which provide

information about their strengths and weaknesses regarding the reading subskills. On the

individual subskill level, the probability of mastery for each of the four subskills on both

pretest and posttest were also obtained to see the change/gain of subskill probability from

pretest to posttest for both intervention and control groups.

To test the three hypotheses, a factorial repeated measures (i.e., repeated on four

subskills) ANOVA (RM-ANOVA) was performed. In this model, the outcome variable

was the change/difference of mastery probability of four subskills from pretest to

posttest. Subskill was the within-subject factor and there were three between-subject

factors with each having two levels—experimental condition (1 = intervention group

using ITSS instruction; 0 = control group using regular curriculum, not ITSS instruction),

grade level (1 = 5th grade; 0 = 4th grade), and gender (1 = female; 0 = male).

The first hypothesis was that the effect of ITSS instruction was not the same on

all four subskills; its effect was stronger on literal and inferential subskills so that these

two subskills were significantly promoted through ITSS instruction while critical and

affective subskills were not. To test this hypothesis, the interaction between ITSS

instruction and subskill was examined. Should the interaction be not significant at the

0.05 significance level, there was no sufficient evidence obtained to support the claim

that the effect of ITSS instruction was different on all subskills.

Two other interactions were taken into consideration and investigated as well to

test the second and third hypotheses that the effect of ITSS instruction in improving

39

reading subskills did not differ by grade level or gender. Specifically, the interaction

between subskill and grade level and the interaction between subskill and gender were

examined to test these two hypotheses, respectively. The hypotheses would be supported

if neither of the interaction was significant.

40

Chapter 4 Results

Chapter 4 reports the findings regarding 4th- and 5th-grade students’ performance

on the GSRT on both pretest and posttest in terms of the students’ strengths and

weaknesses in the four reading subskills and the effect of ITSS instruction.

4.1 On Group Level

To begin with, we analyzed the class probabilities or estimated occurrence

proportions of each subskill mastery profile for all students in both pretest and posttest

and intervention and control groups when the G-DINA model was applied. With four

subskills involved in the GSRT, the total number of subskill mastery profiles or latent

classes was 16 (i.e., 24).

The probability of each subskill mastery profile for pretest and posttest and

intervention and control groups based on the G-DINA model are presented in Figure 4.1.

Figure 4.1 Estimated probability of each subskill mastery profile for pretest and posttest

41

Figure 4.1 presents examinees’ 16 subskill mastery profiles between pretest and

posttest for both intervention and control groups, which is diagnostic information

pertinent to students’ strengths and weaknesses in the four reading subskills. In CDM

applications in retrofitting contexts, it is usually the case that examinees’ flat skill

mastery profiles (either mastery of none subskills or mastery of all subskills) are

predominant. Figure 4.1 shows that many students belong to Class 16—masters of all

four subskills ([1111]). With regard to pretest, it can be seen that the performance of

control group and intervention group was pretty similar in terms of their subskill mastery

profiles, indicating that reading ability of the two groups was originally at the similar

level before ITSS intervention. The proportion of students who were masters of none of

the four subskills (see subskill mastery profile [0000]) decreased from pretest to posttest

for the two experimental groups. In addition, the proportion of students who were

masters of all four subskills (see subskill mastery profile [1111]) increased from pretest

to posttest. According to Liu et al. (2017), it should be expected in most retrofitting

contexts that the majority of examinees are classified either as non-masters on each

subskill ([0000]) or masters on all subskills ([1111]).

The above pattern of profiles also suggests that students who were non-masters of

inferential comprehension but masters of all other three subskills (see the subskill

mastery profile of [1011]) decreased sharply from pretest to posttest for both

experimental groups. In addition to random measurement errors, this might be the case

that inferential subskill grew in a so rapid fashion from pretest to posttest that fewer

students were non-masters of this subskill at posttest. This will be further discussed in

the subsequent section about the growth rate of the four reading subskills.

42

4.2 On Individual Subskill Level

We then looked into students’ mastery levels of each individual subskill by

examining their subskill mastery profile. Table 4.1 shows the subskill probability for

each reading subskill in pretest and posttest by experimental condition.

Table 4.1 Subskill Mastery Probability in Pretest and Posttest

Pretest Posttest

Subskills Control Intervention Control Intervention

Literal 0.6408 0.6512 0.6554 0.6876

Inferential 0.4906 0.4784 0.6051 0.6160

Critical 0.6533 0.6610 0.6696 0.7064

Affective 0.6280 0.6362 0.6904 0.7247

Table 4.1 indicates that among the four reading subskills, critical subskill, a

higher-order reading subskill as deemed by many researchers, was the subskill best

mastered in pretest for both groups. This finding was in line with some researchers’

views introduced in Chapter 2 that lower-level reading subskills are not prerequisites for

higher-level reading subskills. Therefore, it is highly likely that readers, such as

elementary school students have mastered higher-order reading subskills before they

master the lower-level subskills.

Figure 4.2 presents the change/difference of subskill mastery probability from

pretest to posttest for both groups as well as the difference between intervention and

control groups at both pretest and posttest.

43

Figure 4.2 Difference of probability of mastery of each subskill at pretest and posttest by

intervention condition

Figure 4.2 shows that control and intervention groups performed similarly in

pretest with regard to the probability of mastery of all four reading subskills, which was

what we expected. In pretest, there was no significant difference in the estimated

probability of mastery between control and intervention conditions of literal subskill;

t(5382.3) = -0.092, p = 0.927. Likewise, there was no significant difference for

inferential, critical, and affective subskills between the two treatment groups at pretest;

t(5379.4) = 1.368, p = 0.171; t(5383.5) = -0.078, p = 0.937; and t(5386.1) = -0.489, p =

0.624, respectively. Figure 4.2 also presents that the larger difference of probability of

mastery of all four subskills, inferential subskill in particular, between pretest and posttest

for students in the intervention condition than for students in the control condition, which

44

was consistent with the research findings in the literature that the structure strategy is

effective in improving readers’ reading comprehension ability.

A four-way RM-ANOVA was subsequently conducted to test the three research

hypotheses. Prior to the analysis, the missing data pattern was explored and the missing

values are completely randomly distributed across all observations.

One of the assumptions for RM-ANOVA is that of sphericity, which refers to the

equality of the variance of the differences between levels of repeated measures factor. To

test whether sphericity assumption was satisfied, Mauchly’s Test of Sphericity was used.

The test yielded a highly significant result with W = 0.051, χ2

(5) = 18453.497, p < 0.001,

which suggested that sphericity assumption was not met. In this situation, the

corrections, most notably the Greenhouse-Geisser and Huynh-Feldt epsilon corrections,

were used instead. The corresponding corrective coefficients were: Greenhouse-Geisser

ε = 0.417 and Huynh-Feldt ε = 0.418. Table 4.2 summarizes the results of the four-way

RM-ANOVA, depicting the RM-ANOVA results for within-subjects and between-

subjects effects.

45

Table 4.2 Repeated Measures Analysis of Variance for Within-Subjects and Between-

Subjects Effects

Source df MS F p

Greenh

ouse-

Geisser

Huynh

-Feldt

Within-subject

Subskill 1.251 33.074 129.965 .000 .000

Subskill*Female 1.251 .767 3.014 .073 .073

Subskill*Grade 1.251 .357 1.405 .242 .242

Subskill*ITSS 1.251 .028 .110 .796 .796

Subskill*Female

*Grade 1.251 .208 .819 .391 .391

Subskill*Female

*ITSS 1.251 .132 .517 .512 .512

Subskill*Grade*

ITSS 1.251 .535 2.104 .142 .142

Subskill*Female

*Grade*ITSS 1.251 .205 .807 .395 .395

Error 7767.403 .254

Between-subject

Intercept 1 102.086 213.986 .000***

Female 1 .005 .010 .920

Grade 1 .960 2.011 .156

ITSS 1 3.905 8.186 .004**

Female*Grade 1 .064 .134 .714

Female*ITSS 1 .403 .845 .358

Grade*ITSS 1 1.055 2.211 .137

Female*Grade

*ITSS 1 .005 .010 .921

Error 2961.643 .477

Note: Female coded as 1 = Female and 0 = Male; Grade coded as 1 = Grade 5 and 0 =

Grade 4; ITSS coded as 1 = ITSS intervention (i.e., intervention group) and 0 = no ITSS

intervention (i.e., control group).

*p < .05. **p < .01. ***p < .001.

Table 4.2 indicates that the main effect of ITSS instruction was statistically

significant at the 0.05 significance level (p = 0.004), suggesting that ITSS instruction was

46

effective in promoting students’ reading subskills. This finding was consistent with the

findings in literature that ITSS instruction can promote reading comprehension.

Hypothesis 1. Table 4.2 shows that the interaction between subskill and ITSS

instruction was not significant at the 0.05 significance level (p = 0.796), suggesting that

there was no sufficient evidence found in the data to support that ITSS instruction was

differentially effective on different reading subskills. Therefore, the first hypothesis that

the effect of ITSS instruction was different for subskills and its effect for literal and

inferential subskills was greater was not supported.

Hypotheses 2 and 3. The second and third hypotheses were that the effect of

ITSS instruction in improving reading subskills did not vary by grade level or by gender.

Table 4.2 indicates that the interaction between subskill and grade level was not

significant at the 0.05 significance level (p = 0.242), indicating that there was no

sufficient evidence available in our data to suggest that the effect of ITSS instruction for

both Grade 4 and Grade 5 was different. Likewise, the same table shows the interaction

between subskill and gender was not significant at the 0.05 significance level (p = 0.073)

either, suggesting that not enough evidence was available in our data to suggest the effect

of ITSS instruction for female students and male students was not the same.

Consequently, both Hypothesis 2 and Hypothesis 3 were supported by the data collected.

In addition, the main effect of subskill was significant at the 0.05 significance

level (p < 0.001), so we followed this effect up with post hoc tests. The results are

presented in Table 4.3.

47

Table 4.3 Mean Difference t-test

95% Simultaneous

confidence interval

Subskill Compare Mean different p Lower Upper

Literal L - I -.102 .000*** -.122 -.082

Inferential L - C -.006 .209 -.015 .002

Critical L - A -.051 .002** -.057 -.045

Affective I - C .095 .000*** .073 .117

I - A .051 .000*** .030 .071

C - A -.045 .000*** -.049 -.040

Note: *p < .05. **p < .01. ***p < .001.

A Bonferroni correction has been applied for multiple comparisons.

Table 4.3 presents the specific mean difference between a pair of subskills with

regard to the difference of subskill mastery probability from pretest to posttest, which

represents the growth rate of subskills between pretest and posttest. It follows that the

growth rate or magnitude of growth was not the same across the four reading subskills.

The largest mean difference was observed between literal and inferential subskills, and

the smallest significant difference was between critical and affective subskills. Both

Table 4.3 and Figure 4.2 suggest that the growth rate for the four subskills was different,

with inferential subskill having the greatest growth rate from pretest to posttest. The

greatest growth rate for inferential subskill may be accounted for by the following

reasons.

First, there exit differences in the developmental trajectories of reading subskills

(e.g., Leppanen, Niemi, Aunola, & Nurmi, 2004; Paris, 2005), which has often been

neglected in theories about reading (Paris, 2005). Research in the literature has shown

that reading skills neither develop in a linear fashion nor grow in the same rate. Paris

(2005) postulated that reading skills involved both constrained and unconstrained skills,

48

with the former being composed of “alphabetic knowledge, phonemic awareness, and

oral reading fluency” (p. 187). Constrained skills develop from non-existent to high in

childhood in a rapid speed, and they be “constrained developmentally, conceptually, and

by measurement” (p. 192). Take the alphabetic knowledge for example, kids may have

known and mastered the 26 letters at a very young age. In contrast, unconstrained

reading skills, such as the four reading skills specified in the present study, continue to

develop across a lifetime, and they develop in varying growth rate.

Second, with varying growth rate for different reading subskills, the inferential

subskill showed the greatest growth rate from pretest to posttest for both intervention

group and control group. This might be due to the fact that teachers in these schools

emphasized this subskill more than other skills. Hansen (1981) provided a strategy for

improving inferential comprehension subskill. According to the author, reading

comprehension skills needed to be taught in school, which is also the actual practice in

the U.S. elementary schools. The author further pointed out that reading teachers had

different emphasis in teaching reading skills such that some teachers emphasize literal

skill whereas some emphasize on drawing inferences. It was possible that the huge

difference in inferential subskill from pretest to posttest was because teachers emphasized

the instruction on inferential subskill.

Third, the varying growth rate of the four reading skills may also be caused by the

form difference between the two forms (i.e., Form A and Form B) of the GSRT even

though they are supposed to be parallel test forms. In addition, the practice of taking test

might also result in the difference in growth rate.

49

In summary, the first hypothesis was partially supported in that all four subskills,

not only literal and inferential subskills as hypothesized, have been significantly

improved through ITSS instruction. With regard to the growth rate of the four subskills

from pretest to posttest, however, it was not the same across the four subskills, with

inferential subskill having the greatest growth rate. The reasons for the difference in

growth rate were also discussed from multiple perspectives. The second and third

hypotheses that the effect of ITSS instruction was the same between 4th and 5th graders

and between female and male students were both supported by our data.

50

Chapter 5 Discussion, Implication, Limitations, and Future Research

This chapter discusses the overall findings of the present study and its

implications for reading instructional designs and learning and cognitive diagnostic

assessment of reading tests. In addition, limitations of the study and future research

directions are also discussed.

5.1 Discussion of the Overall Findings

The present study evaluated the effect of ITSS instruction on 4th and 5th grade

students’ reading ability in terms of reading subskills as represented in the standardized

reading test, Gray Silent Reading Test, by applying a cognitive diagnostic modeling

approach. Four reading subskills are involved in the GSRT as specified by the test

developers: literal, inferential, critical, and affective subskills. The G-DINA model, a

generalized cognitive diagnostic model, was applied because it fitted the students’ GSRT

response data the best from both absolute and relative fit evaluations.

The findings yielded clear evidence that 4th and 5th grade students’ all four

subskills, literal, inferential, critical, and affective subskills, were significantly improved

through ITSS instruction compared to those students who used the regular reading

instruction curriculum. That critical and affective subskills were also significantly

improved through the ITSS instruction was not expected as the structure strategy is

primarily focused on literal and inferential reading subskills.

51

On the other hand, this finding can be sensibly explained by the strong

correlations between the reading subskills. Many studies in reading have demonstrated

the strong correlations between reading subskills, for instance, between low-level skills

and higher-level skills (e.g., Landi, 2010; Landi & Perfetti, 2007; Shankweiler, 1989).

Liu and his colleagues claimed that attributes or subskills obtained from retrofitting an

assessment not originally developed for diagnostic purpose are expected to be highly

correlated because the test is usually unidimensional when designed under CTT or

conventional IRT framework (Liu, Huggins-Manley, & Bulut, 2017). When the

correlations are high enough as close to 1, it follows that examinees are classified into

either all-mastery or non-mastery profile, namely, the flat skill mastery profile introduced

in Chapter 2. As a result, when reading subskills are closely related to one another, it is

highly likely that critical and affective subskills are also improved given that literal and

inferential subskills are enhanced through the ITSS instruction.

Although all four reading subskills were significantly improved through ITSS

instruction, their growth rate was not the same over time from pretest to posttest, with

inferential subskill having the greatest growth. No significant difference was found

between that of literal and critical subskills but there was significant difference between

all other pairs of subskills.

Furthermore, for the four significantly improved reading subskills, no significant

evidence was obtained to show that ITSS instruction was not equally effective in

promoting them.

52

In terms of the effect of ITSS instruction by grade level and by gender, our data

supported the hypotheses that the effect in improving reading subskills did not differ

between 4th- and 5th-grade students or between female and male students.

In summary, our findings demonstrated that ITSS instruction was effective in

significantly improving all four reading subskills as represented in the GSRT, which

agreed with the findings in literature that ITSS instruction significantly promotes

learners’ reading comprehension achievement. The present research adds more onto the

literature in providing more fine-grained information as to what specific reading skills

has been improved through ITSS instruction and the effect of ITSS instruction was the

same for our sample between 4th and 5th graders and between female and male students.

5.2 Implications for Reading Instruction and CDA of Reading

Under both CTT and traditional unidimensional IRT frameworks, learners’

reading comprehension performance is represented as a single score. Consequently,

when a reading test is taken, none of the policy-makers, teachers, or students themselves

are clear about the students’ strengths or weakness in terms of reading subskills. With

cognitive diagnostic modeling analyses, however, students are classified into

multidimensional reading subskill profiles. Accordingly, the teachers are aware of

whether an individual is a master or non-master of a specific subskill with the fine-

grained feedback provided by the CDM analyses results.

The findings of the present study have several implications for future reading

instructional designs or reading intervention. First, the structure strategy can be provided

53

to students as a supplement of the regular standard language arts curriculum to enhance

students’ reading ability since it consistently demonstrates to be an effective reading

strategy. Second, through the CDM assessment performed in this study, students’

strengths and weaknesses in their reading subskills were provided and represented as

different latent classes, which can be utilized by teachers to remediate their weakness

through providing additional reading materials or even personalized reading intervention

programs. Students who are weak in certain subskills can be focusing more on their

weakness and take actions accordingly.

In addition, the present study also has important implications for cognitive

diagnostic assessment of reading comprehension. The present study used the G-DINA

model in addressing the research questions. In the literature, however, the G-DINA

model has not been used much for the cognitive diagnostic assessment of reading tests.

Instead, Fusion model and DINA model have been much more commonly used. In the

present study, the G-DINA model has proved to fit our data better than DINA model

from the model fit evaluation procedure. The successful application of the G-DINA

model in the current study provides evidence that this cognitive diagnostic model can be

a good option to be used for reading assessments in extracting readers’ mastery and non-

mastery reading skill profiles.

5.3 Limitation and Future Research

The present study used a cognitive diagnostic modeling approach with retrofitting

method to test the research questions about which subskill(s) as represented in the GSRT

would be significantly improved through ITSS instruction, whether the effect was the

54

same across subskills, and whether the effect was the same by grade level and by gender.

It posed a couple of limitations and restrictions. First of all, the primary limitation lies in

the retrofitting approach utilized in the study. Popular as retrofitting is, Liu, Huggins-

Manley, and Bulut (2017) pointed out the problems and challenges that retrofitting

approach possesses in terms of assessment design and statistical quality due to the fact

that cognitive diagnostic models are applied in non-diagnostic tests. Therefore, the CDA

approach should be used with caution. On the other hand, the future research for CDA

approach should be focusing more on creating a test that is designed by developing

diagnostic items that enable researchers to infer the skill processes underlying the item

responses rather than the retrofitting method, namely, the first approach to performing

CDA analyses as introduced in Chapter 2.

Another limitation is that examinees’ think-aloud protocol was not involved in the

present study as a source for creating the Q-matrix since it was impractical to recruit a

sample of examinees at the time when this study was conducted. As noted earlier,

experts’ coding might be different from the approach that examinees actually use in

practice because they may approach questions very differently. Therefore, the

examinees’ think-aloud protocol may be a very valuable source in developing and

validating the Q-matrix, which is critical in conducting any cognitive diagnostic

modeling study. Should conditions allow it, future study applying CDM approach should

take this perspective into consideration.

55

References

Alderson, J.C. (1990). Testing reading comprehension skills (Part One). Reading in a

Foreign Language, 6 (2), 425-438.

Alderson, J.C., & Lukmani, Y. (1989). Cognition and reading: Cognitive levels as

embodied in test questions. Reading in a Foreign Language, 5, 253-270.

Alfassi, M. (2004). Reading to learn: Effects of combined strategy instruction on high

school students. Journal of Educational Research, 97, 171-184.

Anderson, R. C., & Pearson,P. D. (1984). A schema-theoretic view of basic processes in

reading. In P. D. Pearson (Ed.), Handbook of reading research (pp. 255-292). New

York: Longman.

Bloom, B. S. (1956). Taxonomy of educational objectives: The classification of

educational goals. New York: Longmans, Green.

Carr, T. H., & Levy, B. A. (Eds.). (1990). Reading and its development: Component skills

approaches. San Diego: Academic Press.

Carrell, L. P. (1985). Facilitating ESL Reading by Teaching Text Structure. TESOL

Quarterly, 19(4), 727-752.

Chen, H., & Chen, J. (2015). Exploring reading comprehension skill relationships

through the G-DINA model. Educational Psychology, 1-20.

Chen, J. S., & de la Torre, J. (2014). A Procedure for Diagnostically Modeling Extant

Large-Scale Assessment Data: The Case of the Programme for International

Student Assessment in Reading. Psychology, 5, 1967-1978.

56

Davis, F. (1944). Fundamental factors of comprehension in reading. Psychometrika, 9,

185-197.

de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76, 179-

199.

Englert, C. S., & Thomas, C. C. (1987). Sensitivity to text structure in reading and

writing: A comparison between learning disabled and non-learning disabled

students. Learning Disability Quarterly, 10(2), 93-105.

Gao, L. (2007). Cognitive-psychometric modeling of the MELAB reading items

(Unpublished doctoral dissertation). University of Alberta, Edmonton, Alberta.

George, A. C., Robitzsch, A., Kiefer, T., Groß, J., & Ünlü, A. (2016). The R package

CDM for cognitive diagnosis models. Journal of Statistical Software, 74(2). 1-24.

Gierl, M. J., Cui, Y., & Zhou, J. (2009). Reliability and attribute-based scoring in

cognitive diagnostic assessment. Journal of Educational Measurement, 46, 293-

313.

Gierl, M. J., Leighton, J. P., Wang, C., Zhou, J., Gokiert, R., & Tan, A. (2009).

Validating cognitive models of task performance in algebra on the SAT®. New

York: The College Board.

Gierl, M. J., Roberts, M., Alves, C., & Gotzmann, A. (2009). Using judgments from

content specialists to develop cognitive models for diagnostic assessments. Paper

presented at the annual meeting of National Council on Measurement in Education,

San Diego, CA.

57

Gray, W. S. (1960). The major aspects of reading. In H. Robinson (Ed.), Sequential

development of reading abilities (Vol. 90, pp. 8-24). Chicago: Chicago University

Press.

Haertel, E. H. (1989). Using restricted latent class models to map the skill structure of

achievement items. Journal of Educational Measurement, 26, 333-352.

Hartz, S.M. (2002). A Bayesian framework for the unified model for assessing cognitive

abilities: Blending theory with practicality (Unpublished doctoral dissertation).

University of Illinois at Urbana-Champaign, Urbana-Champaign, IL.

Hartz, S., & Roussos, L. (2008). The Fusion model for skills diagnosis: Blending theory

with practice (ETS research report, No. RR-08-71). Princeton, NJ: Educational

Testing Service.

Hansen, J. (1981). An inferential comprehension strategy for use with primary grade

children. The Reading Teacher. 34(6), 665-669.

Hou, L., de la Torre, J., & Nandakumar, R. (2014). Differential item functioning

assessment in cognitive diagnostic modeling: Application of the Wald test to

investigate DIF in the DINA model. Journal of Educational Measurement, 51(1),

98-125.

Hubner, A., (2010). An Overview of Recent Development in Cognitive Diagnostic

Computer Adaptive Assessments. Practical Assessment, Research & Evaluation,

15(3), 1-7.

58

Huff, K., & Goodman, D. P. (2007). The demand for cognitive diagnostic assessment. In

J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education:

Theory and applications (pp. 19-60). Cambridge, UK: Cambridge University Press.

Jang, E.E. (2005). A validity narrative: Effects of reading skills diagnosis on teaching

and learning in the context of NG-TOEFL (Unpublished doctoral dissertation).

University of Illinois at Urbana-Champaign, Urbana-Champaign, IL.

Jang, E. E. (2008). A Framework for Cognitive Diagnostic Assessment. Paper presented

at conference held at Iowa State University on September 21 and 22 of 2007, Ames,

IA.

Jang, E. E. (2009a). Demystifying a Q-matrix for making diagnostic inferences about L2

reading skills, Language Assessment Quarterly, 6(3), 210-238.

Jang, E. E. (2009b). Cognitive diagnostic assessment of L2 reading comprehension

ability: Validity arguments for applying Fusion Model to LanguEdge assessment.

Language Testing, 26(1), 31-73.

Jang, E. E., Dunlop, M., Wagner, M., Kim, Y., & Gu, Z. (2013). Elementary school

ELLs’ reading skill profiles using cognitive diagnosis modeling: Roles of length of

residence and home language environment. Language Learning, 63(3), 400-436.

Johnson-Glenberg, M. C. (2000). Training reading comprehension in adequate

decoders/poor comprehensions: Verbal versus visual strategies. Journal of

Educational Psychology, 92, 772-782.

59

Junker, B. W. & Sijtsma, K. (2001). Cognitive assessment models with few assumptions,

and connections with nonparametric item response theory. Applied Psychological

Measurement, 25(3), 258-272.

Kim, A. Y. (2011). Examining second language reading components in relation to

reading test performance for diagnostic purposes: A fusion model approach.

(Unpublished doctoral dissertation). Columbia University, New York, NY.

Kim, H. S. (2011). Diagnosing Examinees’ Attributes-mastery Using the Bayesian

Inference for Binomial Proportion: a New Method for Cognitive Diagnostic

Assessment. (Unpublished doctoral dissertation). Georgia Institute of Technology,

Atlanta, GA.

Kintsch, W., & Kintsch, E. (2005). Comprehension. In S. G. Paris & S. A. Stahl (Eds.),

Children’s reading comprehension and assessment (pp. 71-92). Mahwah, NJ:

Erlbaum.

Landi N. (2010). An examination of the relationship between reading comprehension,

higher-level and lower-level reading sub-skills in adults. Reading and Writing.

23(6), 701-717.

Landi N., & Perfetti C. A. (2007). An electrophysiological investigation of semantic and

phonological processing in skilled and less skilled comprehenders. Brain and

Language. 102, 30-45.

Lee, Y-W., & Sawaki, Y. (2009). Cognitive diagnosis approaches to language

assessment: An overview. Assessments: Language Assessment Quarterly, 6(3), 172-

189.

60

Leighton, J. P., & Gierl, M. J. (2007). Defining and evaluating models of cognition used

in educational measurement to make inferences about examinees’ thinking

processes. Educational Measurement: Issues and Practice, 26(2), 3-16.

Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy method for

cognitive assessment: A variation on Tatsuoka’s rule-space approach. Journal of

Educational Measurement, 41(3), 205-237.

Leppanen, U., Niemi, P., Aunola, K. & Nurmi, J.-E. (2004). Development of reading

skills among preschool and primary school pupils. Reading Research Quarterly,

39(1), 72-93.

Li, H. (2011). Evaluating language group differences in the subskills of reading using a

cognitive diagnostic modeling and differential skill functioning approach

(Unpublished doctoral dissertation). Pennsylvania State University, State College,

PA.

Li, H., Hunter, C.V., & Lei, P-W. (2015). The selection of cognitive diagnostic models

for a reading comprehension test. Language Testing, 1(19), 1-19.

Li, H., & Suen, H. K. (2013). Detecting native language group differences at the subskills

level of reading: A differential skill functioning approach. Language Testing, 30(2),

273-298.

Lissitz, B., Jiao H., Li M., Lee, D.Y., & Kang Y. (2014). Cognitive Diagnostic Models:

Executive Report for the Maryland State Department of Education by the MARC

Team. Retrieved from

61

http://marces.org/current/ExecutiveReport_MARC_2014_Cognitive%20Diagnostic

%20Models.pdf.

Liu, R., Huggins-Manley, A. C., & Bulut O. (2017). Retrofitting Diagnostic

Classification Models to Response from IRT-Based Assessment Forms.

Educational and Psychological Measurement, 27, 1-27.

Lunzer, E., Waite, M., & Dolan, T. (1979). Comprehension and comprehension tests. In

E. Lunzer & K. Gardner (Eds.), The effective use of reading (pp. 37-71). London:

Heinemann Educational Books.

Ma, X., & Meng, Y. (2014). Towards Personalized English Learning Diagnosis:

Cognitive Diagnostic Modelling for EFL Listening. Asian Journal of Education and

e-Learning, 2(5), 336-348.

Maris, E. (1995). Psychometric latent response models. Psychometrika, 60(4), 523-547.

Matthews, M. (1990). Skill taxonomies and problems for the testing of reading. Reading

in a Foreign Language, 7(1), 511-517.

McNamara, D. S. (2007). Reading comprehension strategies: Theories, interventions,

and technologies. New York: Lawrence Erlbaum Associates.

Meyer, B. J. F. (1975). The organization of prose and its effects on memory. Amsterdam:

North Holland.

Meyer, B. J. F., Brandt, D. M., & Bluth, G. J. (1980). Use of the top-level structure in

text: Key for reading comprehension of ninth-grade students. Reading Research

Quarterly, 16, 72-103.

62

Meyer, B. J. F., Middlemiss, W., Theodorou, E., Brezinski, K. L., McDougall, J., &

Bartlett, B. J. (2002). Effects of structure strategy instruction delivered to fifth-

grade children using the internet with and without the aid of older adult tutors.

Journal of Educational Psychology, 94(3), 486-519.

Meyer, B. J. F., & Poon, L. W. (2001). Effects of structure strategy training and signaling

on recall of text. Journal of Educational Psychology, 93, 141-159.

Meyer, B. J. F., & Rice, G. E. (1989). Prose processing in adulthood: The text, the reader,

and the task. In L. W. Poon, D. C. Rubin, & B. A. Wilson (Eds.), Everyday

cognition in adulthood and later life (pp. 157-194). New York, NY: Cambridge

University Press.

Meyer, B. J. F., & Wijekumar, K. (2007). A web-based tutoring system for the structure

strategy: Theoretical background, design, and findings. In D. S. McNamara (Ed.),

Reading comprehension strategies: Theories, interventions, and technologies (pp.

347-375). Mahwah, NJ: Lawrence Erlbaum Associates.

Meyer, B. J. F., Wijekumar, K., & Lin, Y. (2011). Individualizing a web-based structure

strategy intervention for fifth graders’ comprehension of nonfiction. Journal of

Educational Psychology, 103(1), 140-168.

Meyer, B. J. F., Wijekumar, K., Middlemiss, W., Higley, K., Lei, P., Meier, C., &

Spielvogel, J. (2010). Web-based tutoring of the structure strategy with or without

elaborated feedback or choice for fifth- and seventh-grade readers. Reading

Research Quarterly, 45(1), 62-92.

63

Meyer, B. J. F., Young, C. J., & Bartlett, B. J. (1989). Memory improved: Reading and

memory enhancement across the life span through strategic text structures.

Hillsdale, NJ: Lawrence Erlbaum.

Munby, J. (1978). Communicative syllabus design. Cambridge, UK: Cambridge

University Press.

National Assessment of Educational Progress (NAEP). (2007). Available online at

http://nationsreportcard.gov/reading_2007/.

National Research Council. (2001). Knowing what students know: The science and

design of educational assessment. Washington, DC: National Academies Press.

Paris, S. G. (2005). Reinterpreting the development of reading skills. Reading Research

Quarterly, 40(2), 184-202.

RAND Reading Study Group. (2002). Reading for understanding: Toward an R&D

program in reading comprehension. Washington, DC: RAND Education.

Ravand, H., Barati, H., & Widhiarso, W. (2013). Exploring Diagnostic Capacity of a

High Stakes Reading Comprehension Test: A Pedagogical Demonstration. Iranian

Journal of Language Testing, 3(1).

Ravand H. & Robitzsch A. (2015). Cognitive Diagnostic Modeling Using R. Practical

Assessment, Research & Evaluation. 20(11), 1-12.

Robitzsch, A., Kiefer, T., & George, A. C., & Uenlue, A. (2016). CDM: Cognitive

Diagnosis Modeling. R package version 5.1-0. https://CRAN.R-

project.org/package=CDM

64

Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification

models: A comprehensive review of the current state-of-the-art. Measurement:

Interdisciplinary Research and Perspectives, 6(4), 219-262.

Rupp, A. A., Templin, J. L., & Henson, R. A. (2010). Diagnostic assessment: Theory,

methods, and applications. New York: Guilford Press.

Sawaki, Y., Kim, H. J., & Gentile, C. (2009). Q-Matrix construction: Defining the link

between constructs and test items in large-scale reading and listening

comprehension assessments. Language Assessment Quarterly, 6(3), 190-209.

Scarborough, H. S. (2001). Connecting early language and later reading (dis)abilities:

Evidence, theory, and practice. In S. B. Neuman & D. K. Dickinson (Eds.),

Handbook of early literacy research (pp. 97-110). New York: The Guilford Press.

Shankweiler, D. (1989). How problems of comprehension are related to difficulties in

decoding. In Shankweiler D., & Liberman IY. (Eds.), Phonology and reading

disability: Solving the reading puzzle (pp. 35-68). Ann Arbor, MI: The University

of Michigan Press.

Sung, Y., Chang, K., & Huang, J. (2008). Improving children’s reading comprehension

and use of strategies through computer-based strategy training. Computers in

Human Behavior, 24, 1552-1571.

Tatsuoka, K.K., & Tatsuoka, M.M. (1982). Detection of aberrant response patterns and

their effect on dimensionality. Journal of Educational Statistics, 7, 215-231.

Templin, J. L., & Bradshaw, L. (2013). Measuring the reliability of diagnostic

classification model examinee estimates. Journal of Classification, 30(2), 251-275.

65

Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using

cognitive diagnosis models. Psychological Methods, 11(3), 287-305.

Tracey, D. H., & Morrow, L. M. (2006). Lenses on reading: An introduction to theories

and models. New York: Guilford Press.

von Davier, M. (2008). A general diagnostic model applied to language testing data.

British Journal of Mathematical and Statistical Psychology, 61(2), 287.

Wang, C. & Gierl, M. J. (2011). Using the attribute hierarchy method to make diagnostic

inferences about examinees’ cognitive skills in critical reading. Journal of

Educational Measurement, 48(2), 165-187.

Wang, C., Gierl, M. J., & Leighton, J. P. (2006). Investigating the cognitive attributes

underlying student performance on a foreign language reading test: An application

of the attribute hierarchy method. Paper presented at the annual meeting of the

National Council on Measurement in Education, San Francisco, California.

Wiederholt, J. L., & Blalock, G. (2000). Gray Silent Reading Tests: Examiners’ Manual.

Austin, Texas: PRO-ED, Inc.

Wijekumar, K., Meyer, B. J. F., & Lei, P. (2012). Large-scale randomized controlled trial

with 4th graders using intelligent tutoring of the structure strategy to improve

nonfiction reading comprehension. Educational Technology Research and

Development, 60(6), 987-1013.

Wijekumar, K., Meyer, B. J. F., Lei, P., Lin, Y., Johnson, L. A., Spielvogel, J. A.,

Shurmatz, K. M., Ray, M., & Cook, M. (2014). Multisite randomized controlled

66

trial examining intelligent tutoring of structure strategy for fifth-grade readers.

Journal of Research on Educational Effectiveness, 7, 331-357.

Williams, J. P., Hall, K. M., Lauer, K. D., Stafford, K. B., DeSisto, L. A., & deCani, J. S.

(2005). Expository text comprehension in the primary grade classroom. Journal of

Educational Psychology, 97(4), 538-550.

Xu, G., & Zhang, S. (2016). Identifiability of diagnostic classification models.

Psychometrika, 81(3), 625-649.

Xu, X., & von Davier, M. (2006). Cognitive diagnosis for NAEP proficiency data

(Research Report No. RR-06-08). Princeton, NJ: Educational Testing Service.

67

Appendix A. Q-matrix for Form A of the GSRT

Item Literal Inferential Critical Affective

1 0 1 0 0

2 1 0 0 0

3 0 1 1 0

4 0 1 1 0

5 0 1 1 0

6 1 0 0 0

7 0 1 0 0

8 0 1 1 0

9 0 1 1 0

10 0 1 1 0

11 1 0 0 0

12 1 0 0 0

13 0 1 0 0

14 0 0 0 1

15 0 1 1 0

16 1 0 0 0

17 1 0 0 0

18 0 1 0 0

19 0 0 0 1

20 0 1 1 0

21 0 1 0 0

22 1 0 0 0

23 1 0 0 0

24 0 0 0 1

25 0 1 1 0

26 1 0 0 0

27 1 0 0 0

28 0 1 1 0

29 0 1 1 0

30 0 1 1 0

31 1 0 0 0

32 0 1 1 0

33 0 0 0 1

34 0 1 1 0

35 0 1 1 0

36 1 0 0 0

37 1 0 0 0

38 0 1 0 0

39 0 1 1 0

40 0 1 1 0

41 1 0 0 0

42 1 0 0 0

43 0 1 1 0

68

44 0 1 1 0

45 0 1 1 0

46 0 1 1 0

47 0 1 0 0

48 0 1 0 0

49 0 1 1 0

50 0 1 1 0

51 1 0 0 0

52 0 1 0 0

53 0 1 0 0

54 1 0 0 0

55 0 1 1 0

56 0 1 0 0

57 1 0 0 0

58 0 1 0 0

59 0 1 1 0

60 0 1 1 0

61 0 1 1 0

62 0 1 1 0

63 0 0 0 1

64 0 1 1 0

65 0 1 1 0

69

Appendix B. Q-matrix for Form B of the GSRT

Item Literal Inferential Critical Affective

1 1 0 0 0

2 1 0 0 0

3 1 0 0 0

4 0 1 0 0

5 0 1 0 0

6 1 0 0 0

7 0 1 0 0

8 0 1 0 0

9 0 1 0 0

10 0 1 1 0

11 1 0 0 0

12 1 0 0 0

13 1 0 0 0

14 0 1 1 0

15 0 1 1 0

16 1 0 0 0

17 0 0 0 1

18 0 1 0 0

19 0 1 1 0

20 0 1 1 0

21 1 0 0 0

22 1 0 0 0

23 0 1 1 0

24 0 0 0 1

25 0 1 1 0

26 1 0 0 0

27 1 0 0 0

28 0 1 1 0

29 0 1 1 0

30 0 1 1 0

31 1 0 0 0

32 0 1 0 0

33 0 1 1 0

34 0 1 0 0

35 0 1 1 0

36 0 1 0 0

37 1 0 0 0

38 0 1 1 0

39 1 0 0 0

40 0 1 0 0

41 1 0 0 0

42 0 1 1 0

43 1 0 0 0

70

44 0 0 0 1

45 0 1 1 0

46 1 0 0 0

47 0 1 0 0

48 0 1 0 0

49 0 1 0 0

50 0 1 0 0

51 0 1 0 0

52 0 1 0 0

53 0 1 1 0

54 0 1 1 0

55 0 1 1 0

56 0 1 0 0

57 0 1 1 0

58 0 1 1 0

59 0 1 1 0

60 0 1 1 0

61 0 1 1 0

62 1 0 0 0

63 0 1 0 0

64 0 1 0 0

65 0 1 0 0

evaluating the effect of intelligent tutoring of …

Documents