evaluating the effect of intelligent tutoring of …
TRANSCRIPT
The Pennsylvania State University
The Graduate School
College of Education
EVALUATING THE EFFECT OF INTELLIGENT TUTORING
OF STRUCTURE STRATEGY ON STUDENTS’ READING ABILITY
USING A COGNITIVE DIAGNOSTIC MODELING APPROACH
A Thesis in
Educational Psychology
by
Xiaoli Jiang
© 2017 Xiaoli Jiang
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Master of Science
August 2017
The thesis of Xiaoli Jiang was reviewed and approved* by the following:
Pui-Wa Lei
Professor of Education
Thesis Advisor
Bonnie J. F. Meyer
Professor of Education
Peggy Van Meter
Associate Professor of Education
Professor in Charge, Educational Psychology Program
*Signatures are on file in the Graduate School.
iii
ABSTRACT
Reading comprehension ability is generally acknowledged to be crucial for
learners of all ages. Previous studies have consistently shown that the intelligent tutoring
of structure strategy (ITSS) is a promising approach to improving average reading
comprehension at group level. However, it is yet unclear as to whether ITSS instruction
is equally effective in promoting different reading subskills for all individuals, and
whether its effect differs between different grade levels or female and male learners.
The current study addressed those questions by applying cognitive diagnostic
modeling (CDM) technique, which is designed to provide formative diagnostic
information through a fine-grained reporting of learners’ subskill or attribute mastery
profile, through a retrofitting approach. The G-DINA model was used to examine which
of the four subskills (literal, inferential, critical, and affective subskills) as represented in
the Gray Silent Reading Test (GSRT) were significantly improved through ITSS
instruction as compared to a business-as-usual control group.
Our findings showed that all four reading subskills were significantly enhanced,
and that there was no sufficient evidence obtained to claim that the effect of ITSS
instruction differed in improving the four subskills. However, the growth rate of the four
subskills was different over time from pretest to posttest, with inferential subskill having
the greatest growth rate. In addition, the data collected supported the other two
hypotheses that the effect of ITSS instruction in promoting reading subskills did not
differ between Grade 4 and Grade 5 students or between female and male students.
iv
The results of the study are expected to shed light on both instruction and learning
of reading in that teachers can remediate the students’ weakness in reading subskills
through adapting instructional design and providing customized reading intervention
programs to aid student learning. The limitations of the present study and future research
directions were also discussed.
v
Table of Contents
LIST OF FIGURES ........................................................................................................... vi
LIST OF TABLES ............................................................................................................ vii
ACKNOWLEDGEMENT ............................................................................................... viii
Chapter 1 Introduction ........................................................................................................ 1
1.1 Statement of the Problem ...................................................................................................... 2
1.2 Significance of the Study ...................................................................................................... 5
Chapter 2 Literature Review ............................................................................................... 7
2.1 Overview of Cognitive Diagnostic Assessment .................................................................... 7
2.2 Cognitive Diagnostic Modeling Approach ......................................................................... 11
2.2.1 Cognitive Diagnostic Models ...................................................................................... 15
2.2.2 Application of the G-DINA Model in Reading Tests .................................................. 17
2.3 Reading Comprehension Subskills ..................................................................................... 18
2.3.1 Reading Comprehension as a Construct of Multidimensional Subskills ..................... 19
2.3.2 Specifying Reading Subskills and Subskill-Item Relationship.................................... 22
2.4 Research Questions and Study Hypotheses ........................................................................ 25
Chapter 3 Research Methodology ..................................................................................... 27
3.1 Identifying the GSRT Reading Subskills ............................................................................ 27
3.1.1 Instrumentation and Measure of Reading Comprehension .......................................... 28
3.1.2 Reading Subskill Identification ................................................................................... 29
3.1.3 Coding Procedures and Q-matrix Construction ........................................................... 30
3.2 Investigating Examinees’ Performance on the GSRT ......................................................... 32
3.2.1 Participants and Data Collection .................................................................................. 32
3.2.2 Model Fit Evaluation ................................................................................................... 33
3.3 Data Analysis ...................................................................................................................... 37
Chapter 4 Results .............................................................................................................. 40
4.1 On Group Level .................................................................................................................. 40
4.2 On Individual Subskill Level .............................................................................................. 42
Chapter 5 Discussion, Implication, Limitations, and Future Research............................. 50
5.1 Discussion of the Overall Findings ..................................................................................... 50
5.2 Implications for Reading Instruction and CDA of Reading ................................................ 52
5.3 Limitation and Future Research .......................................................................................... 53
References ......................................................................................................................... 55
Appendix A. Q-matrix for Form A of the GSRT.............................................................. 67
Appendix B. Q-matrix for Form B of the GSRT .............................................................. 69
vi
LIST OF FIGURES
Figure 4.1 Estimated probability of each subskill mastery profile for pretest and posttest
........................................................................................................................................... 40
Figure 4.2 Difference of probability of mastery of each subskill at pretest and posttest by
intervention condition ....................................................................................................... 43
vii
LIST OF TABLES
Table 2.1 An Example of Q-matrix .................................................................................. 11
Table 2.2 An Example of Latent Classes .......................................................................... 13
Table 3.1 An Analysis of Sample Reading Passages in the GSRT (Form A) .................. 29
Table 3.2 Subskills Specified by the GSRT Developers .................................................. 30
Table 3.3 Q-matrix for Form B of the GSRT (single subskill) ......................................... 31
Table 3.4 Q-matrix for Form B of the GSRT (multiple subskills) ................................... 32
Table 3.5 Absolute Fit Statistics for Pretest...................................................................... 34
Table 3.6 Relative Fit Statistics for Pretest ....................................................................... 36
Table 3.7 Absolute Fit Statistics for Posttest .................................................................... 36
Table 3.8 Relative Fit Statistics for Posttest ..................................................................... 37
Table 4.1 Subskill Mastery Probability in Pretest and Posttest ........................................ 42
Table 4.2 Repeated Measures Analysis of Variance for Within-Subjects and Between-
Subjects Effects ................................................................................................................. 45
Table 4.3 Mean Difference t-test ...................................................................................... 47
viii
ACKNOWLEDGEMENT
I have spent a lot of time working on this thesis, yet the study would not have
been possible without the contribution and help of many people. I am feeling grateful to
everyone who has helped me in completing the thesis. I am particularly thankful to my
advisor, Dr. Pui-Wa Lei, for her superb mentorship, guidance, encouragement, and
patience. Dr. Lei has a great vision and strong passion for research in educational
measurement, which has served as a great inspiration during my study in the Educational
Psychology program. It is she who has introduced me to the field of cognitive diagnostic
assessment, in which I am developing increasing interest.
I would also like to express my deep gratitude to Dr. Bonnie Meyer. As an expert
in reading research, she has provided me with valuable insights for my study. Dr. Meyer
is always encouraging and ready to help, which plays a very important role along my
academic journey.
I am also grateful to Janet Dillon, Robert Fogarty, Cory Kite and other colleagues
at Outreach Analytics and Reporting (OAR) in Penn State Outreach and Online
Education for their help. Janet is a wonderful supervisor, who is always supportive and
cares for both my research and my life. My GA work experience at OAR becomes
terrific because of her and the OAR team that she leads.
Special thanks must go to my peers Xiuyan Guo, Yao Xiong, and Xinyue Li in
Educational Psychology program. Their help is always appreciated.
My gratefulness to my family is beyond words. I am indebted to them all for their
continuous understanding, love, and support.
1
Chapter 1 Introduction
Reading comprehension ability, the ability to understand written text, is generally
acknowledged to be crucial for learners across all ages both inside and outside school
environment. Reading comprehension is fundamental and imperative for K-12 students
to have successful academic performance in school as they first learn to read before they
start reading to learn. On the other hand, reading comprehension is a challenge for
school children; U.S. school students have been under prepared for reading as evidenced
by state-level and national-level reading tests (NAEP 2007).
It is widely accepted that reading strategies are necessary for readers to maintain
appropriate comprehension. The utility of various strategies has been consistently proved
to be beneficial for readers to have better comprehension (Alfassi, 2004; Johnson-
Glenberg, 2000; Sung, Chang, & Huang, 2008). Therefore, reading instructors often
teach reading strategies in order to handle the challenges of learners’ reading obstacles
(McNamara, 2007). One of the most efficient strategies for which there is much research
and practice is training students to facilitate their reading comprehension by using the
structure strategy.
An influx of research about the structure strategy spans 40 years (Wijekumar,
Meyer, & Lei, 2012), which has indicated that the structure strategy can improve reading
comprehension achievement for both young (like 3rd graders) and adult readers (e.g.,
Meyer, Brandt, & Bluth, 1980; Meyer et al., 2002; Meyer et al., 2010; Wijekumar et al.,
2014). The structure strategy, designed by Dr. Bonnie Meyer in 1975, mainly focuses on
the organizational structure of a reading text and then follows this structure to determine
2
what is important (Meyer et al., 1980). Essentially, the structure strategy teaches readers
to find out the structure of an expository text (e.g., comparison, cause and effect, and
sequence) by identifying the key signaling words and to organize their reading
comprehension by using that structure (Meyer & Wijekumar, 2007).
Extensive research has been conducted with trained teachers training students on
how to use the structure strategy. The web-based Intelligent Tutoring System for the
Structure Strategy (ITSS), funded by the U.S. Department of Education, creates user
models by capturing tutoring interactions between human tutors and students and
converts them to computerized interactions. ITSS has demonstrated success in improving
reading comprehension for elementary and middle school students (e.g., Meyer et al.,
2010; Wijekumar et al., 2014).
1.1 Statement of the Problem
The structure strategy is a “well-tested method that helps readers to focus on the
text organization, helping them to organize their reading accordingly, and show
significant improvement in recall of expository text” (Project Description. Retrieved
September 13, 2016, from https://itss.psu.edu/itss/index.php/project-54). Specifically, the
structure strategy is a text-structure-based reading comprehension approach that requires
readers to: a) identify specific signaling words for the text structure (e.g., description,
comparison, sequence, problem-and-solution, and cause-and-effect), b) impose a top-
level structure on the reading text to create a main idea, and c) obtain proper
comprehension of the details in a text by combining both the signaling words and the
main idea (Wijekumar et al., 2014).
3
This suggests that the structure strategy has three major steps: a) to find out the
text structure by identifying the key signaling words such as “in contrast” for the
comparison structure (Meyer, 1975), b) to write down the main idea of the text that has a
particular text structure with the help of signaling words identified, and c) to recall the
text without referring to the text by using the text structure and main idea already
obtained. The three steps can be summarized to be focusing on identifying the main idea
(i.e., first two steps) and figuring out detailed explicit information (i.e., the third step)
depicted in a given passage through the use of a text structure. To obtain explicit
information from a text is literal-level comprehension whereas to figure out the main idea
involves both literal and inferential comprehension since the main idea of a passage can
be either explicitly or directly expressed (literal-level) or implied and therefore should be
inferred by readers (inferential-level).
Research studies have consistently shown that the structure strategy helps readers
of all ages improve their reading comprehension in that those who have received the
structure strategy instruction perform significantly better than their counterparts who
have not at a group level on assessments such as standardized reading comprehension
tests (e.g., Gray Silent Reading Test) or researcher-designed tests (e.g., Carrell, 1985;
Englert & Thomas, 1987; Meyer et al., 1980; Meyer et al., 2002; Meyer et al., 2010;
Meyer, Wijekumar, & Lin, 2011; Meyer & Poon, 2001; Meyer & Rice, 1989;
Wijekumar, Meyer, & Lei, 2012; Wijekumar et al., 2014; Williams et al., 2005).
However, it is yet unclear as to whether the structure strategy is equally effective in
promoting different reading subskills for all individuals.
4
The Gray Silent Reading Test (GSRT) mentioned above is a standardized reading
assessment that measures silent reading comprehension ability. The GSRT has two
equivalent forms (Form A and Form B) and each form is composed of 13
developmentally sequenced reading passages/stories with five multiple-choice questions
following each passage. There are four reading attributes or subskills underlying the
GSRT, which are specified by the test developers in the Gray Silent Reading Tests:
Examiner’s Manual (Manual hereafter) (Wiederholt & Blalock, 2000), namely, literal,
inferential, critical, and affective subskills.
According to the Manual (Wiederholt & Blalock, 2000), literal comprehension is
the subskill to figure out explicit information within the test, both general (such as the
directly expressed main idea) and specific (such as the explicitly stated supporting
details). Inferential comprehension is to infer meanings that go beyond the stated
information which includes inferring main ideas and inferring facts or to link text
information to other situations to make predictions like what is likely to happen based on
the current text. That is, inferential comprehension requires a mental process by which
we obtain information not explicitly stated and reach a conclusion based on specific
evidence or given information. Critical comprehension refers to readers’ ability to
analyze, evaluate, or make reasonable judgment about the text’s content. Therefore,
critical comprehension is also termed as evaluative comprehension in the literature of
reading research. According to Bloom’s Taxonomy (1956), to analyze and to evaluate
are higher-order thinking skills. Affective comprehension involves readers’ personal or
emotional response to the text. Bloom’s Taxonomy of educational objectives is divided
into three domains: cognitive, affective, and psychomotor. Bloom identified six levels
5
within the cognitive domain—knowledge, comprehension, application, analysis,
synthesis, and evaluation, which are also applied in reading comprehension (More
elaboration on this can be found in Chapter 2). Affective domain includes the manner in
which people deal with things emotionally and affective comprehension involves readers’
emotions and feelings to a text.
The present study aims at moving forward one step further from previous studies
by looking into the effect of structure strategy instruction on reading comprehension
subskills. The research questions addressed in the present study are: Is ITSS instruction
equally effective in improving the reading subskills as represented in the GSRT? Does
the effect of ITSS instruction vary by grade level (4th grade vs. 5th grade) or gender
(female vs. male)? Cognitive diagnostic modeling (CDM) assessment will be applied to
examine 4th and 5th grade students’ mastery of reading subskills based on their
performance on the GSRT before and after ITSS instruction and compare to their
business-as-usual control counterparts to address the research questions. An introduction
to CDM is given in the subsequent chapter.
1.2 Significance of the Study
McNamara (2007) stressed the importance of developing reading comprehension
skills in elementary school children, and further argued that development and instruction
in reading subskills are as important as fostering basic language skills for these students.
Therefore, it would be helpful to obtain more diagnostic information regarding students’
reading subskill mastery profiles by identifying the specific subskills that can be
significantly enhanced through ITSS instruction. Once the diagnostic information is
6
obtained, it will shed light on the design of reading skill instruction, training or
intervention, which in turn can benefit student learning. That is to say, this study will
inform the skill areas that need further strengthening within ITSS and it is hoped that
more customized and personalized intervention(s) can be developed and delivered to
individuals in need.
7
Chapter 2 Literature Review
This chapter is devoted to providing background information necessary for a
thorough understanding of the context of the present study, which encompasses the terms,
concepts, and previous research studies pertinent to cognitive diagnostic models and
cognitive diagnostic modeling approach, along with reading comprehension assessment
within the cognitive diagnostic assessment framework.
2.1 Overview of Cognitive Diagnostic Assessment
Knowing What Students Know (National Research Council, 2001) discussed some
measurement models that allow the merger of advances in cognitive science and
psychometric theories and facilitate inferences more relevant to learning. The
psychometric models include models known as cognitive diagnostic models (CDMs),
which can be used to understand the skills and cognitive processes involved in a task.
The approach that takes advantage of CDMs for designing and interpreting diagnostic
assessment is known as cognitive diagnostic assessment (CDA). The CDA approach
combines theories of cognition of interest with statistical models intended to make
inferences about examinees’ mastery profile of tested skills or their knowledge state of
the cognitive skills. The cognitive skills represent the processes or strategies that
examinees utilize in order to correctly solve tasks (Jang, 2008). Broader terms for
cognitive skills include knowledge and strategies that are used to successfully solve a
8
problem. Terms like skill, subskill, and attribute are often used interchangeably in the
literature.
CDA is an assessment approach that employs a cognitive diagnostic model to
identify items that measure specific skills, and then use this model to direct the
psychometric analyses of the examinees’ item response patterns for test score inferences
(Gierl, Cui, & Zhou, 2009). Designed to evaluate learners’ cognitive skills or attributes
specified in a cognitive model of test performance, CDA has a fundamental assumption
that every task or each item in a test can be described in terms of multiple cognitive
subskills or attributes that must be mastered by an individual in order to successfully
accomplish the task or answer each item correctly (Gierl et al., 2000).
In performing CDA analyses to language assessments, such as reading
comprehension, there are generally two approaches. First of all, a test can be designed by
developing diagnostic items that enable researchers to infer the skill processes underlying
the item responses (Jang, 2008; Lee & Sawaki, 2009). With the diagnostic items
developed, CDMs can be applied to further test the knowledge, subskills, or attributes
underlying each item. This approach, the most-needed area of future CDA research
(Jang, 2009b), however, is not commonly used because the development of cognitively
diagnostic tests is costly, time-consuming, and labor intensive. In contrast, the second
approach is presently much more common; that is, cognitive diagnostic modeling is
usually retrofitted to an existing non-diagnostic test although the development of CDA
has largely been motivated by the need for formative assessment techniques (Jang, 2005,
2008). Most applications of cognitive diagnostic modeling in the literature involve the
retrofitting of CDMs to assessments designed by using either classical test theory (CTT)
9
or item response theory (IRT), which is described as an addition of a new technology to
an older system (Gierl, Roberts, Alves, & Gotzmann, 2009). According to Liu, Huggins-
Manley, and Bulut (2017), retrofitting cognitive diagnostic models to existing non-
diagnostic assessments is “a plausible approach to obtain more actionable scores or
understand more about the constructs themselves” (p. 1).
Diagnostic information can be extracted from non-diagnostic tests, and the
resulting information can further provide insight into both teaching and learning (Lee &
Sawaki, 2009). Furthermore, there are currently many large-scale tests available
including various types of standardized reading comprehension tests like GRE and
TOEFL. A large amount of resources have been invested in developing them, so it would
be cost-effective if they can be used for other purposes such as extracting diagnostic
information about students’ mastery and non-mastery status of specified subskills (Chen
& de la Torre, 2014).
When a CDA analysis is conducted using retrofitting method, the following four
basic steps or procedures are typically followed. First, identify and define a set of
subskills required to perform successfully in a given test, which may or may not be
provided by the test developers. Second, construct or create a Q-matrix displaying the
relationship between test items and subskills specified. A Q-matrix is an item-by-subskill
incidence matrix that demonstrates which subskill or multiple subskills are needed for
successful performance for each item on the test (as shown in Table 2.1). In other words,
the Q-matrix specifies the set of latent traits, including knowledge, strategies, or
subskills, necessary for each item. Third, select a proper cognitive diagnostic model and
analyze the examinees’ test performance data by utilizing this model with the Q-matrix
10
already developed. Fourth, provide diagnostic feedback of the assessment results and
report the scores with diagnostic information (i.e., individual’s skill mastery
status/profile) to inform examinees, teachers, and parents of the examinees’ strengths and
weaknesses in a domain such as reading comprehension or mathematics (Lee & Sawaki,
2009).
George and his colleagues simplified the four steps, summarizing that only two
steps, a qualitative step and a quantitative step, are involved in a CDA analysis (George,
Robitzsch, Kiefer, Groß, & Ünlü, 2016). The first step is a qualitative one in which
researchers subdivide the tested ability into different abilities or subskills. Subsequently,
the researchers expose in a Q-matrix which skills are needed to master each of the items.
The second step is a quantitative one in which a cognitive diagnostic model is applied
and examinees are classified into dichotomous latent skill classes that predict their
mastery or non-mastery of the skills defined in the first step.
Q-matrix plays a key role in CDA analysis as it provides the specification of
subskill-item relationship. For a J*K Q-matrix, where J represents the number of items
and K the number of subskills, qjk denotes the element in row j and column k (“1” means
that item j assesses attribute k, and “0” otherwise). Table 2.1 illustrates what a typical Q-
matrix looks like. Suppose three subskills have been identified and defined. The
subskills measured by each item are indicated in this item-by-subskill binary matrix. For
example, Item 1 requires the first and third subskills for a correct response, and Item 3
only requires the first subskill.
11
Table 2.1 An Example of Q-matrix
Item Subskill 1 Subskill 2 Subskill 3
1 1 0 1
2 0 1 1
3 1 0 0
4 0 0 1
5 1 1 0
2.2 Cognitive Diagnostic Modeling Approach
Cognitive diagnostic models are a class of psychometric models developed
primarily for assessing students’ mastery and non-mastery status on a set of fine-grained
skills or attributes within a domain in order to provide a multivariate view of their
strengths and weaknesses (Templin & Bradshaw, 2013). Cognitive in this context means
CDMs are able to provide detailed information of students’ cognitive strengths and
weaknesses (Huebner, 2010), which can be used to adapt instructional designs and
determine the skills and knowledge that need to be further developed. This differs from
summative assessment which is generally evaluative rather than diagnostic. As the
detection of students’ strengths and weaknesses is gradually becoming an essential part of
educational assessment (Huff & Goodman, 2007), the CDM approach has received
increased research interest in recent years from researchers and educators in educational
and psychological fields. CDMs also have some alternative labels such as diagnostic
classification models (DCMs; Rupp, Templin, & Henson, 2010), latent response models
(Maris, 1995), and structured located latent class models (Xu & von Davier, 2006).
According to Rupp and Templin (2008), CDMs are “probabilistic, confirmatory
multidimensional latent-variable models with a simple or complex loading structure” (p.
12
226). CDMs are probabilistic because they predict probability of an observable
categorical response from unobservable latent categorical variables and provide
formative information in the form of attribute profiles. The unobserved latent variables
are usually termed and labeled in the literature as skills, subskills, attributes, or even
more broadly as abilities, strategies, or knowledge as introduced earlier. These terms are
often used interchangeably in the literature to refer to the unobserved latent variables or
latent traits (Ravand & Robitzsch, 2015). Attributes and subskills, commonly denoted as
α, will be used for most of the time throughout the paper for the purpose of clarity and
consistency. Examinees are assigned multidimensional attribute profiles, and further
diagnostically classified as either masters or non-masters of each subskill involved within
a domain. The examinee’s attribute profile is analogous to their subskill profile, latent
class, or knowledge state of subskills involved in a test. The four terms—attribute
profile, subskill (mastery) profile, latent classes, and knowledge state—are used
interchangeably throughout the present study.
Given k subskills, there are a total number of 2k attribute profiles or latent classes
(Gierl, Cui, & Zhou, 2009; Rupp, Templin, & Henson, 2010). Table 2.2 is an example of
eight latent classes when there are three subskills (k = 3) underlying performance on a
test. The examinees falling into Class 1 have mastered none of the three subskills,
whereas the examinees classified into Class 2 have only mastered the third subskill.
Examinees in either Class 1 (masters of none subskills) or Class 8 (masters of all three
subskills) are referred to as having a flat skill mastery profile (Ravand, Barati, &
Widhiarso, 2013).
13
Table 2.2 An Example of Latent Classes
Latent Class Number Representation of Latent Class
Class 1 [0,0,0]
Class 2 [0,0,1]
Class 3 [0,1,0]
Class 4 [0,1,1]
Class 5 [1,0,0]
Class 6 [1,0,1]
Class 7 [1,1,0]
Class 8 [1,1,1]
Cognitive diagnostic method using CDMs is now receiving widened attention due
to its advantages over traditional analysis of test results with CTT or unidimentional IRT
models. Both CTT and the most commonly used unidimensional IRT models are geared
towards measuring students’ overall performance by providing their unidimensional
ability values, typically with a summative scaled score and/or a percentile rank. In IRT
framework, for example, it is generally believed that a single latent trait denoted as q
underlies an individual’s performance on an item or in a test. Students with higher latent
traits have higher probability of giving correct response. In contrast, within CDM
framework, an individual’s performance on an item is assumed to be a function of
multiple, discrete latent traits generically referred to as attributes or subskills, which has
been explained earlier. Students’ successful performance on a test requires a series of
successful implementation of the attributes underlying the test.
A CDM framework can provide multidimensional results and examinees’
attribute profiles whereas neither CTT nor unidimensional IRT can provide diagnostic
information on examinees’ mastery or non-mastery state of tested subskills. Diagnostic
score reports based on these subskill profiles can then be provided to researchers,
14
educators, students, and their parents. Proper actions can be taken accordingly in view of
the examinees’ strengths and weaknesses in a domain in order to guide their
improvements in the ongoing teaching and learning context. This demonstrates that, as
Jang (2008) pointed out, the CDA approach is aimed at promoting assessment for
learning and learning process as opposed to assessment of learning in traditional methods.
CDA analysis using CDMs has gained its popularity among researchers and
educators only in recent years despite all its advantages. Ravand and Robitzsch (2015)
suggested that its underutilization is due to two reasons. First, CDMs are novel and more
complex compared to conventional IRT models, leading to the fact that many researchers
remain unfamiliar with CDMs, including how to use them. Second, model specifications
for CDMs are more complex and more subject to estimation problems because they are
highly parameterized models.
In the past decade, CDMs have been applied to tests in different domains such as
mathematics and reading comprehension tests using the retrofitting approach. A number
of research studies with regard to reading tests, second language reading tests in
particular, have been conducted using CDMs (e.g., Chen & Chen, 2015; Gao, 2007;
Hartz & Roussos, 2008; Jang, 2005; Jang, Dunlop, Wagner, Kim, & Gu, 2013; Li, 2011;
Li & Suen, 2013; Ma & Meng, 2014; Ravand, Barati, & Widhiarso, 2013; Sawaki, Kim,
& Gentile 2009).
Like most CDM applications in the literature, the GSRT test used in the present
study was not originally designed for cognitive diagnostic purpose. By applying the
retrofitting approach, it is hoped that diagnostic information extracted from this
15
standardized reading test can facilitate the understanding of the effectiveness of ITSS
instruction on specific reading subskills.
2.2.1 Cognitive Diagnostic Models
In the past decade, there has been a rapid progress in cognitive diagnostic
assessment practice. A wide array of CDMs have been developed, as evidenced by over
120 CDMs reviewed by Fu and Li (2007, cited in Lee & Sawaki, 2009). These CDMs
can be generally grouped into three families including non-compensatory CDMs,
compensatory CDMs, and general CDMs. The terms refer to how subskills are related to
modeling the probability of a correct response (Lissitz et al., 2014).
In a non-compensatory CDM framework, a correct response to an item can be
obtained only if all required subskills for this item are present. The deterministic input,
noisy “and” gate model (DINA; Haertel, 1989; Junker & Sijtsma, 2001), the Rule Space
Model (Tatsuoka, 1983), and the Reparameterized Unified Model (RUM; Hartz, 2002)
that is also known as Fusion Model (Hartz, 2002) all belong to this family. In contrast, in
compensatory CDMs, not all skills are required; mastery of only some of the subskills
required for a correct response can compensate for non-mastery of other subskills. The
diagnostic inputs, noisy “or” gate model (DINO; Templin & Henson, 2006) and the
Additive Cognitive Diagnostic Model (ACDM; de la Torre, 2011) both belong to this
second family. Finally, general CDMs allow for both types of relationships. General
diagnostic model (GDM; von Davier, 2008; Xu & von Davier, 2006) and generalized
DINA model (G-DINA; de la Torre, 2011) are examples of general CDMs.
16
Non-compensatory models are more popular than compensatory models in
performing cognitive diagnostic analyses because the former generate more fine-grained
diagnostic information than the latter can (Li, 2011). Application of CDMs is most
common with mathematics and reading comprehension. It is generally agreed that non-
compensatory models are appropriate for mathematics assessments, while compensatory
models are better for reading assessments since many researchers believe reading skills
are compensatory in nature (Li, Hunter, & Lei, 2015). A number of studies have been
conducted where CDMs are applied to large-scale reading assessments (e.g., Chen &
Chen, 2015; Jang, 2009b; Kim, A. Y., 2011; Kim, H. S., 2011; Li, 2011; Li & Suen,
2013; Ravand, Barati, & Widhiarso, 2013).
As a general cognitive diagnostic model, the G-DINA model is a generalization of
the DINA model (Junker & Sijtsma, 2001). The G-DINA model relaxes the highly
restricted DINA model assumption that anyone who fails to master all subskills has the
equal probability of success. Based on a Q-matrix of dimensions J (number of items) * K
(number of attributes), the probability formula for the G-DINA model is given by de la
Torres (2011) as follows:
where,
a lj
* is the binary latent attribute vector for latent class l
is the intercept for item j
is the main effect of mastering attribute for item j
P(a lj
*) = d j0 + d jk
k=1
K j*
å a lk + djkk '
k=1
K j*-1
åk '=k+1
K j*
å a lka lk '...+dj12...K j
* a lk
k=1
K j*
Õ
d j0
d jk a k
17
is the interaction effect due to attributes and for item j
In the formula above, represents the baseline probability of a correct response
when none of the required subskills is mastered by an examinee, which can be viewed as
the guessing parameter. With guessing, a correct response is still possible in this case.
When used, the G-DINA model is usually accepted for its simplicity of
computation and estimation in identifying the role an individual subskill plays for
someone in completing a task (Tatsuoka & Tatsuoka, 1982). Nevertheless, the G-DINA
model is much less frequently adopted in educational research as compared to other
CDMs like Fusion model and DINA model. The G-DINA model has drawn researchers’
attention only in recent years, especially after de la Torre’s (2011) illustration of
estimation and application of the G-DINA model using both real-world data and
simulated data.
2.2.2 Application of the G-DINA Model in Reading Tests
Li and her colleagues reviewed previous CDM analyses of reading tests,
summarizing each study in the literature on the test analyzed, the cognitive diagnostic
model(s) used, and software used (Li, Hunter, & Lei, 2015). A total of 15 studies have
been reviewed, and the summary of review was organized in a tabular form. Of the 15
studies reviewed, none of them used the G-DINA model; instead, Rule Space Model and
Fusion Model were usually used.
The feature of the G-DINA model aligns with that of reading comprehension in
that this model “is sensitive to the integrative nature of reading comprehension skills and
djkk ' a k a
k '
d 0
18
can detect the interactive relationship among them” (Chen & Chen, 2015, p. 4). This
suggests that the G-DINA model can in theory be applied to reading tests. In the limited
sources that can be tracked in the extant literature, it is found that the G-DINA model has
been successfully applied in reading tests. For example, Chen & Chen (2015) examined
the relationships among the five reading subskills (i.e., identifying explicit information,
generalizing main idea, interpretation and explanation, getting inference, and evaluation
and comment) specified and defined by six content experts by applying the G-DINA
model and analyzing the test response data of 1,029 British secondary school students’
performance on a reading test (i.e., Programme for International Student Assessment
English) with 20 items. The research demonstrated that the G-DINA model caters to the
characteristics of reading subskills and suggested that the model be applied to tests that
involve hierarchical skills.
In view of the research findings that the G-DINA model is well suited for
modeling reading comprehension subskills and the fact that very few educational studies
have employed the G-DINA model, it might be a meaningful endeavor to apply this
model to analyze students’ test performance data on the GSRT reading test to extract
more fine-grained diagnostic information about examinees at the attribute level for the
current study. Goodness of fit of different more parsimonious variants of the G-DINA
model will be further evaluated to determine the best fitting model.
2.3 Reading Comprehension Subskills
Reading is usually viewed as comprehension—“the process of simultaneously
extracting and constructing meaning through interaction and engagement with written
19
language” (RAND Reading Study Group, 2002, p. xiii). Reading ability, the ability to
comprehend meaning from the written text, is a key skill for learning. Reading
comprehension and reading ability are used interchangeably throughout the study. Given
the important role reading plays in our lives, there has long existed a debate about the
nature of reading comprehension. A myriad of theories have been proposed to describe
and explain reading comprehension, among which cognitive processing perspectives of
reading remain the most influential to date (Tracey & Morrow, 2006). Because reading is
an unobservable construct or mental process, educational psychologists in the late 1950s
started to extensively describe the underlying cognitive processes involved in reading
comprehension. Therefore, cognitive processing view began to appear and is still popular
today among researchers. Nevertheless, the nature of reading is still little known and
there is no agreed-upon view on the dimensionality of the reading construct.
2.3.1 Reading Comprehension as a Construct of Multidimensional Subskills
In general, there are two opposite views on the dimensionality of reading
comprehension. Many experts hold the view that reading comprehension is unitary and
holistic, and thus it cannot be further divided (e.g., Lunzer, Waite, & Dolan, 1979).
Others, on the contrary, argue that reading comprehension is multi-divisible or that
reading ability is composed of separable components or subskills (e.g., Carr & Levy,
1990; Davis, 1968; Munby, 1978).
Experts in the second group believe that reading comprehension consists of
separable components or constituents, in other words, multiple subskills are involved in
reading comprehension (e.g., Davis, 1944; Munby, 1978). However, they differ on the
20
number of reading subskills there exist. For example, Davis (1944) reviewed the relevant
literature and identified nine skills deemed most basic and important by authorities in the
domain of reading, such as ability to select the appropriate meaning for a word or phrase
in light of its particular contextual setting, ability to select the main thought of a passage,
ability to draw inferences from a passage about its contents, ability to draw inferences
about a writer, and so forth. In a subsequent factor analysis study, Davis (1968) found
eight skills in reading: recalling word meanings, obtaining word meaning based on the
context clues, finding out explicit information, integrating ideas of the text, drawing
inferences based on given information, figuring out the writer’s writing purpose, tone
etc., identifying the writer’s techniques, and identifying the text structure. Munby (1978)
contended that 19 skills are required for reading comprehension, such as skimming,
scanning, having basic reference skills, and so on. Scarborough (2001) suggested that
reading largely consisted of word recognition skill and language comprehension skill,
each of which further consists of three and five subskills, respectively. This review
shows that the identified number of skills that reading comprehension include varies,
which indicates the complex nature of reading ability. This intricate nature of reading
makes it difficult to find a suitable model for assessment of learners’ reading ability (Kim
A. H., 2011).
According to Anderson and Pearson (1984), drawing inferences, one of the
common reading skills, is an essential part of the reading comprehension process, even
among young children. There are researchers who argue that hierarchical relationships
exist among reading comprehension skills. Bloom’s Taxonomy (1956) consisted of six
categories in the cognitive domain used to express the level of expertise required to
21
achieve an educational goal: knowledge, comprehension, application, analysis, synthesis,
and evaluation. The six categories lie along a continuum from simple to complex and
from low level to higher level. This taxonomy is also used in reading comprehension by
some researchers holding the position that it would be comparatively more difficult for
readers to attain the higher level of reading skills such as synthesis and evaluation (e.g.,
Gray, 1960).
In contrast to this view, some researchers have found that lower-level reading
skills are actually not prerequisites for higher-level reading skills (e.g., Alderson, 1990;
Matthews, 1990), which is evidenced by examples that some examinees can perform
successfully on items requiring higher-level reading skills but fail on items requiring
lower-level skills. In addition, the construction-integration model (Kintsch & Kintsch,
2005) proposed that reading is a cognitive decoding process, making inferences and
integration in the reading process. This model also supports the multiple-component
view of reading comprehension.
In the present study, we adopted the divisibility view of reading comprehension
and contended that multiple reading comprehension subskills are required to complete
reading tasks. CDMs are well-suited for modeling multiple subskills in a confirmatory
fashion. That is, comprehension subskills required for an assessment are defined a priori
in CDM analyses.
22
2.3.2 Specifying Reading Subskills and Subskill-Item Relationship
As introduced in the beginning of the chapter on the procedures of carrying out a
cognitive diagnostic assessment using retrofitting approach, the first step is to specify and
define a set of tested subskills. Considering the importance of proper specification of a
set of tested subskills, a variety of sources are usually utilized, which include test
specifications, content domain theories, respondents’ think-aloud interviews, and a
literature review on relevant research (Leighton & Gierl, 2007; Leighton, Gierl, &
Hunka, 2004). When the subskills are not specified by the test developers, which is
common in retrofitting, it is especially important to take advantage of different sources to
identify the set of subskills. One or a combination of all the above-mentioned sources is
found in the literature of application of CDMs in an assessment.
In specifying subskills underlying a given assessment from existing items, some
considerations are important for researchers to keep in mind. One of the considerations is
that the total number of subskills, k, should be small (Liu, Huggins-Manley & Bulut,
2017). Since examinees are classified into 2k attribute profiles or latent classes, it would
be very difficult to estimate the attributes and interpret the results given a large number of
attributes (Liu, Huggins-Manley & Bulut, 2017; Xu & Zhang, 2016). Generally, the
retrofitting studies in the literature specify three to five attributes or subskills (Liu,
Huggins-Manley & Bulut, 2017) so that examinees are classified into 8 (k = 3), 16 (k = 4),
or 32 (k = 5) subskill profiles or latent classes.
Upon completing the specification of reading subskills, subskill-item relationships
should be examined and contained in a Q-matrix. There are generally three ways in
selecting coders for analyzing subskill-item relationships in order to create a Q-matrix.
23
First, the author(s) of a study can serve as coders in employing diagnostic models to
existing test (e.g., Buck et al., 1997; Gierl, Leighton, et al., 2009). It is in many cases due
to either funding issues or the fact that people not involved in the project cannot be
invited to serve as coders (Buck & Tatsuoka, 1998). Second, content experts are
recruited and included in the research to code the items (e.g., von Davier, 2008; Wang,
Gierl, & Leighton, 2006). It is generally believed that content experts are experienced in
teaching in a domain such as reading; they are familiar with the examinee population of a
given test and understand examinees’ subskill processes regarding problem solving.
Third, some researchers recruit graduate students to assist with the coding (e.g., Jang,
2005; Wang & Gierl, 2007; Wang & Gierl, 2011). Compared to content experts,
graduate students can be more easily recruited. Wang and Gierl (2007) illustrated the use
of attribute hierarchy method (AHM) by applying AHM to students’ response data from
the SAT Critical Reading Subtest. In specifying the attribute hierarchy, the researchers
recruited two graduate students in cognitive psychology, who had experience in
conducting verbal report study and were familiar with reading comprehension research.
In practice, the three ways outlined may have some overlap as in a situation where the
authors of a study are a cohort of graduate students.
The three ways introduced above have all been used in the literature for both
subskills specification and the development of Q-matrix, with no consensus so far about
which way is superior to the others. One might claim that content experts are
authoritative in a domain and should always be included in the coding process. It should
be noted, however, that experts’ ability is substantially higher than that of the examinees
so that experts may approach questions in considerably different ways than examinees
24
do. There is no evidence suggesting that subskills identified by content experts are
indeed used by examinees in solving problems (Leighton & Gierl, 2007). Furthermore,
the number of coders involved in a study also varies across studies, with the most
common number being two (e.g., Birenbaum & Tatsuoka, 1993; Wang & Gierl, 2011).
Unlike the common practice, Jang (2005) recruited five coders to review the entire test
items of the LanguEdge Reading Comprehension Test (Form 1) and to decide primary
subskills required by each item. Additionally, there are also studies in which a single
coder is used in the coding process (e.g., Buck & Tatsuoka, 1998; Leighton, Cui, & Cor,
2009). Therefore, there is no agree-upon way of identifying subskills nor number of
coders recruited to specify the subskill-item relationships and code the Q-matrix.
Researchers may employ any reasonable method and any number of coders depending on
their own situation.
In addition to the involvement of coders, a number of researchers also use
examinees’ think-aloud protocols or verbal reports as an important empirical source of
information in constructing Q-matrix (e.g., Buck, 1991, 1994; Jang, 2005; Li, 2011;
Wang & Gierl, 2007). Think-aloud protocol is a qualitative research method in which
items in a test are given to a sample of examinees in order to obtain their thinking
processes utilized in understanding, conceptualizing, reasoning, and responding to the
items (Leighton, 2004; Leighton & Gierl, 2011).
Researchers have adopted the methods outlined above in constructing their Q-
matrix. For example, Jang (2005) came up with a Q-matrix by using extensive
preliminary analyses of tasks and examinees’ performance, and qualitative analysis of
verbal protocols from both coders/raters and examinees in the NG TOEFL reading
25
comprehension test. Similarly, Li and Suen (2013) constructed their initial Q-matrix for
the MELAB reading assessment by employing various sources including reviewing
related literature, obtaining examinees’ think-aloud protocols, and inviting content
experts for their ratings. The initial Q-matrix was then validated using preliminary
empirical evidence.
2.4 Research Questions and Study Hypotheses
Using the CDM approach, the present research attempts to investigate the
following research questions:
(1) Is ITSS instruction equally effective in promoting students’ reading subskills
as represented in the GSRT (i.e., literal, inferential, critical, and affective
subskills)?
(2) Does the effect of ITSS instruction in improving reading subskills differ by
grade level (Grade 4 vs. Grade 5)?
(3) Does the effect of ITSS instruction in improving reading subskills differ by
gender (female vs. male)?
The set of subskills involved in the GSRT will be elaborated in the next chapter
on methodology of the study. Students’ performance on the GSRT in terms of presence
or absence of these subskills before and after ITSS instruction will be evaluated relative a
business-as-usual control condition to address the research questions.
As mentioned in the introduction to structure strategy in Chapter 1, two reading
subskills, namely, identifying main idea and extracting explicit information from text, are
26
primarily emphasized by the structure strategy (i.e., the ITSS instruction in the present
study), which include literal and inferential comprehension. In addition, since the
structure strategy has shown effective for users of different age groups in increasing their
reading comprehension, no grade or gender difference in reading achievement was
expected through the ITSS instruction. Consequently, the hypotheses of the present study
are as follows.
Hypothesis 1: ITSS instruction was not equally effective in promoting reading subskills
as represented in the GSRT; students’ literal and inferential subskills would be
significantly improved through ITSS instruction compared to their counterparts who did
not receive ITSS instruction controlling for their prior reading subskill levels. Students
who did or did not receive ITSS instruction were expected to perform similarly on the
other two subskills—critical and affective subskills.
Hypothesis 2: The effect of ITSS instruction in improving reading subskills did not differ
between 4th- and 5th-grade students.
Hypothesis 3: The effect of ITSS instruction in improving reading subskills did not differ
between female and male students.
27
Chapter 3 Research Methodology
Chapter 3 describes the methodology of the current study, which consists of two
parts. The first part mainly focuses on describing the subskills involved in the GSRT and
the construction of Q-matrix for both forms of the GSRT. The second part deals with the
investigation of examinees’ performance on the GSRT, of both pretest (i.e., Form B was
administered) and posttest (i.e., Form A was administered). Examinees’ test response
data in pretest and posttest were analyzed based on their treatment groups using the G-
DINA model in order to examine their strengths (presence/mastery of reading subskills as
represented in the GSRT) and weaknesses (absence/non-mastery of these reading
subskills). Results were used to address the research questions and the findings were
expected to provide diagnostic feedback to the ITSS developers and educators for future
instructional purposes.
3.1 Identifying the GSRT Reading Subskills
As stated previously, the first step in conducting a cognitive diagnostic
assessment using retrofitting method is to identify and define a set of subskills because
successful performance on a test requires a series of successful implementations of these
subskills. This is the fundamental assumption of CDA that each item in a test can be
described in terms of multiple cognitive subskills that must be mastered in order for an
individual to answer each item correctly. Without this first step, the diagnostic analysis is
impossible to be performed.
28
3.1.1 Instrumentation and Measure of Reading Comprehension
The reading assessment employed in the current study is the Gray Silent Reading
Test, a standardized reading test that helps test administrators or teachers quickly and
efficiently measure and assess an individual’s silent reading comprehension abilities.
Namely, the GSRT is used to test whether the individual has developed, or is developing
the ability to silent read with proper comprehension. The GSRT consists of two parallel
forms (Form A and Form B), with each containing 13 reading passages or stories that are
developmentally sequenced. That is to say, passages gradually become longer and more
difficult from the first one to the last. Following each passage, there are five multiple-
choice questions, making each form of the GSRT consist of 65 items.
The GSRT reading test can be either group or individually administered and each
form of the test yields raw scores, grade equivalents, age equivalents, percentiles and a
Silent Reading Quotient. The appropriate age range for the GSRT ranges from 7 to 25
years of age. The test has been used for a variety of subgroups including males, females,
European Americans, African Americans, Hispanic Americans, Asian Americans, Native
Americans, the Learning Disabled, and so forth.
Table 3.1 is an illustration of some selected passages from Form A of the GSRT,
suggesting that the reading passages in the GSRT differ in multiple ways including
content, content structure, length, and difficulty level. Form B of the GSRT shares the
same feature.
29
Table 3.1 An Analysis of Sample Reading Passages in the GSRT (Form A)
Reading Passage Topic/Theme Content Structure Length
Passage 1 A boy Narration 41 words
Passage 4 Girls at a party Narration 100 words
Passage 6 Sharks Description 132 words
Passage 10 Folk dances Description 157 words
Passage 12 Advertisements Argumentation 159 words
3.1.2 Reading Subskill Identification
Like most current application of cognitive diagnostic models, we retrofitted the
diagnostic model to the GSRT, a non-diagnostic assessment. When an assessment is
non-diagnostic but the use of the assessment is diagnostic, it is essential to first identify a
set of subskills involved in the assessment and subsequently analyze each item to figure
out the subskill-item relationship by indicating what subskills are necessary for each item
in order for examinees to provide a correct response, namely, the construction of the Q-
matrix.
As previously introduced, a variety of sources are usually utilized to specify the
set of skills involved in a task. The four subskills involved in the GSRT are specified in
the Manual (2000) and can thus be used directly; they are literal, inferential, critical, and
affective subskills. In addition, we also looked into the literature with regard to reading
subskills and all four subskills, particularly literal, inferential, critical subskills, are
commonly used reading attributes. Table 3.2 provides a summary of the definitions of
the four subskills.
30
Table 3.2 Subskills Specified by the GSRT Developers
Reading subskills Definition and description
Literal Forming a global understanding of a passage/text; locating and
recognizing information explicitly stated.
Inferential Inferring information not explicitly stated by using given
information; linking text information to make predictions.
Critical Analyzing, evaluating, or making reasonable judgment about the
text, forms of writing or author’s opinions.
Affective Personal or emotional response to the text.
3.1.3 Coding Procedures and Q-matrix Construction
For the current study, the author along with two other graduate students (2nd year
and 4th year in Educational Psychology program at Penn State) served as the coders.
Think-aloud protocol was not used since the data had already been collected long before
they were diagnostically analyzed, making it impractical to take a sample of examinees to
examine their multidimensional cognitive processes that they were engaged in when
taking the GSRT.
Coding Procedures. The author introduced to the other two coders both the
GSRT and the four subskills for the test. The coding process was done independently for
the first round. Three coders each worked on their own to determine what subskill(s) is
used for each item and create a Q-matrix on their own. Each coder came up with two Q-
matrices—one is only composed of single subskill used for each item and the other
multiple subskills. As we read through the items in the GSRT, it can be found that some
items only require one single subskill. For example, an item asks examinees about a
detail that one passage explicitly expresses, suggesting that it measures only literal
subskill. For other items, they may involve two or more subskills but one of them is
31
dominant. So the single subskill Q-matrix only depicts the one subskill required by each
item, which can be either the only subskill required or the subskill that is dominant. The
other Q-matrix, however, encompasses multiple subskills required. For the first round,
coding was performed by the three coders independently. Following that, the coding
results were compared and any disagreement was discussed before a consensus was
achieved. The inter-rater reliability (i.e., Pearson’s correlation) for independent coding
was 0.861 for the first set of Q-matrix and 0.804 for the second set. After all
disagreements were solved, all coders reached consensus (i.e., the inter-rater reliability
reached 1.00) that two sets of common Q-matrix were acceptable based on their own
understanding and standards.
Common Q-matrix. For illustration purpose, Table 3.3 displays a portion of the
first common Q-matrix for Form B of the GSRT with each item requiring only one
subskill. Table 3.4 is the other common Q-matrix with some items coded as requiring
multiple reading subskills. Both complete Q-matrices for each of the two forms of the
GSRT can be found in Appendix A and Appendix B.
Table 3.3 Q-matrix for Form B of the GSRT (single subskill)
Item Subskill
Literal Inferential Critical Affective
9 0 1 0 0
10 0 1 0 0
11 1 0 0 0
12 1 0 0 0
13 1 0 0 0
14 0 1 0 0
32
Table 3.4 Q-matrix for Form B of the GSRT (multiple subskills)
Item Subskill
Literal Inferential Critical Affective
9 0 1 0 0
10 0 1 1 0
11 1 0 0 0
12 1 0 0 0
13 1 0 0 0
14 0 1 1 0
3.2 Investigating Examinees’ Performance on the GSRT
3.2.1 Participants and Data Collection
The present study used a secondary dataset from a large randomized control trial
of the ITSS intervention. Participants of the study included both 4th and 5th graders in
the states of Michigan and Pennsylvania, United States, with a total number of 5,790
students. A check on the GSRT performance data found that there were missing values; a
total of 5,393 students participated in the pretest and 5,125 students participated in the
posttest. The ITSS intervention group had 3.47% (201 cases) missing on pretest and
7.13% (413 cases) missing on posttest, while the control group had 3.38% (196 cases)
missing on pretest and 4.35% (252 cases) missing on posttest. Little’s missing
completely at random (MCAR) test was conducted, which yielded a non-significant
result; χ2
(261) = 1740, p = 0.93. This suggested that the missing data could be viewed as
MCAR. Therefore, listwise deletion was employed and cases with missing values were
all dropped listwise from the analysis. The analysis sample size was 4,728 students.
33
The 4th grade students from 131 classrooms, and the 5th grade students from 128
classrooms, including both rural and suburban schools, participated in the study. The
schools signed a memorandum indicating their agreement to participate in the study. The
participating classrooms were randomly assigned to experimental conditions (ITSS
intervention vs. control) within schools, with participants being approximately balanced
across both conditions. For example, 65 classrooms of 5th grade were assigned to the
ITSS intervention condition and 63 classrooms to the control condition. The students in
intervention group used the ITSS system as a partial substitute for the standard language
arts curriculum for one class period (30-45 minutes) per week for six to seven months
whereas those in control group used only their regular language arts curriculum.
To prevent intervention diffusion, the teachers in charge of the intervention
classrooms were instructed not to share the ITSS software access passwords or relevant
reading materials with other teachers in the school. In addition, students in control
condition would get access to use ITSS for the next academic year. Form B of the GSRT
was administered to all participants before ITSS instruction. Form A of the GSRT was
administered about six months later when ITSS instruction for the intervention group was
completed under the same conditions as the pretest administration. Participants’ response
data of both pretest and posttest for both intervention and control groups were analyzed
for the current study.
3.2.2 Model Fit Evaluation
Like any other statistical models, the estimated parameters in CDMs will be
interpretable when the model fits the data. When a model is evaluated for its fit to the
34
data, it is an absolute fit evaluation. On the other hand, with the availability of various
CDMs, it is also important to compare a model with other rival ones and select the most
appropriate model. This way of model fit evaluation is referred to as relative fit
evaluation. Both absolute and relative fit indices can be obtained using the CDM
packages (Robitzsch, Kiefer, George, & Uenlue, 2016) in R.
To obtain the fit statistics, we firstly used the first Q-matrix (i.e., unidimensional
Q-matrix) in which each item was coded as only requiring one reading subskill for
correct response with four CDMs (G-DINA, DINA, ACDM, and DINO). This Q-matrix
resulted in equivalent models with the same fit statistics for all four CDMs, which
indicated that when only one subskill is required for each item, there is no need to
distinguish compensatory or non-compensatory CDMs. We then used the second Q-
matrix (i.e., multidimensional Q-matrix) in which some items require multiple subskills
with these four CDMs. The absolute and relative model fit indices for both pretest and
posttest that were obtained from this second round of analysis are presented in Table 3.5,
Table 3.6, Table 3.7, and Table 3.8, respectively.
Table 3.5 Absolute Fit Statistics for Pretest
G-DINA DINA ACDM DINO
maxX2 759.998 770.650 953.824 787.778
p_maxX2 0 0 0 0
MADcor 0.055 0.059 0.061 0.060
SRMSR 0.080 0.079 0.086 0.080
100* MADRESIDCOV
(MADRCOV)
0.882 0.887 0.977 0.903
MADQ3 0.035 0.057 0.038 0.052
MADaQ3 0.036 0.058 0.039 0.053
35
Table 3.5 summarizes the absolute model fit statistics for pretest of all
participants including MADcor, SRMSR, 100*MADRESIDVOC, MADQ3, and
MADaQ3. According to Robitzsch, Kiefer, George, and Uenlue (2016), for each of these
fit statistics, “it holds that smaller values (values near to zero) indicate better fit” (p. 136).
Measures including MADcor, SRMSR, and 100*MADRESIDVOC are in fact effect
sizes of absolute model fit which compare observed and predicted covariances of item
pairs (Ravand & Robitzsch, 2015). The smaller an effect size is, the better a model fits
the data. Therefore, the G-DINA model has better model fit than the DINA, ACDM, and
DINO models. The significance test of absolute model fit indicates that there was a
significant model misfit for all four models (p = 0). However, the maxX2 value for the
G-DINA model was the smallest (maxX2 = 759.998), suggesting that the other three
models had worse model fit. In summary, the absolute fit statistics showed that the G-
DINA model fitted our data the best and was superior to other models for our study.
36
Table 3.6 Relative Fit Statistics for Pretest
G-DINA DINA ACDM DINO
Loglike -153045.0 -157153.6 -154443.3 -156911.3
Deviance 306090.0 314307.2 308886.5 313822.5
Npars 189.0 145.0 165.0 145.0
Nobs 5393.0 5393.0 5393.0 5393.0
AIC 306468.0 314597.2 309216.5 314112.5
BIC 307714.0 315553.2 310304.4 315068.5
AIC3 306657.0 314742.2 309381.5 314257.5
AICc 306481.8 314605.3 309227.0 314120.6
CAIC 307903.0 315698.2 310469.4 315213.5
Table 3.6 focuses on the relative fit statistics for pretest of information criteria
such as AIC, BIC, AIC3, AICc (i.e., sample size adjusted AIC), and CAIC (consistent
AIC). For all the fit statistics, the G-DINA model performed the best among the four
models as it yielded the smallest value for each fit index.
The absolute and relative model fit indices for posttest are presented in Table 3.7
and Table 3.8, respectively. The G-DINA model performed the best amongst all models
for all fit statistics.
Table 3.7 Absolute Fit Statistics for Posttest
G-DINA DINA ACDM DINO
maxX2 757.101 812.882 989.811 802.082
p_maxX2 0 0 0 0
MADcor 0.050 0.058 0.054 0.057
SRMSR 0.075 0.080 0.078 0.078
100* MADRESIDCOV
(MADRCOV)
0.752 0.862 0.813 0.839
MADQ3 0.035 0.053 0.038 0.054
MADaQ3 0.036 0.053 0.037 0.054
37
Table 3.8 Relative Fit Statistics for Posttest
G-DINA DINA ACDM DINO
Loglike -147549.2 -151017.1 -148333.8 -151109.5
Deviance 295098.5 302034.2 296667.6 302219.1
Npars 201.0 145.0 173.0 145.0
Nobs 5124.0 5124.0 5124.0 5124.0
AIC 295500.5 302324.2 297009.6 302509.1
BIC 296815.4 303272.7 298128.2 303457.6
AIC3 295701.5 302469.2 297180.6 302654.1
AICc 295517.0 302332.7 297021.5 302517.6
CAIC 297016.4 303417.7 298299.2 303602.6
The absolute and relative fit statistics presented in the tables above all suggested
that the G-DINA model had a good model fit with the 4th- and 5th-grade students’ test
response data. Therefore, the G-DINA model with the multidimensional Q-matrix was
used to analyze students’ GSRT test performance data for further diagnostic inferences to
address the research questions.
3.3 Data Analysis
To test the three hypotheses in the study, the G-DINA model was fit to
participants’ item response data, separately by pretest and posttest, using the CDM
package in R. More details on the analysis procedure are provided as follows.
On the group level, the subskill mastery patterns of all participants in terms of
their 16 (k = 4) subskill mastery profiles and the estimated probability of mastery for both
pretest and posttest and intervention and control groups were generated. Examinees’ 16
subskill mastery profiles between pretest and posttest for both intervention and control
38
groups were then compared based on the estimated probability of each subskill mastery
profile, which could yield a global picture of students’ subskill mastery status. The
subskill mastery profiles classify students into different latent classes, which provide
information about their strengths and weaknesses regarding the reading subskills. On the
individual subskill level, the probability of mastery for each of the four subskills on both
pretest and posttest were also obtained to see the change/gain of subskill probability from
pretest to posttest for both intervention and control groups.
To test the three hypotheses, a factorial repeated measures (i.e., repeated on four
subskills) ANOVA (RM-ANOVA) was performed. In this model, the outcome variable
was the change/difference of mastery probability of four subskills from pretest to
posttest. Subskill was the within-subject factor and there were three between-subject
factors with each having two levels—experimental condition (1 = intervention group
using ITSS instruction; 0 = control group using regular curriculum, not ITSS instruction),
grade level (1 = 5th grade; 0 = 4th grade), and gender (1 = female; 0 = male).
The first hypothesis was that the effect of ITSS instruction was not the same on
all four subskills; its effect was stronger on literal and inferential subskills so that these
two subskills were significantly promoted through ITSS instruction while critical and
affective subskills were not. To test this hypothesis, the interaction between ITSS
instruction and subskill was examined. Should the interaction be not significant at the
0.05 significance level, there was no sufficient evidence obtained to support the claim
that the effect of ITSS instruction was different on all subskills.
Two other interactions were taken into consideration and investigated as well to
test the second and third hypotheses that the effect of ITSS instruction in improving
39
reading subskills did not differ by grade level or gender. Specifically, the interaction
between subskill and grade level and the interaction between subskill and gender were
examined to test these two hypotheses, respectively. The hypotheses would be supported
if neither of the interaction was significant.
40
Chapter 4 Results
Chapter 4 reports the findings regarding 4th- and 5th-grade students’ performance
on the GSRT on both pretest and posttest in terms of the students’ strengths and
weaknesses in the four reading subskills and the effect of ITSS instruction.
4.1 On Group Level
To begin with, we analyzed the class probabilities or estimated occurrence
proportions of each subskill mastery profile for all students in both pretest and posttest
and intervention and control groups when the G-DINA model was applied. With four
subskills involved in the GSRT, the total number of subskill mastery profiles or latent
classes was 16 (i.e., 24).
The probability of each subskill mastery profile for pretest and posttest and
intervention and control groups based on the G-DINA model are presented in Figure 4.1.
Figure 4.1 Estimated probability of each subskill mastery profile for pretest and posttest
41
Figure 4.1 presents examinees’ 16 subskill mastery profiles between pretest and
posttest for both intervention and control groups, which is diagnostic information
pertinent to students’ strengths and weaknesses in the four reading subskills. In CDM
applications in retrofitting contexts, it is usually the case that examinees’ flat skill
mastery profiles (either mastery of none subskills or mastery of all subskills) are
predominant. Figure 4.1 shows that many students belong to Class 16—masters of all
four subskills ([1111]). With regard to pretest, it can be seen that the performance of
control group and intervention group was pretty similar in terms of their subskill mastery
profiles, indicating that reading ability of the two groups was originally at the similar
level before ITSS intervention. The proportion of students who were masters of none of
the four subskills (see subskill mastery profile [0000]) decreased from pretest to posttest
for the two experimental groups. In addition, the proportion of students who were
masters of all four subskills (see subskill mastery profile [1111]) increased from pretest
to posttest. According to Liu et al. (2017), it should be expected in most retrofitting
contexts that the majority of examinees are classified either as non-masters on each
subskill ([0000]) or masters on all subskills ([1111]).
The above pattern of profiles also suggests that students who were non-masters of
inferential comprehension but masters of all other three subskills (see the subskill
mastery profile of [1011]) decreased sharply from pretest to posttest for both
experimental groups. In addition to random measurement errors, this might be the case
that inferential subskill grew in a so rapid fashion from pretest to posttest that fewer
students were non-masters of this subskill at posttest. This will be further discussed in
the subsequent section about the growth rate of the four reading subskills.
42
4.2 On Individual Subskill Level
We then looked into students’ mastery levels of each individual subskill by
examining their subskill mastery profile. Table 4.1 shows the subskill probability for
each reading subskill in pretest and posttest by experimental condition.
Table 4.1 Subskill Mastery Probability in Pretest and Posttest
Pretest Posttest
Subskills Control Intervention Control Intervention
Literal 0.6408 0.6512 0.6554 0.6876
Inferential 0.4906 0.4784 0.6051 0.6160
Critical 0.6533 0.6610 0.6696 0.7064
Affective 0.6280 0.6362 0.6904 0.7247
Table 4.1 indicates that among the four reading subskills, critical subskill, a
higher-order reading subskill as deemed by many researchers, was the subskill best
mastered in pretest for both groups. This finding was in line with some researchers’
views introduced in Chapter 2 that lower-level reading subskills are not prerequisites for
higher-level reading subskills. Therefore, it is highly likely that readers, such as
elementary school students have mastered higher-order reading subskills before they
master the lower-level subskills.
Figure 4.2 presents the change/difference of subskill mastery probability from
pretest to posttest for both groups as well as the difference between intervention and
control groups at both pretest and posttest.
43
Figure 4.2 Difference of probability of mastery of each subskill at pretest and posttest by
intervention condition
Figure 4.2 shows that control and intervention groups performed similarly in
pretest with regard to the probability of mastery of all four reading subskills, which was
what we expected. In pretest, there was no significant difference in the estimated
probability of mastery between control and intervention conditions of literal subskill;
t(5382.3) = -0.092, p = 0.927. Likewise, there was no significant difference for
inferential, critical, and affective subskills between the two treatment groups at pretest;
t(5379.4) = 1.368, p = 0.171; t(5383.5) = -0.078, p = 0.937; and t(5386.1) = -0.489, p =
0.624, respectively. Figure 4.2 also presents that the larger difference of probability of
mastery of all four subskills, inferential subskill in particular, between pretest and posttest
for students in the intervention condition than for students in the control condition, which
44
was consistent with the research findings in the literature that the structure strategy is
effective in improving readers’ reading comprehension ability.
A four-way RM-ANOVA was subsequently conducted to test the three research
hypotheses. Prior to the analysis, the missing data pattern was explored and the missing
values are completely randomly distributed across all observations.
One of the assumptions for RM-ANOVA is that of sphericity, which refers to the
equality of the variance of the differences between levels of repeated measures factor. To
test whether sphericity assumption was satisfied, Mauchly’s Test of Sphericity was used.
The test yielded a highly significant result with W = 0.051, χ2
(5) = 18453.497, p < 0.001,
which suggested that sphericity assumption was not met. In this situation, the
corrections, most notably the Greenhouse-Geisser and Huynh-Feldt epsilon corrections,
were used instead. The corresponding corrective coefficients were: Greenhouse-Geisser
ε = 0.417 and Huynh-Feldt ε = 0.418. Table 4.2 summarizes the results of the four-way
RM-ANOVA, depicting the RM-ANOVA results for within-subjects and between-
subjects effects.
45
Table 4.2 Repeated Measures Analysis of Variance for Within-Subjects and Between-
Subjects Effects
Source df MS F p
Greenh
ouse-
Geisser
Huynh
-Feldt
Within-subject
Subskill 1.251 33.074 129.965 .000 .000
Subskill*Female 1.251 .767 3.014 .073 .073
Subskill*Grade 1.251 .357 1.405 .242 .242
Subskill*ITSS 1.251 .028 .110 .796 .796
Subskill*Female
*Grade 1.251 .208 .819 .391 .391
Subskill*Female
*ITSS 1.251 .132 .517 .512 .512
Subskill*Grade*
ITSS 1.251 .535 2.104 .142 .142
Subskill*Female
*Grade*ITSS 1.251 .205 .807 .395 .395
Error 7767.403 .254
Between-subject
Intercept 1 102.086 213.986 .000***
Female 1 .005 .010 .920
Grade 1 .960 2.011 .156
ITSS 1 3.905 8.186 .004**
Female*Grade 1 .064 .134 .714
Female*ITSS 1 .403 .845 .358
Grade*ITSS 1 1.055 2.211 .137
Female*Grade
*ITSS 1 .005 .010 .921
Error 2961.643 .477
Note: Female coded as 1 = Female and 0 = Male; Grade coded as 1 = Grade 5 and 0 =
Grade 4; ITSS coded as 1 = ITSS intervention (i.e., intervention group) and 0 = no ITSS
intervention (i.e., control group).
*p < .05. **p < .01. ***p < .001.
Table 4.2 indicates that the main effect of ITSS instruction was statistically
significant at the 0.05 significance level (p = 0.004), suggesting that ITSS instruction was
46
effective in promoting students’ reading subskills. This finding was consistent with the
findings in literature that ITSS instruction can promote reading comprehension.
Hypothesis 1. Table 4.2 shows that the interaction between subskill and ITSS
instruction was not significant at the 0.05 significance level (p = 0.796), suggesting that
there was no sufficient evidence found in the data to support that ITSS instruction was
differentially effective on different reading subskills. Therefore, the first hypothesis that
the effect of ITSS instruction was different for subskills and its effect for literal and
inferential subskills was greater was not supported.
Hypotheses 2 and 3. The second and third hypotheses were that the effect of
ITSS instruction in improving reading subskills did not vary by grade level or by gender.
Table 4.2 indicates that the interaction between subskill and grade level was not
significant at the 0.05 significance level (p = 0.242), indicating that there was no
sufficient evidence available in our data to suggest that the effect of ITSS instruction for
both Grade 4 and Grade 5 was different. Likewise, the same table shows the interaction
between subskill and gender was not significant at the 0.05 significance level (p = 0.073)
either, suggesting that not enough evidence was available in our data to suggest the effect
of ITSS instruction for female students and male students was not the same.
Consequently, both Hypothesis 2 and Hypothesis 3 were supported by the data collected.
In addition, the main effect of subskill was significant at the 0.05 significance
level (p < 0.001), so we followed this effect up with post hoc tests. The results are
presented in Table 4.3.
47
Table 4.3 Mean Difference t-test
95% Simultaneous
confidence interval
Subskill Compare Mean different p Lower Upper
Literal L - I -.102 .000*** -.122 -.082
Inferential L - C -.006 .209 -.015 .002
Critical L - A -.051 .002** -.057 -.045
Affective I - C .095 .000*** .073 .117
I - A .051 .000*** .030 .071
C - A -.045 .000*** -.049 -.040
Note: *p < .05. **p < .01. ***p < .001.
A Bonferroni correction has been applied for multiple comparisons.
Table 4.3 presents the specific mean difference between a pair of subskills with
regard to the difference of subskill mastery probability from pretest to posttest, which
represents the growth rate of subskills between pretest and posttest. It follows that the
growth rate or magnitude of growth was not the same across the four reading subskills.
The largest mean difference was observed between literal and inferential subskills, and
the smallest significant difference was between critical and affective subskills. Both
Table 4.3 and Figure 4.2 suggest that the growth rate for the four subskills was different,
with inferential subskill having the greatest growth rate from pretest to posttest. The
greatest growth rate for inferential subskill may be accounted for by the following
reasons.
First, there exit differences in the developmental trajectories of reading subskills
(e.g., Leppanen, Niemi, Aunola, & Nurmi, 2004; Paris, 2005), which has often been
neglected in theories about reading (Paris, 2005). Research in the literature has shown
that reading skills neither develop in a linear fashion nor grow in the same rate. Paris
(2005) postulated that reading skills involved both constrained and unconstrained skills,
48
with the former being composed of “alphabetic knowledge, phonemic awareness, and
oral reading fluency” (p. 187). Constrained skills develop from non-existent to high in
childhood in a rapid speed, and they be “constrained developmentally, conceptually, and
by measurement” (p. 192). Take the alphabetic knowledge for example, kids may have
known and mastered the 26 letters at a very young age. In contrast, unconstrained
reading skills, such as the four reading skills specified in the present study, continue to
develop across a lifetime, and they develop in varying growth rate.
Second, with varying growth rate for different reading subskills, the inferential
subskill showed the greatest growth rate from pretest to posttest for both intervention
group and control group. This might be due to the fact that teachers in these schools
emphasized this subskill more than other skills. Hansen (1981) provided a strategy for
improving inferential comprehension subskill. According to the author, reading
comprehension skills needed to be taught in school, which is also the actual practice in
the U.S. elementary schools. The author further pointed out that reading teachers had
different emphasis in teaching reading skills such that some teachers emphasize literal
skill whereas some emphasize on drawing inferences. It was possible that the huge
difference in inferential subskill from pretest to posttest was because teachers emphasized
the instruction on inferential subskill.
Third, the varying growth rate of the four reading skills may also be caused by the
form difference between the two forms (i.e., Form A and Form B) of the GSRT even
though they are supposed to be parallel test forms. In addition, the practice of taking test
might also result in the difference in growth rate.
49
In summary, the first hypothesis was partially supported in that all four subskills,
not only literal and inferential subskills as hypothesized, have been significantly
improved through ITSS instruction. With regard to the growth rate of the four subskills
from pretest to posttest, however, it was not the same across the four subskills, with
inferential subskill having the greatest growth rate. The reasons for the difference in
growth rate were also discussed from multiple perspectives. The second and third
hypotheses that the effect of ITSS instruction was the same between 4th and 5th graders
and between female and male students were both supported by our data.
50
Chapter 5 Discussion, Implication, Limitations, and Future Research
This chapter discusses the overall findings of the present study and its
implications for reading instructional designs and learning and cognitive diagnostic
assessment of reading tests. In addition, limitations of the study and future research
directions are also discussed.
5.1 Discussion of the Overall Findings
The present study evaluated the effect of ITSS instruction on 4th and 5th grade
students’ reading ability in terms of reading subskills as represented in the standardized
reading test, Gray Silent Reading Test, by applying a cognitive diagnostic modeling
approach. Four reading subskills are involved in the GSRT as specified by the test
developers: literal, inferential, critical, and affective subskills. The G-DINA model, a
generalized cognitive diagnostic model, was applied because it fitted the students’ GSRT
response data the best from both absolute and relative fit evaluations.
The findings yielded clear evidence that 4th and 5th grade students’ all four
subskills, literal, inferential, critical, and affective subskills, were significantly improved
through ITSS instruction compared to those students who used the regular reading
instruction curriculum. That critical and affective subskills were also significantly
improved through the ITSS instruction was not expected as the structure strategy is
primarily focused on literal and inferential reading subskills.
51
On the other hand, this finding can be sensibly explained by the strong
correlations between the reading subskills. Many studies in reading have demonstrated
the strong correlations between reading subskills, for instance, between low-level skills
and higher-level skills (e.g., Landi, 2010; Landi & Perfetti, 2007; Shankweiler, 1989).
Liu and his colleagues claimed that attributes or subskills obtained from retrofitting an
assessment not originally developed for diagnostic purpose are expected to be highly
correlated because the test is usually unidimensional when designed under CTT or
conventional IRT framework (Liu, Huggins-Manley, & Bulut, 2017). When the
correlations are high enough as close to 1, it follows that examinees are classified into
either all-mastery or non-mastery profile, namely, the flat skill mastery profile introduced
in Chapter 2. As a result, when reading subskills are closely related to one another, it is
highly likely that critical and affective subskills are also improved given that literal and
inferential subskills are enhanced through the ITSS instruction.
Although all four reading subskills were significantly improved through ITSS
instruction, their growth rate was not the same over time from pretest to posttest, with
inferential subskill having the greatest growth. No significant difference was found
between that of literal and critical subskills but there was significant difference between
all other pairs of subskills.
Furthermore, for the four significantly improved reading subskills, no significant
evidence was obtained to show that ITSS instruction was not equally effective in
promoting them.
52
In terms of the effect of ITSS instruction by grade level and by gender, our data
supported the hypotheses that the effect in improving reading subskills did not differ
between 4th- and 5th-grade students or between female and male students.
In summary, our findings demonstrated that ITSS instruction was effective in
significantly improving all four reading subskills as represented in the GSRT, which
agreed with the findings in literature that ITSS instruction significantly promotes
learners’ reading comprehension achievement. The present research adds more onto the
literature in providing more fine-grained information as to what specific reading skills
has been improved through ITSS instruction and the effect of ITSS instruction was the
same for our sample between 4th and 5th graders and between female and male students.
5.2 Implications for Reading Instruction and CDA of Reading
Under both CTT and traditional unidimensional IRT frameworks, learners’
reading comprehension performance is represented as a single score. Consequently,
when a reading test is taken, none of the policy-makers, teachers, or students themselves
are clear about the students’ strengths or weakness in terms of reading subskills. With
cognitive diagnostic modeling analyses, however, students are classified into
multidimensional reading subskill profiles. Accordingly, the teachers are aware of
whether an individual is a master or non-master of a specific subskill with the fine-
grained feedback provided by the CDM analyses results.
The findings of the present study have several implications for future reading
instructional designs or reading intervention. First, the structure strategy can be provided
53
to students as a supplement of the regular standard language arts curriculum to enhance
students’ reading ability since it consistently demonstrates to be an effective reading
strategy. Second, through the CDM assessment performed in this study, students’
strengths and weaknesses in their reading subskills were provided and represented as
different latent classes, which can be utilized by teachers to remediate their weakness
through providing additional reading materials or even personalized reading intervention
programs. Students who are weak in certain subskills can be focusing more on their
weakness and take actions accordingly.
In addition, the present study also has important implications for cognitive
diagnostic assessment of reading comprehension. The present study used the G-DINA
model in addressing the research questions. In the literature, however, the G-DINA
model has not been used much for the cognitive diagnostic assessment of reading tests.
Instead, Fusion model and DINA model have been much more commonly used. In the
present study, the G-DINA model has proved to fit our data better than DINA model
from the model fit evaluation procedure. The successful application of the G-DINA
model in the current study provides evidence that this cognitive diagnostic model can be
a good option to be used for reading assessments in extracting readers’ mastery and non-
mastery reading skill profiles.
5.3 Limitation and Future Research
The present study used a cognitive diagnostic modeling approach with retrofitting
method to test the research questions about which subskill(s) as represented in the GSRT
would be significantly improved through ITSS instruction, whether the effect was the
54
same across subskills, and whether the effect was the same by grade level and by gender.
It posed a couple of limitations and restrictions. First of all, the primary limitation lies in
the retrofitting approach utilized in the study. Popular as retrofitting is, Liu, Huggins-
Manley, and Bulut (2017) pointed out the problems and challenges that retrofitting
approach possesses in terms of assessment design and statistical quality due to the fact
that cognitive diagnostic models are applied in non-diagnostic tests. Therefore, the CDA
approach should be used with caution. On the other hand, the future research for CDA
approach should be focusing more on creating a test that is designed by developing
diagnostic items that enable researchers to infer the skill processes underlying the item
responses rather than the retrofitting method, namely, the first approach to performing
CDA analyses as introduced in Chapter 2.
Another limitation is that examinees’ think-aloud protocol was not involved in the
present study as a source for creating the Q-matrix since it was impractical to recruit a
sample of examinees at the time when this study was conducted. As noted earlier,
experts’ coding might be different from the approach that examinees actually use in
practice because they may approach questions very differently. Therefore, the
examinees’ think-aloud protocol may be a very valuable source in developing and
validating the Q-matrix, which is critical in conducting any cognitive diagnostic
modeling study. Should conditions allow it, future study applying CDM approach should
take this perspective into consideration.
55
References
Alderson, J.C. (1990). Testing reading comprehension skills (Part One). Reading in a
Foreign Language, 6 (2), 425-438.
Alderson, J.C., & Lukmani, Y. (1989). Cognition and reading: Cognitive levels as
embodied in test questions. Reading in a Foreign Language, 5, 253-270.
Alfassi, M. (2004). Reading to learn: Effects of combined strategy instruction on high
school students. Journal of Educational Research, 97, 171-184.
Anderson, R. C., & Pearson,P. D. (1984). A schema-theoretic view of basic processes in
reading. In P. D. Pearson (Ed.), Handbook of reading research (pp. 255-292). New
York: Longman.
Bloom, B. S. (1956). Taxonomy of educational objectives: The classification of
educational goals. New York: Longmans, Green.
Carr, T. H., & Levy, B. A. (Eds.). (1990). Reading and its development: Component skills
approaches. San Diego: Academic Press.
Carrell, L. P. (1985). Facilitating ESL Reading by Teaching Text Structure. TESOL
Quarterly, 19(4), 727-752.
Chen, H., & Chen, J. (2015). Exploring reading comprehension skill relationships
through the G-DINA model. Educational Psychology, 1-20.
Chen, J. S., & de la Torre, J. (2014). A Procedure for Diagnostically Modeling Extant
Large-Scale Assessment Data: The Case of the Programme for International
Student Assessment in Reading. Psychology, 5, 1967-1978.
56
Davis, F. (1944). Fundamental factors of comprehension in reading. Psychometrika, 9,
185-197.
de la Torre, J. (2011). The generalized DINA model framework. Psychometrika, 76, 179-
199.
Englert, C. S., & Thomas, C. C. (1987). Sensitivity to text structure in reading and
writing: A comparison between learning disabled and non-learning disabled
students. Learning Disability Quarterly, 10(2), 93-105.
Gao, L. (2007). Cognitive-psychometric modeling of the MELAB reading items
(Unpublished doctoral dissertation). University of Alberta, Edmonton, Alberta.
George, A. C., Robitzsch, A., Kiefer, T., Groß, J., & Ünlü, A. (2016). The R package
CDM for cognitive diagnosis models. Journal of Statistical Software, 74(2). 1-24.
Gierl, M. J., Cui, Y., & Zhou, J. (2009). Reliability and attribute-based scoring in
cognitive diagnostic assessment. Journal of Educational Measurement, 46, 293-
313.
Gierl, M. J., Leighton, J. P., Wang, C., Zhou, J., Gokiert, R., & Tan, A. (2009).
Validating cognitive models of task performance in algebra on the SAT®. New
York: The College Board.
Gierl, M. J., Roberts, M., Alves, C., & Gotzmann, A. (2009). Using judgments from
content specialists to develop cognitive models for diagnostic assessments. Paper
presented at the annual meeting of National Council on Measurement in Education,
San Diego, CA.
57
Gray, W. S. (1960). The major aspects of reading. In H. Robinson (Ed.), Sequential
development of reading abilities (Vol. 90, pp. 8-24). Chicago: Chicago University
Press.
Haertel, E. H. (1989). Using restricted latent class models to map the skill structure of
achievement items. Journal of Educational Measurement, 26, 333-352.
Hartz, S.M. (2002). A Bayesian framework for the unified model for assessing cognitive
abilities: Blending theory with practicality (Unpublished doctoral dissertation).
University of Illinois at Urbana-Champaign, Urbana-Champaign, IL.
Hartz, S., & Roussos, L. (2008). The Fusion model for skills diagnosis: Blending theory
with practice (ETS research report, No. RR-08-71). Princeton, NJ: Educational
Testing Service.
Hansen, J. (1981). An inferential comprehension strategy for use with primary grade
children. The Reading Teacher. 34(6), 665-669.
Hou, L., de la Torre, J., & Nandakumar, R. (2014). Differential item functioning
assessment in cognitive diagnostic modeling: Application of the Wald test to
investigate DIF in the DINA model. Journal of Educational Measurement, 51(1),
98-125.
Hubner, A., (2010). An Overview of Recent Development in Cognitive Diagnostic
Computer Adaptive Assessments. Practical Assessment, Research & Evaluation,
15(3), 1-7.
58
Huff, K., & Goodman, D. P. (2007). The demand for cognitive diagnostic assessment. In
J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education:
Theory and applications (pp. 19-60). Cambridge, UK: Cambridge University Press.
Jang, E.E. (2005). A validity narrative: Effects of reading skills diagnosis on teaching
and learning in the context of NG-TOEFL (Unpublished doctoral dissertation).
University of Illinois at Urbana-Champaign, Urbana-Champaign, IL.
Jang, E. E. (2008). A Framework for Cognitive Diagnostic Assessment. Paper presented
at conference held at Iowa State University on September 21 and 22 of 2007, Ames,
IA.
Jang, E. E. (2009a). Demystifying a Q-matrix for making diagnostic inferences about L2
reading skills, Language Assessment Quarterly, 6(3), 210-238.
Jang, E. E. (2009b). Cognitive diagnostic assessment of L2 reading comprehension
ability: Validity arguments for applying Fusion Model to LanguEdge assessment.
Language Testing, 26(1), 31-73.
Jang, E. E., Dunlop, M., Wagner, M., Kim, Y., & Gu, Z. (2013). Elementary school
ELLs’ reading skill profiles using cognitive diagnosis modeling: Roles of length of
residence and home language environment. Language Learning, 63(3), 400-436.
Johnson-Glenberg, M. C. (2000). Training reading comprehension in adequate
decoders/poor comprehensions: Verbal versus visual strategies. Journal of
Educational Psychology, 92, 772-782.
59
Junker, B. W. & Sijtsma, K. (2001). Cognitive assessment models with few assumptions,
and connections with nonparametric item response theory. Applied Psychological
Measurement, 25(3), 258-272.
Kim, A. Y. (2011). Examining second language reading components in relation to
reading test performance for diagnostic purposes: A fusion model approach.
(Unpublished doctoral dissertation). Columbia University, New York, NY.
Kim, H. S. (2011). Diagnosing Examinees’ Attributes-mastery Using the Bayesian
Inference for Binomial Proportion: a New Method for Cognitive Diagnostic
Assessment. (Unpublished doctoral dissertation). Georgia Institute of Technology,
Atlanta, GA.
Kintsch, W., & Kintsch, E. (2005). Comprehension. In S. G. Paris & S. A. Stahl (Eds.),
Children’s reading comprehension and assessment (pp. 71-92). Mahwah, NJ:
Erlbaum.
Landi N. (2010). An examination of the relationship between reading comprehension,
higher-level and lower-level reading sub-skills in adults. Reading and Writing.
23(6), 701-717.
Landi N., & Perfetti C. A. (2007). An electrophysiological investigation of semantic and
phonological processing in skilled and less skilled comprehenders. Brain and
Language. 102, 30-45.
Lee, Y-W., & Sawaki, Y. (2009). Cognitive diagnosis approaches to language
assessment: An overview. Assessments: Language Assessment Quarterly, 6(3), 172-
189.
60
Leighton, J. P., & Gierl, M. J. (2007). Defining and evaluating models of cognition used
in educational measurement to make inferences about examinees’ thinking
processes. Educational Measurement: Issues and Practice, 26(2), 3-16.
Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy method for
cognitive assessment: A variation on Tatsuoka’s rule-space approach. Journal of
Educational Measurement, 41(3), 205-237.
Leppanen, U., Niemi, P., Aunola, K. & Nurmi, J.-E. (2004). Development of reading
skills among preschool and primary school pupils. Reading Research Quarterly,
39(1), 72-93.
Li, H. (2011). Evaluating language group differences in the subskills of reading using a
cognitive diagnostic modeling and differential skill functioning approach
(Unpublished doctoral dissertation). Pennsylvania State University, State College,
PA.
Li, H., Hunter, C.V., & Lei, P-W. (2015). The selection of cognitive diagnostic models
for a reading comprehension test. Language Testing, 1(19), 1-19.
Li, H., & Suen, H. K. (2013). Detecting native language group differences at the subskills
level of reading: A differential skill functioning approach. Language Testing, 30(2),
273-298.
Lissitz, B., Jiao H., Li M., Lee, D.Y., & Kang Y. (2014). Cognitive Diagnostic Models:
Executive Report for the Maryland State Department of Education by the MARC
Team. Retrieved from
61
http://marces.org/current/ExecutiveReport_MARC_2014_Cognitive%20Diagnostic
%20Models.pdf.
Liu, R., Huggins-Manley, A. C., & Bulut O. (2017). Retrofitting Diagnostic
Classification Models to Response from IRT-Based Assessment Forms.
Educational and Psychological Measurement, 27, 1-27.
Lunzer, E., Waite, M., & Dolan, T. (1979). Comprehension and comprehension tests. In
E. Lunzer & K. Gardner (Eds.), The effective use of reading (pp. 37-71). London:
Heinemann Educational Books.
Ma, X., & Meng, Y. (2014). Towards Personalized English Learning Diagnosis:
Cognitive Diagnostic Modelling for EFL Listening. Asian Journal of Education and
e-Learning, 2(5), 336-348.
Maris, E. (1995). Psychometric latent response models. Psychometrika, 60(4), 523-547.
Matthews, M. (1990). Skill taxonomies and problems for the testing of reading. Reading
in a Foreign Language, 7(1), 511-517.
McNamara, D. S. (2007). Reading comprehension strategies: Theories, interventions,
and technologies. New York: Lawrence Erlbaum Associates.
Meyer, B. J. F. (1975). The organization of prose and its effects on memory. Amsterdam:
North Holland.
Meyer, B. J. F., Brandt, D. M., & Bluth, G. J. (1980). Use of the top-level structure in
text: Key for reading comprehension of ninth-grade students. Reading Research
Quarterly, 16, 72-103.
62
Meyer, B. J. F., Middlemiss, W., Theodorou, E., Brezinski, K. L., McDougall, J., &
Bartlett, B. J. (2002). Effects of structure strategy instruction delivered to fifth-
grade children using the internet with and without the aid of older adult tutors.
Journal of Educational Psychology, 94(3), 486-519.
Meyer, B. J. F., & Poon, L. W. (2001). Effects of structure strategy training and signaling
on recall of text. Journal of Educational Psychology, 93, 141-159.
Meyer, B. J. F., & Rice, G. E. (1989). Prose processing in adulthood: The text, the reader,
and the task. In L. W. Poon, D. C. Rubin, & B. A. Wilson (Eds.), Everyday
cognition in adulthood and later life (pp. 157-194). New York, NY: Cambridge
University Press.
Meyer, B. J. F., & Wijekumar, K. (2007). A web-based tutoring system for the structure
strategy: Theoretical background, design, and findings. In D. S. McNamara (Ed.),
Reading comprehension strategies: Theories, interventions, and technologies (pp.
347-375). Mahwah, NJ: Lawrence Erlbaum Associates.
Meyer, B. J. F., Wijekumar, K., & Lin, Y. (2011). Individualizing a web-based structure
strategy intervention for fifth graders’ comprehension of nonfiction. Journal of
Educational Psychology, 103(1), 140-168.
Meyer, B. J. F., Wijekumar, K., Middlemiss, W., Higley, K., Lei, P., Meier, C., &
Spielvogel, J. (2010). Web-based tutoring of the structure strategy with or without
elaborated feedback or choice for fifth- and seventh-grade readers. Reading
Research Quarterly, 45(1), 62-92.
63
Meyer, B. J. F., Young, C. J., & Bartlett, B. J. (1989). Memory improved: Reading and
memory enhancement across the life span through strategic text structures.
Hillsdale, NJ: Lawrence Erlbaum.
Munby, J. (1978). Communicative syllabus design. Cambridge, UK: Cambridge
University Press.
National Assessment of Educational Progress (NAEP). (2007). Available online at
http://nationsreportcard.gov/reading_2007/.
National Research Council. (2001). Knowing what students know: The science and
design of educational assessment. Washington, DC: National Academies Press.
Paris, S. G. (2005). Reinterpreting the development of reading skills. Reading Research
Quarterly, 40(2), 184-202.
RAND Reading Study Group. (2002). Reading for understanding: Toward an R&D
program in reading comprehension. Washington, DC: RAND Education.
Ravand, H., Barati, H., & Widhiarso, W. (2013). Exploring Diagnostic Capacity of a
High Stakes Reading Comprehension Test: A Pedagogical Demonstration. Iranian
Journal of Language Testing, 3(1).
Ravand H. & Robitzsch A. (2015). Cognitive Diagnostic Modeling Using R. Practical
Assessment, Research & Evaluation. 20(11), 1-12.
Robitzsch, A., Kiefer, T., & George, A. C., & Uenlue, A. (2016). CDM: Cognitive
Diagnosis Modeling. R package version 5.1-0. https://CRAN.R-
project.org/package=CDM
64
Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification
models: A comprehensive review of the current state-of-the-art. Measurement:
Interdisciplinary Research and Perspectives, 6(4), 219-262.
Rupp, A. A., Templin, J. L., & Henson, R. A. (2010). Diagnostic assessment: Theory,
methods, and applications. New York: Guilford Press.
Sawaki, Y., Kim, H. J., & Gentile, C. (2009). Q-Matrix construction: Defining the link
between constructs and test items in large-scale reading and listening
comprehension assessments. Language Assessment Quarterly, 6(3), 190-209.
Scarborough, H. S. (2001). Connecting early language and later reading (dis)abilities:
Evidence, theory, and practice. In S. B. Neuman & D. K. Dickinson (Eds.),
Handbook of early literacy research (pp. 97-110). New York: The Guilford Press.
Shankweiler, D. (1989). How problems of comprehension are related to difficulties in
decoding. In Shankweiler D., & Liberman IY. (Eds.), Phonology and reading
disability: Solving the reading puzzle (pp. 35-68). Ann Arbor, MI: The University
of Michigan Press.
Sung, Y., Chang, K., & Huang, J. (2008). Improving children’s reading comprehension
and use of strategies through computer-based strategy training. Computers in
Human Behavior, 24, 1552-1571.
Tatsuoka, K.K., & Tatsuoka, M.M. (1982). Detection of aberrant response patterns and
their effect on dimensionality. Journal of Educational Statistics, 7, 215-231.
Templin, J. L., & Bradshaw, L. (2013). Measuring the reliability of diagnostic
classification model examinee estimates. Journal of Classification, 30(2), 251-275.
65
Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using
cognitive diagnosis models. Psychological Methods, 11(3), 287-305.
Tracey, D. H., & Morrow, L. M. (2006). Lenses on reading: An introduction to theories
and models. New York: Guilford Press.
von Davier, M. (2008). A general diagnostic model applied to language testing data.
British Journal of Mathematical and Statistical Psychology, 61(2), 287.
Wang, C. & Gierl, M. J. (2011). Using the attribute hierarchy method to make diagnostic
inferences about examinees’ cognitive skills in critical reading. Journal of
Educational Measurement, 48(2), 165-187.
Wang, C., Gierl, M. J., & Leighton, J. P. (2006). Investigating the cognitive attributes
underlying student performance on a foreign language reading test: An application
of the attribute hierarchy method. Paper presented at the annual meeting of the
National Council on Measurement in Education, San Francisco, California.
Wiederholt, J. L., & Blalock, G. (2000). Gray Silent Reading Tests: Examiners’ Manual.
Austin, Texas: PRO-ED, Inc.
Wijekumar, K., Meyer, B. J. F., & Lei, P. (2012). Large-scale randomized controlled trial
with 4th graders using intelligent tutoring of the structure strategy to improve
nonfiction reading comprehension. Educational Technology Research and
Development, 60(6), 987-1013.
Wijekumar, K., Meyer, B. J. F., Lei, P., Lin, Y., Johnson, L. A., Spielvogel, J. A.,
Shurmatz, K. M., Ray, M., & Cook, M. (2014). Multisite randomized controlled
66
trial examining intelligent tutoring of structure strategy for fifth-grade readers.
Journal of Research on Educational Effectiveness, 7, 331-357.
Williams, J. P., Hall, K. M., Lauer, K. D., Stafford, K. B., DeSisto, L. A., & deCani, J. S.
(2005). Expository text comprehension in the primary grade classroom. Journal of
Educational Psychology, 97(4), 538-550.
Xu, G., & Zhang, S. (2016). Identifiability of diagnostic classification models.
Psychometrika, 81(3), 625-649.
Xu, X., & von Davier, M. (2006). Cognitive diagnosis for NAEP proficiency data
(Research Report No. RR-06-08). Princeton, NJ: Educational Testing Service.
67
Appendix A. Q-matrix for Form A of the GSRT
Item Literal Inferential Critical Affective
1 0 1 0 0
2 1 0 0 0
3 0 1 1 0
4 0 1 1 0
5 0 1 1 0
6 1 0 0 0
7 0 1 0 0
8 0 1 1 0
9 0 1 1 0
10 0 1 1 0
11 1 0 0 0
12 1 0 0 0
13 0 1 0 0
14 0 0 0 1
15 0 1 1 0
16 1 0 0 0
17 1 0 0 0
18 0 1 0 0
19 0 0 0 1
20 0 1 1 0
21 0 1 0 0
22 1 0 0 0
23 1 0 0 0
24 0 0 0 1
25 0 1 1 0
26 1 0 0 0
27 1 0 0 0
28 0 1 1 0
29 0 1 1 0
30 0 1 1 0
31 1 0 0 0
32 0 1 1 0
33 0 0 0 1
34 0 1 1 0
35 0 1 1 0
36 1 0 0 0
37 1 0 0 0
38 0 1 0 0
39 0 1 1 0
40 0 1 1 0
41 1 0 0 0
42 1 0 0 0
43 0 1 1 0
68
44 0 1 1 0
45 0 1 1 0
46 0 1 1 0
47 0 1 0 0
48 0 1 0 0
49 0 1 1 0
50 0 1 1 0
51 1 0 0 0
52 0 1 0 0
53 0 1 0 0
54 1 0 0 0
55 0 1 1 0
56 0 1 0 0
57 1 0 0 0
58 0 1 0 0
59 0 1 1 0
60 0 1 1 0
61 0 1 1 0
62 0 1 1 0
63 0 0 0 1
64 0 1 1 0
65 0 1 1 0
69
Appendix B. Q-matrix for Form B of the GSRT
Item Literal Inferential Critical Affective
1 1 0 0 0
2 1 0 0 0
3 1 0 0 0
4 0 1 0 0
5 0 1 0 0
6 1 0 0 0
7 0 1 0 0
8 0 1 0 0
9 0 1 0 0
10 0 1 1 0
11 1 0 0 0
12 1 0 0 0
13 1 0 0 0
14 0 1 1 0
15 0 1 1 0
16 1 0 0 0
17 0 0 0 1
18 0 1 0 0
19 0 1 1 0
20 0 1 1 0
21 1 0 0 0
22 1 0 0 0
23 0 1 1 0
24 0 0 0 1
25 0 1 1 0
26 1 0 0 0
27 1 0 0 0
28 0 1 1 0
29 0 1 1 0
30 0 1 1 0
31 1 0 0 0
32 0 1 0 0
33 0 1 1 0
34 0 1 0 0
35 0 1 1 0
36 0 1 0 0
37 1 0 0 0
38 0 1 1 0
39 1 0 0 0
40 0 1 0 0
41 1 0 0 0
42 0 1 1 0
43 1 0 0 0
70
44 0 0 0 1
45 0 1 1 0
46 1 0 0 0
47 0 1 0 0
48 0 1 0 0
49 0 1 0 0
50 0 1 0 0
51 0 1 0 0
52 0 1 0 0
53 0 1 1 0
54 0 1 1 0
55 0 1 1 0
56 0 1 0 0
57 0 1 1 0
58 0 1 1 0
59 0 1 1 0
60 0 1 1 0
61 0 1 1 0
62 1 0 0 0
63 0 1 0 0
64 0 1 0 0
65 0 1 0 0