piat-rnu reliability review
TRANSCRIPT
Running head: RELIABILITY REVIEW OF PIAT-R/NU 1
The Normative Update of the Revised Peabody Individual Achievement Test - A Review of
Reliability
Jenna M. Powell
Lamar University
RELIABILITY REVIEW OF PIAT-R/NU 2
Abstract
Individually administered achievement tests are delivered to students for a variety of reasons.
The Peabody Individual Achievement Test (PIAT) is one that was originally created in 1970 by
Frederick C. Markwardt, Jr., PhD. The purpose of the test was to assess academic achievement
for students in grades kindergarten to grade 12 (K-12). Originally, five subtest subjects were
created to achieve this goal. In 1986, this test was revised and a total of six subtests existed. A
sample of 1,563 students in grades K-12 were tested in the United States for the 1986 revision
and the results were carefully analyzed through several methods of reliability testing. The
content implemented during the revision remains the same, however, in 1995-6, the normative
update was created and 3,184 students were sampled. The testing process of the normative
update involved four other individually administered achievement tests using an approach called
the domain-norming approach. Each student was given one of the five full tests as well as one or
several subtests, ensuring one-fifth of each grade level received the PIAT-R. Normative update
results of 1996 were then compared with the revised results of 1986.
This reliability review mainly focuses on the revision and normative update of the PIAT. It also
discusses each subtest and describes the sampling process, norm development, reliability testing
methods, demographic variables used, and administration and interpretation process.
RELIABILITY REVIEW OF PIAT-R/NU 3
The Normative Update of the Revised Peabody Individual Achievement Test - A Review of
Reliability
The Peabody Individual Achievement Test (PIAT-R/NU) is the revised/normative updated
version of the classic 1970 version of the Peabody Individual Achievement Test (PIAT), an
individually administered measure of academic achievement. The test was revised in 1989 and
1997 is when the norms were updated. Frederick Markwardt, Jr. designed the test to evaluate the
academic achievement of students in kindergarten through the 12th grade (K-12) in six areas of
content known as the subtests. The test contains two main formats: multiple choice and free
response which includes verbal and writing components. A major feature of the revision is the
addition of the Written Expression subtest. The subtests are: (a) General Information, (b)
Reading recognition, (c) Reading Comprehension, (d) Mathematics, (e) Spelling, and the newest
subtest, and (f) Written Expression. Along with the six subtests, there are three composites.
They are (a) Total Reading, (b) Total Test, and (c) Written Language Composite. (Markwardt,
1997)
During the revision, the order of subtests was changed in hopes of increasing the subject’s
motivation as well as interest level. (Markwardt, 1997) The General Information subtest
measures the general knowledge of the subject and the question is read verbally with an oral
response to follow. The Reading Recognition subtest is an oral reading test – the subject reads
items out loud. In the Reading Comprehension subtest, the subject reads a sentence and chooses
a picture to illustrate what the sentence stated, measuring the comprehension of what they read.
Multiple choice is offered for the Mathematics subtest, which measures conceptual application
and knowledge of mathematical facts. All items are arranged in ascending difficulty order. The
RELIABILITY REVIEW OF PIAT-R/NU 4
Spelling subtest measures the ability to recognize letters from sounds at the beginning of the
subtest and then items are measured by the subject’s recognition of standardized spellings in a
later part of the subtest. In the newest subtest, Written Expression, there are two levels. Level
One is for kindergarten through first grade and the subject is asked to copy and write letters,
words and sentences from dictation. Level Two is for grades 2-12 and the subject is asked to
write a story after being shown a picture. (Markwardt, 1997)
The PIAT-R items measure functional knowledge and general abilities that are not specific to a
particular curriculum. This test is objectively scored without a time constraint, although the
average time administration time is about 60 minutes. PIAT-R is helpful when a scholastic
survey is needed and it helps in the selection of a more diagnostic instrument. The Total
Reading composite is obtained by summing the reading recognition and reading comprehension
subtests raw scores. The Total Test composite is determined by the sum of the general
information, reading recognition, reading comprehension, math and spelling subtests raw scores.
Finally, the Written Language composite is the sum of scaled scores for the spelling and written
expression subtests. (Markwardt, 1997)
The test is administered individually, due to the design and materials involved. There are four
test plates and test easels which contain all six subtests. These plates are created specifically to
capture and hold the interest of each subject, regardless of sex, age, intellect, and cultural
background. In an attempt to balance the representation of ethnic groups, sex and race among
the test items, contemporary artwork is used throughout the test. (Markwardt, 1997)
RELIABILITY REVIEW OF PIAT-R/NU 5
The PIAT-R is based on a national sample of 1,563 subjects that were tested in the spring of
1986. (Markwardt, 1997) These subjects are representative of the total school population (K-12)
based on parental socioeconomic status, sex, geographic region, race/ethnic group and grade.
Information obtained was based on U.S. Census Bureau and U.S. Population. This sample size is
extremely small and limits actual representation. The parental socioeconomic status was divided
into four categories: less than high school education, high school graduate, one to three years of
college or education beyond high school, and four or more years of college. 42 percent were at
the high school graduate level, while the other three ranged from 18.9 percent to 19.9 percent.
The target value for sex was 50 percent and they came very close to their target with female
subjects totaling 50.1 percent and males subjects totaling 49.9 percent. Geographic region was
divided into four categories and did not include Hawaii or Alaska. The four divisions and
percentages of students are: Northeast with 19.1 percent, North Central with 25.6 percent, South
with 32.8 percent, and finally West with the remaining 20.6 percent. The divisions given are a
close comparison to percentage of population per region according to Markwardt, however, only
20 states are represented out of the upper 48. This is not a fair depiction of every student in the
upper 48 in grades K-12. Race/ethnic group was also divided into four categories. The
categories and student percentages are as follows: White was the majority with 73.3 percent,
Black with 14.3 percent, Hispanic with 9.7 percent and Other with 2.7 percent. The “Other”
category includes Asians, Pacific Islanders, Native Americans, and Alaskan Natives.
(Markwardt, 1997) According to the US Population data in 1985, the “other” category is 3.2
percent and only 2.7 percent is represented in this study.
RELIABILITY REVIEW OF PIAT-R/NU 6
Before we discuss the major uses of the PIAT-R, let’s discuss the differences from the 1970
version and the most current, 1989 revised edition. As mentioned above, the most notable
changes are the order of subtest and the addition of the Written Expression subtest. The primary
reason for revision, however, was to bring in more current content. A small 35 percent of the
original items remained from the PIAT to the PIAT-R. (Markwardt, 1997) Another addition was
the Total Reading composite score, which measures overall achievement in the area of reading.
The revision also consisted of increasing the number of items in the original five subtests. In
four subtests (General Information, Math, Reading Recognition, and Spelling), the items went
from 84 to 100, while in Reading Comprehension, the number went from 66 to 82 items.
(Markwardt, 1997)
There are seven major implemental uses for the PIAT-R according to Markwardt. The first is
called individual evaluation, which helps gain insight to the subject’s existing knowledge,
strength and weakness of education and testing behavior. Program planning is another way the
PIAT-R is used. This helps develop a course of action to meet unique needs of the subject. The
test can also help aid parents and students understand the subject’s strengths and weaknesses
when deciding future plans through guidance and counseling. Once the subject’s general level of
accomplishment is determined, school placement is easily achieved and the subject can also be
transferred or admitted to a new school determined on these scores. This test is also used to
group students by achievement level. The follow-up evaluation provides a measure of
educational intervention at times when a more precise test is appropriate. The final use is termed
personnel selection and training. The subject’s level of achievement is used for employment
selection or guiding employee to educational programs as appropriate. In addition to the seven
RELIABILITY REVIEW OF PIAT-R/NU 7
implemental uses of the PIAT-R/NU, there are five research uses. These research uses are:
longitudinal, demographic, program evaluation, basic research and validation studies.
(Markwardt, 1997)
The administration and interpretation of the PIAT-R is relatively simple. The administer
presents items, records responses and then calculates scores. There is only one test manual
containing both technical and administrative guidelines to create ease for the user. While test
administration may be handled with little study, score interpretation of the tests may not be
completed by one unfamiliar with the understanding of psychometrics. Educational curriculum
and implications are required for accurate interpretation. (Markwardt, 1997) A flaw to mention
here is that the test manual never states in detail a required degree or experience level of neither
administrator nor interpreter but merely lists a few examples of professions that can do so
effectively. Interpretation focuses on the actual meaning of the scores, determining the
confidence level that can be placed on the scores as well as the prediction of future behaviors.
As previously discussed, this test is not meant to be a diagnostic tool itself. It is not created to
provide a precise achievement or assessment on a high level. Also, items are selected to sample
a mere cross section of various curriculum across the United States and not specific to any one
individual school system. People without the proper background are open to incorrectly interpret
the data, due to the brief and simple administration process. This high potential to misinterpret
and misuse the test is mentioned as well as several pitfalls to avoid.
Before we discuss the reliability tests used, we should take a moment to discuss the development
of the norms used for standardization. In 1986, spring data was collected from all grades and
was standardized, except for kindergarten, which used fall data. With the exception of Written
RELIABILITY REVIEW OF PIAT-R/NU 8
Expression, each subtest and composite were generated with a standard deviation of 15 and a
mean of 100, which is typical for an IQ z-score. Additionally, percentile ranks, stanines, grade
and age equivalents and normal curve equivalents were developed. The subtest standard scores
of the five subtests were derived through careful calculation and smoothing. (Markwardt, 1997)
The function of smoothing is to “smooth out” the average of a data set by adding more data to”
move” the average within a normal distribution. It is ethical if done when the data set is normal.
Composite standard scores were computed by adding the raw scores of subtests for each student
at each age and grade resulting in the distributions used to find said scores. Equivalents for age
and grade were each plotted and calculated. Now that the normalized standard scores were
discovered, the normal curve equivalents, percentile ranks and stanines could be distinguished
from the normal curve table. (Markwardt, 1997) As mentioned before, written expression norms
must be converted in another way. They do not have an adequate range and findings showed it
inappropriate to develop the standard score and age/grade norms. Therefore stanines for Levels
one and two are derived from grade-based stanines and Level two has an additional
developmental scaled score. (Markwardt, 1997)
The normative update was published in 1997 using data collected in 1995-96. (Markwardt,
1997) Four other individually administered achievement tests were used in the program to
perform the update. Two of the tests included are the Brief and Comprehensive versions of the
Kaufman Test of Educational Achievement (K-TEA). The other two tests are the Key Math
Revised test (KeyMath-R) and the Revised Woodcock Reading Mastery Test (WRMT-R). Five
domains were also included in this standardization process to create a cross-battery domain of
subtests. They were spelling, reading comprehension, math computation, word reading, and
RELIABILITY REVIEW OF PIAT-R/NU 9
math applications. The approach used is referred to as the domain-norming approach where each
person being tested is administered one full achievement test from the five listed. This test is
referred to as their primary test and they are also given one or more subtests from other tests.
Rasch scaling was applied to the five domains listed above while Written Expression and
General Information subtests obtained standard scores normalized through the smoothing process
which was explained earlier. (Markwardt, 1997)
The sample size used for the normative update was 3,184 students in kindergarten through grade
12 as well as a variety of educational statuses for an additional 245 young adults aged 18-22.
(Markwardt, 1997) This sample size, again, is limiting and does not accurately represent the
population. The fact that they only used English speaking people also is not a fair representation.
The sampling was based primarily on the 3,184 grade norm samples (K-12) and less emphasis on
the 245 age norm samples. (Markwardt, 1997) The demographic variables used for the
normative update were similar to the revised variables and they are the same for the grade-norm
and age-norm samples except for one variation. For grade-norm, the variables were sex,
race/ethnicity, parental education, educational placement, and geographic region. The age-norm
swaps out educational placement with educational status.
The demographic variables were compared to US population data from March of 1994 and they
are mostly in line with one another for both grade-norm and age-norm samples. Parental
Education, Region of the country and Race/Ethnicity are all divided into four categories each for
both samples, similar to the revised version of the test. The main difference is Educational
Placement for grade-norm and Educational Status for age-norms. For Educational Placement,
Special Education sample accounts for 10.8 percent and the US population shows 10.2 percent.
RELIABILITY REVIEW OF PIAT-R/NU 10
The Gifted sample, however, shows only 2.3 percent represented in the sample but 4.2 percent
represents the US population. (Markwardt, 1997) Only half is represented here.
Sample selection was done by a random procedure which pooled the permission forms received
together, and then used the computer to select. (Markwardt, 1997) Each examinee was randomly
assigned one of the five tests in an organized fashion to ensure one fifth took each test per grade.
The development of the norms was obtained from Rasch distribution of ability scores per grade
or age. Sampling error was avoided by using a smoothing method developed by Poste and Traub
(1990). (Markwardt, 1997)
While comparing the normative update scores of 1996 with the revised scores of 1986, one must
take into account several changes over the ten year time frame. Modernized cultural
environment, educational curriculum, and population demographics can all have an effect on the
results. The overall change in scores was a decline in below-average students in grades 1-12 and
the greatest improvement was the Math and Reading Comprehension subtests. The breakdown
in change per subtest is as follows: General Information – increased performance with above
average students in grades 1-3 and below average students declined slightly in grades 4-7.
Reading Recognition – average level students decreased performance in grades 2 and 3, higher
performance was noticed for above average students in grades 1-2, and below average students
declined in grades 1-12. Reading Comprehension – decrease in performance for average
students in grades 1-2 and increase for average students in grades 8-12. Total Reading – decline
performance for average students in grades K-2, increase in performance for above average
students in grades 1-2, and decrease in performance for below average students in grades 1-12.
Math – average students decreased performance in grades 1-3 and increased performance in
RELIABILITY REVIEW OF PIAT-R/NU 11
grades 5-12, while above average students increased performance in grades 2-12. We see the
same pattern with below average students declining in grades 1-12 on this subtest. Spelling –
Same pattern seen here with below average students with a decline in grades 1-12, above average
students showed an increase in performance in grades K-1, and decline for average students in
grades 1-3. Total Test – performance decline in grades 1-3 and increase in grades 7-9 for
average students, increase in above average students for K-8, and decreased performance for
below average students in grades 1-9. Written Expression – Level I showed a decline and Level
II showed increase for grades 4-12. Written Language Composite – average and below average
students showed a decline in grades K-1. (Markwardt, 1997) While a chart is provided,
reliability was not discussed in detail for the normative update as it was for the revised test.
Having said all that, we move on to the reliability of the revised test. Four methods were used to
determine a slightly different perspective regarding reliability for the PIAT-R. They are: split-
half, item response theory, Kuder-Richardson and test-retest. (Markwardt, 1997) In order to
create a better understanding, each test will be clearly defined first, and then the results will be
stated. The first reliability test mentioned is the split-half reliability test. This is used to show
performance consistency on each subtest. It was followed by the Spearman-Brown prophecy
formula to estimate the full test’s reliability as a whole. (Markwardt, 1997) Although the
Spearman-Brown formula is known to inflate reliability results, it is acceptable here as a follow-
up to the split-half reliability test. Coefficients are considered reliable the closer to 1 they reach,
without ever reaching 1. The results for the PIAT-R are presented by grade only for the first five
subtests, as the Written Expression subtest is interpreted and measured differently and will be
discussed separately. Split-half results for subtests and composites by grade show a median
RELIABILITY REVIEW OF PIAT-R/NU 12
of .98 while split-half for the subtests and composites for the sample by age show a close .99.
These high coefficients are partly due to the operational rule that all items below base rate are
counted as correct and all items above ceiling rate count as incorrect. The next reliability test
used was the Kuder-Richardson reliability test. This test measures the consistency of all items
and shows the amount of measurement error in the test. The result for the Kuder-Richardson has
the same median numbers as the split-half, which shows the content to have a high homogeneity
level. The next reliability test is known as the test-retest reliability which simply shows the
consistency of scores from one administration of the test to another. The subject is given the
test, and then given the same test again after some period of time has lapsed. One year is the
ideal amount of time, however, in this situation fifty randomly selected subjects were retested
from two to four weeks after the original test date which is not a sufficient amount of time to
show true reliability. The test-retest reliability coefficient median for selected grades in random
sample was .96 and for selected ages was also .96. Another reliability test used was the Item
Response Theory Reliability. This method gives different estimates of error variance as well as
true score and relies on the seven assumptions of Classical True Score Theory. It’s based on the
idea that the probability of a correct response is a combination of error variance and true score.
(Markwardt, 1997) The total test coefficients for both grade and age using this method are by far
the highest, both with a median of .99. The median reliability coefficients for the total test all
showed in the high 90’s, however, if we break down the actual numbers of reliability, there are
definitely some subtests that need revising. In Mathematics, the split-half reliability is a low .84
for kindergarten subjects.
RELIABILITY REVIEW OF PIAT-R/NU 13
The PIAT-R briefly discusses the standard error of measurement (SEM) which is one of the vital
conclusions of the Classical True Score Theory. The coefficients we have discussed thus far are
helpful in determining a group of subjects but do not allow interpretation of the test scores for an
individual. This is where SEM is useful. Using the split-half coefficients, standard and raw
scores were computed then rounded to the nearest tenth. The values were then smoothed and
transformed into whole numbers so raw score confidence intervals could be computed. Before
smoothing, the median of total test by grade and age was 5.8. After smoothing, the median for
grade was 2 and age was 1.8. (Markwardt, 1997) The standard deviation used here was 15
showing the SEM well below the standard deviation proving a moderate confidence interval after
smoothing.
Levels one (I) and two (II) of the Written Expression subtest is examined and scored differently
because it contains different properties than the other subtests. The three reliability tests used for
this subtest are test-retest, interrater and internal consistency. (Markwardt, 1997) Level I test
(first grade and kindergarten level) were scored by two individuals independently working and
the results were correlated. First grade had a correlation of .95 and kindergarten had a
correlation of .90. An interrater reliability test was also used – interrater tests are composed of a
predetermined value for each item, and then scored appropriately. The coefficient for interrater
of level one was .88 for first grade and .91 for kindergarten. Internal consistency reliability uses
the coefficient alpha formula and the outcome was considered a moderate level. For first grade,
spring standardization was used and for kindergarten both fall and spring data was used resulting
in .60 to .69. (Markwardt, 1997) The reason given for the likely reason of this moderate data
was the fact that the content is not homogeneous and the sample size is small. In Level II written
RELIABILITY REVIEW OF PIAT-R/NU 14
expression, there were two different prompts (pictures) used and will be referred to as Prompt A
and Prompt B. Level two (2-12 grades) used three types of reliability testing as well, but instead
of test-retest, they used alternate-form test. Coefficient alpha ranges from .69 to .91. Total
standardization samples were .86 for Prompt A and Prompt B had .88. Interrater correlations
have a median of .58 for Prompt A and Prompt B had a median of .67. Internal consistency
shows a median of .57 for Prompt A and .67 for Prompt B. The alternate-form portion of the
reliability test included about 35 subjects randomly selected using a picture prompt that differed
from the original prompt and took place two to four weeks after initial test. Coefficient alpha for
the total sample was .63. (Markwardt,1997) Extremely low reliability is found in both Prompt A
and Prompt B.
To summarize, the PIAT was devised to measure the academic achievement for students aged
kindergarten to grade 12. The results shown lean towards the accomplishment of this goal for
the revised version, however, the reliability is somewhat flawed. The sample size in both
revision and normative update are extremely underrepresented of the population size at large.
Another major error is the extremely short time frame in between test-retest. Neither the revised
version nor the normative update had every state in the US represented, however, the test results
are supposed to represent the entire US. A few obvious things that could have helped support
their high reliability numbers are: increase sample size, include all 50 states, and increase the
time between test-retesting (a year is ideal). These changes could be a great start to increase
reliability for the next version of the Peabody Individual Achievement Test.
RELIABILITY REVIEW OF PIAT-R/NU 15
References
Markwardt, F. C. (1998). Peabody Individual Achievement Test – Revised Normative
Update. Minneapolis, MN: NCS Pearson, Inc.