piat-rnu reliability review

Running head: RELIABILITY REVIEW OF PIAT-R/NU 1

The Normative Update of the Revised Peabody Individual Achievement Test - A Review of

Reliability

Jenna M. Powell

Lamar University

RELIABILITY REVIEW OF PIAT-R/NU 2

Abstract

Individually administered achievement tests are delivered to students for a variety of reasons.

The Peabody Individual Achievement Test (PIAT) is one that was originally created in 1970 by

Frederick C. Markwardt, Jr., PhD. The purpose of the test was to assess academic achievement

for students in grades kindergarten to grade 12 (K-12). Originally, five subtest subjects were

created to achieve this goal. In 1986, this test was revised and a total of six subtests existed. A

sample of 1,563 students in grades K-12 were tested in the United States for the 1986 revision

and the results were carefully analyzed through several methods of reliability testing. The

content implemented during the revision remains the same, however, in 1995-6, the normative

update was created and 3,184 students were sampled. The testing process of the normative

update involved four other individually administered achievement tests using an approach called

the domain-norming approach. Each student was given one of the five full tests as well as one or

several subtests, ensuring one-fifth of each grade level received the PIAT-R. Normative update

results of 1996 were then compared with the revised results of 1986.

This reliability review mainly focuses on the revision and normative update of the PIAT. It also

discusses each subtest and describes the sampling process, norm development, reliability testing

methods, demographic variables used, and administration and interpretation process.


The Normative Update of the Revised Peabody Individual Achievement Test - A Review of

Reliability

The Peabody Individual Achievement Test (PIAT-R/NU) is the revised/normative updated

version of the classic 1970 version of the Peabody Individual Achievement Test (PIAT), an

individually administered measure of academic achievement. The test was revised in 1989 and

1997 is when the norms were updated. Frederick Markwardt, Jr. designed the test to evaluate the

academic achievement of students in kindergarten through the 12th grade (K-12) in six areas of

content known as the subtests. The test contains two main formats: multiple choice and free

response which includes verbal and writing components. A major feature of the revision is the

addition of the Written Expression subtest. The subtests are: (a) General Information, (b)

Reading recognition, (c) Reading Comprehension, (d) Mathematics, (e) Spelling, and the newest

subtest, and (f) Written Expression. Along with the six subtests, there are three composites.

They are (a) Total Reading, (b) Total Test, and (c) Written Language Composite. (Markwardt,

1997)

During the revision, the order of subtests was changed in hopes of increasing the subject’s

motivation as well as interest level. (Markwardt, 1997) The General Information subtest

measures the general knowledge of the subject and the question is read verbally with an oral

response to follow. The Reading Recognition subtest is an oral reading test – the subject reads

items out loud. In the Reading Comprehension subtest, the subject reads a sentence and chooses

a picture to illustrate what the sentence stated, measuring the comprehension of what they read.

Multiple choice is offered for the Mathematics subtest, which measures conceptual application

and knowledge of mathematical facts. All items are arranged in ascending difficulty order. The


Spelling subtest measures the ability to recognize letters from sounds at the beginning of the

subtest and then items are measured by the subject’s recognition of standardized spellings in a

later part of the subtest. In the newest subtest, Written Expression, there are two levels. Level

One is for kindergarten through first grade and the subject is asked to copy and write letters,

words and sentences from dictation. Level Two is for grades 2-12 and the subject is asked to

write a story after being shown a picture. (Markwardt, 1997)

The PIAT-R items measure functional knowledge and general abilities that are not specific to a

particular curriculum. This test is objectively scored without a time constraint, although the

average time administration time is about 60 minutes. PIAT-R is helpful when a scholastic

survey is needed and it helps in the selection of a more diagnostic instrument. The Total

Reading composite is obtained by summing the reading recognition and reading comprehension

subtests raw scores. The Total Test composite is determined by the sum of the general

information, reading recognition, reading comprehension, math and spelling subtests raw scores.

Finally, the Written Language composite is the sum of scaled scores for the spelling and written

expression subtests. (Markwardt, 1997)

The test is administered individually, due to the design and materials involved. There are four

test plates and test easels which contain all six subtests. These plates are created specifically to

capture and hold the interest of each subject, regardless of sex, age, intellect, and cultural

background. In an attempt to balance the representation of ethnic groups, sex and race among

the test items, contemporary artwork is used throughout the test. (Markwardt, 1997)


The PIAT-R is based on a national sample of 1,563 subjects that were tested in the spring of

1986. (Markwardt, 1997) These subjects are representative of the total school population (K-12)

based on parental socioeconomic status, sex, geographic region, race/ethnic group and grade.

Information obtained was based on U.S. Census Bureau and U.S. Population. This sample size is

extremely small and limits actual representation. The parental socioeconomic status was divided

into four categories: less than high school education, high school graduate, one to three years of

college or education beyond high school, and four or more years of college. 42 percent were at

the high school graduate level, while the other three ranged from 18.9 percent to 19.9 percent.

The target value for sex was 50 percent and they came very close to their target with female

subjects totaling 50.1 percent and males subjects totaling 49.9 percent. Geographic region was

divided into four categories and did not include Hawaii or Alaska. The four divisions and

percentages of students are: Northeast with 19.1 percent, North Central with 25.6 percent, South

with 32.8 percent, and finally West with the remaining 20.6 percent. The divisions given are a

close comparison to percentage of population per region according to Markwardt, however, only

20 states are represented out of the upper 48. This is not a fair depiction of every student in the

upper 48 in grades K-12. Race/ethnic group was also divided into four categories. The

categories and student percentages are as follows: White was the majority with 73.3 percent,

Black with 14.3 percent, Hispanic with 9.7 percent and Other with 2.7 percent. The “Other”

category includes Asians, Pacific Islanders, Native Americans, and Alaskan Natives.

(Markwardt, 1997) According to the US Population data in 1985, the “other” category is 3.2

percent and only 2.7 percent is represented in this study.


Before we discuss the major uses of the PIAT-R, let’s discuss the differences from the 1970

version and the most current, 1989 revised edition. As mentioned above, the most notable

changes are the order of subtest and the addition of the Written Expression subtest. The primary

reason for revision, however, was to bring in more current content. A small 35 percent of the

original items remained from the PIAT to the PIAT-R. (Markwardt, 1997) Another addition was

the Total Reading composite score, which measures overall achievement in the area of reading.

The revision also consisted of increasing the number of items in the original five subtests. In

four subtests (General Information, Math, Reading Recognition, and Spelling), the items went

from 84 to 100, while in Reading Comprehension, the number went from 66 to 82 items.

(Markwardt, 1997)

There are seven major implemental uses for the PIAT-R according to Markwardt. The first is

called individual evaluation, which helps gain insight to the subject’s existing knowledge,

strength and weakness of education and testing behavior. Program planning is another way the

PIAT-R is used. This helps develop a course of action to meet unique needs of the subject. The

test can also help aid parents and students understand the subject’s strengths and weaknesses

when deciding future plans through guidance and counseling. Once the subject’s general level of

accomplishment is determined, school placement is easily achieved and the subject can also be

transferred or admitted to a new school determined on these scores. This test is also used to

group students by achievement level. The follow-up evaluation provides a measure of

educational intervention at times when a more precise test is appropriate. The final use is termed

personnel selection and training. The subject’s level of achievement is used for employment

selection or guiding employee to educational programs as appropriate. In addition to the seven


implemental uses of the PIAT-R/NU, there are five research uses. These research uses are:

longitudinal, demographic, program evaluation, basic research and validation studies.

(Markwardt, 1997)

The administration and interpretation of the PIAT-R is relatively simple. The administer

presents items, records responses and then calculates scores. There is only one test manual

containing both technical and administrative guidelines to create ease for the user. While test

administration may be handled with little study, score interpretation of the tests may not be

completed by one unfamiliar with the understanding of psychometrics. Educational curriculum

and implications are required for accurate interpretation. (Markwardt, 1997) A flaw to mention

here is that the test manual never states in detail a required degree or experience level of neither

administrator nor interpreter but merely lists a few examples of professions that can do so

effectively. Interpretation focuses on the actual meaning of the scores, determining the

confidence level that can be placed on the scores as well as the prediction of future behaviors.

As previously discussed, this test is not meant to be a diagnostic tool itself. It is not created to

provide a precise achievement or assessment on a high level. Also, items are selected to sample

a mere cross section of various curriculum across the United States and not specific to any one

individual school system. People without the proper background are open to incorrectly interpret

the data, due to the brief and simple administration process. This high potential to misinterpret

and misuse the test is mentioned as well as several pitfalls to avoid.

Before we discuss the reliability tests used, we should take a moment to discuss the development

of the norms used for standardization. In 1986, spring data was collected from all grades and

was standardized, except for kindergarten, which used fall data. With the exception of Written


Expression, each subtest and composite were generated with a standard deviation of 15 and a

mean of 100, which is typical for an IQ z-score. Additionally, percentile ranks, stanines, grade

and age equivalents and normal curve equivalents were developed. The subtest standard scores

of the five subtests were derived through careful calculation and smoothing. (Markwardt, 1997)

The function of smoothing is to “smooth out” the average of a data set by adding more data to”

move” the average within a normal distribution. It is ethical if done when the data set is normal.

Composite standard scores were computed by adding the raw scores of subtests for each student

at each age and grade resulting in the distributions used to find said scores. Equivalents for age

and grade were each plotted and calculated. Now that the normalized standard scores were

discovered, the normal curve equivalents, percentile ranks and stanines could be distinguished

from the normal curve table. (Markwardt, 1997) As mentioned before, written expression norms

must be converted in another way. They do not have an adequate range and findings showed it

inappropriate to develop the standard score and age/grade norms. Therefore stanines for Levels

one and two are derived from grade-based stanines and Level two has an additional

developmental scaled score. (Markwardt, 1997)

The normative update was published in 1997 using data collected in 1995-96. (Markwardt,

1997) Four other individually administered achievement tests were used in the program to

perform the update. Two of the tests included are the Brief and Comprehensive versions of the

Kaufman Test of Educational Achievement (K-TEA). The other two tests are the Key Math

Revised test (KeyMath-R) and the Revised Woodcock Reading Mastery Test (WRMT-R). Five

domains were also included in this standardization process to create a cross-battery domain of

subtests. They were spelling, reading comprehension, math computation, word reading, and


math applications. The approach used is referred to as the domain-norming approach where each

person being tested is administered one full achievement test from the five listed. This test is

referred to as their primary test and they are also given one or more subtests from other tests.

Rasch scaling was applied to the five domains listed above while Written Expression and

General Information subtests obtained standard scores normalized through the smoothing process

which was explained earlier. (Markwardt, 1997)

The sample size used for the normative update was 3,184 students in kindergarten through grade

12 as well as a variety of educational statuses for an additional 245 young adults aged 18-22.

(Markwardt, 1997) This sample size, again, is limiting and does not accurately represent the

population. The fact that they only used English speaking people also is not a fair representation.

The sampling was based primarily on the 3,184 grade norm samples (K-12) and less emphasis on

the 245 age norm samples. (Markwardt, 1997) The demographic variables used for the

normative update were similar to the revised variables and they are the same for the grade-norm

and age-norm samples except for one variation. For grade-norm, the variables were sex,

race/ethnicity, parental education, educational placement, and geographic region. The age-norm

swaps out educational placement with educational status.

The demographic variables were compared to US population data from March of 1994 and they

are mostly in line with one another for both grade-norm and age-norm samples. Parental

Education, Region of the country and Race/Ethnicity are all divided into four categories each for

both samples, similar to the revised version of the test. The main difference is Educational

Placement for grade-norm and Educational Status for age-norms. For Educational Placement,

Special Education sample accounts for 10.8 percent and the US population shows 10.2 percent.


The Gifted sample, however, shows only 2.3 percent represented in the sample but 4.2 percent

represents the US population. (Markwardt, 1997) Only half is represented here.

Sample selection was done by a random procedure which pooled the permission forms received

together, and then used the computer to select. (Markwardt, 1997) Each examinee was randomly

assigned one of the five tests in an organized fashion to ensure one fifth took each test per grade.

The development of the norms was obtained from Rasch distribution of ability scores per grade

or age. Sampling error was avoided by using a smoothing method developed by Poste and Traub

(1990). (Markwardt, 1997)

While comparing the normative update scores of 1996 with the revised scores of 1986, one must

take into account several changes over the ten year time frame. Modernized cultural

environment, educational curriculum, and population demographics can all have an effect on the

results. The overall change in scores was a decline in below-average students in grades 1-12 and

the greatest improvement was the Math and Reading Comprehension subtests. The breakdown

in change per subtest is as follows: General Information – increased performance with above

average students in grades 1-3 and below average students declined slightly in grades 4-7.

Reading Recognition – average level students decreased performance in grades 2 and 3, higher

performance was noticed for above average students in grades 1-2, and below average students

declined in grades 1-12. Reading Comprehension – decrease in performance for average

students in grades 1-2 and increase for average students in grades 8-12. Total Reading – decline

performance for average students in grades K-2, increase in performance for above average

students in grades 1-2, and decrease in performance for below average students in grades 1-12.

Math – average students decreased performance in grades 1-3 and increased performance in


grades 5-12, while above average students increased performance in grades 2-12. We see the

same pattern with below average students declining in grades 1-12 on this subtest. Spelling –

Same pattern seen here with below average students with a decline in grades 1-12, above average

students showed an increase in performance in grades K-1, and decline for average students in

grades 1-3. Total Test – performance decline in grades 1-3 and increase in grades 7-9 for

average students, increase in above average students for K-8, and decreased performance for

below average students in grades 1-9. Written Expression – Level I showed a decline and Level

II showed increase for grades 4-12. Written Language Composite – average and below average

students showed a decline in grades K-1. (Markwardt, 1997) While a chart is provided,

reliability was not discussed in detail for the normative update as it was for the revised test.

Having said all that, we move on to the reliability of the revised test. Four methods were used to

determine a slightly different perspective regarding reliability for the PIAT-R. They are: split-

half, item response theory, Kuder-Richardson and test-retest. (Markwardt, 1997) In order to

create a better understanding, each test will be clearly defined first, and then the results will be

stated. The first reliability test mentioned is the split-half reliability test. This is used to show

performance consistency on each subtest. It was followed by the Spearman-Brown prophecy

formula to estimate the full test’s reliability as a whole. (Markwardt, 1997) Although the

Spearman-Brown formula is known to inflate reliability results, it is acceptable here as a follow-

up to the split-half reliability test. Coefficients are considered reliable the closer to 1 they reach,

without ever reaching 1. The results for the PIAT-R are presented by grade only for the first five

subtests, as the Written Expression subtest is interpreted and measured differently and will be

discussed separately. Split-half results for subtests and composites by grade show a median


of .98 while split-half for the subtests and composites for the sample by age show a close .99.

These high coefficients are partly due to the operational rule that all items below base rate are

counted as correct and all items above ceiling rate count as incorrect. The next reliability test

used was the Kuder-Richardson reliability test. This test measures the consistency of all items

and shows the amount of measurement error in the test. The result for the Kuder-Richardson has

the same median numbers as the split-half, which shows the content to have a high homogeneity

level. The next reliability test is known as the test-retest reliability which simply shows the

consistency of scores from one administration of the test to another. The subject is given the

test, and then given the same test again after some period of time has lapsed. One year is the

ideal amount of time, however, in this situation fifty randomly selected subjects were retested

from two to four weeks after the original test date which is not a sufficient amount of time to

show true reliability. The test-retest reliability coefficient median for selected grades in random

sample was .96 and for selected ages was also .96. Another reliability test used was the Item

Response Theory Reliability. This method gives different estimates of error variance as well as

true score and relies on the seven assumptions of Classical True Score Theory. It’s based on the

idea that the probability of a correct response is a combination of error variance and true score.

(Markwardt, 1997) The total test coefficients for both grade and age using this method are by far

the highest, both with a median of .99. The median reliability coefficients for the total test all

showed in the high 90’s, however, if we break down the actual numbers of reliability, there are

definitely some subtests that need revising. In Mathematics, the split-half reliability is a low .84

for kindergarten subjects.


The PIAT-R briefly discusses the standard error of measurement (SEM) which is one of the vital

conclusions of the Classical True Score Theory. The coefficients we have discussed thus far are

helpful in determining a group of subjects but do not allow interpretation of the test scores for an

individual. This is where SEM is useful. Using the split-half coefficients, standard and raw

scores were computed then rounded to the nearest tenth. The values were then smoothed and

transformed into whole numbers so raw score confidence intervals could be computed. Before

smoothing, the median of total test by grade and age was 5.8. After smoothing, the median for

grade was 2 and age was 1.8. (Markwardt, 1997) The standard deviation used here was 15

showing the SEM well below the standard deviation proving a moderate confidence interval after

smoothing.

Levels one (I) and two (II) of the Written Expression subtest is examined and scored differently

because it contains different properties than the other subtests. The three reliability tests used for

this subtest are test-retest, interrater and internal consistency. (Markwardt, 1997) Level I test

(first grade and kindergarten level) were scored by two individuals independently working and

the results were correlated. First grade had a correlation of .95 and kindergarten had a

correlation of .90. An interrater reliability test was also used – interrater tests are composed of a

predetermined value for each item, and then scored appropriately. The coefficient for interrater

of level one was .88 for first grade and .91 for kindergarten. Internal consistency reliability uses

the coefficient alpha formula and the outcome was considered a moderate level. For first grade,

spring standardization was used and for kindergarten both fall and spring data was used resulting

in .60 to .69. (Markwardt, 1997) The reason given for the likely reason of this moderate data

was the fact that the content is not homogeneous and the sample size is small. In Level II written


expression, there were two different prompts (pictures) used and will be referred to as Prompt A

and Prompt B. Level two (2-12 grades) used three types of reliability testing as well, but instead

of test-retest, they used alternate-form test. Coefficient alpha ranges from .69 to .91. Total

standardization samples were .86 for Prompt A and Prompt B had .88. Interrater correlations

have a median of .58 for Prompt A and Prompt B had a median of .67. Internal consistency

shows a median of .57 for Prompt A and .67 for Prompt B. The alternate-form portion of the

reliability test included about 35 subjects randomly selected using a picture prompt that differed

from the original prompt and took place two to four weeks after initial test. Coefficient alpha for

the total sample was .63. (Markwardt,1997) Extremely low reliability is found in both Prompt A

and Prompt B.

To summarize, the PIAT was devised to measure the academic achievement for students aged

kindergarten to grade 12. The results shown lean towards the accomplishment of this goal for

the revised version, however, the reliability is somewhat flawed. The sample size in both

revision and normative update are extremely underrepresented of the population size at large.

Another major error is the extremely short time frame in between test-retest. Neither the revised

version nor the normative update had every state in the US represented, however, the test results

are supposed to represent the entire US. A few obvious things that could have helped support

their high reliability numbers are: increase sample size, include all 50 states, and increase the

time between test-retesting (a year is ideal). These changes could be a great start to increase

reliability for the next version of the Peabody Individual Achievement Test.


References

Markwardt, F. C. (1998). Peabody Individual Achievement Test – Revised Normative

Update. Minneapolis, MN: NCS Pearson, Inc.

piat-rnu reliability review

Documents

written language composite

parental socioeconomic status

normal curve equivalents

written expression subtest

average students showed

half reliability test

written expression

standard deviation