t he v alidity of a ssessment -b ased i nterpretations “a test is only a measuring instrument, an...

Sara B

etz, Chancy R

odeghero, Corey W

inking, M

elissa Yucuis

THE VALIDITY OF ASSESSMENT-BASED INTERPRETATIONS

“A test is only a measuring instrument, an instrument far less precise than most people believe.”

VALIDITY

A test is merely a tool that we use in order to make inferences. If we misuse the tool by employing it in the wrong situation, it is not the tool that’s defective. It’s the tool-user.

CONTENT-RELATED EVIDENCE OF VALIDITY

Demonstrates the degree to which the sample of items or tasks on a test are representative of a designated domain of content.

A NEED FOR QUANTIFICATION1) Assemble a panel of 15 to 20 knowledgeable

individuals.2) Ask them to respond Yes or No to should a student

possess the knowledge and skill measured by this test item?

3) Compute the percentage of Yes responses to each item.

4) Ask the panel to respond to Having first considered the entire range of content knowledge, what percentage of that range is represented by this test’s items?

5)Computed the mean percentage of all panelists’ content-coverage percentage estimates.

CRITERION-REFERENCED EVIDENCE OF VALIDITY

Based on the extent to which a student’s score on a test allows someone to make an accurate inference about the student’s performance on a criterion variable.

Example: aptitude test administered in high school to predict what kind of grades students will earn in college.

CONSTRUCT-RELATED EVIDENCE OF VALIDITY

Intervention Studies: attempts to demonstrate that students respond differently to the measure after receiving some sort of treatment.

Differential-Population Studies: Efforts to show that individuals representing distinct populations score differently on the measure.

Related-Measures Studies: Correlations, positive and negative depending on the measures, between students’ scores on the test and their scores on other measures.

CONSEQUENTIAL VALIDITY

A concept, disputed by some, focused on the appropriateness of a test’s social consequences.

Teachers need to understand the difference between “validity-focused-on-inferences” and “validity-focused-on-tests”.

RELIABILITY OF ASSESSMENT DEVICES

RELIABILITY

Assessment reliability refers to consistency of measurement.

In everyday language, if an assessment device is reliable, it measures with consistency.

CONSISTENCY OF MEASUREMENT

Stability evidence Alternate-form evidence Internal-consistency evidence

STABILITY RELIABILITY EVIDENCE

Stability estimates of a test’s reliability are based on the consistency of measurement over time.

If students’ test results from two administrations of the same test – separated by a meaningful between-testing interval – turn out to be positively correlated, then the test is said to possess stability reliability.

STABILITY RELIABILITY EVIDENCE Typically, educators simply administer a test

to a group of students, wait a few weeks or so, then re-administer the same test to the same students.

The correlation between the two sets of scores is referred to as a stability reliability co-efficient.

Classifying students (for example, basic, proficient, or advanced) can help track consistency over time.

ALTERNATE-FORM RELIABILITY EVIDENCE

Alternate-form reliability evidence is collected when the same group of students completes two forms of the same assessment device and their performances on the two forms are correlated.

Standardized tests depend on it since there are so many versions of the same test. Both tests must be equally difficult.

ALTERNATE-FORM RELIABILITY EVIDENCE Because equally difficult test forms are so pivotal

in the collection of alternate-form reliability evidence, test developers frequently try out test items to establish their difficulty, then constitute different forms so that the item difficulties of the two forms are similar.

Sometimes these item difficulties are determined via separate field test; sometimes they are determined by embedding a small number of trial items in an operational form of a test.

INTERNAL CONSISTENCY RELIABILITY EVIDENCE The focus of an internal-consistency

approach to reliability is the homogeneity of the set of items that constitute a test.

If the items are functioning in a similar fashion, then the test is said to be internally consistent.

Different gauges are used depending on the number of test questions.

STANDARD ERROR OF MEASUREMENT

Provides a an estimate of the consistency of an individual test-taker’s performance.

Similar to the plus or minus error margins now widely reported with any kind of sampling-based opinion polls.

As with error margins, the smaller the standard error of measurement, the more consistency educators can ascribe to a student’s test performance.

ABSENCE OF BIAS

Bias is a bad thing. Bias is a preference or inclination that inhibits impartial judgment.

Absence of Bias is a good thing.

Biased tests yield results that are likely to be misinterpreted.

During the past two decades, educators have become sensitized to the possibility that our traditional testing procedures may be biased.

Types of bias: gender, religious, geographic, linguistic, racial, etc.

Educational tests have typically been written by white, middle-class Americans; tried out on white, middle-class students; and normed on white, middle class students.

Not a case of maliciousness but more of ignorance.

Test bias is operative when there are qualities in the test itself, the way it is administered, or the manner in which the results are interpreted that unfairly penalize OR give an advantage to members of a subgroup because of their membership in that subgroup.

Every time members of a minority group score lower on a test item than majority members, does that mean the test item is biased?

Of course not! It may be biased but it may be totally

unbiased and may merely be detecting deficits in the instruction received by minority children.

P-values – the proportion of students answering an item correctly

Disparate Impact – when the test scores of different groups are decidedly different.

These items should be judged to determine if they need to be modified or jettisoned.

To address potential bias, educators should exercise evaluative judgments in reviewing:

the test itself the procedures used to administer it the interpretations made from the test’s

results

A test item is offensive when it contains elements that would insult any group of test takers on the basis of their personal characteristics.

An offended student will often be distracted when completing such items and will perform poorly on those items.

A test item is biased if it unfairly penalizes a particular group of students.

Such an item could be created due to dissimilar interests of the two groups.

Example here

BIAS ERADICATION

Subject all potential items to a stringent review by a judgmental panel of members that consists of many minority groups.

Do empirical analyses of the differences between groups of students based on actual administrations of the test items. “Flag” items with vast differing p-values. The item should then be earmarked for judgmental review.

DETECTING BIAS IN TEST ADMINISTRATION

Examiner Variables (i.e. demeanor) Situational Variables (i.e. undesirable

environment) Students need to be equally familiar with the

nature of the test being taken. Ample practice opportunities should be given for students to become accustomed to the form of the test being used.

THE ELL DILEMMA

How should ELL students be assessed, in their native language or in English?

NCLB requires that children be assessed with English-language NCLB tests after 3 or more consecutive years in a United States school.

Be familiar with the very latest version of NCLB related regulations.

Keep abreast of relevant new laws as well as related case law to the education and assessment of ELL students.

t he v alidity of a ssessment -b ased i nterpretations “a test is only a measuring instrument, an...

Documents

test item

aptitude test

tests reliability

students performance

group of students

melissa yucuis slide

c onsequential v alidity

ssessment d evices slide