tugasan kumpulan_reliability (1)

https://www.uni.edu/chfasoa/reliabilityandvalidity.htm http://www.measuringu.com/blog/measure-reliability.php https://explorable.com/definition-of-reliability?gid=1579 https://explorable.com/internal-consistency-reliability?gid=1579 SUMBER 1 Reliability is the degree to which an assessment tool produces stable and consistent results. Reliability is a measure of the consistency of a metric or a method. Every metric or method we use, including things like methods for uncovering usability problems in an interface and expert judgment, must be assessed for reliability. Imagine that a researcher discovers a new drug that she believes helps people to become more intelligent, a process measured by a series of mental exercises. After analyzing the results, she finds that the group given the drug performed the mental tests much better than the control group. For her results to be reliable, another researcher must be able to perform exactly the same experiment on another group of people and generate results with the same statistical significance. If repeat experiments fail, then there may be something wrong with the original research. Types of Reliability

Upload: anitah-gotandabani

Post on 30-Jan-2016




0 download


Education Measurement


Page 1: Tugasan Kumpulan_reliability (1)






Reliability is the degree to which an assessment tool produces stable and consistent results.

Reliability is a measure of the consistency of a metric or a method.

 Every metric or method we use, including things like methods for uncovering usability problems in an interface and expert judgment, must be assessed for reliability.

Imagine that a researcher discovers a new drug that she believes helps people to become more intelligent, a process measured by a series of mental exercises. After analyzing the results, she finds that the group given the drug performed the mental tests much better than the control group.

For her results to be reliable, another researcher must be able to perform exactly the same experiment on another group of people and generate results with the same statistical significance. If repeat experiments fail, then there may be something wrong with the original research.

Types of Reliability


1. Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals.  The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. 


Page 2: Tugasan Kumpulan_reliability (1)

Example:  A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first.  The obtained correlation coefficient would indicate the stability of the scores.

2. Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same construct, skill, knowledge base, etc.) to the same group of individuals.  The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions. 


Example:  If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms.


3. Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions.  Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed. 


Example:  Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards.  Inter-rater reliability is especially useful when judgments can be considered relatively subjective.  Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems.

4. Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results. 


A. Average inter-item correlation is a subtype of internal consistency reliability.  It is obtained by taking all of the items on a test that probe the same construct (e.g., reading comprehension), determining the correlation coefficient for each pair of items, and finally taking the average of all of

Page 3: Tugasan Kumpulan_reliability (1)

these correlation coefficients.  This final step yields the average inter-item correlation. 


B. Split-half reliability is another subtype of internal consistency reliability.  The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items.  The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.

Page 4: Tugasan Kumpulan_reliability (1)


The test-retest reliability method is one of the simplest ways of testing the

stability and reliability of an instrument over time.

For example, if a group of students takes a test, you would expect them to show very similar results

if they take the same test a few months later. This definition relies upon there being no confounding

factor during the intervening time interval.

Instruments such as IQ tests and surveys are prime candidates for test-retest methodology, because

there is little chance of people experiencing a sudden jump in IQ or suddenly changing their


On the other hand, educational tests are often not suitable, because students will learn much more

information over the intervening period and show better results in the second test.

  Test-Retest Reliability and the Ravages of Time

For example, if a group of students take a geography test just before the end of semester and one

when they return to school at the beginning of the next, the tests should produce broadly the same


If, on the other hand, the test and retest are taken at the beginning and at the end of the semester, it

can be assumed that the intervening lessons will have improved the ability of the students. Thus,

test-retest reliability will be compromised and other methods, such as split testing, are better.

Even if a test-retest reliability process is applied with no sign of intervening factors, there will always

be some degree of error. There is a strong chance that subjects will remember some of the

questions from the previous test and perform better.

Some subjects might just have had a bad day the first time around or they may not have taken the

test seriously. For these reasons, students facing retakes of exams can expect to face different

questions and a slightly tougher standard of marking to compensate.

Page 5: Tugasan Kumpulan_reliability (1)

Even in surveys, it is quite conceivable that there may be a big change in opinion. People may have

been asked about their favourite type of bread. In the intervening period, if a bread company mounts

a long and expansive advertising campaign, this is likely to influence opinion in favour of that brand.

This will jeopardise the test-retest reliability and so the analysis that must be handled with caution.

Test-Retest Reliability and Confounding Factors

To give an element of quantification to the test-retest reliability, statistical tests factor this into the

analysis and generate a number between zero and one, with 1 being a perfect correlation between

the test and the retest.

Perfection is impossible and most researchers accept a lower level, either 0.7, 0.8 or 0.9, depending

upon the particular field of research.

However, this cannot remove confounding factors completely, and a researcher must anticipate and

address these during the research design to maintain test-retest reliability.

To dampen down the chances of a few subjects skewing the results, for whatever reason, the test

forcorrelation is much more accurate with large subject groups, drowning out the extremes and

providing a more accurate result.

 For any research program that requires qualitative rating by different

researchers, it is important to establish a good level of interrater reliability,

also known as interobserver reliability.

This ensures that the generated results meet the accepted criteria defining reliability, by

quantitatively defining the degree of agreement between two or more observers.

Interrater Reliability and the Olympics

Interrater reliability is the most easily understood form of reliability, because everybody has

encountered it.

For example, watching any sport using judges, such as Olympics ice skating or a dog show, relies

upon human observers maintaining a great degree of consistency between observers. If even one of

the judges is erratic in their scoring system, this can jeopardize the entire system and deny a

participant their rightful prize.

Page 6: Tugasan Kumpulan_reliability (1)

Outside the world of sport and hobbies, inter-rater reliability has some far more important

connotations and can directly influence your life.

Examiners marking school and university exams are assessed on a regular basis, to ensure that

they all adhere to the same standards. This is the most important example of interobserver reliability

- it would be extremely unfair to fail an exam because the observer was having a bad day.

For most examination boards, appeals are usually rare, showing that the interrater reliability process

is fairly robust.

An Example From Experience

I used to work for a bird protection charity and, every morning, we went down to the seashore and

used to estimate the number individuals for each bird species.

Obviously, you cannot count thousands of birds individually; apart from the huge numbers, they

constantly move, leaving and rejoining the group. Using experience, we estimated the numbers and

then compared our estimate.

If one person estimated 1000 dunlin, one 4000 and the other 12000, then there was something

wrong with our estimation and it was highly unreliable.

If, however, we independently came up with figures of 4000, 5000 and 6000, then that was accurate

enough for our purposes, and we knew that we could use the average with a good degree of


Qualitative Assessments and Interrater Reliability

Any qualitative assessment using two or more researchers must establish interrater reliability to

ensure that the results generated will be useful.

One good example is Bandura's Bobo Doll experiment, which used a scale to rate the levels of

displayed aggression in young children. Apart from extensive pre-testing, the observers constantly

compared and calibrated their ratings, adjusting their scales to ensure that they were as similar as


Guidelines and Experience

Page 7: Tugasan Kumpulan_reliability (1)

Interobserver reliability is strengthened by establishing clear guidelines and thorough experience. If

the observers are given clear and concise instructions about how to rate or estimate behavior, this

increases the interobserver reliability.

Experience is also a great teacher; researchers who have worked together for a long time will be

fully aware of each other's strengths, and will be surprisingly similar in their observations.

Internal consistency reliability defines the consistency of the results

delivered in a test, ensuring that the various items measuring the different

constructs deliver consistent scores.

For example, an English test is divided into vocabulary, spelling, punctuation and grammar. The

internal consistency reliability test provides a measure that each of these particular aptitudes is

measured correctly and reliably.

One way of testing this is by using a 

test-retest method, where the same test is administered some after the initial test and the results compared.

However, this creates some problems and so many researchers prefer to measure internal

consistency by including two versions of the same instrument within the same test. Our example of

the English test might include two very similar questions about comma use, two about spelling and

so on.

The basic principle is that the student should give the same answer to both - if they do not know how

to use commas, they will get both questions wrong. A few nifty statistical manipulations will give the

internal consistency reliability and allow the researcher to evaluate the reliability of the test.

There are three main techniques for measuring the internal consistency reliability, depending upon

the degree, complexity and scope of the test.

They all check that the results and constructs measured by a test are correct, and the exact type

used is dictated by subject, size of the data set and resources.

Page 8: Tugasan Kumpulan_reliability (1)

Split-Halves Test

The split halves test for internal consistency reliability is the easiest type, and involves dividing a test

into two halves.

For example, a questionnaire to measure extroversion could be divided into odd and even questions.

The results from both halves are statistically analysed, and if there is weak correlation between the

two, then there is a reliability problem with the test.

The split halves test gives a measurement of in between zero and one, with one

meaning a perfect correlation.

The division of the question into two sets must be random. Split halves testing was a popular way to

measure reliability, because of its simplicity and speed.

However, in an age where computers can take over the laborious number crunching, scientists tend

to use much more powerful tests.

Kuder-Richardson Test

The Kuder-Richardson test for internal consistency reliability is a more advanced, and slightly more

complex, version of the split halves test.

In this version, the test works out the average correlation for all the possible split half combinations

in a test. The Kuder-Richardson test also generates a correlation of between zero and one, with a

more accurate result than the split halves test. The weakness of this approach, as with split-halves,

is that the answer for each question must be a simple right or wrong answer, zero or one.

For multi-scale responses, sophisticated techniques are needed to measure internal consistency


Cronbach's Alpha Test

The Cronbach's Alpha test not only averages the correlation between every possible combination of

split halves, but it allows multi-level responses.

Page 9: Tugasan Kumpulan_reliability (1)

For example, a series of questions might ask the subjects to rate their response between one and

five. Cronbach's Alpha gives a score of between zero and one, with 0.7 generally accepted as a sign

of acceptable reliability.

The test also takes into account both the size of the sample and the number of potential responses.

A 40-question test with possible ratings of 1 - 5 is seen as having more accuracy than a ten-question

test with three possible levels of response.

Of course, even with Cronbach's clever methodology, which makes calculation much simpler than

crunching through every possible permutation, this is still a test best left to computers and statistics

spreadsheet programmes.


Internal consistency reliability is a measure of how well a test addresses different constructs and

delivers reliable scores. The test-retest method involves administering the same test, after a period

of time, and comparing the results.

By contrast, measuring the internal consistency reliability involves measuring two different versions

of the same item within the same test.