IJCA - Volume I - Flipbook - Page 31
30 The International Journal of Conformity Assessment
takers (pictured below in orange).
2022 | Volume 1, Issue 1
• Test form. When tests are administered on
multiple dates, for security reasons, additional
forms of the test may be necessary. It is
expected that test forms will be revised at least
annually. Test forms must be assembled to
the same “test blueprint.” Different forms of a
test are known as parallel forms or alternate
forms. These forms are designed to have similar
measurement characteristics, but they contain
different items. Because the forms are not
exactly the same, a test taker might do better on
one form than on another.
• Multiple raters. In certain tests, scoring is
Notice that the lists intersect at a score of 31, which can
then be used as a cut off score.
Test Reliability
Test reliability is an index of the consistency of
scores produced by the test, with a higher value
being desirable. A value of 1.0 indicates a perfectly
reliable test. A value of 0.0 indicates the test
essentially produces random scores.
The test measures what it claims to measure
consistently or reliably. This means that if a person
were to take the test again, the person would get a
similar test score.
Reliability refers to how dependably or consistently
a test measures a characteristic. If a person takes
the test again, will he or she get a similar test score,
or a much different score? A test that yields similar
scores for a person who repeats the test is said to
measure a characteristic reliably.
How do we account for an individual who does not
get exactly the same test score every time he or
she takes the test? Some possible reasons are as
follows:
• Test taker’s temporary psychological or
physical state. Test performance can be
influenced by a person’s psychological or
physical state at the time of testing. For
example, differing levels of anxiety, fatigue,
or motivation may affect the applicant’s test
results.
• Environmental factors. Differences in the
testing environment, such as room temperature,
lighting, noise, or even the test administrator,
can influence an individual’s test performance.
determined by a rater’s judgments of the test
taker’s performance or responses. Differences
in training, experience, and frame of reference
among raters can produce different test scores
for the test taker.
These factors are sources of chance or random
measurement error in the assessment process. If
there were no random errors of measurement, the
individual would get the same test score. The degree
to which test scores are unaffected by measurement
errors is an indication of the reliability of the test.
Types of Reliability Estimates
There are several types of reliability estimates, each
influenced by different sources of measurement
error. The acceptable level of reliability will differ
depending on the type of test and the reliability
estimate used.
1. Test-retest reliability indicates the repeatability
of test scores with the passage of time. This
estimate also reflects the stability of the
characteristic or construct being measured by
the test. For constructs that are expected to vary
over time, an acceptable test-retest reliability
coefficient may be lower than is suggested in
Table 3 below.
2. Alternate or parallel form reliability indicates the
likelihood of achieving consistent test scores
if a person takes two or more forms of a test. A
high parallel form reliability coefficient indicates
that the different forms of the test are very
similar, which means that it makes virtually no
difference which version of the test a person
takes. On the other hand, a low parallel form
reliability coefficient suggests that the different
forms are probably not comparable; they may be
measuring different things and therefore cannot
be used interchangeably.
31
3. Inter-rater reliability applies most often to
examinations administered by examiners
(vs. objective multiple-choice examinations).
Inter-rater reliability indicates the likelihood of
achieving consistent test scores when two or
more raters score the test. On some tests, raters
evaluate responses to questions and determine
the scores. Differences in judgment among
raters are likely to produce variations in test
scores. A high inter-rater reliability coefficient
indicates that the judgment process is stable,
and the resulting scores are reliable. Inter-rater
reliability coefficients are typically lower than
other types of reliability estimates. However, it
is possible to obtain higher levels of inter-rater
reliabilities if raters are appropriately trained.
with different numbers of points given for different
response alternatives. When the coefficient alpha
is applied to tests in which each item has only one
correct answer and all correct answers are worth the
same number of points, the resulting coefficient is
identical to KR-20.
4. Internal consistency reliability indicates the
extent to which items on a test measure the
same thing. A high internal consistency reliability
coefficient for a test indicates the items on the
test are very similar to each other in content
(homogeneous). It is important to note that the
length of a test can affect internal consistency
reliability. For example, a very lengthy test can
spuriously inflate the reliability coefficient.
Kuder-Richardson Formula 20, or KR-20, is a
reliability measure for a test with binary variables
(i.e., answers that are right or wrong). Reliability
refers to how consistent test results are, or how well
the test actually measures what it is intended to
measure.
Interpretation of Reliability
The reliability of a test is indicated by the reliability
coefficient. It is denoted by the letter “r” and
expressed as a number ranging between 0 and
1.00, with r = 0 indicating no reliability and r = 1.00
indicating perfect reliability. .
Generally, you will see the reliability of a test as a
decimal, for example, r = 0.80 or r = 0.93. The larger
the reliability coefficient, the more repeatable or
reliable the test scores.
TABLE 3
General Guidelines for Interpreting Reliability
Coefficients
Reliability Coefficient Value
.90 and up
.80 - .89
.70 - .79
below .70
Interpretation
excellent
good
adequate
may have limited
applicability
One measure of reliability used is Cronbach’s alpha.
This is the general form of the more commonly
reported Kuder-Richardson Formula 20 (KR-20)
and can be applied to tests composed of items
Estimates of test reliability are only meaningful
when there are a sufficient number of examinations
administered, typically requiring data from at least
100 candidates. While newly formed certification
bodies may not have access to sufficient data to
estimate reliability, it is expected that more mature
programs will estimate and consider statistical
reliability in their validation processes.
Kuder-Richardson Method
The KR-20 is used for items that have varying
difficulty. For example, some items might be very
easy, while others are more challenging. It should
only be used if there is a correct answer for each
question—it shouldn’t be used for questions where
partial credit is possible or for scales like the Likert
scale.
KR20 Scores: The scores for KR-20 range from 0 to
1, where 0 is no reliability and 1 is perfect reliability.
The closer the score is to 1, the more reliable the
test.
It is denoted as:
n: sample size for the test
p: proportion of people passing the item
q : proportion of people failing the item
Var : variance for the test
: sum up (add up) *In other words, multiply each
question’s p by q, then add them all together. If
you have 10 items, you’ll multiply p by q 10 times,
then you’ll add those 10 items together to get a
total.
KR21 Scores: If all questions in your binary test
are equally challenging, use the Kuder-Richardson