IJCA - Volume I - Flipbook - Page 33
32
The International Journal of Conformity Assessment
2022 | Volume 1, Issue 1
n: sample size for the test
between reliability and validity. Validity
demonstrates how good a test is for a particular
situation; reliability indicates how trustworthy a
score on that test will be. Examiners must carefully
select a test that is both reliable and valid for each
unique situation.
Var : variance for the test
Methods for Conducting Test Validation Studies
M : mean score for the test
The validity of a certification examination requires
analysis of the entire process, including the
supporting research for the examination (job
analysis and scheme-development) as well
as the security and integrity of the process for
administering and scoring examinations. A holistic
approach is necessary. Because of the diversity of
facets that impact validity, statistical indicators of
validity of an examination are rarely employed but
may be useful.
Formula 21 (KR-21) method.
It is denoted as:
For example, consider an exam with 20 participants
that contains 45 multiple-choice questions. Since all
questions in this situation are equally challenging,
we would choose to use the KR-21 score. If, however,
the summation of the product of people passing
and failing each item is 8.0325 and the variance is
42.0275, we could deduce the KR-20 score for this
exam to be -0.17, further verifying it is incorrect to
use KR-20 in this scenario. Knowing that that mean
is 32.85, we could then deduce the KR-21 score to be
0.5299, indicating average reliability of the test.
Cronbach’s Alpha: This measures reliability, or
internal consistency. If you have a test with more
than two answer possibilities (or opportunities
for partial credit), use Cronbach’s alpha instead.
Cronbach’s alpha is used to see if multiple-question
Likert scale surveys are reliable.
It is denoted as:
k : number of items on the test
: sum of the “j” item score variances
: variance of the total test scores
For example, , consider an exam with 20 participants
that contains 14 questions with more than two
answer possibilities (or opportunities for partial
credit). If the sum of the “j” item score variances is
42.9 and the variance of the total test scores is 161.4,
Cronbach’s alpha can be calculated to be 0.7907,
which would indicate adequate-to-good reliability.
Test validity
Validity indicates whether the characteristic
measured by a test is related to job qualifications
and requirements for entry-level, competent
practitioners. Validity gives meaning to the test
scores. Validity evidence indicates there is linkage
between test performance and job performance.
It is important to understand the differences
Broad constructs for analyses for certification
examinations are often defined as “face validity,”
“criterion-related validity,” “content-related validity”
and “construct-related validity.” The simplest
of these is face validity—whether or not the
examination appears (to examination candidates)
to relate to important elements of professional
practice. This is a qualitative metric that is important
for public acceptance and the reputation of the
examination. The remaining constructs include
quantitative metrics and are defined as follows:
1. Criterion-related validation requires
demonstration of a correlation or other
statistical relationship between test performance
and job performance. In other words, individuals
who score high on the test tend to perform better
on the job than those who score low on the test.
If the criterion is obtained at the same time the
test is given, it is called concurrent validity; if the
criterion is obtained at a later time, it is called
predictive validity.
The criterion-related validity of a test is
measured by the validity coefficient. It is
reported as a number between 0 and 1.00 and
indicates the magnitude of the relationship,
“r,” between the test and a measure of job
performance (criterion). The larger the validity
coefficient, the more confidence there is
in predictions made from the test scores.
However, a single test can never fully predict
job performance because success on the job
depends on so many varied factors. Therefore,
validity coefficients, unlike reliability coefficients,
33
rarely exceed r = 0.40.
Standard Error of Measurement (SEM)
It is denoted as:
All examinations are imperfect measures of
professional competency. It is important that
certification bodies are aware of this and use
available statistics to estimate the level of possible
errors. For traditional multiple-choice examinations,
a statistical estimate of this error is called the
“Standard Error of Measurement” (SEM). The SEM is
comparable to the statistical estimate “Uncertainty
of Measurement” (MU), which is estimated by
product-testing laboratories (ISO/IEC 17025).
x: exam score of test taker in group 1
: arithmetic mean of the exam scores of group 1
y: exam score of test taker in group 2
: arithmetic mean of the exam scores of group 2
As a general rule, the higher the validity coefficient, the more beneficial it is to use the test.
Validity coefficients of r = 0.21 to r = 0.35 are
typical for a single test.
For example, consider an exam that contains 45
multiple-choice questions with two exam groups
of 20 participants each. If the square of the summation of the difference of individual scores to
the mean score is 840.55 in group 1 and 779.2 in
group 2, and the summation of their products is
792.4, we can derive a correlation of 0.98, which
is incredibly beneficial.
TABLE 4
General Guidelines for Interpreting Validity
Coefficients
Validity coefficient value
above .35
.21 - .35
.11 - .20
below .11
Interpretation
very beneficial
likely to be useful
depends on
circumstances
unlikely to be useful
2. Content-related validation is a non-statistical
type of validity and requires a demonstration
that the content of the test represents important
job-related behaviors. In other words, test items
should be relevant to and measure directly
important requirements and qualifications for
the job.
3. Construct-related validation requires a
demonstration that the test measures the
construct or characteristic it claims to measure,
and that this characteristic is important to
successful performance on the job.
Professionally developed tests should come with
reports on validity evidence, including detailed
explanations of how validation studies were conducted. If examiners develop their own tests or
procedures, they will need to conduct their own
validation studies.
SEM provides an estimate of the margin of error that
is expected in an individual test score because of the
imperfect reliability of the test.
The SEM represents the degree of confidence that
a person’s “true” score lies within a particular range
of scores. For example, an SEM of “2” indicates
that a test taker’s “true” score probably lies within
two points in either direction of the score he or she
receives on the test. This means that if an individual
receives a 91 on the test, there is a good chance the
true score lies somewhere between 89 and 93.
The SEM is a useful measure of the accuracy of
individual test scores. The smaller the SEM, the
more accurate the measurements.
It is denoted as:
SD : standard deviation of tests scores
rxx : reliability or precision of the test
: variance of the true scores
: variance of the observed scores
We use the SEM to calculate confidence intervals
around obtained scores..
68 % CI = Score ± SEM
95 % CI = Score ± (1.96*SEM)
99 % CI = Score ± (2.58*SEM)
For example, consider an exam with 20 participants
that contains 45 multiple-choice questions. If the
standard deviation of the scores is 6.65128 and
the reliability of the test is 0.52988, the calculation
for the standard error of measurement is 4.6. This
implies that the true scores are as follows: raw score
of ± 4.6 (68% CI), raw score ± 9.02 (95% CI), and raw
score ± 11.87 (99% CI).