IJCA - Volume I - Flipbook - Page 37
36 The International Journal of Conformity Assessment
Correlation Coefficient That Can Be Used According
to Variable Types
Variable Y/X
Quantitative X
Quantitative
Y
Pearson r
Ordinal Y
Biserial r bis
Nominal Y
Ordinal X
Nominal X
Point Biserial
Biserial r bis
2022 | Volume 1, Issue 1
Tetrachoric Correlation Coefficient ( rtet )
An index reflecting the degree of the relationship
between two continuous variables that have both
been dichotomized.
rpbis
Spearman rho/
Tetrachoric rtet
Rank Biserial
Point Biserial
Rank Biserial
rpbis
rrrbis
Phi, L, C,
Lambda
rrrbis
Pearson Product-Moment Correlation Coefficient
(PPMCC)
The correlation between sets of data measures how
well they are related. It shows the linear relationship
between two sets of data. In simple terms, it
answers the question: “Can I draw a line graph to
represent the data?”
Annex 2
Classical Test Theory
Classical test theory (CTT), sometimes called
the true score model, is the mathematics behind
creating and answering tests and measurement
scales. The goal of CTT is to improve tests,
particularly the reliability and validity of tests.
Reliability implies consistency: If you take any test
five times, you should get roughly the same results
each time. A test is valid if it measures what it’s
supposed to.
True Scores
Point-Biserial Correlation Coefficient ( rpbis )
This is a special case of Pearson in which one
variable is quantitative and the other variable is
dichotomous and nominal. The calculations simplify
since typically the values 1 (presence) and 0
(absence) are used for the dichotomous variable.
Classical test theory assumes that each person has
an innate true score. It can be summed up with an
equation: X = T + E
Where:
X is an observed score
T is the true score
E is random error
Phi Coefficient ( )
A measure of association for two binary variables. It
is used for contingency tables when:
• at least one variable is a nominal variable
• both variables are dichotomous variables
Y/X
0
1
Totals
1
A
B
A+B
0
C
D
C+D
Totals
A+C
B+D
N
Contingency table
For example, let’s assume you know exactly 70% of
all the material covered in a statistics course. This is
your true score (T). A perfect end-of-semester test
(which doesn’t exist) should ideally reflect this true
score. In reality, you’re likely to score around 65% to
75%. The 5% discrepancy from your true score is the
error (E).
The errors are assumed to be normally distributed
with a mean of zero. Hypothetically, if you took the
test an infinite number of times, your observed score
should equal your true score.
Statistics Used in Classical Test Theory
Is your test measuring what it’s supposed to?
Classical test theory is a collection of many
statistics, including the average score, item
difficulty, and the test’s reliability.
1.Correlation: Shows how two variables, X and
Y, are related to each other. Different measures
are used for different test types. For example, a
dichotomously scored test (e.g., yes/no answers)
would be correlated with point-biserial correlation
while a polytomously scored test (one with multiple
answers) would be scored with the Pearson
Product-Moment Correlation Coefficient.
2. Covariance: A measure of how much two random
variables vary together. It’s similar to variance, but
where variance tells how a single variable varies,
covariance tells how two variables vary together.
3. Discrimination Index: The ability of the test to
discriminate between different levels of learning or
other concepts of interest. A high discrimination
index indicates the test is able to differentiate
between levels.
4. Item difficulty: A measure of individual test
question difficulty. It is the proportion of test takers
who answered correctly out of the total number of
test takers. For example, an item difficulty score of
89/100 means that out of 100 people, 89 answered
correctly.
5. Reliability Coefficient: A measure of how well the
test measures achievement. Several methods exist
for calculating the coefficient, including test-retest,
parallel or alternate-form, and internal analysis.
Rules of thumb for preferred levels of the coefficient
are:
For high-stakes tests (e.g., college admissions):
> 0.85
For low-stakes tests (e.g., classroom
assessment): > 0.70
6. Sample Variance / Standard Deviation: Sample
variance and sample standard deviation are
measures of how spread out the scores are.
7. Standard Error of Measurement (SEM): A measure
of how much measured test scores are spread
around a “true” score.
Annex 3
Item Response Theory (IRT)
Item response theory (IRT) is a way to analyze
responses to tests or questionnaires with the goal
of improving measurement accuracy and reliability.
IRT is one way to develop tests that actually
measure what they are intended to measure (e.g.,
mathematical ability, reading ability, historical
knowledge).
The first step in IRT is the development of a two-
37
dimensional matrix, which lists examinees and
correct responses. In this matrix, 1 represents a
correct answer and 0 is an incorrect answer
Item 1 Item 2 Item 3 Item 4 Item 5
Mean Proficiency Level
(Q)
Person 1
1
1
1
1
1
1
Person 2
0
1
1
1
1
0.8
Person 3
0
0
1
1
1
0.6
Person 4
0
0
0
1
1
0.4
Person 5
0
0
0
0
1
0.2
Mean ID
( pj )
0.8
0.6
0.4
0.2
0
A quick look at this table illustrates that Person
1 answered all five questions correctly (100%
proficient) while Person 4 correctly answered two
questions (40% proficiency). However, proficiency
isn’t the only factor in IRT theory; the question’s
level of difficulty must also be considered. Let’s
say there are two test takers who both score
2/5. The first test taker may have answered two
easy questions, and the second test taker may
have answered two difficult questions. Therefore,
although they both scored 40%, their proficiency is
not the same.
Item response theory takes into account the number
of questions answered correctly and the difficulty of
each question.
There are many different models for IRT. Three of the
most popular are:
• Rasch model
• Two-parameter model
• Three-parameter model
Some researchers consider the Rasch model to
be completely separate from IRT. This is mainly
because the Rasch model uses only a single
parameter (called a “threshold”), while general IRT
models use three. Another reason is that IRT aims to
fit a model to data, while the Rasch model fits data
to a model. Despite these differences, both models
are used in favor of classical test theory—where the
test taker’s scores vary from one test to another.
The Rasch model
In item response theory, a model that specifies only
one parameter—item difficulty. This is thought to be
a parsimonious way to describe the relation between