Procedures for measuring attributes can be judged on a variety of merits. These include
practical well as technical issues. All measurement procedures, whether qualitative or
quantitative, have strengths and weaknesses—no one procedure is perfect for every task. In order
to improve a study it is frequently prudent for an investigator to use multiple measurement tools
and triangulate the results.
Some of the practical issues that need to be considered for each tool include:
- Training required
- Ease of administration, scoring, analysis
- Time and effort required for respondents to complete the measure
- Completeness of the data gathered
- Potential sources of bias
- Relevance to research question
Along with the practical issues, quantitative measurement procedures (especially surveys, tests,
and scales) may be judged on the technical characteristics or psychometric properties of the
instruments. There are two major categories of psychometric properties—reliability and
validity—both of which are important for good quantitative research instruments. The following
description is a general outline of the major forms of reliability and validity. For more specific
information the reader is urged to consult a good text on psychometrics (e.g., Furr and
A good measure of some entity is expected to produce consistent scores. A procedures'
reliability is estimated using a coefficient (i.e., a numerical summary). For purposes of servicelearning
research, the major types of coefficients include:
- Temporal consistency: the ability of an instrument to give accurate scores of the same
entity from one time to another. Also known as test-retest reliability, it uses the
correlation coefficient between the two administrations of the same scale.
- Coherence: the consistency of items within a scale. Internal consistency reliability
estimates the consistency among all items in the instrument (typically measured using
Cronbach's coefficient alpha). According to Nunnally (1967), coefficient alpha is an
estimate of the correlation between the scale and a hypothetical alternative form of the
scale of the same length. Alternatively, it is an estimate of the correlation of the scale
with the construct's true score. An important principle that is related to coefficient alpha
is that, other things being equal (e.g., item quality), the more items a scale contains, the
more reliable, coherent, and error free it will be.
- Scoring agreement: the degree to which different observers or raters give consistent
scores using the same instrument, rating scale, or rubric. Also called inter-rater
reliability, it is a particularly important consideration when using rubrics. Knowing the
inter-rater reliability helps to establish the credibility of the rubric being used and helps
the investigator feel confidence in the results and conclusions coming from the research.
For more information about reliability please refer to Types of Reliability
(www.socialresearchmethods.net/kb/reltypes.php) and Reliability Analysis
A valid measurement tool or procedure does a good job of measuring the concept that it
purports to measure. Validity of an instrument only applies to a specific purpose with a specific
group of people. For example, a scale is not considered simply "valid" or "invalid"—but it might
be considered valid for measuring social responsibility outcomes with college freshmen, but not
knowledge of the nonprofit sector among professionals. Below are three main classes of validity,
each having several subtypes.
- Construct validity: The theoretical concept (e.g., intelligence, moral development,
content knowledge) that is being measured is called the construct. Construct validity
establishes that the procedure or instrument is measuring the desired construct because
the operationalization (e.g., scores on the scale) conforms to theoretical predictions. This
is the most important form of validity, because it subsumes other forms of validity.
- Convergent validity: correlation of scores on an instrument with other variables or
scores that should theoretically be similar. For example, two measures of social
responsibility should yield similar scores and therefore be highly correlated.
- Discriminate validity: Comparison of scores on an instrument with other variables
or scores from which it should theoretically differ. For example, a measure of
verbal ability should not be highly correlated with artistic skills.
- Factor structure: Factor analysis provides an empirical examination of the
internal consistency of an instrument. The items that are theoretically supposed to
be measuring one concept (i.e., a subscale) should correlate highly with each
other and all load on the same factor, but have low correlations with items
measuring a theoretically different concept (an orthogonal or independent factor).
In some cases, the theoretical construct might have multiple dimensions and the
factor structure will not be unidimensional, but the factor structure should
correspond to the theoretical structure.
- Content validity: Establishes that the instrument includes items that are judged to be
representative of a clearly delineated content domain. For example, the IUPUI Center for
Service and Learning established a framework of knowledge, skills, and dispositions of a
civic-minded graduate (Bringle & Steinberg, in press). They used this conceptual
framework to develop items for an instrument, the Civic-Minded Graduate Scale. Content
validity can be assessed by the degree to which the scale items are a representative
sample of a clearly defined conceptual domain according to the evaluation of independent
- Face validity: A subjective judgment about whether or not, on the "face of it," the
tool seems to be measuring what it intended to measure.
- Criterion-related validity: The degree to which an instrument is associated with a
criterion that is implicated by the theory of the construct.
- Concurrent validity: Comparison of scores on some instrument with concurrent
scores on another criterion (e.g., behavioral index, independent assessment of
knowledge). If the scale and the criterion are theoretically related in some manner,
the scores should reflect the theorized relationship. For example, a measure of
verbal intelligence should be highly correlated with a reading achievement test
given at the same time, because theoretically reading skill is related to verbal
- Predictive validity: Comparison of scores on an instrument with some future
criterion (e.g., behavior). The instrument's scores should do a reasonable job of
predicting the future performance. For example, scores on a social responsibility
scale would be expected to be fairly good predictor of future post-graduation civic
involvement (e.g., voting, volunteering).
- For more specific information about test validity, see the following web pages: