Choose a path through the site:

Characteristics of Good Measurement Instruments

Procedures for measuring attributes can be judged on a variety of merits. These include practical well as technical issues. All measurement procedures, whether qualitative or quantitative, have strengths and weaknesses—no one procedure is perfect for every task. In order to improve a study it is frequently prudent for an investigator to use multiple measurement tools and triangulate the results.

Practical Issues

Some of the practical issues that need to be considered for each tool include:

  • Cost
  • Availability
  • Training required
  • Ease of administration, scoring, analysis
  • Time and effort required for respondents to complete the measure
  • Completeness of the data gathered
  • Potential sources of bias
  • Relevance to research question

Along with the practical issues, quantitative measurement procedures (especially surveys, tests, and scales) may be judged on the technical characteristics or psychometric properties of the instruments. There are two major categories of psychometric properties—reliability and validity—both of which are important for good quantitative research instruments. The following description is a general outline of the major forms of reliability and validity. For more specific information the reader is urged to consult a good text on psychometrics (e.g., Furr and Bacharach, 2008).

Consistency (Reliability)

A good measure of some entity is expected to produce consistent scores. A procedures' reliability is estimated using a coefficient (i.e., a numerical summary). For purposes of servicelearning research, the major types of coefficients include:

  • Temporal consistency: the ability of an instrument to give accurate scores of the same entity from one time to another. Also known as test-retest reliability, it uses the correlation coefficient between the two administrations of the same scale.
  • Coherence: the consistency of items within a scale. Internal consistency reliability estimates the consistency among all items in the instrument (typically measured using Cronbach's coefficient alpha). According to Nunnally (1967), coefficient alpha is an estimate of the correlation between the scale and a hypothetical alternative form of the scale of the same length. Alternatively, it is an estimate of the correlation of the scale with the construct's true score. An important principle that is related to coefficient alpha is that, other things being equal (e.g., item quality), the more items a scale contains, the more reliable, coherent, and error free it will be.
  • Scoring agreement: the degree to which different observers or raters give consistent scores using the same instrument, rating scale, or rubric. Also called inter-rater reliability, it is a particularly important consideration when using rubrics. Knowing the inter-rater reliability helps to establish the credibility of the rubric being used and helps the investigator feel confidence in the results and conclusions coming from the research.

For more information about reliability please refer to Types of Reliability (www.socialresearchmethods.net/kb/reltypes.php) and Reliability Analysis (faculty.chass.ncsu.edu/garson/PA765/reliab.htm).

Meaningfulness (Validity)

A valid measurement tool or procedure does a good job of measuring the concept that it purports to measure. Validity of an instrument only applies to a specific purpose with a specific group of people. For example, a scale is not considered simply "valid" or "invalid"—but it might be considered valid for measuring social responsibility outcomes with college freshmen, but not knowledge of the nonprofit sector among professionals. Below are three main classes of validity, each having several subtypes.

  • Construct validity: The theoretical concept (e.g., intelligence, moral development, content knowledge) that is being measured is called the construct. Construct validity establishes that the procedure or instrument is measuring the desired construct because the operationalization (e.g., scores on the scale) conforms to theoretical predictions. This is the most important form of validity, because it subsumes other forms of validity.
    • Convergent validity: correlation of scores on an instrument with other variables or scores that should theoretically be similar. For example, two measures of social responsibility should yield similar scores and therefore be highly correlated.
    • Discriminate validity: Comparison of scores on an instrument with other variables or scores from which it should theoretically differ. For example, a measure of verbal ability should not be highly correlated with artistic skills.
    • Factor structure: Factor analysis provides an empirical examination of the internal consistency of an instrument. The items that are theoretically supposed to be measuring one concept (i.e., a subscale) should correlate highly with each other and all load on the same factor, but have low correlations with items measuring a theoretically different concept (an orthogonal or independent factor). In some cases, the theoretical construct might have multiple dimensions and the factor structure will not be unidimensional, but the factor structure should correspond to the theoretical structure.
  • Content validity: Establishes that the instrument includes items that are judged to be representative of a clearly delineated content domain. For example, the IUPUI Center for Service and Learning established a framework of knowledge, skills, and dispositions of a civic-minded graduate (Bringle & Steinberg, in press). They used this conceptual framework to develop items for an instrument, the Civic-Minded Graduate Scale. Content validity can be assessed by the degree to which the scale items are a representative sample of a clearly defined conceptual domain according to the evaluation of independent reviewers.
    • Face validity: A subjective judgment about whether or not, on the "face of it," the tool seems to be measuring what it intended to measure.
  • Criterion-related validity: The degree to which an instrument is associated with a criterion that is implicated by the theory of the construct.
    • Concurrent validity: Comparison of scores on some instrument with concurrent scores on another criterion (e.g., behavioral index, independent assessment of knowledge). If the scale and the criterion are theoretically related in some manner, the scores should reflect the theorized relationship. For example, a measure of verbal intelligence should be highly correlated with a reading achievement test given at the same time, because theoretically reading skill is related to verbal intelligence.
    • Predictive validity: Comparison of scores on an instrument with some future criterion (e.g., behavior). The instrument's scores should do a reasonable job of predicting the future performance. For example, scores on a social responsibility scale would be expected to be fairly good predictor of future post-graduation civic involvement (e.g., voting, volunteering).
  • For more specific information about test validity, see the following web pages: