Chapter 7: Glossary of Terms that Evaluators Use
The Art of Appropriate Evaluation



Administrative Evaluation (Process Evaluation)—An assessment of the extent to which a program was implemented or conducted according to plan. Administrative evaluations are useful to establish that a program actually reached its intended target audience with the appropriate messages the desired number of times through the selected media. Process evaluations are most useful in troubleshooting unsuccessful programs delivering proven countermeasures.

children crossing signBefore and After Design—An evaluation design that assesses the change in an outcome measure as the difference between pre-program levels and post-program levels. An evaluation of a school-age pedestrian safety program, for example, might observe street crossing behaviors before and after the educational program had been implemented. An increase in the proportion of children observed using the desired search patterns would provide evidence of program effectiveness. This design is sensitive to historical effects, however. If something else happened between the two assessment periods that might affect the observed behavior, then the outcome can not be unequivocally attributed to the program. In this example, the outcome would be confounded if the local news media gave extensive coverage to a child killed or injured by a hit-and-run driver. This design is stronger if a comparison group is also assessed at the same time periods as the treatment group.

Bias—A potential characteristic of non-random samples that affects the program’s outcome. For example, an evaluation of a driver improvement program that is provided to volunteers cannot determine how well the program conveys information because volunteers have different motivations than “average” drivers. Researchers prefer to use random samples whenever possible to avoid bias.

speed zone ahead signConfounding Factors (or Variables)—Events other than those being investigated that may also have an effect on the outcomes of the program. For example, the results of an evaluation of a speed enforcement program could be confounded by the highway department making engineering changes in the same areas as the enforcement efforts.

Comparison Group and Treatment Group—In order to demonstrate a program’s effects, evaluators may compare a group that receives a countermeasure with an equivalent group that does not. The group getting the countermeasure is the “treatment” or “experimental” group and the other is the “comparison” or “control” group.

Correlation—A mathematical technique that assesses the extent to which one variable increases (or decreases) in value as another variable changes in value. Temperature in Fahrenheit and temperature in Celsius is perfectly correlated — as one goes up, so does the other. If one event causes another, they are necessarily correlated, but two variables that are highly correlated are not necessarily causally connected — they might both be caused by a third, unmeasured, variable.

motorcycle crossing signCost-Benefit Analysis—A process comparing the cost of a program with the savings resulting from the outcomes of the program. While it is often difficult to identify and enumerate all the costs and benefits, the process can be meaningfully applied to a single program. For example, a law requiring motorcycle riders to wear protective helmets has limited enforcement costs compared with fairly large benefits in health care expenses and welfare benefits avoided.

Cost-Effectiveness Analysis—A process for determining the relative benefit of alternative programs by comparing the amount each program costs with the extent to which each affects a common measure of effectiveness. In this analysis, the outcomes of the program need not be converted to actual dollars saved. In comparing two approaches to increasing safety belt use rates, for example, one could calculate the cost of increasing belt use by, say, 5 percentage points for each program.

Evaluation Design—The plan for conducting an evaluation in a way that permits the evaluator to rule out the possibility that other factors (other than the program) caused the observed outcomes. This plan should include a clear statement of the objectives of the program, how success will be measured, what populations will be exposed to the treatment, how treatment and comparison groups will be constituted, and how the data will be collected, analyzed, and reported.

Field Test—A study of a limited-scale implementation of a new program in a setting similar to where it is likely to be used. Field test sites are generally recruited from candidates showing a high level of interest in participation; a quality that sometimes provides an “ideal” environment rather than a “representative” one. This is not all bad, as it shows the potential benefit of a countermeasure unfettered by implementation problems.

pedestrian crossing signImpact (or Outcome) Evaluation—An evaluation that determines the extent to which a program achieved its stated outcome objectives. For example, an impact evaluation of a program designed to reduce pedestrian crossings against red lights could compare the observed post-program change in the number of pedestrians crossing on the red and green cycles at selected intersections with an appropriate comparison group.

Outcome Objectives—A specification of the events that would mark the successful achievement of the program’s goals. These should be easily and unambiguously measured and closely related to the issues addressed by the program. While all traffic safety programs hope to reduce the number of traffic fatalities, reduction of fatalities is not often closely related to the program’s activities. Rather, appropriate objectives should be related to increasing use of safety belts, reducing the number of drinking drivers, improving street-crossing behavior, increasing helmet use, etc. Objectives may specify the populations of interest (e.g., decrease driving after drinking among Native Americans living in Nevada); and, in an ideal world, objectives should state a quantifiable level of change (e.g., increase belt use by pickup truck drivers on 2-lane rural roads in Iowa by 10 percentage points).

Quasi-Experimental Design—A system of procedures for ruling out alternative explanations for study results when study groups could not be constituted by random assignment. While random assignment to groups is the preferred method for ruling out bias in samples, many real-world situations do not permit random assignment. Consequently, evaluators must turn to other techniques (e.g., additional comparison groups, multiple levels of treatment, comparisons over long time periods) to dismiss threats to the validity of the study.

Random Sample—A subset of a population chosen in such a way that each member of the population has equal probability of selection. Random samples permit the use of certain statistical procedures that provide measures of the potential error in estimates of means (averages) and differences between means of two groups. A simple system for making random selections is to create an alphabetical listing of population members and selecting every nth name. If the population list contained 1000 names and the evaluator needed a sample of 100, she would select every 10th name.

Reliability—An assessment of the extent to which a measurement system will give the same results if used to assess the same events on repeated occasions. A measure can be reliable, however, without being valid. For example, a weekly count of citations for driving while intoxicated may be highly repeatable. However, it is not a valid measure for evaluating a program designed to reduce the incidence of impaired driving because it is so dependent on other factors, including police motivation, program funding, and department priorities.

Representative Sample—A group of individuals deliberately chosen from a particular population to try to emulate the characteristics of the target population as a whole. When random sampling is not possible, use of a representative sample, with careful attention to defining the relevant population characteristics may be an acceptable option. Focus groups are usually constituted using representative samples. For example, participants may be selected to match the following characteristics: 60% male, 40% female; ages 21 through 30; primary vehicle is pickup truck; drives more than 10,000 miles per year; graduated from high school and attended college for 2 or fewer years.

stop signStatistical Significance—An estimate of the probability that the differences observed between. treatment and comparison groups occurred by chance alone (i.e., that the treatment had no effect). The probability level below which results are said to be significant is somewhat arbitrary, but is usually .05 (5 chances in 100) or .10 (1 chance in 10). Statistical significance can be obtained with extremely small differences if the size of the groups is sufficiently large. While statistical significance can tell you if the results are not likely due to chance events, it cannot tell you if the size of the difference is programmatically meaningful (that is, worth the effort).

Validity—An assessment of the extent to which a measurement system actually measures what it is supposed to measure. For example, observed belt use is a much more valid measure of compliance with belt-use laws than is self-report on a survey. However, there are some circumstances (e.g., nighttime, fogged windows, high-speed locations) under which observations are not very reliable.