The quality of the measurements taken by health professionals or by measurement devices is fundamental not only for clinical care but also for research (Shoukri 2005). Measurement of variables always implies some degree of error. Often, when an observer takes a measurement, the value obtained depends on several things such us the skills of the observer, his or her experience, the measurement instrument, and the observer’s expectations (Bland 2000). Also, natural continuous variation in a biological quantity can be present (Bland 2000). For example, if two observers measure blood pressure consecutively in the same subject and under the same circumstances, the values obtained by the two observers can be different because of their different skills and experience, their different expectations, the measurement instrument, and also because of the natural continuous variation in blood pressure. When natural continuous variation in a biological quantity (variation within the subject) is present, it is outside the control of the observer. It is, however, possible to minimize the observer variability by training of observers, by the use of guidelines, and through automation, for example. In fact, it is almost always possible to reduce observer variability but it is impossible to eliminate it. Thus, the assessment of measurement agreement in different forms is an important issue in medicine.
It is important to assess how much different observers agree when they rate the same measurement (inter-observer agreement). Sometimes the interest is in the properties of the measurement method itself; for example, we might want to see whether a new measurement technique can be reproduced by a second observer. At other times, the focus may be on the observers rather than on the measurement method. For example, we may wish to evaluate the observer agreement of different obstetricians classifying cardiotocographic patterns as normal, suspicious, or pathological before and after a course explaining the guidelines for cardiotocography interpretation. It is also important to assess how much observations agree when the same observer repeats the same measurement (intra-observer agreement).
Sometimes, the aim is to assess the agreement between two methods of clinical measurement; for example, if a new, faster, or less invasive method agrees with the old one, i.e. if they do not disagree sufficiently to cause problems in clinical interpretation, we can replace the old method with the new one or use both interchangeably.
But the aim could also be to test the validity, or the agreement between a measurement and the corresponding true value (gold standard), if it is known. When a new diagnostic test is developed, for example, it is essential to test it validity. When the gold standard is not known, the observer agreement provides an upper bound on the degree of validity present in the rating. In fact, if the observer agreement is good, then there is a possibility, but no means a guarantee, that the ratings do in fact reflect the gold standard. On the other hand, if the observer agreement is poor, then the usefulness of the ratings is severely limited and they cannot be considered valid (Fleiss 2003). Everyone is usually aware of the importance of a validity study before the introduction of a new measurement method or diagnostic test in medical practice. Also, everyone is aware of the importance of assessing agreement between two methods when we intend to replace an old method with a new one. However, although the clinical consequences of low observer agreement have been recognized for more than 50 years (Yerushalmy 1953), the importance of observer agreement studies is not always given due recognition. In general, improvement of observer agreement in clinical observations can have a great influence on the quality of healthcare (de Vet 2005).
Last updated: March 2011