Statistical methods to assess agreement in continuous variables

Several statistical methods to assess agreement between measurements in continuous scales can be found in the literature; however, all these statistical methods for assessment of agreement have important limitations and no method appears to be foolproof (Luiz 2005). Moreover, it is not uncommon to find some inappropriate analyses in leading journals. Bland and Altaman have described the problems of some applications of correlation and regression methods to agreement studies, using examples from the obstetrics and gynaecology literature (Bland 2003), Costa-Santos found several misuses and misinterpretations of agreement methdos also in obstetrics and gynaecology literature (Costa-Santos 2005), but other authors have found inappropriate analyses in papers dealing with other aspects of medicine (Gow 2008; Jakobsson 2005; Rothwell 2000; Winkfield 2003). Some authors recommended the use of a combination of more than one statistical method to assess agreement (Khan 2001; Luiz 2005; Costa-Santos 2005) in order to overcome the limitations of each method, but this is not an ideal approach.

We can quantify variation within the subject by replicating readings of each subject and calculating the within-subject standard deviation square root of sw by one-way analysis of variance (ANOVA) where Sw is the residual mean square in one-way ANOVA with the subjects being the “groups”. The repeatability coefficient, which is square root of 2 sw squared , can be also computed. In this situation, we can also calculate the within-subject coefficient of variation, which is the withinsubjectstandard deviation divided by the mean.

The intraclass correlation coefficient (ICC) is frequently claimed to be suitable for assessment of observer agreement (Bartko 1966; Shoukri 1996). Sometimes the ICC is also recommended for assessing agreement between two methods of clinical measurement (Lee et al. 1989), but other authors contest this recommendation, claiming that ICC is inappropriate for assessing the agreement between two measurement methods (Bland et al. 1990). The ICC is defined as the ratio of the variance between subjects to the total variance (Bartko 1966; Shrout 1979). These variances are derived from ANOVA and the ANOVA model depends on the agreement study design. If we have a pool of observers, and for each subject we select different observers randomly to rate this subject, this corresponds to a one-way ANOVA in which the subject is a random effect, and the observer is viewed as measurement error. On the other hand, if the same set of observers rates each subject, this corresponds to twoway ANOVA design in which both the subject and observers are separate effects. Here, observers are considered a random effect; this means that the observers in the study are considered to be a random sample from a population of potential observers. Finally, there is an ANOVA design similar to the former, but it is two-way in that the observers are “fixed”; the ICC estimatesonly apply to the observers selected in the study, since it does not permit generalization to other ones (Shrout 1979).

The ICC (from two-way models) that should be used for assessing agreement was defined by McGraw and Wong as the “ICC for agreement”. We obtain the “ICC for consistency” or the “ICC for agreement” either by excluding or not excluding the observer variance from the denominator mean square, respectively (McGraw 1996). The systematic variability due to observers is irrelevant for “ICC for consistency” but it is not irrelevant for “ICC for agreement”. The ICC ranges from 0 (no agreement) to 1 (perfect agreement), but it can be negative. How such negative ICC values should be interpreted is quite unclear (Giraudeau 1996). The assumptions, multivariate normal distributions, and equality of variances should be checked. An important limitation of ICC is that it is strongly influenced by the variance of the trait in the sample/population in which it is assessed (Muller 1994). This limitation can be illustrated with the flowing example. Suppose that we aim to assess the interobserver agreement of a depression scale. When applied to a random sample of the adult population (a heterogeneous population with a high variance) the scale’s inter-observer agreement may be high (high ICC); however, if the scale is applied to a very homogeneous population (with a low variance), such as patients hospitalized for acute depression, the scale’s inter-observer agreement may be low (low ICC). Consequently, ICC values have no absolute meaning and the cut-off value of 0.75 proposed by Fleiss (Fleiss 1986)—which is often reproduced to signify a good agreement—has limited justification.

Professor John Uebersax provides here an interessant discussion about the ICC

Lin’s concordance correlation coefficient (CCC) is the Pearson coefficient of correlation, which assesses the closeness of the data to the line of best fit, modified by taking into account how far the line of best fit is from the 45- degree line through the origin (Lin 1989). Lin objected to the use of the ICC as a way of assessing agreement between methods of measurement and developed the concordance correlation coefficient (CCC). However, there are similarities between certain specifications of the ICC and the CCC (Nickerson 1997). Moreover, some limitations of the ICC, such as the limitation of comparability of populations described above, are also present in the CCC (Atkinson et al. 1997).

Bland claimed that the assessment of observer agreement is one of the most difficult areas in the study of clinical measurement, and he suggested assessing whether measurements taken on the same subject by different observers vary more than measurements taken by the same observer, and if so by how much, taking a sample of observers to make repeated observations on each of a sample of subjects. It is possible then to assess, by ANOVA, how much the variation between measurements on the subject is increased when different observers make the measurements (Bland 2004).

To assess agreement between two methods of clinical measurement, Bland and Altman proposed the limits of agreement approach (Bland 1986). This method is also sometimes used to assess agreement between different observers, and some authors even recommend the use of both approaches to assess observer agreement and to assess agreement between two methods of clinical measurement in continuous variables (Khan 2001; Luiz 2005). In fact, both approaches have different pros and cons. Unlike ICC, the limits of agreement distinguish between random error and bias. Limits of agreement can be calculated based on the mean difference between the measurements of two methods in the same subjects and the standard deviation of these differences. Approximately 95% of these differences will lie between the mean differences ± 1.96 standard deviations of these differences. The limits of agreement approach depends on some assumptions about the data: that the mean and standard deviation of the differences are constant throughout the range of measurement, and that these differences are from an approximately normal distribution. An easy way to check these assumptions is a scatter plot of the difference against the average of the two measurements and a histogram of the differences (Bland 1986). Limits of agreement are expressed in terms of the scale of measurement; it is not possible to say how the limits of agreement should be to represent “good” agreement. Based on the limits of agreement, deciding whether agreement is acceptable or not is always a clinical, and not astatistical, judgement. Limits of agreement and ICC can provide inconsistent results in agreement studies. Accordingly,their results need to be interpreted with caution keeping their respective limitationsin mind (Costa- Sanrtos 2001).

Professor Martin Bland discribes here the Limits of Agreement method

The limits of agreement approach still works in validity studies, when one method of measurement is a “gold standard”. It tells us how far from the “gold standard” the measurements by the method are likely to be. By calculating differences as “method” minus “gold standard”, we are able to say that the new method might exceed the “gold standard” by up to the upper limit and be less than the “gold standard” by down to the lower limit. However, plotting of difference against “gold standard” is misleading: the differences should be plotted against the mean between the method and the gold standard (Bland 1995). Often, the reason for measurement of agreement between one method and the true values is to assess a new diagnostic test comparing the test results with the true diagnosis. The test may be based on a continuous variable and the disease indicated if it is above or below a given level, or it may be a categorical variable such as positive or negative. In both cases, we are interested in knowing the proportion of disease positives that are test positive (the sensitivity) and the proportion of disease negatives that are test negative (the specificity). If the test is based on a continuous variable, a receiver operating characteristic (ROC) curve can be used, plotting the sensitivity against one minus the specificity for different cut-off points (Metz 1978; Zweig 1993). These ROC curves enable us to find the cut-off that best combines good sensitivity and good specificity, and from the area under the curve we cancompare different tests.

As no strategy to assess agreement seems to be fail-safe to compare the degree of agreement, or disagreement, anew information-based measure of disagreement was proposed. This approach can be very useful to compare the degree of disagreement among different populations (Costa-Santos 2010).

Entropy, introduced by Shannon (18), can be described as the average amount of information contained in a variable. The sum over all logarithms of possible outcomes of the variable is a valid measure of the amount of information, or uncertainty, contained in a variable.

Consider that we aim to measure disagreement between measurements obtained by Observer Y (variable Y) and Observer X (variable X). Consider for variable Y, a vector Y that can take the range of non-negative values (y1,.,yn) and for variable X, a vector X that can take the range of non-negative values (x1,.,xn).

The disagreement between Y and X is related to the differences between them. So, we consider

log module ai minus bi

the amount of information contained in the differences between observers.

By adding 1 to the differences, we avoid the behavior of the logarithmic function between 0 and 1 .To get a value between 0 and 1 we normalize the amount of information contained in the differences to obtain the following measure of information based measure of disagreement (IBMD):

1/n sum ( log (mod (ai-bi)/ max(ai,bi) +1 )

With the convention

mod(0,0)/ max(0,0) =0

Information based measure of dissagreement can be calculated using the on-line calculator or Winpepi

Last updated: February 2012