Method agreement analysis – common pitfalls and a review of methodology

A common question in clinical research is whether a new method of measurement is equivalent to an established one. As a statistical consultant at PHASTAR, I am seeing an increase in the number of trials where a new artificial intelligence or machine learning diagnostic tool is being compared to either a pre-existing tool, or to a clinician. Methodology for the analysis of binary data is well established, but methodology for continuous outcomes is less developed. Here we shall review current methodology and outline some of the common pitfalls. It should be noted that concordance analysis doesn’t guarantee the correctness of methods of measurement, rather it shows the degree to which different measuring techniques agree with each other. To properly evaluate a new method of measurement, quantities pertaining to the validity of measures, such as sensitivity, specificity, and positive and negative predictive values should also be considered. 

When we measure a biological variable in a number of individuals, or repeatedly in the same individual, we will always expect a scatter of variables as inter- and intra-variability is likely to be evident. Much of this variability can be attributed to variation in associated factors such as age, gender, genetics, etc. In contrast, measurement error is that which arises because the observed, measured values and the true values are different. This measurement error could be random (sometimes higher, sometimes lower, but balance out on average), or it could be systematic where the observed values are consistently higher or lower than the true values. It is this measurement error that we wish to eliminate or minimize as it concerns the overall accuracy of observations.

When considering binary outcomes, the diagonal elements of a contingency table will show the frequencies of agreement. Cohen’s kappa is a measure of agreement that is considered more robust than a simple percentage agreement as it takes into account the possibility of the agreement occurring by chance. It is given by:


  • n = total number of observations
  • OD = sum of observed frequencies along the diagonal
  • ED = sum of expected frequencies along the diagonal
  • p0 = OD/n
  • pE = ED/n

Perfect agreement is then evident when Cohen’s kappa equals 1 and a value equal to zero suggests that the agreement is no better than that which would have been obtained by chance alone.

Cohen’s kappa is an inappropriate measure of the agreement between a pair of measurements when the value of interest is continuous or ordinal. Paired observations are often incorrectly evaluated for agreement using the Pearson correlation coefficient. Correlation is the presence of a linear relationship between two variables and perfect correlation corresponds to points lying along any straight line. In contrast, agreement looks for concordance with points lying along the line of equality. Data that has poor agreement can still produce high correlations as Figure 1 shows. The correlation plot of hemoglobin levels show that the values are highly correlated. They in fact have a correlation coefficient of 0.98 showing a near perfect correlation. But we can see from the plot that hemoglobin measures “HB2” are consistently higher than hemoglobin measures “HB1” (as the values lie above the line of equality) and the two measures have poor agreement.

Figure 1: Correlation plot for haemoglobin with the line of equality.

Figure 1: Correlation plot for hemoglobin with the line of equality.

An alternative might be to consider the paired t-test which compares mean differences between two observations in a group, but this would also be a common statistical pitfall. The paired t-test can be non-significant if the average difference between paired values is small, even if the difference between two observers for individuals is large.

Figure 2: Correlation plot for systolic blood pressure with line of equality.

Figure 2: Correlation plot for systolic blood pressure with line of equality.

Figure 2 shows the correlation plot for systolic blood pressure measurements and we can see that for individuals, the differences between the two methods of measurement is large at the two extremes. The average difference across all individuals, however, is in fact small at 1.32, and a paired t-test yields a p-value of 0.70, suggesting that there is no difference between the two, even though this plot suggests that the two methods of measurement do not agree. Additionally, when comparing two method of measurement, it is unlikely that the different methods will agree exactly by giving identical results for all individuals. We typically wish to know by how much methods differ, and if this is not enough to cause problems, we can replace the old method with the new. With sufficiently large sample sizes, even small differences between measurements would yield small p-values.

The Bland-Altman plot is a display of the differences between pairs of readings against the mean of measurement and offers insight into the extent of agreement. We summarize the lack of agreement by calculating the bias, estimated by the mean difference, d, and also calculate the standard deviation of the differences, s. If the differences are normally distributed, we would expect most of the differences to lie between approximately d - 2s and d + 2s. These upper and lower bounds are referred to as the “limits of agreement” and allow the assessment of concurrence. If differences are not clinically important, methods can be used interchangeably. An example of the Bland-Altman plot for hemoglobin is given in Figure 3.

Figure 3: Bland-Altman plot for haemoglobin.

Figure 3: Bland-Altman plot for hemoglobin.

Importantly, there is no uniform criteria for what constitutes acceptable limits of agreement. This is a subjective decision, which must be made from a clinical perspective and is dependent on the variable being measured and must be pre-specified. 

Why is all this important? Well according to, the number of trials using the Bland-Altman as a primary analysis is on the up, but an examination of some of these studies in more detail shows that this methodology is often misused, and sample size calculations not properly considered. Chhapola et al (2015) found that there is incomplete reporting of the Bland-Altman methodology in published clinical trials and state “despite its simplicity, B-A appears not to be completely understood by researchers, reviewers and editors of journals.”[1] As the need for comparisons of methods of measurements increases, so will the use of the Bland-Altman methodology, and we, as statisticians, have a crucial role to play in making sure that it is used correctly. 


  1. V Chhapola, SK Kanwal, and R Brar. Reporting standards for Bland–Altman agreement analysis in laboratory research: a cross-sectional survey of current practice. Annals of Clinical Biochemistry, 2015. 52(3):382-386.