Site-level anomaly detection in multi-site clinical trials: A machine learning approach

At Phastar we have experience supporting study monitoring during clinical trials and our visualisation tools and expertise (see Delivering an Interim Analysis During a Pandemic from a Data Management perspective) successfully enable teams to monitor the data as it is collected during the trial. We explored whether machine learning could be utilised to support our monitoring activities, identifying unusual behaviour or data anomalies that are not easily detectable by a human.

Phastar data management teams, like others, monitor the trial sites to minimise the risk of poor-quality data. One approach they use is centralized and statistical monitoring where site data is evaluated for risks in real time from a single off-site location. Centralized and statistical monitoring can involve sophisticated visualisations and complex statistical algorithms to discover data outliers and anomalies. As part of the Centre For Analytical Excellence, we undertook a research project to explore methods to improve the detection of site level anomalies in multi-site clinical trials using machine learning. The goal would be to apply such an approach and use this to inform study teams unusual site behaviour for them to follow up accordingly.

The approach was based on that of Massi et al. who explored whether they could detect fraud across different hospitals. Their overall approach utilised hospital level data from which they selected key features, built models to detect outliers and provided this to the team who then manually validated the results and followed up where necessary. The difference with this project (the workflow is summarised in Figure 1) is the source data is clinical trial data together with any meta-data and the features extracted specific to the context of clinical trials. Explainable AI is key here, it was not sufficient to simply list the anomalous sites to the team. Instead, the team needed to know the anomalous sites together with the explanation as to why they might be anomalous for further follow up.

Figure 1. The workflow adopted in the identification of anomalous sites from the clinical data.

Figure 1. The workflow adopted in the identification of anomalous sites from the clinical data.

During this research phase, data was simulated to include repeated visits for 5000 participants from 150 sites.  Real world distributions were used for many of the features and anomalous values, inliers, and outliers, were inserted into selected features for 12 of the sites (8%).  The features or variables that are used with a machine learning model can have a big impact on the performance and in this project, we explored different features to understand this impact. The subject level data was summarised across visits and individual subject level data summarised across sites, with means and standard deviations across all variables in the dataset. In addition, derived features were explored, for example a feature that allowed us to measure the ‘inlierness’ of a particular variable, this might be of value if data was fabricated.

In this approach anomalous sites were not known and as a result an unsupervised learning approach was used, one where the algorithm looks for patterns in the data without knowing the ‘right’ answer. The identification of anomalous sites is often more complex than simply observing outliers on a chart, it may be subtle changes across different variables which require a more sophisticated approach. We explored 2 such anomaly detection methods, Isolation Forest (a tree-based method) and DBSCAN (a clustering-based method) and investigated the effect of different parameters and the performance of the dataset.

Overall, the results were encouraging. The performance was comparable across both DBSCAN and Isolation Forest, and we developed insights about the inclusion of the different features. Moreover, we explored methods to explain the predictions, enabling clinical teams to not only identify the anomalous sites but the features that are likely to have contributed to such a prediction.

We are looking for partners to explore these methods in more detail, to apply them to a real dataset and support clinical study teams to identify potential anomalous sites within their study. If you would like to find out more, please This email address is being protected from spambots. You need JavaScript enabled to view it..

Massi, M.C., Leva, F. and Lettieri, E. (2020) BMC Medical Informatics and Decision Making. 20 (160):2754

 Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (2008) Eighth IEEE International Conference on Data Mining: 413–422

Ester, M., Kriegel, H., Sander, J., Xu, X. (1996) KDD-96 Proceedings, 226-231