Smart signal detection - Part one
As part of the Centre for Analytical Excellence, we recently delivered a challenging yet enjoyable project looking at detecting signals from historical clinical trial data and developing models to enable the prediction of signals from new trials. Over a 3-month period, experts from within PHASTAR were provided with a large amount of historical data to digest, generate signals and create predictive models from.
At a high level, the remit for the team was to discover ‘signals’ in the data. They needed to define what was meant by a signal and whether there is something specific the team were interested in how they would use the signal information. Their approach is summarised in Figure 1.
As with any project of this type, the team started by understanding the data. What datasets are present and are these common across the different studies? What variables do we have within those datasets and when have they been collected during the study? How much missing data is there? What is the variability within the data and studies and are different variables correlated?
Once the team had an overall understanding of the data, the next step was to define the features that could be extracted and used as input to the machine learning methods. In an ideal world, the historical data would be both complete (no missing data) and all studies would contain the same or at least complementary variables from which to generate signals and build models. Unfortunately, as is often the case, variables were not always consistent across studies and the team explored 2 complimentary approaches to identify features.
The first was computational; applying automated techniques to select variables from the datasets. The second was a manual approach and turned out to be more effective where trial data experts in the team identified and defined features based on the clinical data and explored derivations where appropriate. These features covered demography, adverse events, laboratory measures, patient questionnaires, clinician measures and physical examination results.
Event time was an important additional element to this project. Data was collected from the very start of each trial through to follow up and in between times the visits varied between the different studies. The team explored how to effectively incorporate the time element within the approach.
Following creation of the features, the team explored machine learning methods (6 in total). Each method had a different rationale for use. For example, logistic regression was implemented as it is a useful tool to understand which variables have a significant effect on the dependant variable. Naïve bayes with updating was implemented as these prediction methods allow for updating of beliefs based on additional (i.e., later) timepoints data. Using these different methods, the team identified signals in the data and demonstrated good performance with approximately 75% accuracy and a sensitivity (predicting the positive class correctly) of up to 81%.
Throughout this work a key element was to enable the study team to follow and understand the methods and to enable them to explore the signals and monitor these during an ongoing clinical trial. We developed an interactive platform to enable this which will be discussed in a future blog.
The biggest learning from this project is that close collaboration with the study team is critical, particularly in understanding what it is the team want to know and how the results will be used. The study team are often the experts in the data for example understanding what variables are potentially important and the effectiveness as signal, whether it would be invasive for the patient and, if so, regular monitoring may not be an option. Effective collaboration really is key.
Please get in touch if you would like to find out more or if you think we can support you.