Addressing data sparsity in clinical trials – and female sparsity in data science

Addressing data sparsity in clinical trials – and female sparsity in data science

Digital Health Technologies (DHTs) are democratizing data collection during clinical trials, while promising to make research more efficient and more patient centric. However, shifting the power to input data from clinicians to participants, increases the risk of missed datapoints.

Where it occurs, this data sparsity can lead to incomplete submissions, threatening the success of otherwise highly promising drug candidates.

Speaking at a Women in Data Science (WIDS) Kenya conference, held at the Microsoft Africa Development Centre in Kenya in March, Pamela Adede, Data Operations Programmer at Phastar, explained the causes, consequences, and possible solutions to data sparsity in digital-era clinical trials – and why a woman’s place is in STEM.

Digital data sources and data sparsity

Recent years have seen an explosion of digital health technologies, from wearables and sensors to electronic patient reported outcomes (ePROs) and diaries, being deployed in clinical trials. This shift presents sponsors with a raft of potential benefits.

It has, for example, enabled the utilization of decentralized clinical trials (DCTs), which replace at least a proportion of site visits with home health and remote monitoring. It’s an efficiency-boosting approach that streamlines drug development and, by removing geographical restrictions to participation, expands accessibility and inclusivity.

DHTs can also help sponsors gain a clear idea of how their products are used “in the real world”, better understand what matters the most to patients, and demonstrate all this to regulators.

As with all other technology trends, this comes with potential pitfalls. It moves control of data generation and entry from highly trained staff to clinical trial participants who may not have the same precision focus or pay the same attention to detail. This is compounded by the fact that DHTs vastly increase the data points being collected.

It can all add up to data sparsity, or missing values within records or time series – missed data points that can negatively impact a sponsor’s ability to demonstrate product efficacy.

By extension, then, it can bring expensive drug development programs to their knees, and place insurmountable barriers between life-changing new drugs and the people who need them.

Risk mitigation

Data science, which uses statistics, scientific computing, scientific methods, processes, algorithms, and systems to extract or extrapolate knowledge and insights from data, offers possible solutions to this costly problem.

The first is data augmentation, or artificially increasing the amount of data by generating new data points from the existing data. In simple terms, it extends the data and generates more records, thereby making up for the missing information. Because it extrapolates from existing data, however, it does run the risk of inducing bias.

Another option is risk-based monitoring, a form of centralized monitoring in which sponsors are able to review the study-wide data in near real time as it accumulates. Data science tools build visualizations and provide an oversight of the trial, allowing monitors to quickly spot missing data and untimely data entry, and take timely corrective action.

Some approaches are best suited to particular types of missing data. Researchers can use Artificial Intelligence (AI) tools, for example, to predict the likelihood of patient dropout. This enables them to recruit patients with a higher chance of participating to the end of the trial and avoid data sparsity linked to poor retention. And estimation equations, which replace missing data with averages from across the data set, can be useful in when dealing with empty age fields.

The power of data science

Data Science is at the forefront of many innovations in clinical research. It is being used in collection, management, and analysis of clinical data by automating the process and reducing error rates. This is important because securing the overall quality of clinical data is paramount to ensuring quality care and appropriate decision-making in the medical and healthcare fields.

But it doesn’t stop there. Data science is also playing a role in solving many of the world’s biggest challenges, from tackling climate change, to searching for elusive dark matter, and halting the spread of political disinformation.

In short, the opportunities are huge. Yet women are typically underrepresented in this future-shaping field. In fact, only around 15% of data scientists and between 15-22% of all professionals in data science-related roles are female.1

That’s why WIDS advocates for change, staging international conferences that bring women working in cutting edge, innovative data science fields from across the spectrum together to discuss their results.

Founded by female data scientists for female data scientists, the first conference was held at Stanford University, in the United States, on International Women’s Day eight years ago. Since then, there have been almost 200 events in 50-plus countries, attracting more than 100,000 participants.

These conferences are different in that that they do not bring people together to talk about gender and diversity in technology. Rather, they focus on breakthrough data science that may not be showcased at meetings where women tend to be underrepresented.

Crucially, they demonstrate that it is possible for women to thrive in the technology field – it’s a matter of staying diligent, finding mentors, and being willing to learn.


  1. What’s Keeping Women Out of Data Science? (2020). https://www.bcg.com/publications/2020/what-keeps-women-out-data-science