Can't See the Woods from the Trees? Making Sense of the Data During a Global Pandemic

PHASTAR’s Head of Statistical Research and Consultancy, Professor Jennifer Rogers, was recently announced as the winner of the annual HealthWatch Award for 2020. For her acceptance speech, Jen grappled with the COVID-19 statistics that we have been seeing since the start of the pandemic, trying to make sense of messy data and calling into question some of the more dubious claims that have been made.

I’ve always felt that public engagement is an important part of my career and I have strived to be an ambassador in my field, bringing statistics to the masses. This isn’t just because I am passionate about statistics, but also because I feel it’s important that the general public have a basic understanding of statistical concepts. Every single day, we are bombarded with news headlines telling us what we should and shouldn’t be doing and we are expected to use these sometimes-shocking headlines to inform decisions on how to live our everyday lives. Sometimes, these headlines can be misleading, and often our own personal experiences can skew our interpretation, so I feel that it is essential to give people the tools they need to ask the right questions of the data that they see, so that they can make informed evidence-based decisions.

 

2020 has been a unique year for the world of medical statistics due to the COVID-19 pandemic. I don’t think I can remember a time where statistics and data have been more in the spotlight. During the first wave of the pandemic, when much of the world was in lockdown, the UK Government were giving daily press briefings presenting the latest data that were being updated on a day-to-day basis. There has been an absolute flood of data that individuals must wade through to try and answer important questions such as “How does COVID-19 spread?”, “Where is it now?”, “Who is most at risk?”, and “What treatments are safe and effective?”. But COVID-19 is a brand-new disease and attempts to tackle its spread are constantly evolving. When we are in an ever-changing landscape, how do we answer these key questions? Let’s look at possibly the simplest question we could ask: How many COVID-19 cases are there in the UK? At first glance, this seems like it should be a relatively straightforward question to answer, but even this turns out to be more complicated than it might first appear.

Let’s look at the reported number of cases that we see published by the UK Government daily (Figure 1). Ideally, one would hope that this data would be a good proxy for the underlying prevalence in the wider general population.

Figure 1: Number of individuals who have had at least one lab-confirmed positive COVID-19 test result, by date reported, 24th February – 14th October 2020. https://coronavirus.data.gov.uk/cases.


We know that testing only picks up a fraction of all COVID-19 cases, but ideally this daily reported number of cases would give a good representation of the shape of the true prevalence in the UK. Unfortunately, this is very unlikely to be the case and using these reported case numbers to infer UK wide prevalence is a really difficult task. One of the main complications is that the number of people we have been testing and who we have been testing has been changing throughout the last 6 months. So, can we update our prevalence estimates to accommodate different test numbers? We know that prevalence is the number of positive cases in the population divided by the size of the population, so under some heavy assumptions we can sometimes estimate prevalence by considering the number of positive tests divided by the number of tests. This assumes that:

  1. tests are 100% accurate, and
  2. random samples of the population are tested.

Assumption 1 is unlikely to be true, but you would hope testing would be reasonably accurate that this wouldn’t be too much of an issue. Assumption 2, however, definitely doesn’t hold in the case of COVID-19 as those tested are typically individuals who suspect they may have the virus, either presenting with symptoms or having come into contact with someone else who has tested positive. Nevertheless, we can use this relationship between test numbers and test positivity to give us an indication of just how different the shape of the true prevalence curve might be compared to the reported case data that we see. Figure 2 shows a hypothetical prevalence per 100,000 people, estimated by considering the number of positive tests divided by the number of tests.

Figure 2: Estimated COVID prevalence (number of cases per 100,000 of the population).

To reiterate, I know that this plot has been produced under unrealistic assumptions and is unlikely to be 100% accurate, but it does give an interesting insight into just how much difference there could be between the shape of the reported case curves that are published by the Government and the true underlying prevalence in the general population. Specifically, we see that the underlying prevalence during this second wave is likely to be much smaller than the prevalence during the first wave, despite the reported case numbers being much higher now than they were in March and April. This calls into question the relevance of the reported case numbers that we see updated daily and then announced in the media like some kind of “number theatre” (thanks Sir David Spiegelhalter for that wonderful phrase!). I think it’s a great thing to see the Government trying to be transparent and engaging the general public with data driven decision making, but it’s important that the strengths and limitations of this rapid but messy data are thoroughly discussed.

 

COVID-19 has provided us with a unique opportunity to highlight the essential role that statistics has to play in society. Analysis of COVID-19 data has, and will continue to be, essential in the fight against the virus. But good statistics don’t just appear. One silver lining of the COVID-19 pandemic is that I hope people appreciate the value and importance of numbers and the work that statisticians do every day to turn those numbers into useful narrative. People now talk about and lean on statistics to make informed, data driven decisions. Hopefully that will be a lasting legacy of this global emergency.

If you would like to watch Jen’s lecture in its entirety, you can do here. HealthWatch is the UK charity established 1991 to promote science and integrity in healthcare, see https://www.healthwatch-uk.org/ for more information.