PHASTAR's Zara Ghodsi discusses a new method for the imputation of time series data based on singular spectrum analysis (SSA)

Missing values estimation in clinical time series

Missing data is a common problem in clinical trials despite all our best efforts to minimise it through design. It is likely to occur in most randomised controlled trials. When missing data is present, the ability to conduct intention-to-treat analyses, which require the complete inclusion of all data from all randomised patients, is compromised and can influence results. For this reason, much research is focussed on analytical techniques to estimate unbiased effects in the presence of missing data, including imputation-based methodologies. PHASTAR statistician Zara Ghodsi has recently had a paper published where she proposes a new method for the imputation of time series data based on singular spectrum analysis (SSA).[1]

The idea of SSA is to decompose a time series into oscillatory components and noise with the aim of characterising the underlying regular behaviour of the dynamical system underneath. First a covariance matrix must be calculated, which measures the covariances between lagged values of the time series X. The idea is to calculate the covariance between Xt and X{t+k} for delay, or lag, k. If this covariance is positive, then the values Xt and X{t+k} tend to vary together. We can calculate a vector covariance of length M, where we refer to M as the window size. If, for example, M = 4, we would have a covariance vector of size 4, considering lags of k = 0, 1, 2, 3. We then compute the eigenvalues and the eigenvectors of this covariance matrix and the M eigenvectors of the lag-covariance matrix are called temporal empirical orthogonal functions (EOFs).

These EOFs can then be used to construct the principal components (PCs) of the time series. These PCs can then be projected back onto the eigenvectors to obtain time series referred to as the reconstructed components (RCs) in the original coordinates, with each one corresponding to one of the PCs. The oscillatory component is then characterised by a pair of nearly equal SSA eigenvalues and associated PCs that are in approximate phase quadrature. That is, we can reduce the time series into its oscillatory components and noise components. We can use the RCs to “filter” the time series by using less than the total number of RCs, so we filter out a part of the time series that we suppose to be noise rather than signal. 

Here Zara talks us through how she used SSA as a method for missing data imputation.

In clinical research, missing values are particularly prominent in long trials or studies with less patient adherence to the study protocol (e.g. trials focused on psychiatric disorders)[2]. In such trials ignoring missing values may result in biased estimations or invalid conclusions, so adopting a reliable imputation method should be regarded as an essential consideration.

In clinical time-series which contain measurements collected at different time points in the life cycle of a trial, different reasons may lead to datasets with various proportions of missing data. For example, a series of blood glucose level measured at each visit may contain missing data because of a faulty glucometer, or due to patient absence. Excluding missing records caused by medical device failure affects the power of the study. Additionally, data missing due to patient absence may be more likely to be extreme values (e.g. an appointment was missed because patient was recovering from a hypoglycaemic episode) and so ignoring missingness in this case may lead to underestimating the variance and narrowing the confidence interval.[2-3]

This study introduces a new method of missing value imputation based on singular spectrum analysis (SSA). The SSA technique is comprised of two complementary stages: decomposition and reconstruction, each of which consists of two separate steps. The first stage decomposes a time series into several components. The reconstruction stage yields the signal of the series using the leading eigentriples of the trajectory matrix.

To impute the missing values, first, missing values are replaced by initial values, and then reconstructed, iterating until convergence. The final reconstructed values are considered as imputed values. This imputation algorithm contains the following steps:

  1. Set suitable initial values in place of missing data (e.g. mean of the non-missing data).
  2. Choose reasonable values of window length (L) and the number of leading eigentriples (r).
  3. Reconstruct the time series where its missing data are replaced with initial values.
  4. Replace the values of time series at missing locations with their reconstructed values.
  5. Reconstruct the time series.
  6. Repeat steps 4 and 5 until the maximum absolute value of the difference between consecutive replaced values of the time series by their reconstructed value is less than d (d is the convergence threshold and is a small positive number).
  7. Consider the final values replaced to be the imputed values. For more detailed information, see [1]

Figure 1 below depicts the application of the introduced method on a time series with a length of 491 (206 missing values), where imputed values using the SSA method are shown in red. Note how imputed values are not only consistent with the general pattern of the data, but also contain volatility with an amount similar to what is present in the original dataset, providing readers with a trusty outlook for the long-term prospects of the series.

The performance of the newly introduced method was also compared with traditionally well-known imputation methods; Interpolation (linear, spline and Stineman interpolation), Kalman smoothing (ARIMA and StructTS ), Last observation carried forward (LOCF), Next observation carried backward (NOCB), Simple moving average (SMA), Linear weighted moving average (LWMA) and Exponential weighted moving average (EWMA). To evaluate the performance of these methods, 10% to 40% of the dataset was randomly deleted and removed from the time series. Next, missing values were estimated and the mean squared error (MSE) was used as the main comparison criterion. Results suggest that the SSA based technique produces robust estimations of missing values and outperforms the other employed imputation techniques even in datasets with large proportions of missing data.

SSA is a non-parametric time series analysis technique which does not rely on any assumptions. As confirmed by the results of this study, it is suggested that this method can practically lend itself as a useful imputation method in clinical studies.


References

  1. Hassani, H., Kalantari, M., & Ghodsi, Z. (2019). Evaluating the Performance of Multiple Imputation Methods for Handling Missing Values in Time Series Data: A Study Focused on East Africa, Soil Carbonate-Stable Isotope Data. Stats, 2(4), 457-467.
  2. Kaushal, S. (2014). Missing data in clinical trials: Pitfalls and remedies. International journal of applied and basic medical research, 4(Suppl 1), S6-7.
  3. Little, R. J., D'Agostino, R., Cohen, M. L., Dickersin, K., Emerson, S. S., Farrar, J. T., ... & Neaton, J. D. (2012). The prevention and treatment of missing data in clinical trials. New England Journal of Medicine, 367(14), 1355-1360.