Multiplicity

Recently, an FDA industry guidance draft on “Multiple Endpoints in Clinical Trials” [1] provided an overview of multiplicity which is a controversial topic. This is due to the practitioner’s common lack of clarity on when and how it arises and what adjustments to implement, such that the number of false positives, i.e. the type I error rate is constrained to a pre-specified level of significance alpha when performing inference for multiple hypothesis tests simultaneously [2]. Once a trial is shown to be successful on the primary endpoint(s), there may be other attributes of the drug’s effect which are informative for inclusion in the physician labelling.

Multiplicity has to be accounted for when several tests are performed at the same time rather than just a single inference from a two treatment arm comparison for a single endpoint. Some of the reasons which lead to multiplicity include: (1) multiple primary endpoints which are each individually relevant to address the primary study objective, (2) more than two treatment arms, i.e. multiple treatments, combinations of treatments or different doses of same treatment, (3) repeated measurements, i.e. an outcome measurement is recorded at different time points for each patient, (4) interim analysis and (5) subgroup analyses where patient outcomes may vary when sub-setting by variables, such as biomarker status, age group or diabetes status.

Without adjustment, the overall type I error rate can be inflated which makes it easier to obtain one or more significant results leading to an erroneous conclusion that a treatment is effective. The overall error rate of one or more type I errors is known as family-wise error rate (FWER) and is calculated as 1-(1-alpha)^k where k is the number of tests. The FDA typically requires “strong” control of FWER over primary and secondary endpoints intended as potential label claims, i.e. control of the overall type I error probability regardless of which and how many endpoints have no effect.

Given that the probability of type I error increases with the number of tests, it is generally preferable to have fewer tests, such that fewer patients need to be recruited, as it relates to sample size calculations with the target to yield a significant test statistic for a given significance level. One practical approach would be to define a single primary endpoint and assign it the total alpha amount, while having a small number of secondary hypotheses which are adjusted for multiplicity. However, often more complex approaches are needed to maximise the chance of success.

In the last few years we have witnessed multiple new approaches to account for multiplicity. In SAS, PROC MULTTEST may be used to multiplicity-correct raw p-values obtained from unadjusted hypothesis tests.

Many of these approaches are based on FWER control. There are single-step procedures which do not depend on the test order or the reject-decision of other hypotheses, such as the conservative Bonferroni-correction which controls the per-family error rate of each test to an equal proportion 1/k of the overall alpha, or the Sidak-correction which explicitly models the FWER.

Multi-step procedures are better at preserving power. In these multi-step methods test results are ordered incrementally and sequentially adjusted with a rejection decision rule: (1) start at the smallest p-value and step down to the largest p-value (Holm procedure) or (2) start at largest and go down to smallest p-value (Hochberg procedure).

To conclude, we point to the False Discovery Rate (FDR), as an alternative to the FDA requested strong control of FWER. The FDR does not control for any error but measures the expected proportion of incorrectly rejected hypotheses among all rejected hypotheses which is typically set to less than 10 percent. In FWER a fixed overall error rate defines a rejection region while in FDR a fixed rejection region defines the type I error rate, e.g. a p-value of 0.05 suggests that on average, 1 in 20 correct null hypotheses are rejected while 5% FDR implies that in 5% of the significant tests, the null hypothesis is actually true. FWER is suitable if you prefer to have no falsely rejected null hypotheses (false positives) with few significant decisions in general which is why it is used for regulatory approval of new drugs. However, in other applications, such as genomics and exploratory screening, we aim to accept some of these false positives with a higher number of significant hypotheses (which are correctly rejected). That means that FWER has a lower type I error rate while FDR has higher power. Furthermore, FDR is not designed to handle complex decision rules, such as superiority vs noninferiority and primary vs secondary endpoint.

If you require support with multiplicity considerations please do not hesitate to get in touch with PHASTAR.

[1] Multiple Endpoints in Clinical Trials: Draft Guidance for Industry: https://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm536750.pdf

[2] Dmitrienko A, Tamhane AC, Bretz F, eds. Multiple Testing Problems in Pharmaceutical Statistics. New York: CRC Press, 2009