Sample Size for a Diagnostic Study
Case study
Recently PHASTAR aided on design input and sample size calculations for a study using a new diagnostic test to detect sepsis in blood samples. For patients presenting with suspected symptoms of sepsis, the current method to confirm the diagnosis is via blood cultures, which takes between 36 - 48 hours to process. Due to the severity and rapid escalation of sepsis, treatment needs to be initiated immediately. Since it is not possible to wait until a result is obtained, the patient is treated for sepsis empirically. There are 25 different bacteria types that can cause sepsis and the treatment depends on the bacteria type. The current test method via blood cultures cannot determine the bacteria type of infection, so the patient is treated for the most common sepsis-causing bacteria type (E. Coli). The new diagnostic test seeks to address both issues. It can return a sepsis diagnosis within 3 - 4 hours while identifying the individual bacteria type.
The current standard method of testing using blood cultures is used but it is not the definitive method. Hence, as per the FDA guidance document (Guidance for Industry and FDA Staff: Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests1, dated 13 March 2003), the current method is to be regarded as a non-reference standard rather than a gold standard. The definitive method of testing for sepsis is via genome sequencing of the blood sample but it is not routinely used as it is expensive.
In the proposed study, each patient will provide one blood sample. Each blood sample will be tested for sepsis using both the new diagnostic test and the current (non-reference standard) test. In addition, it will also be necessary to test all blood samples via genome sequencing (true gold standard) to obtain the true result. This will be needed to confirm all results (where there is both agreement and disagreement between the new diagnostic test and the current (blood culture) test). The possible outcomes of each test are Positive and Negative. The possible combinations of outcomes are therefore:
Actual (Genome Sequencing) | Current (Blood Culture) Test | New Diagnostic Test |
Positive | Positive | Positive |
Positive | Positive | Negative |
Positive | Negative | Positive |
Positive | Negative | Negative |
Negative | Positive | Positive |
Negative | Positive | Negative |
Negative | Negative | Positive |
Negative | Negative | Negative |
As per the FDA guidance document, comparing a new test to a non-reference standard does not give a true result. Further discrepant resolution is inappropriate. Outcomes that are altered or updated by discrepant resolution to estimate the sensitivity and specificity of a new test or agreement between a new test and a non-reference standard should not be used. Instead, all blood samples will also need to be tested via genome sequencing (true gold standard) to get a definitive result.
The proposed primary endpoints for the study are:
- The number of positive results using the new diagnostic test compared to the number of positive results from the current (non-reference standard) test
- The number of correct bacteria types identified using the new diagnostic test as confirmed via genome sequencing (true gold standard)
The power to identify the individual bacteria type will be low given the low prevalence of each of the 25 bacteria types. Only the top 5 bacteria types have a prevalence of >5%, whilst the bottom 10 bacteria types have a prevalence of <1%.
At this stage, multiple testing for the co-primary endpoints has not yet been considered.
The proposed analysis will be based on a paired test in proportions, using the following approach:
- Construct two 2x2 tables – one for the new diagnostic test versus genome sequencing (true gold standard) and one for the current (non-reference standard) test versus genome sequencing (true gold standard)
- From each 2x2 table, estimate the sensitivity and specificity for the new diagnostic test and the current (non-reference standard) test
- Compare the sensitivity of the new diagnostic test versus the current (non-reference standard) test using the difference in proportions and similarly for specificity
Sample size calculations were conducted using nQuery, based on a paired test of non-inferiority in proportions, with the following assumptions:
- nQuery - paired test of non-inferiority in proportions
- Significance level = 5%
- One-sided
- Non-inferiority margin = 5%
- H0: π1 – π0 ≤ -0.05 (inferior) versus the alternative hypothesis that H1: π1 - π0 > -0.05 (not inferior)
- Non-inferiority limit difference Δ0 = -0.05 (null hypothesis of inferiority)
- Expected difference Δ1 = 0 (alternative hypothesis of non-inferiority)
- Proportion discordant η = π10 + π01 (proportion of tests that will disagree under the alternative hypothesis)
- Number of discordant pairs varies according to sensitivity and specificity
Example
If the sensitivity and specificity of the current test is 85% and 95%, respectively, and the positive rate of sequencing is 40% as summarized in the table below, then the required sample size is 309 samples.
Sequencing | ||||
Positive | Negative | Total | ||
Current Test | Positive | 34% | 3% | 37% |
Negative | 6% | 57% | 63% | |
Total | 40% | 60% | 100% | |
Sensitivity | Specificity | |||
85% | 95% |
Since this is based only on the number of discordant samples, it is necessary to check that the sample size will be adequate. To do this, use another sample size table in nQuery to compute the power for a test of non-equivalence based on the observed lower limit of the confidence interval for the difference in proportions.
- nQuery – lower confidence limit for difference in paired proportions (simulation)
- Confidence level 1 – α (one-sided) = 0.95
- Expected difference π1 - π0 Δ1 = 0
- Proportion discordant η = π10 + π01 = 0.09 (from the above table – 3% + 6% = 9%)
Use the nQuery side table to compute the remaining values and transfer to main table (may need to overwrite the expected difference to ensure that it remains as 0 rather than the computed value).
- Lower limit for π1 – π0 LL = -0.05
- Number of simulations = 1000
- Random seed = 2020
- n = 309 (sample size as calculated from nQuery in the first step above)
- Lower limit for π1 – π0 LL = -0.05
- Number of simulations = 1000
- Random seed = 2020
- n = 309 (sample size as calculated above)
nQuery then computes the estimated power. If the estimated power is less than 90%, increase the sample size and recompute with a larger value of n. Repeat this until the estimated power is greater than 90%.
In this example, the estimated power is greater than 90% when n = 309.
Various samples sizes for generated for a range of values of sensitivity (80%, 85%, 90%, 95%), specificity (90%, 95%), positive rate (40%, 30%, 20%, 10%) and power (80%, 85%, 90%).
Estimated Sample Sizes Required for Varying Values of Sensitivity, Specificity and Power
40% Positive Rate
Sensitivity |
Specificity |
Non-Inferiority Limit |
Power |
Alpha |
Positive Rate |
Negative Rate |
Proportion Discordant |
n |
80% |
90% |
0.05 |
80% |
0.05 |
40% |
60% |
14.0% |
347 |
85% |
90% |
0.05 |
80% |
0.05 |
40% |
60% |
12.0% |
297 |
90% |
90% |
0.05 |
80% |
0.05 |
40% |
60% |
10.0% |
248 |
95% |
90% |
0.05 |
80% |
0.05 |
40% |
60% |
8.0% |
199 |
80% |
95% |
0.05 |
80% |
0.05 |
40% |
60% |
11.0% |
273 |
85% |
95% |
0.05 |
80% |
0.05 |
40% |
60% |
9.0% |
223 |
90% |
95% |
0.05 |
80% |
0.05 |
40% |
60% |
7.0% |
174 |
95% |
95% |
0.05 |
80% |
0.05 |
40% |
60% |
5.0% |
149 |
80% |
90% |
0.05 |
85% |
0.05 |
40% |
60% |
14.0% |
403 |
85% |
90% |
0.05 |
85% |
0.05 |
40% |
60% |
12.0% |
346 |
90% |
90% |
0.05 |
85% |
0.05 |
40% |
60% |
10.0% |
288 |
95% |
90% |
0.05 |
85% |
0.05 |
40% |
60% |
8.0% |
231 |
80% |
95% |
0.05 |
85% |
0.05 |
40% |
60% |
11.0% |
317 |
85% |
95% |
0.05 |
85% |
0.05 |
40% |
60% |
9.0% |
259 |
90% |
95% |
0.05 |
85% |
0.05 |
40% |
60% |
7.0% |
204 |
95% |
95% |
0.05 |
85% |
0.05 |
40% |
60% |
5.0% |
173 |
80% |
90% |
0.05 |
90% |
0.05 |
40% |
60% |
14.0% |
480 |
85% |
90% |
0.05 |
90% |
0.05 |
40% |
60% |
12.0% |
412 |
90% |
90% |
0.05 |
90% |
0.05 |
40% |
60% |
10.0% |
343 |
95% |
90% |
0.05 |
90% |
0.05 |
40% |
60% |
8.0% |
275 |
80% |
95% |
0.05 |
90% |
0.05 |
40% |
60% |
11.0% |
377 |
85% |
95% |
0.05 |
90% |
0.05 |
40% |
60% |
9.0% |
309 |
90% |
95% |
0.05 |
90% |
0.05 |
40% |
60% |
7.0% |
240 |
95% |
95% |
0.05 |
90% |
0.05 |
40% |
60% |
5.0% |
206 |
30% Positive Rate
Sensitivity |
Specificity |
Non-Inferiority Limit |
Power |
Alpha |
Positive Rate |
Negative Rate |
Proportion Discordant |
n |
80% |
90% |
0.05 |
80% |
0.05 |
30% |
70% |
13.0% |
322 |
85% |
90% |
0.05 |
80% |
0.05 |
30% |
70% |
11.5% |
285 |
90% |
90% |
0.05 |
80% |
0.05 |
30% |
70% |
10.0% |
248 |
95% |
90% |
0.05 |
80% |
0.05 |
30% |
70% |
8.5% |
210 |
80% |
95% |
0.05 |
80% |
0.05 |
30% |
70% |
9.5% |
235 |
85% |
95% |
0.05 |
80% |
0.05 |
30% |
70% |
8.0% |
205 |
90% |
95% |
0.05 |
80% |
0.05 |
30% |
70% |
6.5% |
168 |
95% |
95% |
0.05 |
80% |
0.05 |
30% |
70% |
5.5% |
137 |
80% |
90% |
0.05 |
85% |
0.05 |
30% |
70% |
13.0% |
374 |
85% |
90% |
0.05 |
85% |
0.05 |
30% |
70% |
11.5% |
331 |
90% |
90% |
0.05 |
85% |
0.05 |
30% |
70% |
10.0% |
288 |
95% |
90% |
0.05 |
85% |
0.05 |
30% |
70% |
8.5% |
216 |
80% |
95% |
0.05 |
85% |
0.05 |
30% |
70% |
9.5% |
274 |
85% |
95% |
0.05 |
85% |
0.05 |
30% |
70% |
8.0% |
231 |
90% |
95% |
0.05 |
85% |
0.05 |
30% |
70% |
6.5% |
195 |
95% |
95% |
0.05 |
85% |
0.05 |
30% |
70% |
5.5% |
159 |
80% |
90% |
0.05 |
90% |
0.05 |
30% |
70% |
13.0% |
446 |
85% |
90% |
0.05 |
90% |
0.05 |
30% |
70% |
11.5% |
394 |
90% |
90% |
0.05 |
90% |
0.05 |
30% |
70% |
10.0% |
343 |
95% |
90% |
0.05 |
90% |
0.05 |
30% |
70% |
8.5% |
290 |
80% |
95% |
0.05 |
90% |
0.05 |
30% |
70% |
9.5% |
327 |
85% |
95% |
0.05 |
90% |
0.05 |
30% |
70% |
8.0% |
277 |
90% |
95% |
0.05 |
90% |
0.05 |
30% |
70% |
6.5% |
223 |
95% |
95% |
0.05 |
90% |
0.05 |
30% |
70% |
5.5% |
189 |
20% Positive Rate
Sensitivity |
Specificity |
Non-Inferiority Limit |
Power |
Alpha |
Positive Rate |
Negative Rate |
Proportion Discordant |
n |
80% |
90% |
0.05 |
80% |
0.05 |
20% |
80% |
12.0% |
300 |
85% |
90% |
0.05 |
80% |
0.05 |
20% |
80% |
11.0% |
277 |
90% |
90% |
0.05 |
80% |
0.05 |
20% |
80% |
10.0% |
255 |
95% |
90% |
0.05 |
80% |
0.05 |
20% |
80% |
9.0% |
227 |
80% |
95% |
0.05 |
80% |
0.05 |
20% |
80% |
8.0% |
210 |
85% |
95% |
0.05 |
80% |
0.05 |
20% |
80% |
7.0% |
182 |
90% |
95% |
0.05 |
80% |
0.05 |
20% |
80% |
6.0% |
162 |
95% |
95% |
0.05 |
80% |
0.05 |
20% |
80% |
5.5% |
143 |
80% |
90% |
0.05 |
85% |
0.05 |
20% |
80% |
12.0% |
346 |
85% |
90% |
0.05 |
85% |
0.05 |
20% |
80% |
11.0% |
317 |
90% |
90% |
0.05 |
85% |
0.05 |
20% |
80% |
10.0% |
288 |
95% |
90% |
0.05 |
85% |
0.05 |
20% |
80% |
9.0% |
259 |
80% |
95% |
0.05 |
85% |
0.05 |
20% |
80% |
8.0% |
231 |
85% |
95% |
0.05 |
85% |
0.05 |
20% |
80% |
7.0% |
210 |
90% |
95% |
0.05 |
85% |
0.05 |
20% |
80% |
6.0% |
185 |
95% |
95% |
0.05 |
85% |
0.05 |
20% |
80% |
5.5% |
159 |
80% |
90% |
0.05 |
90% |
0.05 |
20% |
80% |
12.0% |
415 |
85% |
90% |
0.05 |
90% |
0.05 |
20% |
80% |
11.0% |
382 |
90% |
90% |
0.05 |
90% |
0.05 |
20% |
80% |
10.0% |
343 |
95% |
90% |
0.05 |
90% |
0.05 |
20% |
80% |
9.0% |
309 |
80% |
95% |
0.05 |
90% |
0.05 |
20% |
80% |
8.0% |
280 |
85% |
95% |
0.05 |
90% |
0.05 |
20% |
80% |
7.0% |
245 |
90% |
95% |
0.05 |
90% |
0.05 |
20% |
80% |
6.0% |
215 |
95% |
95% |
0.05 |
90% |
0.05 |
20% |
80% |
5.5% |
189 |
10% Positive Rate
Sensitivity |
Specificity |
Non-Inferiority Limit |
Power |
Alpha |
Positive Rate |
Negative Rate |
Proportion Discordant |
n |
80% |
90% |
0.05 |
80% |
0.05 |
10% |
90% |
11.0% |
275 |
85% |
90% |
0.05 |
80% |
0.05 |
10% |
90% |
10.5% |
262 |
90% |
90% |
0.05 |
80% |
0.05 |
10% |
90% |
10.0% |
252 |
95% |
90% |
0.05 |
80% |
0.05 |
10% |
90% |
9.5% |
250 |
80% |
95% |
0.05 |
80% |
0.05 |
10% |
90% |
6.5% |
182 |
85% |
95% |
0.05 |
80% |
0.05 |
10% |
90% |
6.0% |
170 |
90% |
95% |
0.05 |
80% |
0.05 |
10% |
90% |
5.5% |
158 |
95% |
95% |
0.05 |
80% |
0.05 |
10% |
90% |
5.0% |
157 |
80% |
90% |
0.05 |
85% |
0.05 |
10% |
90% |
11.0% |
320 |
85% |
90% |
0.05 |
85% |
0.05 |
10% |
90% |
10.5% |
302 |
90% |
90% |
0.05 |
85% |
0.05 |
10% |
90% |
10.0% |
288 |
95% |
90% |
0.05 |
85% |
0.05 |
10% |
90% |
9.5% |
285 |
80% |
95% |
0.05 |
85% |
0.05 |
10% |
90% |
6.5% |
207 |
85% |
95% |
0.05 |
85% |
0.05 |
10% |
90% |
6.0% |
197 |
90% |
95% |
0.05 |
85% |
0.05 |
10% |
90% |
5.5% |
180 |
95% |
95% |
0.05 |
85% |
0.05 |
10% |
90% |
5.0% |
180 |
80% |
90% |
0.05 |
90% |
0.05 |
10% |
90% |
11.0% |
377 |
85% |
90% |
0.05 |
90% |
0.05 |
10% |
90% |
10.5% |
365 |
90% |
90% |
0.05 |
90% |
0.05 |
10% |
90% |
10.0% |
352 |
95% |
90% |
0.05 |
90% |
0.05 |
10% |
90% |
9.5% |
352 |
80% |
95% |
0.05 |
90% |
0.05 |
10% |
90% |
6.5% |
240 |
85% |
95% |
0.05 |
90% |
0.05 |
10% |
90% |
6.0% |
230 |
90% |
95% |
0.05 |
90% |
0.05 |
10% |
90% |
5.5% |
207 |
95% |
95% |
0.05 |
90% |
0.05 |
10% |
90% |
5.0% |
207 |
Reference