HDDA/Lab6-Large-Scale-Inference.Rmd at master · statOmics/HDDA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
---
title: "Lab 6: Large Scale Inference"
subtitle: "High Dimensional Data Analysis practicals"
author: "Adapted by Milan Malfait and Leo Fuhrhop"
date: "21 Feb 2022 <br/> (Last updated: 2025-11-26)"
references:
- id: alon1999broad
  type: article-journal
  author:
  - family: Alon
    given: Uri
  - family: Barkai
    given: Naama
  - family: Notterman
    given: Daniel A
  - family: Gish
    given: Kurt
  - family: Ybarra
    given: Suzanne
  - family: Mack
    given: Daniel
  - family: Levine
    given: Arnold J
  issued:
  - year: 1999
  title: Broad patterns of gene expression revealed by clustering analysis of tumor
    and normal colon tissues probed by oligonucleotide arrays
  container-title: Proceedings of the National Academy of Sciences
  publisher: National Acad Sciences
  page: 6745-6750
  volume: '96'
  issue: '12'
---

```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center",
  fig.width = 8,
  fig.asp = 0.618,
  out.width = "100%"
)
```

### [Change log](https://github.com/statOmics/HDDA/commits/master/Lab6-Large-Scale-Inference.Rmd) {-}

***

```{r libraries, warning=FALSE, message=FALSE}
## install packages with:
# if (!requireNamespace("remotes", quietly = TRUE)) {
#     install.packages("remotes")
# }
# remotes::install_github("statOmics/HDDAData")

library(HDDAData)
```

# Introduction

**In this lab session we will look at the following topics**

  - FWER
  - FDR
  - Multiple testing problem in a real dataset


## Testing many hypotheses {#testing-hypotheses}

To demonstrate the ideas we will be working with, we will simulate artificial data.
Note that since we are doing simulations, we can control everything and also know
exactly what the underlying "true" distribution is (sometimes also called the "ground truth").
However, keep in mind that this is an unlikely representation of real-world data.

In particular, we will simulate data for multiple hypothesis tests where
__the null hypothesis is always true__. This means that for the $i$-th test, if we let
$\mu_{1i}$ and $\mu_{2i}$ represent the means of the two populations of interest,
the null hypothesis for comparing the two means is
$H_{0i} : \mu_{1i} = \mu_{2i}$. Let the alternative hypothesis of this test be
$H_{1i} : \mu_{1i} \neq \mu_{2i}$, i.e., we perform a *two-tailed* test.
Suppose we have data collected from both populations, given by $X_{1i}$ and $X_{2i}$
of size $n_{1i}$ and $n_{2i}$, respectively, and assume that both populations
have the same known variance $\sigma_i^2$. Then we can test this hypothesis by using a
__Z-test statistic__, given by

$$ Z_i=\frac{\bar X_{1i}-\bar X_{2i}}{\sigma_i\sqrt{1/n_{1i}+1/n_{2i}}} \, .$$

Under the null hypothesis, the scores will be distributed as a *standard normal*
$Z_i\sim N(0,1)$. The null hypothesis is rejected in favor of the alternative if
$z_i < \phi_{\alpha_i/2}$ or $z_i > \phi_{1-\alpha_i/2}$, or equivalently if
$|z_i| > \phi_{1-\alpha_i/2}$, where $\phi_{\alpha_i/2}$ is the $\alpha_i/2$th
quantile of the standard normal distribution.

In a multiple hypothesis testing setting, we will perform $m$ tests using the
same test statistic. If the *null* were true for all hypotheses, we would end up
with a sample $z_1,\ldots,z_m$ from a standard normal distribution.

In this setting, we can only make one type of error: wrongly rejecting the null
hypothesis, i.e. a __type 1 error__. The probability of making this error is given
by

$$ \alpha_i=\text{P}\{\text{ reject } H_{0i} | H_{0i}\} $$

(the $| H_{0i}$ part should be read as "given that the null hypothesis is true").

If we now perform $m_0$ such tests (using the 0 subscript to denote that they
are all hypotheses for which the *null* is true), we can summarise the possible
outcomes as in the table below:

```{r, echo=FALSE}
knitr::kable(
  structure(list(
    `Accept H_0` = "U (True Negative)",
    `Reject H_0` = "V (False Positive)",
    `Total` = "m_0"
  ),
  .Names = c("Accept H_0", "Reject H_0", "Total"),
  row.names = c("H_0 True"),
  class = "data.frame"
  )
)
```


Here, $U$ and $V$ represent the total number of true negative and false positive
results we get, respectively, out of $m_0$ total tests.

As an example, let's simulate the scenario above. In the code below, the `replicate`
function is used to repeat the same procedure a number of times.
Essentially, the following steps are performed :

1. Sample $N = 50$ observations for 2 hypothetical groups from normal distributions
   with the same mean and known variance.
2. Calculate Z-test statistics based on the 2 group means.
3. Repeat steps 1 to 2 `m0` times.

This mimics performing `m0` hypothesis tests on data for which we know the null
hypothesis is true.

```{r}
# Set parameters for the simulation
N <- 50 # samples per group
m0 <- 1000 # number of hypothesis tests
mu_1 <- 3 # true mean group 1
mu_2 <- 3 # true mean group 2
sigma <- 1 # known variance
denom <- sigma * sqrt(2 / N) # denominator for Z-scores

set.seed(987654321) # seed for reproducibility
z_stats <- replicate(m0, {
  group1 <- rnorm(N, mean = mu_1, sd = sqrt(sigma))
  group2 <- rnorm(N, mean = mu_2, sd = sqrt(sigma))

  # Calculate Z-test statistic
  (mean(group2) - mean(group1)) / denom
})

# Visualize Z-test statistics
hist(z_stats, breaks = 50, freq = FALSE, main = "Z-test statistics under H_0")
# Overlay theoretical standard normal
lines(x <- seq(-5, 5, length.out = 100), dnorm(x), col = "dodgerblue", lwd = 3)
# Draw vertical lines at 2.5 and 97.5th percentiles
abline(v = qnorm(c(0.025, 0.975)), col = "firebrick", lty = 2, lwd = 3)
```

We see that the Z-test statistics nicely follow a standard normal distribution.
The vertical dashed lines indicate the 2.5th and 97.5th percentiles of the standard
normal. The regions outside these lines indicate the Z-scores that we would call
significant if we used a cut-off of $\alpha = 0.05$ for a *two-tailed* test.
So, even though we simulated data under the null hypothesis, our Z-test still
returns "significant" results for a number of cases just by chance!

Let's calculate the p-values for our hypothesis tests and see what the damage is.
To calculate the p-values, we use the `pnorm()` function in this case, which returns
the value of the standard normal CDF (i.e. `pnorm(x)` = $P(Z < x)$). Since we consider
a two-tailed test, we take the absolute values of the Z-scores and set the `lower.tail`
argument in `pnorm` to `FALSE` (by default it's `TRUE`), so that we get
`pnorm(abs(x), lower.tail = FALSE)` = $P(Z > |x|)$ and multiply this value by 2.

```{r}
null_pvals <- 2 * pnorm(abs(z_stats), lower.tail = FALSE)
alpha <- 0.05

hist(null_pvals, breaks = seq(0, 1, by = 0.05))
abline(v = alpha, col = "firebrick", lwd = 3)

called <- (null_pvals < alpha)

# V = number of false positives, in this case: all significant tests
(V <- sum(called))
mean(called) # V / m0
```

So we get 51 significant tests (false positives) out of a total of 1000, which is,
unsurprisingly, approximately equal to our significance cutoff $\alpha$.
Note also that the p-values are uniformly distibuted under the null hypothesis by definition.

If we had carried out only a few tests (say 10) it would be very unlikely to observe
a false positive (on average: 0.05 * 10 = 0.5 false positives) at $\alpha = 0.05$,
but since now we're carrying out so many, we're almost guaranteed to get false positives.
This is what is known as the __multiple hypothesis testing problem__.

Note that in real-world data the quantities $U$ and $V$ are *unknown* (because
we don't know the truth; if we did, we wouldn't have to carry out any hypothesis
tests in the first place). However, by using simulated data, we do know the truth
and so we can explore these quantities.


## The family-wise error rate

If we carry out many tests, we're almost guaranteed to get type I errors, just by
chance. Therefore the type I error rate is no longer a relevant metric.
Instead, we consider the __Familywise Error Rate (FWER)__, given by

$$
\text{FWER}=\text{P}\{V > 0\}
  = \text{P}\{\text{rejecting at least one } H_{0i}| H_0\}
$$

where $H_0$ is the intersection of all  partial nulls ($H_{0i}$) $i=1,\ldots,m_0$.
In general, we would prefer testing procedures that keep the FWER under control,
i.e. correct for the multiplicity problem.
So instead of choosing $\alpha$ to control the probability of getting a false positive
in each test, we try to control the FWER, i.e. the probability of getting
*at least* one false positive in our *set* of hypothesis tests.

In the following exercises, we will explore this idea on some synthetic data. We will
simulate performing hypothesis tests for which the null distribution is always true
for different values of `m0`.

To simplify the code, we will directly sample the Z-test statistics from
a standard normal (instead of sampling the individual groups and then performing
the test on them, as shown above). We can also skip the p-value calculation and
compare our Z-test statistics directly to the $(1 - \alpha) / 2$th quantile of the standard
normal distribution (using the `qnorm()` function).


### Tasks {-}

#### 1. Set `m0 = 5` and generate `m0` Z-test statistics by sampling from the standard normal distribution. Check which ones are significant using a cutoff `alpha = 0.05` {-}

#### 2. Calculate $V$, the number of significant tests. {-}

#### 3. Check whether `V > 0` (at least one significant) and store the result in a variable. {-}

#### 4. Repeat steps 1-3 1000 times and keep track of the result in step 3 by storing it in a vector. {-}

#### 5. Now compute the FWER as the proportion of times $V$ was greater than 0. {-}

#### 6. Repeat the same procedure for `m0 = 50` and `m0 = 1000`. Interpret the results. {-}

<details><summary>Solution</summary>

To avoid repeated code, the simulations over the different values of `m_0` are combined into one code block.

```{r}
# Repeat 1000 times for m0 = 5, 50 and 1000
set.seed(987654321)
B <- 1000 # number of simulations
m0_vec <- c(5, 50, 1000)
alpha <- 0.05
# Initialize empty vector to store FWER results
fwer <- numeric(length = length(m0_vec))
names(fwer) <- paste("m0 =", m0_vec)

# Loop over m0 values and simulate test statistics
for (i in seq_along(m0_vec)) {
  # Simulate B times, each time recording V > 0
  at_least_one_signif <- replicate(B, {
    null_stats <- rnorm(m0_vec[i], mean = 0, sd = 1) # simulate z-test
    significant_tests <- abs(null_stats) > qnorm(1 - alpha / 2)
    sum(significant_tests) > 0
  })
  fwer[i] <- mean(at_least_one_signif) # proportion of V >= 1
}
fwer
```

</details>

# Controlling the FWER

We saw in the previous exercises that when conducting a large number of hypothesis
tests the FWER becomes unacceptably large. So what can we do to lower it?
The most straightforward way would be to just lower $\alpha$ (so we get less false positives).

Under the null hypothesis (so the p-values are uniformly distributed), we can write the FWER
for $m$ independent tests as

$$
\begin{align}
\text{FWER} = P(\text{at least one rejection}) &= 1 - P(\text{no rejections}) \\
  &= 1 - \prod_{i=1}^{m} P(p_i > \alpha) \\
  &= 1 - \prod_{i=1}^{m} (1 - \alpha) \\
  &= 1 - (1 - \alpha)^{m} \, ,
\end{align}
$$

where $p_i$ is the p-value for the $i$-th test.

Šidák's correction solves the above equation at a given FWER level (e.g., 0.05)
for $\alpha$ to control the FWER using

$$ \alpha = 1 - (1 - \text{FWER})^{1/m} \, .$$

Without assuming independent tests, the Bonferroni correction uses that

$$
\begin{align}
\text{FWER} &= P(\text{at least one rejection}) \\
  &= P(\cup_{i=1}^m \{ p_i \le \alpha \}) \\
  &\le \sum_{i=1}^{m} P(p_i \le \alpha) \\
  &= m \alpha \, ,
\end{align}
$$

where the inequality follows from the union bound, and solves the equation
at a given FWER level (e.g., 0.05) for $\alpha$ to control the FWER with

$$ \alpha = \frac{\text{FWER}}{m} \, .$$

### Tasks {-}

#### 1. Determine the `alpha` for Šidák's correction at `m0 = 5`, `m0 = 50` and `m0 = 1000`. Use these values to repeat the simulation from the previous exercise, and confirm that the FWER is controlled at 0.05. {-}

<details><summary>Solution</summary>

```{r}
# Repeat 1000 times for m0 = 5, 50 and 1000
set.seed(987654321)
B <- 1000 # number of simulations
m0_vec <- c(5, 50, 1000)
alpha_vec <- 1 - (1 - 0.05)^(1 / m0_vec)
alpha_vec
# Initialize empty vector to store FWER results
fwer <- numeric(length = length(m0_vec))
names(fwer) <- paste("m0 =", m0_vec)

# Loop over m0 values and simulate test statistics
for (i in seq_along(m0_vec)) {
  # Simulate B times, each time recording V > 0
  at_least_one_signif <- replicate(B, {
    null_stats <- rnorm(m0_vec[i], mean = 0, sd = 1) # simulate z-test
    significant_tests <- abs(null_stats) > qnorm(1 - alpha_vec[i] / 2)
    sum(significant_tests) > 0
  })
  fwer[i] <- mean(at_least_one_signif) # proportion of V >= 1
}
fwer
```

</details>

#### 2. Determine the `alpha` for the Bonferroni correction and repeat the simulation one more time. How do the results compare to Šidák's correction? {-}

<details><summary>Solution</summary>

```{r}
# Repeat 1000 times for m0 = 5, 50 and 1000
set.seed(987654321)
B <- 1000 # number of simulations
m0_vec <- c(5, 50, 1000)
alpha_vec <- 0.05 / m0_vec
alpha_vec
# Initialize empty vector to store FWER results
fwer <- numeric(length = length(m0_vec))
names(fwer) <- paste("m0 =", m0_vec)

# Loop over m0 values and simulate test statistics
for (i in seq_along(m0_vec)) {
  # Simulate B times, each time recording V > 0
  at_least_one_signif <- replicate(B, {
    null_stats <- rnorm(m0_vec[i], mean = 0, sd = 1) # simulate z-test
    significant_tests <- abs(null_stats) > qnorm(1 - alpha_vec[i] / 2)
    sum(significant_tests) > 0
  })
  fwer[i] <- mean(at_least_one_signif) # proportion of V >= 1
}
fwer
```

</details>

In general, controlling the FWER at levels such as 0.05 with Šidák's correction
or the Bonferroni correction is very conservative, i.e., if there were true alternatives
(which we did not consider in the simulations), we might miss them.


# The False Discovery Rate (FDR)

Now let's consider a situation where we have both true null and true alternative
cases. In this situation we are prone to two types of errors: false positives
(type 1, as in the previous case) and false negatives (type 2, i.e. accepting
$H_0$ while it's actually false).

We can then extend the table from [before](#testing-hypotheses):

```{r, echo=FALSE}
knitr::kable(
  structure(list(
    `Accept H_0` = c("U (True Negative)", "T (False Negative)", "W"),
    `Reject H_0` = c("V (False Positive)", "S (True Positive)", "R"),
    `Total` = c("m_0", "m_1", "m")
  ),
  .Names = c("Accept H_0", "Reject H_0", "Total"),
  row.names = c("H_0 True", "H_1 True", "Total"),
  class = "data.frame"
  )
)
```

In addition to the FWER, another important quantity in multiple hypothesis
testing is the __False Discovery Rate (FDR)__:

$$
\text{FDR} = E[\frac{V}{R}] \, ,
$$

which is the expected proportion of false positives among all positive tests.
The ratio $V/R$ is also called the False Discovery Proportion (FDP).

However, the only quantities from the table above that are observable
in real data are $W$, $R$ and $m$. So to illustrate the concept, we will again
make use of simulated data.

In the code below, 1.000 tests are simulated, for which 90% come from cases
where the null is true and the other 10% for which the alternative is true.
We assume the *effect size* (difference between the 2 groups) to be equal to 3
under the alternative, so that we can sample the test statistics from a normal
distribution with mean 3.

```{r}
m <- 1000 # total number of hypotheses
p0 <- 0.9 # 90% of cases are true null
m0 <- round(p0 * m)
m1 <- round((1 - p0) * m)

set.seed(987654321)
alpha <- 0.05
null_stats <- rnorm(m0)
alt_stats <- rnorm(m1, mean = 3)
null_tests <- abs(null_stats) > qnorm(1 - alpha / 2)
alt_tests <- abs(alt_stats) > qnorm(1 - alpha / 2)

(U <- sum(!null_tests)) # true negatives
(V <- sum(null_tests)) # false positives
(S <- sum(alt_tests)) # true positives
(T <- sum(!alt_tests)) # false negatives

R <- V + S # total number of positives

(FDP <- V / R)
```

We see that 35.38% of the positive cases are actually false positives.
To get an idea of the FDR, the expected FDP, we will repeat this procedure a
number of times in the following exercises.

### Tasks {-}

#### 1. Set `m0 = 90` and `m1 = 10` and generate `m = 100` test statistics, `m0` of them from the standard normal distribution $N(0, 1)$ (null distribution) and m1 of them from $N(3, 1)$ (the alternative distribution). {-}


#### 2. Test for significance at `alpha = 0.05` by comparing the test statistics to the standard normal, for both sets of tests. {-}


#### 3. Calculate the FDP. {-}


#### 4. Repeat steps 1-3 1000 times and keep track of the FDP for each iteration. Then estimate the FDR as the mean of the FDPs. Interpret the results. {-}

<details><summary>Solution</summary>

```{r}
set.seed(987654321)
B <- 1000 # number of simulations
m0 <- 90
m1 <- 10
alpha <- 0.05

# Simulate B times, each time recording V > 0
FDP <- replicate(B, {
  null_stats <- rnorm(m0, mean = 0)
  alt_stats <- rnorm(m1, mean = 3)
  null_tests <- abs(null_stats) > qnorm(1 - alpha / 2)
  alt_tests <- abs(alt_stats) > qnorm(1 - alpha / 2)
  V <- sum(null_tests) # false positives
  S <- sum(alt_tests) # true positives
  R <- V + S
  # Return V / R
  if (R == 0) 0 else V / R
})

(FDR <- mean(FDP))
```

The FDR tells us the expected proportion of positive tests that will be false
positives.

</details>

As with the FWER, we can now think of methods to control the FDR. A very popular
method that is widely used in large-scale inference for high-throughput
sequencing data is the __Benjamini-Hochberg correction__. It *adjusts* the
p-values such that the FDR is controlled at a certain level. Note that this can
equivalently be formulated as adjusting the significance cut-off $\alpha$, similarly
to how we defined the FWER-controlling techniques.

### Tasks {-}

#### 1. Adjust the simulation code from the previous task to estimate the FDR under the Benjamini-Hochberg correction. {-}
Hint: Calculate p-values of the simulated null and alternative test statistics. Concatenate all p-values into a single vector and apply the Benjamini-Hochberg correction using `p.adjust` with `method = "BH"`. Compare the resulting adjusted p-values to $\alpha$ to estimate the FDP and FDR.

<details><summary>Solution</summary>

```{r}
set.seed(987654321)
B <- 1000 # number of simulations
m0 <- 90
m1 <- 10
alpha <- 0.05

# Simulate B times, each time recording V > 0
FDP <- replicate(B, {
  null_stats <- rnorm(m0, mean = 0)
  alt_stats <- rnorm(m1, mean = 3)
  p_null <- (2 * pnorm(-abs(null_stats)))
  p_alt <- (2 * pnorm(-abs(alt_stats)))
  p_adj <- p.adjust(c(p_null, p_alt), method = "BH")
  V <- sum(p_adj[1:m0] <= alpha) # false positives
  S <- sum(p_adj[(m0 + 1):(m0 + m1)] <= alpha) # true positives
  R <- V + S
  # Return V / R
  if (R == 0) 0 else V / R
})

(FDR <- mean(FDP))
```

</details>

#### 2. Repeat the simulation using `method = "bonferroni"`. Compare the resulting FDR and interpret. {-}
Note: The Bonferroni correction can either be formulated as adjusting the significance cut-off to $\alpha / m$ or, equivalently, as a p-value correction that multiplies each p-value by $m$.

<details><summary>Solution</summary>

```{r}
set.seed(987654321)
B <- 1000 # number of simulations
m0 <- 90
m1 <- 10
alpha <- 0.05

# Simulate B times, each time recording V > 0
FDP <- replicate(B, {
  null_stats <- rnorm(m0, mean = 0)
  alt_stats <- rnorm(m1, mean = 3)
  p_null <- (2 * pnorm(-abs(null_stats)))
  p_alt <- (2 * pnorm(-abs(alt_stats)))
  p_adj <- p.adjust(c(p_null, p_alt), method = "bonferroni")
  V <- sum(p_adj[1:m0] <= alpha) # false positives
  S <- sum(p_adj[(m0 + 1):(m0 + m1)] <= alpha) # true positives
  R <- V + S
  # Return V / R
  if (R == 0) 0 else V / R
})

(FDR <- mean(FDP))
```

__Interpretation__: Both the Bonferroni correction and the Benjamini-Hochberg correction control the FDR at the desired level $\alpha$.
However, the Bonferroni correction is more conservative, resulting in an FDR that is much lower than $\alpha$. This means that the Bonferroni method results in a larger number of false negatives and could possibly lead to us missing interesting discoveries.

</details>

# Multiple testing on the data by Alon *et al.* (1999)

We will take another look at the dataset by @alon1999broad on gene expression
levels in 40 tumour and 22 normal colon tissue samples.
We previously used this data in [Lab 4](./Lab4-Sparse-PCA-LDA.html).
This time, we are interested in finding genes that are expressed *significantly*
different between the tumor and normal samples. In other words, we will perform
hypothesis tests for each of the __2000 genes__. This is clearly a multiple
testing problem.

```{r load-data}
data("Alon1999")
str(Alon1999[, 1:10]) # first 10 columns
table(Alon1999$Y)  # contingency table of tissue types
```

As a reminder, the data consists of gene expression levels in 40 tumor and 22
normal colon tissue samples. The expression on 6500 human genes were measured
using the Affymetrix oligonucleotide array.
As in @alon1999broad, we use the *2000 genes with the highest minimal intensity*
across the samples.

The dataset contains one variable named `Y` with the values `t` and `n`.
This variable indicates whether the sample came from tumourous (`t`) or
normal (`n`) tissue.

### Tasks {-}

#### 1. Perform two-sample t-tests for each gene, comparing the tumor and normal tissue groups. Store the p-values and test statistics. {-}

Hint: The `t.test()` function computes a two-sample t-test with *unequal variances* ([Welch's t-test][Welch]) by default. The return object contains the elements `p.value` and `statistic`.

<details><summary>Solution</summary>

```{r}
# Split gene data and grouping variable and convert to matrix
gene_data <- as.matrix(Alon1999[, -1])
group <- Alon1999$Y

# Use `apply` to perform t-tests over all columns
t_test_results <- t(apply(gene_data, 2, function(x) {
  t_test <- t.test(x ~ group)
  p_val <- t_test$p.value
  stat <- t_test$statistic
  c(stat, "p_val" = p_val)
}))

head(t_test_results)
```

</details>

#### 2. Plot a histogram of the p-values and interpret. {-}

<details><summary>Solution</summary>

Setting `breaks` from 0 to 1 with a width of 0.05 makes sense for p-values because
the first bar will then represent the number of p-values smaller than 0.05 (which
we often use as cutoff).

```{r p_val-hist}
p_vals <- t_test_results[, "p_val"]
hist(
  p_vals,
  breaks = seq(0, 1, by = 0.05), main = "", xlab = "p-value",
  ylim = c(0, 500)
)
```

__Interpretation__: The histogram appears close to a uniform distribution for the larger p-values, but has a lot more small p-values than would be expected under a uniform distribution.
Recall that if the null hypothesis (the gene expression is the same for tumor and normal tissue) were to hold for all genes, the p-values would be uniformly distributed on $[0, 1]$.

</details>

#### 3. How many discoveries / significant genes do you find when using $\alpha = 0.05$? {-}

<details><summary>Solution</summary>

```{r}
alpha <- 0.05
sum(p_vals <= alpha)
```

</details>

#### 4. Control the FWER at 5% using Šidák's correction. How many genes are significant after the correction? {-}

<details><summary>Solution</summary>

```{r}
m <- nrow(t_test_results)
fwer <- 0.05
adj_alpha <- 1 - (1 - fwer)^(1 / m)

# Number of significant discoveries
sum(p_vals <= adj_alpha)
```

By rearranging the equation $$ \alpha = 1 - (1 - \text{FWER})^{1/m} \, ,$$
we can equivalently adjust the p-values via
$$1 - (1 - p)^m$$
and compare them to the cut-off 5\%:

```{r}
sidak_p_vals <- 1 - (1 - p_vals)^m

# Number of significant discoveries
sum(sidak_p_vals <= fwer)
```

</details>

#### 5. Control the FWER at 5% using the Bonferroni correction. How many genes are significant after the correction? {-}

<details><summary>Solution</summary>

By adjusting $\alpha$:

```{r}
adj_alpha <- fwer / m

# Number of significant discoveries
sum(p_vals <= adj_alpha)
```

Or equivalently, by adjusting the p-values:

```{r}
bonferroni_p_vals <- p.adjust(p_vals, method = "bonferroni")

# Number of significant discoveries
sum(bonferroni_p_vals <= fwer)
```

</details>

#### 6. Control the FDR at 5% using the Benjamini-Hochberg method. How many genes are significant after the correction? {-}

<details><summary>Solution</summary>

```{r}
bh_p_vals <- p.adjust(p_vals, method = "BH")

sum(bh_p_vals < 0.05)
```

</details>

#### 7. Plot the adjusted p-values from the Benjamini-Hochberg method against the original p-values. What is the effect of the adjustment? {-}
Bonus: Compute the adjusted p-values based on Šidák's correction and the Bonferroni correction. Add them to the plot and compare.

<details><summary>Solution</summary>

Visualize the adjusted p-values for all three methods in the same figure:

```{r fdr-adjust}
cols <- c("#9d05f4", "#ecd033", "#01d9d9")

plot(
  p_vals[order(p_vals)], bh_p_vals[order(p_vals)],
  pch = 19, cex = 0.6, xlab = "p-value",
  ylab = "Adjusted p-value", col = cols[1]
)
points(
  p_vals[order(p_vals)], bonferroni_p_vals[order(p_vals)],
  pch = 19, cex = 0.6, col = cols[2]
)
points(
  p_vals[order(p_vals)], sidak_p_vals[order(p_vals)],
  pch = 19, cex = 0.6, col = cols[3]
)
abline(a = 0, b = 1)
legend("bottomright",
  c("Benjamini-Hochberg", "Bonferroni", "Sidak"),
  col = cols, pch = 19
)
```

__Interpretation__: All three methods adjust the p-values upwards to account for multiple testing. The adjustment is more pronounced for small p-values, while for higher p-values, the curves of the adjusted p-values flatten out.
The Benjamini-Hochberg method at a FDR of 5% is less conservative compared to the other two methods. Both Šidák's correction and the Bonferroni correction at an FWER of 5% adjust the majority of p-values to exactly 1 and only a small proportion of p-values that are originally close to zero remains smaller than 1.

</details>

# Further reading {-}

- Chapter 6 of the [Biomedical Data Science](http://genomicsclass.github.io/book/) book

[Welch]:https://en.wikipedia.org/wiki/Welch%27s_t-test

```{r, child="_session-info.Rmd"}
```

# References {-}