SiegelTukeyTest bug with ties

I believe there is an error in how `SiegelTukeyTest` handles ties, which is attributable to the manner in which _ranks_ from `SiegelTukeyRank` are supplied as _data_ to `wilcox.test` internally. This causes the ranks to be "re-ranked" by _wilcox.test_, thereby producing inaccurate test statistics and p-values in the presence of ties. When there are no ties, the ranks and the "re-ranked ranks" are identical, and there is no issue. When there are ties, `SiegelTukeyRank` has already allocated average ranks, which gets distorted by the internal re-ranking when passed to `wilcox.test`.

In the presence of ties, `wilcox.test` enforces `exact=FALSE` and applies a normal approximation (typically with a continuity correction and an adjustment to rescale the variance to account for ties). Thus, in the below example, `exact=FALSE` is enforced in order to demonstrate that the mismatch is in how ties are handled. The code presents two examples: one with ties; one without. This is achieved by toggling whether the medians should be aligned, but this is purely incidental: it just so happens that there are no ties without median adjustment and there are ties after median adjustment.

When running this code with `align=FALSE` (i.e., without ties), the final line returns `TRUE`. When running with `align=TRUE` (i.e., with ties), the final line returns `FALSE`.

```
X <- c(23,18,17,25,22,19,31,26,29,33)
Y <- c(21,28,32,30,41,24,35,34,27,39,36)
m <- length(X)
n <- length(Y) - 1 # as the dropped median will later come from the Y group
N <- m + n

# toggle manual median alignment
align <- FALSE
if(align) {
  X <- X - (median(X) - median(Y))
}

# allocate ranks
ranks <- SiegelTukeyRank(c(X, Y), g=rep(1:2, c(length(X), length(Y))), drop.median=TRUE)

# manually calculate Wilcoxon-Mann-Whitney test statistics
WX <- sum(ranks[ranks[,"sort.id"] == 1,"unique.ranks"])
UX <- WX - m * (m + 1)/2
UY <- m * n - UX
U <- min(UX, UY)

# normal approximation with continuity correction and tie-adjustment
E <- m * n/2
V <- m * n * (m + n + 1)/12
TIES <- table(ranks[,"unique.ranks"])
Z <- (UX - E - sign(UX - E)/2)/sqrt(V * (1 - sum(TIES^3 - TIES)/(N * (N - 1) * (N + 1))))

# p-value comparison
p1 <- 2 * pnorm(Z, lower.tail=FALSE)
p2 <- SiegelTukeyTest(X, Y, adjust.median=align, exact=FALSE)$p.value
all.equal(p1, p2)
```

I suggest that this could be fixed by modifying `SiegelTukeyRank`. Presently, it returns 
```
unique.ranks <- tapply(rank, sort.x, mean)
```
If it also returned the unadjusted ranks (simply `rank`) and these lines of `SiegelTukeyTest`
```
strank <- SiegelTukeyRank(xx, g = id)
ranks0 <- strank$unique.ranks[strank$sort.id == 0]
ranks1 <- strank$unique.ranks[strank$sort.id == 1]
```
were then replaced by 
```
strank <- SiegelTukeyRank(xx, g = id)
ranks0 <- strank$ranks[strank$sort.id == 0]
ranks1 <- strank$ranks[strank$sort.id == 1]
```
the problem would disappear.


TL:DR - the ST test should calculate WMW test statistics on ST ranks, but the code supplies ST ranks to the WMW test; this is fine, unless there are ties in the ST ranks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SiegelTukeyTest bug with ties #179

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SiegelTukeyTest bug with ties #179

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions