Add a skipmissing argument to corkendall

With help from @nalimilan, back in February I contributed a re-write of `corkendall` that yielded quite useful speedups. (see [this commit](https://github.com/JuliaStats/StatsBase.jl/commit/11ac5b596405367b3217d3d962e22523fef9bb0d). But I need a version of `corkendall` that supports missing elements, which `StatsBase.corkendall` doesn't support at present. Hence my (public, but not registered) package [KendallTau](https://github.com/PGS62/KendallTau.jl).

I would like to propose that `corkendall` in StatsBase be replaced by code copied from KendallTau, but I know that how best to handle missing values in StatsBase has been a topic of various discussions, including a [more general approach](https://github.com/JuliaLang/Statistics.jl/pull/34) than my proposal here so I'd like to know if a pull request on these lines might be accepted.

This is the approach I have taken:

1) Defined types:
```
const RealOrMissingVector{T <: Real} = AbstractArray{<:Union{T, Missing},1}
const RealOrMissingMatrix{T <: Real} = AbstractArray{<:Union{T, Missing},2}
```

2) written an auxiliary function `skipmissingpairs` that takes a pair of arguments each of one of the above two types and does the needful to implement both a "pairwise" and "complete" handling of missings, i.e. a return a pair of vectors\matrices with the missings (or rows containing missings) removed and also type narrowed to `RealVector` or `RealMatrix`.

3) amended the methods of `corkendall` to have signatures such as:

```
function corkendall(X::Union{RealMatrix,RealOrMissingMatrix},
                    y::Union{RealVector,RealOrMissingVector};
                    skipmissing::Symbol=:undefined)
```

The proposed change would be non-breaking on existing code, which must be passing arguments of type `RealVector` and\or `RealMatrix`. Also when one or more of the arguments is of type `RealOrMissing...` then `skipmissing` defaults to `:undefined` thus forcing the user to decide whether `:pairwise` or `:complete` is the better approach for the data they have.

Other changes I have made:

4) Slightly improved consistency with `corspearman`, for example errors when dimensions of arguments don't match now align between the two functions. Obviously this proposal will take `corkendall` "ahead" of `corspearman`, but I think it would be straightforward to make equivalent changes to `corspearman`, hopefully without code duplication.

5) Some code refactoring that gives an approx 10% further speedup to corkendall, and approx 40% reduction in memory allocation. The trick was to reduce the number of calls to `permute!` in all but the vector-vector case.

6) Amended the tests to test behaviour against missing values. I also have a test suite testing corkendall against `corkendal_naive` that, for simplicity, does not use Knight's algorithm. That gives confidence that results from `corkendall` are correct for a wide variety of argument types and values but I don't propose to port that test suite to StatsBase - it's too complicated.

Could a member of StatsBase let me know their thoughts?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a skipmissing argument to corkendall #683

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a skipmissing argument to corkendall #683

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions