With help from @nalimilan, back in February I contributed a re-write of corkendall that yielded quite useful speedups. (see this commit. But I need a version of corkendall that supports missing elements, which StatsBase.corkendall doesn't support at present. Hence my (public, but not registered) package KendallTau.
I would like to propose that corkendall in StatsBase be replaced by code copied from KendallTau, but I know that how best to handle missing values in StatsBase has been a topic of various discussions, including a more general approach than my proposal here so I'd like to know if a pull request on these lines might be accepted.
This is the approach I have taken:
- Defined types:
const RealOrMissingVector{T <: Real} = AbstractArray{<:Union{T, Missing},1}
const RealOrMissingMatrix{T <: Real} = AbstractArray{<:Union{T, Missing},2}
-
written an auxiliary function skipmissingpairs that takes a pair of arguments each of one of the above two types and does the needful to implement both a "pairwise" and "complete" handling of missings, i.e. a return a pair of vectors\matrices with the missings (or rows containing missings) removed and also type narrowed to RealVector or RealMatrix.
-
amended the methods of corkendall to have signatures such as:
function corkendall(X::Union{RealMatrix,RealOrMissingMatrix},
y::Union{RealVector,RealOrMissingVector};
skipmissing::Symbol=:undefined)
The proposed change would be non-breaking on existing code, which must be passing arguments of type RealVector and\or RealMatrix. Also when one or more of the arguments is of type RealOrMissing... then skipmissing defaults to :undefined thus forcing the user to decide whether :pairwise or :complete is the better approach for the data they have.
Other changes I have made:
-
Slightly improved consistency with corspearman, for example errors when dimensions of arguments don't match now align between the two functions. Obviously this proposal will take corkendall "ahead" of corspearman, but I think it would be straightforward to make equivalent changes to corspearman, hopefully without code duplication.
-
Some code refactoring that gives an approx 10% further speedup to corkendall, and approx 40% reduction in memory allocation. The trick was to reduce the number of calls to permute! in all but the vector-vector case.
-
Amended the tests to test behaviour against missing values. I also have a test suite testing corkendall against corkendal_naive that, for simplicity, does not use Knight's algorithm. That gives confidence that results from corkendall are correct for a wide variety of argument types and values but I don't propose to port that test suite to StatsBase - it's too complicated.
Could a member of StatsBase let me know their thoughts?
With help from @nalimilan, back in February I contributed a re-write of
corkendallthat yielded quite useful speedups. (see this commit. But I need a version ofcorkendallthat supports missing elements, whichStatsBase.corkendalldoesn't support at present. Hence my (public, but not registered) package KendallTau.I would like to propose that
corkendallin StatsBase be replaced by code copied from KendallTau, but I know that how best to handle missing values in StatsBase has been a topic of various discussions, including a more general approach than my proposal here so I'd like to know if a pull request on these lines might be accepted.This is the approach I have taken:
written an auxiliary function
skipmissingpairsthat takes a pair of arguments each of one of the above two types and does the needful to implement both a "pairwise" and "complete" handling of missings, i.e. a return a pair of vectors\matrices with the missings (or rows containing missings) removed and also type narrowed toRealVectororRealMatrix.amended the methods of
corkendallto have signatures such as:The proposed change would be non-breaking on existing code, which must be passing arguments of type
RealVectorand\orRealMatrix. Also when one or more of the arguments is of typeRealOrMissing...thenskipmissingdefaults to:undefinedthus forcing the user to decide whether:pairwiseor:completeis the better approach for the data they have.Other changes I have made:
Slightly improved consistency with
corspearman, for example errors when dimensions of arguments don't match now align between the two functions. Obviously this proposal will takecorkendall"ahead" ofcorspearman, but I think it would be straightforward to make equivalent changes tocorspearman, hopefully without code duplication.Some code refactoring that gives an approx 10% further speedup to corkendall, and approx 40% reduction in memory allocation. The trick was to reduce the number of calls to
permute!in all but the vector-vector case.Amended the tests to test behaviour against missing values. I also have a test suite testing corkendall against
corkendal_naivethat, for simplicity, does not use Knight's algorithm. That gives confidence that results fromcorkendallare correct for a wide variety of argument types and values but I don't propose to port that test suite to StatsBase - it's too complicated.Could a member of StatsBase let me know their thoughts?