The algorithm D in
Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984, page 716-17
is the main result of that article. In src/sampling.jl there are implementations of the A, C, D algorithms. Is there any special reason that B is not implemented?
The main interface function sample! uses algorithms A, C, but not D. Why? In the article on page 704 in table I, we see that D is by far the most performant.
Using sample!(a,x) where n=length(a) is much larger than k=length(x) gives awful performance. For example, when generating a random sparse nxn matrix, where n=10^7, with exactly k=10^8 nonzero entries, we sample k sorted unique elements from 1:n² = 1:10¹⁴ . Wouldn't it make more sense in
to just use
seqsample_d!?
The algorithm D in
Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984, page 716-17
is the main result of that article. In
src/sampling.jlthere are implementations of the A, C, D algorithms. Is there any special reason that B is not implemented?The main interface function
sample!uses algorithms A, C, but not D. Why? In the article on page 704 in table I, we see that D is by far the most performant.Using
sample!(a,x)wheren=length(a)is much larger thank=length(x)gives awful performance. For example, when generating a random sparse nxn matrix, wheren=10^7, with exactlyk=10^8nonzero entries, we sampleksorted unique elements from1:n² = 1:10¹⁴. Wouldn't it make more sense inStatsBase.jl/src/sampling.jl
Line 506 in 3e5382f
to just use
seqsample_d!?