Skip to content

[QST] MatX is around x15 slower than CuPy for the same task #688

@HugoPhibbs

Description

@HugoPhibbs

Gidday.

I'm a bit of a novice with MatX and CPP, and was looking to get some help with optimising my MatX code.

So basically I'm trying to refactor my code that was written in CuPy first into lightning fast MatX code. Except I find that my MatX implementation, despite (what looks to me) an identical equivalent to my CuPy code, it is a lot slower. I was wondering if anybody would be able to give me some tips as to where my code might be slowing down.

FYI a general assumption is that MatX's operators are super lightweight - so the reshapes, repmats are all super quick.

My MatX code looks like:

matx::tensor_t<matx::matxFp16, 2>  GsDBSCAN::findDistancesMatX(matx::tensor_t<matx::matxFp16, 2> &X_t, matx::tensor_t<int, 2> &A_t, matx::tensor_t<int, 2> &B_t, float alpha, int batchSize) {
    const int k = A_t.Shape()[1] / 2;
    const int m = B_t.Shape()[1];

    const int n = X_t.Shape()[0];
    const int d = X_t.Shape()[1];
    int D = B_t.Shape()[0] / 2;

    batchSize = (batchSize != -1) ? batchSize: GsDBSCAN::findDistanceBatchSize(alpha, n, d, k, m);

    auto AFlat_t = matx::flatten(A_t);

    auto distances_t = matx::make_tensor<matx::matxFp16>({n, 2*k*m});

    for (int i = 0; i < n; i += batchSize) {
        int maxBatchIdx = i + batchSize - 1; // Index within X along the ROWS

        auto XSubset_t_op = matx::slice(X_t, {i, 0}, {maxBatchIdx + 1, matx::matxEnd});

        auto ABatchFlat_t_op = matx::slice(AFlat_t, {i * 2 * k}, {(maxBatchIdx + 1) * 2 * k});

        auto BBatch_t_op = matx::remap<0>(B_t, ABatchFlat_t_op);

        auto XBatch_t_op = matx::remap<0>(X_t, matx::flatten(BBatch_t_op));

        auto XBatchReshaped_t_op = matx::reshape(XBatch_t_op, {batchSize, 2*k*m, d});

        auto XSubsetReshaped_t_op = matx::reshape(XSubset_t_op, {batchSize, 1, d});

        auto YBatch_t_op = (XBatchReshaped_t_op - matx::repmat(XSubsetReshaped_t_op, {1, 2*k*m, 1})); // Repmat is a workaround for minusing naively incompatibhle tensor shapes

        auto YBatch_t_norm_op = matx::vector_norm(YBatch_t_op, {2}, matx::NormOrder::L2);

        (matx::slice(distances_t, {i, 0}, {maxBatchIdx + 1, matx::matxEnd}) = YBatch_t_norm_op).run();
    }

    return distances_t;
}

And the same CuPy code looks like:

def find_distances(X, A, B, alpha=1.2, batch_size = -1):
    k = A.shape[1] // 2
    m = B.shape[1]

    n = X.shape[0]
    d = X.shape[1]
    D = B.shape[0] // 2

    batch_size = batch_size if batch_size != -1 else get_batch_size(n, d, k, m, alpha=alpha)

    distances = cp.empty(shape=(n, 2 * k * m),
                         dtype=cp.float16)  # float32 causes a memory overload. float16 is fine (for eps 2DP)

    for i in range(0, n, batch_size):
        max_batch_idx = min(i + batch_size, X.shape[0])

        Z_batch = X[B[A[i:max_batch_idx]]]
      
        # (Edit): Changed the reshape call to be a little clearer. Z_batch_adj is equivalent to XBatchReshaped_t_op above.
        Z_batch_adj = Z_batch.reshape(batch_size, 2 * k * m,  d)

        Y_batch = Z_batch_adj - X[i:max_batch_idx, cp.newaxis, :]

        distances[i:max_batch_idx] = cp.linalg.norm(Y_batch, axis=2)

    return distances

The parameters used for both are:

n = 70_000
k = 5
m = 50
d = 784
D = 1024
batchSize ~= 250 (FYI it will should always be a divisor of n, I found that CuPy implementation was a lot slower otherwise on the final iteration).

Regarding results, the MatX code takes around 14.5 seconds to complete, but CuPy takes 0.9 seconds (including Cuda Synchronisations).

As a baseline, a multithreaded (64 threads) CPU implementation of the above code (using loops with no tensors involved) takes less than 0.7 seconds. A single threaded CPU implementation takes around 7 seconds - (this is using the same machine of course).

Sorry if the variable names are a little cryptic.

I've tested for around n = 1000 and found that the two implementations produce the same results (albeit with a small amount of floating point errors).

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions