[QST] MatX is around x15 slower than CuPy for the same task

Gidday. 

I'm a bit of a novice with MatX and CPP, and was looking to get some help with optimising my MatX code.

So basically I'm trying to refactor my code that was written in CuPy first into lightning fast MatX code. Except I find that my MatX implementation, despite (what looks to me) an identical equivalent to my CuPy code, it is a lot slower. I was wondering if anybody would be able to give me some tips as to where my code might be slowing down.

FYI a general assumption is that MatX's operators are super lightweight - so the reshapes, repmats are all super quick.

My MatX code looks like:

```cpp
matx::tensor_t<matx::matxFp16, 2>  GsDBSCAN::findDistancesMatX(matx::tensor_t<matx::matxFp16, 2> &X_t, matx::tensor_t<int, 2> &A_t, matx::tensor_t<int, 2> &B_t, float alpha, int batchSize) {
    const int k = A_t.Shape()[1] / 2;
    const int m = B_t.Shape()[1];

    const int n = X_t.Shape()[0];
    const int d = X_t.Shape()[1];
    int D = B_t.Shape()[0] / 2;

    batchSize = (batchSize != -1) ? batchSize: GsDBSCAN::findDistanceBatchSize(alpha, n, d, k, m);

    auto AFlat_t = matx::flatten(A_t);

    auto distances_t = matx::make_tensor<matx::matxFp16>({n, 2*k*m});

    for (int i = 0; i < n; i += batchSize) {
        int maxBatchIdx = i + batchSize - 1; // Index within X along the ROWS

        auto XSubset_t_op = matx::slice(X_t, {i, 0}, {maxBatchIdx + 1, matx::matxEnd});

        auto ABatchFlat_t_op = matx::slice(AFlat_t, {i * 2 * k}, {(maxBatchIdx + 1) * 2 * k});

        auto BBatch_t_op = matx::remap<0>(B_t, ABatchFlat_t_op);

        auto XBatch_t_op = matx::remap<0>(X_t, matx::flatten(BBatch_t_op));

        auto XBatchReshaped_t_op = matx::reshape(XBatch_t_op, {batchSize, 2*k*m, d});

        auto XSubsetReshaped_t_op = matx::reshape(XSubset_t_op, {batchSize, 1, d});

        auto YBatch_t_op = (XBatchReshaped_t_op - matx::repmat(XSubsetReshaped_t_op, {1, 2*k*m, 1})); // Repmat is a workaround for minusing naively incompatibhle tensor shapes

        auto YBatch_t_norm_op = matx::vector_norm(YBatch_t_op, {2}, matx::NormOrder::L2);

        (matx::slice(distances_t, {i, 0}, {maxBatchIdx + 1, matx::matxEnd}) = YBatch_t_norm_op).run();
    }

    return distances_t;
}
````

And the same CuPy code looks like:

```python
def find_distances(X, A, B, alpha=1.2, batch_size = -1):
    k = A.shape[1] // 2
    m = B.shape[1]

    n = X.shape[0]
    d = X.shape[1]
    D = B.shape[0] // 2

    batch_size = batch_size if batch_size != -1 else get_batch_size(n, d, k, m, alpha=alpha)

    distances = cp.empty(shape=(n, 2 * k * m),
                         dtype=cp.float16)  # float32 causes a memory overload. float16 is fine (for eps 2DP)

    for i in range(0, n, batch_size):
        max_batch_idx = min(i + batch_size, X.shape[0])

        Z_batch = X[B[A[i:max_batch_idx]]]
      
        # (Edit): Changed the reshape call to be a little clearer. Z_batch_adj is equivalent to XBatchReshaped_t_op above.
        Z_batch_adj = Z_batch.reshape(batch_size, 2 * k * m,  d)

        Y_batch = Z_batch_adj - X[i:max_batch_idx, cp.newaxis, :]

        distances[i:max_batch_idx] = cp.linalg.norm(Y_batch, axis=2)

    return distances
```

The parameters used for both are:

```
n = 70_000
k = 5
m = 50
d = 784
D = 1024
batchSize ~= 250 (FYI it will should always be a divisor of n, I found that CuPy implementation was a lot slower otherwise on the final iteration).
```

Regarding results, the **MatX code takes around 14.5 seconds to complete, but CuPy takes 0.9 seconds** (including Cuda Synchronisations).

As a baseline, a multithreaded (64 threads) CPU implementation of the above code (using loops with no tensors involved) takes less than 0.7 seconds. A single threaded CPU implementation takes around 7 seconds - (this is using the same machine of course).

Sorry if the variable names are a little cryptic.

I've tested for around `n = 1000` and found that the two implementations produce the same results (albeit with a small amount of floating point errors).

Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] MatX is around x15 slower than CuPy for the same task #688

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[QST] MatX is around x15 slower than CuPy for the same task #688

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions