Skip to content

[ops] Add mx.empty() — uninitialized array allocation #3549

@megacpp

Description

@megacpp

Summary

MLX is missing a primitive for allocating an uninitialized array, equivalent to numpy.empty / torch.empty / jnp.empty. This is useful when a buffer will be fully overwritten by a subsequent kernel — the implicit zero-fill of mx.zeros is wasted work in that case.

Adding mx.empty(shape, dtype=..., stream=...) would close that gap.

Motivation

The concrete use case we hit: when a TileLang Metal kernel produces an output tensor, the host-side allocation only needs the right shape/dtype/storage — the kernel will fully overwrite the contents. With only mx.zeros available today, we pay for a memset to zero before the kernel runs and then immediately overwrites every byte. For larger output tensors (e.g. attention outputs in a transformer block) the wasted zero-fill measurably hurts throughput.

The same pattern shows up any time an MLX array is used as a write-only output buffer of an external kernel (a custom Metal op, a DLPack-imported tensor about to be filled in place, etc.).

PyTorch / NumPy / JAX all expose this primitive (torch.empty, numpy.empty, jnp.empty) for the same reason.

Proposed API

mx.empty(shape, dtype=mx.float32, stream=None)

Semantics:

  • Allocates an array of the given shape and dtype on the active device.
  • Does not initialize the contents — the caller is expected to write into it before reading.
  • Reuses MLX's existing allocator and dtype rules, including the existing GPU float64 restriction.
  • Rejects negative dimensions with the standard MLX shape-validation error.
  • Optional stream= argument to match the rest of the MLX ops surface.

This is intentionally a thin wrapper around the existing allocation path — no new buffer-management complexity, just skipping the fill.

Prototype

We have a working implementation in our downstream fork:

  • DatasunriseOU@4acd37aAdd uninitialized array allocation
  • Diff: 60 LOC across 4 files: mlx/ops.cpp, mlx/ops.h, python/src/ops.cpp, python/tests/test_ops.py.

The prototype exposes the API exactly as proposed above. Tests cover default dtype, explicit dtype, negative-shape rejection, and the GPU float64 rejection path.

What we're offering

If maintainers are interested, we can rebase the prototype on current ml-explore/mlx@main and open a PR. The patch is small and independent of the DLPack work in #3531 — no shared surface, no ordering requirement.

If the team would prefer a slightly different shape (e.g. dtype as the first positional argument, or a different stream= default), happy to adjust before opening the PR.

Notes

  • One open design question: in debug builds, should mx.empty fill with NaN / sentinel values to surface uninitialized-read bugs in user code? PyTorch doesn't do this; NumPy doesn't do this. Our prototype follows the same convention (raw allocation, no debug-fill). Flagging it here in case MLX has a different preference.
  • This issue is intentionally narrow per the maintainer guidance on RFC: DLPack consumer support for MLX arrays #3548 — DLPack consumer work is being handled in Add Metal DLPack zero-copy sharing #3531, and this is a small orthogonal piece that came out of the same PoC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions