Skip to content

UCP/MM: Add MAX_HCA_PER_GPU policy for GPU dmabuf registrations#11422

Open
tomerg-nvidia wants to merge 4 commits intoopenucx:masterfrom
tomerg-nvidia:limit_md_registrations
Open

UCP/MM: Add MAX_HCA_PER_GPU policy for GPU dmabuf registrations#11422
tomerg-nvidia wants to merge 4 commits intoopenucx:masterfrom
tomerg-nvidia:limit_md_registrations

Conversation

@tomerg-nvidia
Copy link
Copy Markdown
Contributor

@tomerg-nvidia tomerg-nvidia commented May 5, 2026

What?

Add UCX_MAX_HCA_PER_GPU to limit how many HCAs UCP registers GPU memory on through dmabuf-capable MDs.

Supported values:

  • inf: register on all reachable HCAs, default
  • 0: register on the closest reachable HCA set
  • <N> : register on up to N closest reachable HCAs

Why?

UCX currently registers on all MDs, causing high ICM memory usage. This should reduce it.

How?

Precompute dmabuf-capable MD reachability and latency per memory system device during context setup. During user ucp_mem_map(), narrow only the dmabuf-capable part of the registration MD map according to UCX_MAX_HCA_PER_GPU. keep non-dmabuf MDs unchanged.

Selection uses topology latency first, then MD use count, then MD name for deterministic tie breaking.

@tomerg-nvidia tomerg-nvidia added the WIP-DNM Work in progress / Do not review label May 5, 2026
Add a configurable policy (UCX_DMABUF_REG_DEVICES) to control how many
dmabuf-capable MDs are registered during ucp_mem_map for GPU memory.
Three modes are supported: "all" (default, register every reachable
dmabuf MD), "closest" (only MDs on the same NUMA/bus), and a numeric
limit N (pick the N closest reachable MDs, load-balanced by use count).
The policy is applied in a place so that later registrations
(e.g. zcopy, RNDV staging) bypass it.
@tomerg-nvidia tomerg-nvidia force-pushed the limit_md_registrations branch from 23d6c18 to 648f3ec Compare May 6, 2026 15:39
@tomerg-nvidia tomerg-nvidia requested a review from brminich May 6, 2026 15:45
@tomerg-nvidia tomerg-nvidia removed the WIP-DNM Work in progress / Do not review label May 6, 2026
@tomerg-nvidia tomerg-nvidia marked this pull request as ready for review May 6, 2026 15:45
@tomerg-nvidia tomerg-nvidia requested a review from tvegas1 May 7, 2026 03:09
Comment thread src/ucp/core/ucp_context.c Outdated
ucs_offsetof(ucp_context_config_t, proto_use_single_net_device),
UCS_CONFIG_TYPE_BOOL},

{"DMABUF_REG_DEVICES", "all",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to rename without DMABUF as this will remain an existing option even when it becomes not strictly dmabuf related?

It should eventually replace GDA_MAX_HCA_PER_GPU, maybe MAX_HCA_PER_GPU=0/N/inf, 0 for closest, inf for all? Or a string like currently done?

Comment thread src/ucp/core/ucp_context.c Outdated
UCS_CONFIG_TYPE_BOOL},

{"DMABUF_REG_DEVICES", "all",
"Specifies which dmabuf-capable UCP memory domain(s) to use for broad "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no dmabuf reference here

Comment thread src/ucp/core/ucp_context.c Outdated
md_map &= context->dmabuf_reg_md_map;
ucs_for_each_bit(md_index, md_map) {
context->dmabuf_reg_md[md_index].last_used =
ucs_atomic_fadd32(&context->dmabuf_reg_timestamp, 1) + 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for atomic here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted it to use count (though context-local only), it needs to be atomic because it can be accessed from multiple threads.

Comment thread src/ucp/core/ucp_context.h Outdated
ucp_dmabuf_reg_md_t dmabuf_reg_md[UCP_MAX_MDS];

/* Monotonic counter for LRU-based dmabuf MD selection */
volatile uint32_t dmabuf_reg_timestamp;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without timestamp as it is not really time, rather sequence number?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converted to use count (per context)

Comment thread src/ucp/core/ucp_context.c Outdated
}

static ucp_md_map_t
ucp_dmabuf_reg_select_limit(const ucp_context_config_t *config,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just remove all dmabuf prefixes in struct's and functions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep the actual function implemented around dmabuf functionality though

Comment thread src/ucp/core/ucp_mm.c Outdated
dmabuf_md_map);
selected = ucp_context_select_dmabuf_reg_md_map(context, reachable,
sys_dev);
ucs_trace("dmabuf_policy: mem_type=%d sys_dev=%d dmabuf=0x%" PRIx64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug level, and such computation once at startup if it is possible?

Comment thread src/ucp/core/ucp_mm.c Outdated
selected = ucp_context_select_dmabuf_reg_md_map(context, reachable,
sys_dev);
ucs_trace("dmabuf_policy: mem_type=%d sys_dev=%d dmabuf=0x%" PRIx64
" reachable=0x%" PRIx64 " selected=0x%" PRIx64,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH for usability we'd need the HCA name if possible

Comment thread src/ucp/core/ucp_context.c Outdated
ucs_offsetof(ucp_context_config_t, proto_use_single_net_device),
UCS_CONFIG_TYPE_BOOL},

{"DMABUF_REG_DEVICES", "all",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set to '1' by default at the potential expense of throughput.

Comment thread src/ucp/core/ucp_context.c Outdated
ucp_dmabuf_reg_select_md_t select_mds[UCP_MAX_MDS];
ucp_md_index_t md_index;

md_map &= context->dmabuf_reg_md_map;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not needed as done by caller?

Comment thread src/ucp/core/ucp_mm.c Outdated
@tomerg-nvidia tomerg-nvidia changed the title UCP/MM: Add UCX_DMABUF_REG_DEVICES policy to limit dmabuf registrations UCP/MM: Add MAX_HCA_PER_GPU policy for GPU dmabuf registrations May 8, 2026
@tomerg-nvidia tomerg-nvidia requested a review from tvegas1 May 8, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants