UCP/MM: Add MAX_HCA_PER_GPU policy for GPU dmabuf registrations#11422
UCP/MM: Add MAX_HCA_PER_GPU policy for GPU dmabuf registrations#11422tomerg-nvidia wants to merge 4 commits intoopenucx:masterfrom
Conversation
Add a configurable policy (UCX_DMABUF_REG_DEVICES) to control how many dmabuf-capable MDs are registered during ucp_mem_map for GPU memory. Three modes are supported: "all" (default, register every reachable dmabuf MD), "closest" (only MDs on the same NUMA/bus), and a numeric limit N (pick the N closest reachable MDs, load-balanced by use count). The policy is applied in a place so that later registrations (e.g. zcopy, RNDV staging) bypass it.
23d6c18 to
648f3ec
Compare
| ucs_offsetof(ucp_context_config_t, proto_use_single_net_device), | ||
| UCS_CONFIG_TYPE_BOOL}, | ||
|
|
||
| {"DMABUF_REG_DEVICES", "all", |
There was a problem hiding this comment.
Suggest to rename without DMABUF as this will remain an existing option even when it becomes not strictly dmabuf related?
It should eventually replace GDA_MAX_HCA_PER_GPU, maybe MAX_HCA_PER_GPU=0/N/inf, 0 for closest, inf for all? Or a string like currently done?
| UCS_CONFIG_TYPE_BOOL}, | ||
|
|
||
| {"DMABUF_REG_DEVICES", "all", | ||
| "Specifies which dmabuf-capable UCP memory domain(s) to use for broad " |
| md_map &= context->dmabuf_reg_md_map; | ||
| ucs_for_each_bit(md_index, md_map) { | ||
| context->dmabuf_reg_md[md_index].last_used = | ||
| ucs_atomic_fadd32(&context->dmabuf_reg_timestamp, 1) + 1; |
There was a problem hiding this comment.
I converted it to use count (though context-local only), it needs to be atomic because it can be accessed from multiple threads.
| ucp_dmabuf_reg_md_t dmabuf_reg_md[UCP_MAX_MDS]; | ||
|
|
||
| /* Monotonic counter for LRU-based dmabuf MD selection */ | ||
| volatile uint32_t dmabuf_reg_timestamp; |
There was a problem hiding this comment.
without timestamp as it is not really time, rather sequence number?
There was a problem hiding this comment.
converted to use count (per context)
| } | ||
|
|
||
| static ucp_md_map_t | ||
| ucp_dmabuf_reg_select_limit(const ucp_context_config_t *config, |
There was a problem hiding this comment.
we can just remove all dmabuf prefixes in struct's and functions
There was a problem hiding this comment.
we can keep the actual function implemented around dmabuf functionality though
| dmabuf_md_map); | ||
| selected = ucp_context_select_dmabuf_reg_md_map(context, reachable, | ||
| sys_dev); | ||
| ucs_trace("dmabuf_policy: mem_type=%d sys_dev=%d dmabuf=0x%" PRIx64 |
There was a problem hiding this comment.
debug level, and such computation once at startup if it is possible?
| selected = ucp_context_select_dmabuf_reg_md_map(context, reachable, | ||
| sys_dev); | ||
| ucs_trace("dmabuf_policy: mem_type=%d sys_dev=%d dmabuf=0x%" PRIx64 | ||
| " reachable=0x%" PRIx64 " selected=0x%" PRIx64, |
There was a problem hiding this comment.
TBH for usability we'd need the HCA name if possible
| ucs_offsetof(ucp_context_config_t, proto_use_single_net_device), | ||
| UCS_CONFIG_TYPE_BOOL}, | ||
|
|
||
| {"DMABUF_REG_DEVICES", "all", |
There was a problem hiding this comment.
set to '1' by default at the potential expense of throughput.
| ucp_dmabuf_reg_select_md_t select_mds[UCP_MAX_MDS]; | ||
| ucp_md_index_t md_index; | ||
|
|
||
| md_map &= context->dmabuf_reg_md_map; |
There was a problem hiding this comment.
probably not needed as done by caller?
What?
Add
UCX_MAX_HCA_PER_GPUto limit how many HCAs UCP registers GPU memory on through dmabuf-capable MDs.Supported values:
inf: register on all reachable HCAs, default0: register on the closest reachable HCA set<N>: register on up to N closest reachable HCAsWhy?
UCX currently registers on all MDs, causing high ICM memory usage. This should reduce it.
How?
Precompute dmabuf-capable MD reachability and latency per memory system device during context setup. During user
ucp_mem_map(), narrow only the dmabuf-capable part of the registration MD map according toUCX_MAX_HCA_PER_GPU. keep non-dmabuf MDs unchanged.Selection uses topology latency first, then MD use count, then MD name for deterministic tie breaking.