Skip to content

fit-params : refactor + add option to output estimated memory per device#22171

Open
ggerganov wants to merge 6 commits intomasterfrom
gg/fit-params-estimate
Open

fit-params : refactor + add option to output estimated memory per device#22171
ggerganov wants to merge 6 commits intomasterfrom
gg/fit-params-estimate

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Apr 20, 2026

Overview

cont #16653
ref #19070 (review)

  • Refactor the fit param logic. Move from libllama to libcommon
  • Add CLI argument -fite, --fit-estimate to the llama-fit-params tool. This is useful for 3rd party applications to estimate the required memory for a model.

Additional information

Example:

llama-fit-params -m ~/models/gemma-3-4b-it/ggml-model-f16.gguf -c 32768 --fit-estimate on

0.00.196.882 I main: printing estimated memory in MiB to stdout (device, model, context, compute) ...
MTL0 7401 814 517 
host 1280 0 154 

Requirements

@ggerganov ggerganov requested a review from a team as a code owner April 20, 2026 14:07
Comment thread src/llama-ext.h Outdated
#include "llama.h"

#include <cstdint>
#include <vector>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are making llama-ext.h a C++ only header then?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header is not required to be C-style. It is for staging new API and can be C++ if needed.

Once an API is ready to become public, it has to be become C-style.

Comment thread tools/fit-params/fit-params.cpp Outdated
Comment on lines 77 to 89
for (size_t id = 0; id < devs.size(); id++) {
printf("%s ", ggml_backend_dev_name(devs[id]));
printf("%zu ", dmd[id].mb.model/1024/1024);
printf("%zu ", dmd[id].mb.context/1024/1024);
printf("%zu ", dmd[id].mb.compute/1024/1024);
printf("\n");
}
printf("%s%s=%s", itbo > 0 ? "," : "", mparams.tensor_buft_overrides[itbo].pattern, ggml_backend_buft_name(mparams.tensor_buft_overrides[itbo].buft));
any_tbo = true;
printf("Host ");
printf("%zu ", dmd.back().mb.model/1024/1024);
printf("%zu ", dmd.back().mb.context/1024/1024);
printf("%zu ", dmd.back().mb.compute/1024/1024);
printf("\n");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intent is to parse this output programmatically I think it would be preferable to use a well-defined format like JSON.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, JSON is a huge overhead and in this specific case I don't think it is warranted because the data that we want to output is very simple.

Comment thread common/arg.cpp Outdated
Comment on lines +2430 to +2431
{ "-fite", "--fit-estimate" }, "[on|off]",
string_format("estimate the required memory to run the model ('on' or 'off', default: '%s')", params.fit_params_est ? "on" : "off"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this descriptions is confusing. It is true that with this option enabled the program will estimate the required memory but that is already what it's doing anyways. Maybe it would be better to call this something like --fit-print (or --fit-print-json depending on my other comment).

Comment thread tools/fit-params/fit-params.cpp Outdated
LOG_INF("%s: printing fitted CLI arguments to stdout...\n", __func__);
common_log_flush(common_log_main());
printf("-c %" PRIu32 " -ngl %" PRIi32, cparams.n_ctx, mparams.n_gpu_layers);
if (!params.fit_params_est) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, if the only goal is to disable the fitting this can already be done by manually setting e.g. -c 0 -ngl 999. The code should then recognize that these have been set manually and will not alter them. The only downside vs. the current approach would be that it's slightly slower than to only retrieve llama_get_device_memory_data once. But if we're already messing around with internal headers anyways we may as well make llama_params_fit_impl return the device memory data instead. Then all that would need to be done is change what is being printed.

@ggerganov ggerganov requested review from a team, CISC and ngxson as code owners April 20, 2026 17:16
@ggerganov
Copy link
Copy Markdown
Member Author

I took the opportunity to refactor the implementation and move the param fitting logic outside of libllama. It's something that I suggested in the original PR (#16653 (review)) and I think it makes sense to have this logic in libcommon for now. IMO the param fitting API on master is too narrow and does not allow flexibility for the user code. The idea is to eventually expose a much more lightweight and flexible API that would allow the applications to implement more sophisticated logic for param fitting and memory queries.

The prototype of this API is now in llama-ext.h:

//
// device memory querying
//
// "memory" as in physical memory for a buffer type, in bytes
struct llama_memory_breakdown_data {
size_t model = 0; // memory allocated for the model
size_t context = 0; // memory allocated for the context
size_t compute = 0; // memory allocated for temporary compute buffers
size_t total() const {
return model + context + compute;
}
};
struct llama_device_memory_data {
int64_t total;
int64_t free;
llama_memory_breakdown_data mb;
};
// TODO: convert to C-style data structure
using llama_memory_breakdown = std::map<ggml_backend_buffer_type_t, llama_memory_breakdown_data>;
int32_t llama_model_n_expert (const struct llama_model * model);
int32_t llama_model_n_devices(const struct llama_model * model);
ggml_backend_dev_t llama_model_get_device(const struct llama_model * model, int i);
llama_memory_breakdown llama_get_memory_breakdown(const struct llama_context * ctx);

@ggerganov ggerganov added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Apr 20, 2026
@ggerganov ggerganov changed the title fit-params : add option to output estimated memory per device fit-params : refactor + add option to output estimated memory per device Apr 20, 2026
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add me as a code owner for fit.cpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. examples server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants