fit-params : refactor + add option to output estimated memory per device#22171
fit-params : refactor + add option to output estimated memory per device#22171
Conversation
| #include "llama.h" | ||
|
|
||
| #include <cstdint> | ||
| #include <vector> |
There was a problem hiding this comment.
We are making llama-ext.h a C++ only header then?
There was a problem hiding this comment.
This header is not required to be C-style. It is for staging new API and can be C++ if needed.
Once an API is ready to become public, it has to be become C-style.
| for (size_t id = 0; id < devs.size(); id++) { | ||
| printf("%s ", ggml_backend_dev_name(devs[id])); | ||
| printf("%zu ", dmd[id].mb.model/1024/1024); | ||
| printf("%zu ", dmd[id].mb.context/1024/1024); | ||
| printf("%zu ", dmd[id].mb.compute/1024/1024); | ||
| printf("\n"); | ||
| } | ||
| printf("%s%s=%s", itbo > 0 ? "," : "", mparams.tensor_buft_overrides[itbo].pattern, ggml_backend_buft_name(mparams.tensor_buft_overrides[itbo].buft)); | ||
| any_tbo = true; | ||
| printf("Host "); | ||
| printf("%zu ", dmd.back().mb.model/1024/1024); | ||
| printf("%zu ", dmd.back().mb.context/1024/1024); | ||
| printf("%zu ", dmd.back().mb.compute/1024/1024); | ||
| printf("\n"); | ||
| } |
There was a problem hiding this comment.
If the intent is to parse this output programmatically I think it would be preferable to use a well-defined format like JSON.
There was a problem hiding this comment.
For me, JSON is a huge overhead and in this specific case I don't think it is warranted because the data that we want to output is very simple.
| { "-fite", "--fit-estimate" }, "[on|off]", | ||
| string_format("estimate the required memory to run the model ('on' or 'off', default: '%s')", params.fit_params_est ? "on" : "off"), |
There was a problem hiding this comment.
I think this descriptions is confusing. It is true that with this option enabled the program will estimate the required memory but that is already what it's doing anyways. Maybe it would be better to call this something like --fit-print (or --fit-print-json depending on my other comment).
| LOG_INF("%s: printing fitted CLI arguments to stdout...\n", __func__); | ||
| common_log_flush(common_log_main()); | ||
| printf("-c %" PRIu32 " -ngl %" PRIi32, cparams.n_ctx, mparams.n_gpu_layers); | ||
| if (!params.fit_params_est) { |
There was a problem hiding this comment.
In principle, if the only goal is to disable the fitting this can already be done by manually setting e.g. -c 0 -ngl 999. The code should then recognize that these have been set manually and will not alter them. The only downside vs. the current approach would be that it's slightly slower than to only retrieve llama_get_device_memory_data once. But if we're already messing around with internal headers anyways we may as well make llama_params_fit_impl return the device memory data instead. Then all that would need to be done is change what is being printed.
|
I took the opportunity to refactor the implementation and move the param fitting logic outside of The prototype of this API is now in Lines 61 to 90 in 7681f17 |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Please also add me as a code owner for fit.cpp.
Overview
cont #16653
ref #19070 (review)
libllamatolibcommon-fite, --fit-estimateto thellama-fit-paramstool. This is useful for 3rd party applications to estimate the required memory for a model.Additional information
Example:
Requirements