If using e.g. Ollama with locally running Gemma4, the response from the LLM is quite often stripped in the middle.
Reason: the hard-coded max_token limit is too low. This kind of LLMs produces "thinking" tokens, which, being counted together with response tokens, exhaust the tokens' limit causing the generation to stop with "length" as the stop reason.
Suggested solution: make max_tokens configurable so that user can set the limit higher if this occurs. See also #65
If using e.g. Ollama with locally running Gemma4, the response from the LLM is quite often stripped in the middle.
Reason: the hard-coded max_token limit is too low. This kind of LLMs produces "thinking" tokens, which, being counted together with response tokens, exhaust the tokens' limit causing the generation to stop with "length" as the stop reason.
Suggested solution: make max_tokens configurable so that user can set the limit higher if this occurs. See also #65