Approx usage for interrupted streaming requests#55
Conversation
| def approx_text_tokens(s): return (len(s or '') + 2)//3 | ||
|
|
||
| def approx_obj_tokens(o): | ||
| try: s = json.dumps(obj2dict(o), ensure_ascii=False, default=str) | ||
| except Exception: s = str(o) | ||
| return approx_text_tokens(s) |
There was a problem hiding this comment.
@jph00 Should we instead use the tiktoken based estimator from solveit?
There was a problem hiding this comment.
Why json.dumps instead of str here btw?
| api_name=api_name, | ||
| vendor_name=vendor_name, | ||
| usage=usage) | ||
| chat._track(self.value) |
There was a problem hiding this comment.
A cancelled request exits AsyncChat._call while yielding chunks (async for chunk in res: yield chunk # exits here) and never reaches the rest of the code:
# AsyncChat._call()
...
if stream:
if self.prefill: yield _mk_prefill(self.prefill)
res = astream_with_complete(self, res, postproc=postproc)
async for chunk in res: yield chunk # exits here
res = res.valueSo we manually call chat._track(self.value) here to set c.use for the interrupted request.
| def mk_client(model=None, vendor_name=None, api_name=None, api_key=None, base_url=None, xtra_hdrs=None, | ||
| timeout=httpx.Timeout(connect=30, read=300, write=30, pool=10)): | ||
| # %% ../nbs/06_acomplete.ipynb #c714601e | ||
| def resolve_api_vendor(model=None, vendor_name=None, api_name=None, api_key=None, base_url=None): |
There was a problem hiding this comment.
Factored this out to be able to use it during interrupted Completion construction.
| yield postproc(chunk) | ||
| self.value = chunk | ||
| except (GeneratorExit, asyncio.CancelledError): | ||
| api_name,vendor_name,*_ = resolve_api_vendor(chat.model, chat.vendor_name, chat.api_name, chat.api_key, chat.base_url) |
There was a problem hiding this comment.
api_name and vendor_name are inferred in acomplete inside mk_client and not stored in AsyncChat, so we resolve them here using the new helper.
| FinishReason = str_enum('finish_reason', 'stop', 'tool_calls', 'length', 'content_filter', 'interrupted') | ||
|
|
||
| # %% ../nbs/00_types.ipynb #c5a88e6f | ||
| def approx_text_tokens(s): return (len(s or '') + 2)//3 |
There was a problem hiding this comment.
Shouldn't this be len((s or '').split()... ?
There was a problem hiding this comment.
This is some heuristic AI came up with, I couldn't find our pre-tiktoken estimator in the git history. If you have that available that would be awesome. Was it something like:
def str_tokens(s): return int(len(s)/3.4) + 1from https://github.com/AnswerDotAI/solveit/blob/b3d4b09dbef1f6a7437ca1c79a81d796f9ac50ed/00_db.ipynb ?
a65b4c5 to
6ca958e
Compare
|
@jph00 I've simplified the token approx logic down to def approx_str_tokens(o): return int(len(str(o))/3.4) + 1 |
Streaming wrappers now build an interrupted
Completionwhen a stream is cancelled or closed before provider usage is returned. The wrapper estimates prompt/output tokens, assumes 80% of input tokens were cached, normalizes the synthetic usage through the provider’s existingnorm_usage, and tracks it with the normalAsyncChatusage accounting.Providers can now register
approx_raw_usagehooks, so approximate usage keeps the same provider-shaped raw usage and cost path as real responses. This adds hooks for OpenAI Responses, OpenAI Chat, Anthropic, and Gemini.This lets callers such as Solveit show and log approximate token usage/cost for interrupted prompts, including cancellations before the first streamed token.
interrupted_usage_half.mov