What problem does this solve?
Re-deploying the same artifact (e.g. after a server restart or a fix attempt) burns LLM API calls on identical codegen. The artifact hasn't changed, the prompt hasn't changed, but the full generate → validate → repair loop runs again from scratch.
Proposed solution
Cache LLM generation results keyed by a SHA-256 hash of:
- artifact file bytes
- interpreted metadata JSON (from the LLM interpretation stage, A3) — captures user answers to clarifying questions
PROMPT_VERSION constant in agent.py — bumped manually when system prompts change
This is more precise than hashing the artifact alone: two deploys of the same artifact with different answers to clarifying questions (e.g. "state_dict" vs "full model") produce different cache keys and different cached code.
Cache stored at ~/.inference-engine/cache/<hash>.json:
{
"key": "<sha256>",
"load_body": "...",
"predict_body": "...",
"framework": "pytorch",
"created_at": "2026-05-14T07:00:00Z",
"prompt_version": "3"
}
On deploy, if a cache hit exists:
- interactive:
"Cached generation found (pytorch, created 2026-05-14). Use it? [Y/n]"
--yes mode: always use cache, print "Using cached generation."
Cache is never expired by time — artifacts are immutable. Invalidated only when PROMPT_VERSION changes (different key, old entry is simply never matched).
Alternatives considered
No cache. Acceptable for one-off deploys but wasteful in CI pipelines that re-deploy the same artifact on every run.
Area
CLI (deploy / fix)
What problem does this solve?
Re-deploying the same artifact (e.g. after a server restart or a
fixattempt) burns LLM API calls on identical codegen. The artifact hasn't changed, the prompt hasn't changed, but the full generate → validate → repair loop runs again from scratch.Proposed solution
Cache LLM generation results keyed by a SHA-256 hash of:
PROMPT_VERSIONconstant inagent.py— bumped manually when system prompts changeThis is more precise than hashing the artifact alone: two deploys of the same artifact with different answers to clarifying questions (e.g. "state_dict" vs "full model") produce different cache keys and different cached code.
Cache stored at
~/.inference-engine/cache/<hash>.json:{ "key": "<sha256>", "load_body": "...", "predict_body": "...", "framework": "pytorch", "created_at": "2026-05-14T07:00:00Z", "prompt_version": "3" }On deploy, if a cache hit exists:
"Cached generation found (pytorch, created 2026-05-14). Use it? [Y/n]"--yesmode: always use cache, print"Using cached generation."Cache is never expired by time — artifacts are immutable. Invalidated only when
PROMPT_VERSIONchanges (different key, old entry is simply never matched).Alternatives considered
No cache. Acceptable for one-off deploys but wasteful in CI pipelines that re-deploy the same artifact on every run.
Area
CLI (deploy / fix)