Add near-complete Codex VSCode Support, full OAI Responses bridge by michaelw9999 · Pull Request #3 · michaelw9999/llama.cpp

michaelw9999 · 2026-05-03T08:57:36Z

Things brings in automatic compaction, web_search and file_search and is super easy to configure, for example:

model = "qwen3.5-4B-NVFP4"
model_provider = "llamacpp"
personality = "friendly"
model_context_window = 128000
model_auto_compact_token_limit = 100000
model_supports_reasoning_summaries = true
model_reasoning_summary = "auto"
model_reasoning_effort = "medium"

[model_providers.llamacpp]
name = "Local llama.cpp"
model = "Qwen3.5-4B-NVFP4.gguf"
base_url = "http://192.168.50.50:43901/v1"
supports_websockets = false

[model_providers.llamacpp.http_headers]
X-Llama-Responses-Web-Search-Wrapper = "tvly"
X-Llama-Responses-File-Search-Wrapper = "rg"
X-Llama-Responses-Reasoning-Budget-Tokens = "minimal=2048,low=4096,medium=8192,high=16384,xhigh=32768"

For the automatic compaction to work, you must set model_context_window and model_auto_compact_token_limit. Summary boxes and clickable diffs with the undo button ususally need model_supports_reasoning_summaries = true and model_reasoning_summary = "auto".
Just install tavily (but shell command is tvly) and rg or any other preferred web search MCP or file search/locator tool, it will wrap it through the shell and integrate it more natively and intuitively. If left out, it will hide these tools from the model.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d72b0819db

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-03T09:03:12Z

+                    {"sequence_number", seq_num++},
+                    {"output_index",    output_idx++},
                    {"item", json {
+                        {"id",        oai_resp_fc_item_id},


Emit a fresh function-call item id for each added tool call

server_task_result_cmpl_partial::update() only assigns state.oai_resp_fc_item_id after snapshotting state into the chunk fields, so to_json_oaicompat_resp() can emit response.output_item.added with {"id": oai_resp_fc_item_id} from the previous value (often empty on the first streamed tool call). This makes streamed response.function_call_arguments.delta.item_id/final output_item.done.item.id inconsistent with the announced item, which breaks clients that stitch function-call argument deltas by item_id.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-03T09:03:12Z

+    if (checkpoints.empty()) {
+        return true;
+    }


Remove stale checkpoint sidecar when no checkpoints exist

When checkpoints is empty, slot_checkpoints_save() returns without touching <filepath>.checkpoints, so reusing the same save filename can leave an old sidecar file behind. A later restore will then load stale checkpoint metadata for a different KV snapshot, which can trigger invalid recurrent-state restore attempts or unnecessary full prompt reprocessing.

Useful? React with 👍 / 👎.

Create a pool of N threads that grab a chunk of up to 100 tests at a time to iterate through. The number of tests at a time decreases as fewer remain. Each thread uses its own dev and cpu backend, and set_n_threads_fn is not called on the cpu backend. Fix some TSAN issues that arose: - In init_tensor_uniform, don't use static vector of generators. - Replace gmtime with versions that don't use a global variable. - Mutex calls to print_test_result.

@sanmai

* SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit 938929c. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* convert : support Gemma4ForCausalLM architecture (ggml-org#23674) * fix indent --------- Co-authored-by: Oleg Afonin <your.email@example.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ci : reduce [no ci] * cont : disable sycl, cann + rename caches [no ci] * cont : cann [no ci]

* hexagon: add support for CONCAT with optimized concat_2d_transposed qwen3.5 models are quite heavy on the CONCAT with large and transposed src1. * hex-concat: use fastdiv in generic version * hex-concat: make checks for transposed a bit more readable * hex-concat: reoder dma ops for better pipelining * hex-cont/cpy: optimize CPY and CONT ops The primary change is to avoid scalar divs in the inner loops. We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr. This causes runtime divs by that value which is normally just 4 or 2 (f32/f16). * hex-get-rows: optimize GET_ROWS for large rows We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models that do lots of GET_ROWS with huge (2MB+ rows). Also bump the DMA queue depth now that we can take advantage of it. * hex-concat: unroll the inner loops of concat_2d * hex-concat: more updates to concat_2d to improve perf a bit further * hex-cpy: fixed n_rows per thread checks in the copy ops * hmx-fa: fix alignment issues while computing dma sizes * hex-set-rows: add early returns for idle threads * hvx-rope: minor optimization to replace loops with fastdiv logic * hex-rope: replace scalar tail processing with HVX * hex-rope: optimize rope cache init with HVX Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc) Use the helpers to optimize ROPE.

* ci : remove vulkan dep from webgpu build * cont : add ccache to `ubuntu-24-webgpu-wasm` * ci : fix name + add wasm test

* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d * vulkan: skip conv2d bounds checks when shapes align with tile sizes * vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d * vulkan: stage cm2 conv2d accumulator through shmem before global store * vulkan: add coopmat1 conv2d path * fallback when using too much shared memory. clean up comments * Require 16x16x16 and subgroup size 32 or 64 * check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values

* ci : skip release workflow on master when commit message contains [no release] Assisted-by: llama.cpp:local pi * ci : restrict sanitizer builds to x86_64 + fix build type the spark is apparently too slow for some reason * tests : fix undefined warning [no ci]

…#23734) * ci : move [no release] check to dedicated check_release job Move the workflow-level \`if\` condition that skips builds when the commit message contains \`[no release]\` into a lightweight \`check_release\` job. All build jobs now depend on it via \`needs\` and check its output. This ensures the skip logic is evaluated at the job level rather than at the workflow level, which is the recommended approach for conditional jobs. Assisted-by: llama.cpp:local pi * cont : use `fast` runner

) * ci : do not allocate ccache for 3rd-party hosted runners [no release] * cont : add prints [no ci] [no release]

* ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>

When llama-server is started with SSL key and cert, the log says that it listens on http instead of https. This patch fixes this.

Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>

ggml-org#23767) Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>

…l-org#23763) * ci : fix undefined sanitizer build to use Debug build type only * ci : ccache the server builds * cont : remove ui dependency + reuse ccache for both ubuntu jobs * tmp : force ccache save * Revert "tmp : force ccache save" This reverts commit a857b03. * cont : no need for node.js

…22455)

* run tests in correct build folder * remove wasm test

* ci : server windows set build type explicitly * cont : try windows-2025 * ci : use llvm * cont : use ninja * cont : fix shell * ci : set number of jobs correctly * ci : fix windows with vulkan ccache by using llvm * ci : server ccache only on master * ocd : fix job names [no release]

…3746) * add conversion folder and update dependencies * limit python version for triton * update dev-dependencies section

…l-org#23780) * ci : move ARM jobs to 3rd-party runners + disable kleidiai release * cont : fix deps + fix names * ocd : fix names * cont : fix PR links

* feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines

* wip * ok: lazy bitmap API * remember to free lazy text * wip * add mtmd_helper_video * support video input on server (base64 input) * add MTMD_VIDEO config * add timestamp * update CLI * cli: allow auto-completion for video * add --video arg * fix build * update docs * rename as suggested

ggml-org#24044) * Only run webgpu CI on my fork * Add webgpu only workflow * Implement 2d workgroups for more operations * fix * Fix type * Move back to global_invocation_id

…ator (ggml-org#24000) * Only run webgpu CI on my fork * Add webgpu only workflow * handle buffer overlap case for concat operator * restore build-webgpu.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Run clang-format * Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Reese Levine <reeselevine1@gmail.com>

A SWA-only draft head (e.g. StepFun MTP) leaves the base sub-cache empty, so its kq_mask buffer stays null and asserts at load. Guard each mask on its own buffer in set_input and can_reuse, base and swa. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* models: update converter to support smaller assistants * models: add masked_embd tensors to gemma4-assist arch * gemma-4: remove temp debug for conversion * gemma-4-mtp: filter out masked_embedding tensors during conversion

…r Q4/Q5/Q8 and k-quants (ggml-org#24225) * ggml-webgpu: Improve prefill speeds + refactor matmul for quants * Fixes for editroconfig checker

* Add clang-format job * try local formatting

…ml-org#24305) * ggml-cpu : fix rms_norm_back wrong output under in-place aliasing * cont : clean-up comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…4317)

* Always export idle slots to RAM Without this, a slot's VRAM cache may not be written to RAM. If this slot happens to be busy then later on, this triggers needless preprocessing in another slot. * cont : clean-up --------- Co-authored-by: Christoph Weiss <weiss@wsoptics.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cpu: add GGML_OP_COL2IM_1D Add the overlap-add (scatter-add) step of a 1D transposed convolution. A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d scatters those columns back into the [T_out, OC] signal, with T_out = (T_in - 1)*s0 + K - 2*p0. Keeping the contraction as a plain mul_mat leaves the heavy work on the optimized (and quantizable) matmul kernels, so col2im_1d only does the cheap overlap-add. CPU uses a gather formulation parallelized over output channels, supporting F32, F16 and BF16 with an F32 accumulator. * tests: add backend coverage for GGML_OP_COL2IM_1D Add test_col2im_1d next to the conv_transpose_1d cases, covering F32, F16 and BF16 across eight geometries: the canonical kernel = 2*stride DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and p0 = stride/2), kernel < stride with zeroed gaps, kernel not a multiple of stride, and a single column unfold. Perf mode gets three real vocoder stage shapes reporting memory bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16. * cpu: harden GGML_OP_COL2IM_1D ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph build time, before the oc division, protecting every backend at once. The kernel asserts the contiguity its flat indexing assumes and its doc states the full output length including the crop term. The kernel parallelizes over the time axis: the split stays balanced down to OC = 1, where the previous channel split was single threaded. Values are bit identical on the three real vocoder chains, two out of three improve. * tests: extend the GGML_OP_COL2IM_1D grid The eval grid grows to eleven geometries: OC = 1 (mono output stage), K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and a crop down to T_out = 2 where all the gather bounds act at once. * tests: add col2im_1d equivalence test tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16 and BF16 through casts of the column matrix. test-backend-ops cannot cover this for a CPU only op since the CPU backend is its own reference there. * rpc: bump protocol patch version for GGML_OP_COL2IM_1D GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the op is appended and no existing op code shifts.

…rg#24158)

@ngxson

* server: log prompts to directory Add `--log-prompts-dir` to write each prompt to a separate text file in the specified directory. * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* mtmd: refactor video subproc handling * Update tools/mtmd/mtmd-helper.cpp Co-authored-by: Mikko Juola <mikjuo@gmail.com> --------- Co-authored-by: Mikko Juola <mikjuo@gmail.com>

…and Flash Attention (ggml-org#24123) * vulkan: add support for valve fp16 dot2 extension * use macro for dot2 path choice * properly check for the feature * add dot_product abstraction to reduce preprocessor branching

* ui: add opt-in run_javascript frontend tool Expose a run_javascript tool to the model, executed entirely in the browser through the existing agentic loop. Code runs in a Web Worker inside a sandboxed iframe with an opaque origin, isolated from the WebUI and its API. Console output, errors and the return value are fed back as the tool result. The parent enforces a hard timeout by removing the iframe, which terminates the worker. Disabled by default, toggle in Settings > Developer. * ui: address review feedback from allozaur Use the JsonSchemaType enum for the tool definition parameter types instead of raw string literals, extending it with STRING and NUMBER. Move the worker shim and the iframe harness html into their own files so the service no longer carries inline source blobs. Replace the remaining magic strings with constants: SANDBOX_EMPTY_OUTPUT and SANDBOX_TRUNCATION_NOTICE, and reuse NEWLINE_SEPARATOR for joins. * ui: move sandbox worker shim to a raw imported file Replace the inline worker template string with a real sandbox-worker.js imported as raw text, and build the iframe harness from it in sandbox-harness.ts. The raw worker ships as a string, not a module, so it is excluded from eslint and the typecheck program.

@gabe-l-hart

… when deepstack is not used (ggml-org#24357) * llama-graph : apply embedding scale when deepstack is not used * nits: remove non-existant hunyuan-vl from the tests * apply suggestion from @gabe-l-hart --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Codex CLI compatibility: - Skip non-function tool types (web_search, code_interpreter) - Merge developer/system messages into position 0 for Qwen templates - Strip Responses-only request keys (store, include, prompt_cache_key) - Restore refusal content type handling Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted): - Add 24 missing Response object fields per OpenAI spec - Fix function_call id/call_id field mapping - Add sequence_number, output_index, content_index to ALL streaming events - Full response object in response.created/in_progress events - Accept input_text type and EasyInputMessage for multi-turn input - output_text convenience field, output_tokens_details 14 pytest tests, E2E tested with async OpenAI SDK and Codex CLI. Refs: ggml-org#19138, ggml-org#19720, ggml-org#21174

Cherry-pick of ggml-org#20819 by European-tech. Persist context checkpoints in a companion .checkpoints file alongside slot saves. Without this, restoring a slot for hybrid/recurrent models triggers full prompt reprocessing (23s for 26K tokens on Qwen3.5-27B). With checkpoint persistence, restore takes 75ms. Binary format with magic 0x4C4C4350 ("LLCP"), versioned, backward compatible (old saves without companion file load normally).

chatgpt-codex-connector Bot reviewed May 3, 2026

View reviewed changes

jeffbolznv and others added 29 commits May 26, 2026 07:57

models : Attach Mistral3 NVFP4 weight scales (ggml-org#23629)

6fe90de

convert : support Gemma4ForCausalLM architecture (ggml-org#23682)

dbe9c0c

* convert : support Gemma4ForCausalLM architecture (ggml-org#23674) * fix indent --------- Co-authored-by: Oleg Afonin <your.email@example.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : reduce (disable SYCL and CANN builds/releases) (ggml-org#23705)

3dc7684

* ci : reduce [no ci] * cont : disable sycl, cann + rename caches [no ci] * cont : cann [no ci]

ci : move sanitizer jobs to self-hosted runners (ggml-org#23713)

ef41a69

ci : move more CPU jobs to self-hosted runners (ggml-org#23715)

678d43d

ci : remove vulkan SDK dep from webgpu job (ggml-org#23718)

3a3ed15

* ci : remove vulkan dep from webgpu build * cont : add ccache to `ubuntu-24-webgpu-wasm` * ci : fix name + add wasm test

ci : move macos jobs to the apple workflow + fix names (ggml-org#23721)

5190c2e

ci : do not allocate ccache for 3rd-party hosted runners (ggml-org#23730

0d18aaa

) * ci : do not allocate ccache for 3rd-party hosted runners [no release] * cont : add prints [no ci] [no release]

ggml-zendnn : fixed naming of matmul function (ggml-org#20964)

b4c0549

* ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>

server : fix the log message when using SSL (ggml-org#23393)

7085492

When llama-server is started with SSL key and cert, the log says that it listens on http instead of https. This patch fixes this.

convert: add MiniCPM5 tokenizer support (ggml-org#23384)

9777256

Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>

docs : fix duplicated "the" in granitevision and model-conversion docs (

1d971bb

ggml-org#23767) Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>

vulkan: avoid preferring transfer queue on AMD UMA devices (ggml-org#…

4d8cc0c

…22455)

ci : remove wasm test (ggml-org#23733)

b3a739c

* run tests in correct build folder * remove wasm test

common : fix env names to all have LLAMA_ARG_ prefix (ggml-org#23778)

6b4e4bd

ci : bump cuda release to 13.3 (ggml-org#23749)

2d0656f

CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (ggml-org#23742)

fda8528

pyproject : add conversion folder and update dependencies (ggml-org#2…

87b0a60

…3746) * add conversion folder and update dependencies * limit python version for triton * update dev-dependencies section

vendor : update cpp-httplib to 0.46.0 (ggml-org#23650)

617255d

ci : move ARM jobs to self-hosted + disable kleidiai mac release (ggm…

ba4dd0b

…l-org#23780) * ci : move ARM jobs to 3rd-party runners + disable kleidiai release * cont : fix deps + fix names * ocd : fix names * cont : fix PR links

ngxson and others added 29 commits June 8, 2026 11:11

cli: fix spinner not show during prompt processing (ggml-org#24283)

715b86a

ggml : bump version to 0.14.0 (ggml/1533)

6a1de6f

sync : ggml

c2b1518

docker: install ffmpeg in the released image (ggml-org#24302)

3ebe862

[ggml-webgpu] Implement 2D workgroups for scale, binary, and unary ops (

3b3da01

ggml-org#24044) * Only run webgpu CI on my fork * Add webgpu only workflow * Implement 2d workgroups for more operations * fix * Fix type * Move back to global_invocation_id

server : do not parse when flushing http headers (ggml-org#24281)

42a0afd

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul fo…

1e1aca0

…r Q4/Q5/Q8 and k-quants (ggml-org#24225) * ggml-webgpu: Improve prefill speeds + refactor matmul for quants * Fixes for editroconfig checker

ggml-webgpu: Add clang-format job (ggml-org#24308)

3ac3c20

* Add clang-format job * try local formatting

Remove case for GGML_TYPE_Q4_K in mvvq.cu (ggml-org#23528)

e3471b3

ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (gg…

fd3271e

…ml-org#24305) * ggml-cpu : fix rms_norm_back wrong output under in-place aliasing * cont : clean-up comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

models : fix plamo2 attention_key/value_length regression (ggml-org#2…

f0152ef

…4317)

ui: fix mobile chat form overflow and bust stale bundle cache (ggml-o…

efbacf8

…rg#24158)

server: log prompts to directory (ggml-org#22031)

1e91256

* server: log prompts to directory Add `--log-prompts-dir` to write each prompt to a separate text file in the specified directory. * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

mtmd: refactor video subproc handling (ggml-org#24316)

9682e35

* mtmd: refactor video subproc handling * Update tools/mtmd/mtmd-helper.cpp Co-authored-by: Mikko Juola <mikjuo@gmail.com> --------- Co-authored-by: Mikko Juola <mikjuo@gmail.com>

ui: Fix excessive style recalculation on hover (ggml-org#24243)

ae735b1

vulkan: reduce iq1 shared memory usage for mul_mm (ggml-org#24287)

d6d0ce8

mtmd: build_vit batching (ggml-org#24352)

49f3542

ci : fix windows release (ggml-org#24369)

e25a32e

krystophny force-pushed the master branch from d72b081 to 43fb8c0 Compare June 9, 2026 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add near-complete Codex VSCode Support, full OAI Responses bridge#3

Add near-complete Codex VSCode Support, full OAI Responses bridge#3
michaelw9999 wants to merge 594 commits into
michaelw9999:full-openai-responsesfrom
krystophny:master

michaelw9999 commented May 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

michaelw9999 commented May 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants