Skip to content

feat: add selectable context window (200K/1M) in model picker#156

Open
rbinar wants to merge 3 commits into
Vizards:mainfrom
rbinar:feat/selectable-context-window
Open

feat: add selectable context window (200K/1M) in model picker#156
rbinar wants to merge 3 commits into
Vizards:mainfrom
rbinar:feat/selectable-context-window

Conversation

@rbinar

@rbinar rbinar commented Jun 12, 2026

Copy link
Copy Markdown

What

Adds a "Context Window" dropdown in the Copilot Chat model configuration panel, letting users switch between 200K and 1M token context windows per model — just like the existing thinking effort dropdown.

Changes

  • Combined reasoningEffort and contextSize into a single configurationSchema so both controls appear in the model picker
  • Each selectable window maps to an input/output token split whose sum equals the advertised window, because VS Code/Copilot derives the displayed context window from maxInputTokens + maxOutputTokens:
    • 1M655,360 + 393,216 = 1,048,576 — DeepSeek's official combined limit; this is byte-for-byte the split from Fix DeepSeek V4 reported context window #71, so the default path keeps the corrected total accounting
    • 200K125,000 + 75,000 = 200,000 — same 5:3 input:output reservation, scaled down
  • Base model metadata restored to 655,360 / 393,216, so only selecting 200K changes the split
  • Dropdown choice synced back to the deepseek-copilot.contextSize workspace setting after each request, persisting across sessions
  • English + Chinese i18n for dropdown labels/descriptions

Why

Previously the extension reported a single fixed context window with no way to choose. This gives users control over the speed/cost-vs-capacity trade-off (see the context-rot benchmark in the thread for why a focused 200K window can outperform 1M on retrieval), while keeping the reported total within DeepSeek's real 1M limit and preserving the #71 accounting fix.

Adds a "Context Window" dropdown in the Copilot Chat model
configuration panel, letting users switch between 200K and
1M token context windows per model.

- Combine reasoningEffort and contextSize into a single
  configurationSchema so both controls appear in the
  model picker dropdown (non-public VS Code API, same
  mechanism used by Copilot for thinking effort).
- maxInputTokens raised from 656K to 1,000,000 to support
  1M mode.
- maxOutputTokens set to 128,000 — a conservative middle
  ground that keeps the displayed context window reasonable
  (200K input + 128K output = 328K, 1M input + 128K output
  = ~1.1M) without sacrificing DeepSeek's actual generation
  headroom. This does NOT affect API-level max_tokens, which
  is controlled by a separate VS Code setting.
- Sync dropdown choice back to workspace setting
  (deepseek-copilot.contextSize) after each request so
  the value persists across sessions.
- Both DeepSeek V4 Flash and Pro supported.
- English and Chinese i18n for dropdown labels and descriptions.

Why: the default 656K input window was too small for long
sessions, while always reserving 1M adds latency and cost.
Giving users control lets them pick the right trade-off between
speed/cost (200K) and capacity (1M).
@rbinar rbinar force-pushed the feat/selectable-context-window branch from 6ab12af to 73b591d Compare June 12, 2026 22:26
@Vizards

Vizards commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Hi @rbinar Thanks for working on this. The implementation direction looks aligned with the model configuration mechanism VS Code/Copilot exposes for context size selection.

I have one concern about the rationale for choosing 200K as the secondary context size. Since DeepSeek API already provides a 1M context window at the same base price, it would be helpful to understand what evidence led to 200K specifically: is it based on DeepSeek guidance, Copilot harness behavior, latency measurements, task-quality benchmarks, or observed failure modes with larger contexts?

There is also a cost-related concern here. A major part of DeepSeek’s cost efficiency comes from prefix caching. If reporting a 200K context window causes Copilot to compact/summarize earlier, the request prefix may change more often and previously cacheable context could be lost. That may silently increase the user’s effective API cost, even if the nominal 200K vs 1M pricing is the same.

Could we add some measurements or explanation comparing 200K vs 1M, for example task success rate, compaction frequency, latency, input/cache-hit tokens, and effective cost? Without that, it is hard to evaluate whether 200K is the right trade-off or whether it may accidentally make long-running agent sessions worse.

One more compatibility note: this extension currently supports VS Code as low as 1.116. Since context-size model configuration is relatively new and the behavior may differ across VS Code versions, we probably need additional forward/backward compatibility testing around 1.116+ to make sure the dropdown, request options, and fallback behavior all work as expected.

@rbinar

rbinar commented Jun 15, 2026

Copy link
Copy Markdown
Author

Context Rot Analysis: Why 200K Outperforms 1M

The "Lost-at-the-Edges" Effect

The benchmark reveals a specific degradation pattern in the 1M context window — the model does not uniformly forget information. Instead, accuracy varies dramatically by needle position:

Position 200K (Focused) 1M (Full) Δ
Head (10%) 100% 57.1% −42.9pp
Middle (50%) 85.7% 85.7% 0pp
Tail (90%) 85.7% 71.4% −14.3pp

What This Means

  1. 1M context causes catastrophic forgetting at the beginning. The 57.1% head accuracy means that in 3 out of 7 items, the model could not retrieve a value placed in the first 10% of a 1M-token context — even though the same needle at the same relative position was retrieved perfectly (100%) in a 200K context.

  2. The middle holds up surprisingly well — both conditions score 85.7% at the 50% depth. This is consistent with the "lost-in-the-middle" literature: the middle is always the hardest region, but 200K doesn't make it worse.

  3. Tail degradation is moderate but real — 1M tail accuracy drops to 71.4% (vs 85.7% for 200K). The model struggles with both extremes in large contexts.

The Root Cause: Attention Dilution

In a 1M-token context, the attention mechanism must distribute its limited capacity across ~1,000,000 tokens. The filler content (config documentation, log excerpts, JSON schemas) competes with the actual needles for attention weight. At 200K tokens, the signal-to-noise ratio is 5× higher — the needle stands out more clearly because there is less irrelevant content around it.

The head position is particularly vulnerable in 1M because:

  • The model processes the prompt sequentially
  • By the time it reaches the user's question at the end, the head-positioned needle is "far away" in both token distance and attention layers
  • In 200K, the head needle is only ~20K tokens from the start — still within the model's effective attention range

Practical Implication for Copilot

Coding sessions place critical context at the beginning (project structure, file layout, imports) and require the model to reference it throughout. If the 1M window loses track of head-positioned information 42.9% of the time, this could manifest as:

  • Forgetting which imports are available
  • Losing track of the project structure defined early in the conversation
  • Misunderstanding initial user instructions after the conversation grows long

200K avoids this by keeping the context dense enough that even early information remains within the model's effective attention radius.


Benchmark Methodology & Full Results

Hypothesis

200K does not provide lower cost — compaction breaks the prefix cache, driving effective cost higher. 200K is a quality-vs-cost trade-off, not a "cheap mode". Default → 1M, 200K as an explicit, labeled option.

Methodology

3-tier benchmark (bench/runner.py), all using deepseek-v4-flash:

Tier Purpose Configuration
Tier 1 — Cache/Cost Measure cache-hit ratio & effective cost 10-turn conversations × 2 runs × 2 conditions (1M no-compaction vs 200K with compaction at 180K tokens). 160K token prefix, unique UUID sentinels per condition/run for cache isolation.
Tier 2 — Quality/Context-Rot Needle-in-a-haystack accuracy at 3 depths 7 items × 3 depths (head=10%, middle=50%, tail=90%) × 1 trial × 2 conditions (200K focused vs 1M full). Needles: database_config, cache_ttl, api_rate_limit, batch_size, retry_count. Judge: deepseek-v4-flash at temperature=0.
Tier 3 — Copilot A/B Real extension-host cache metrics Skipped — requires PR #158 panel instrumentation.

Model: deepseek-v4-flash (budget mode)
Pricing: cache-hit=$0.0028/M, cache-miss=$0.14/M, output=$0.28/M
Budget cap: $5.00 | Actual spend: $2.04
Total tests: 40 (Tier 1) + 42 (Tier 2) = 82 API requests

Results Summary

Metric 1M (no compaction) 200K (compaction)
Cache-hit ratio (avg) 99.3% 96.1–96.8%
Compaction events / session 0 3 (turns 7–9)
Cache-miss tokens (total) 169,948 187,939
Effective cost / session $0.0291 $0.0316 (+8.6%)
Avg latency / request 8,102 ms 8,408 ms (+3.8%)
Task accuracy (overall) 71.4% 90.5% (+19.1pp)
└ Head (beginning) 57.1% 100%
└ Middle 85.7% 85.7%
└ Tail (end) 71.4% 85.7%

Tier 1 — Per-Run Cache Data

1M (no compaction)

Run1: hit=1,464,960 miss=169,935 out=4,288 avg_lat=7,951ms cost=$0.0291
Run2: hit=1,465,856 miss=169,961 out=4,541 avg_lat=8,253ms cost=$0.0292

200K (compaction at 180K tokens)

Run1: hit=1,439,232 miss=187,371 out=4,112 avg_lat=7,886ms cost=$0.0314  (3 compactions)
Run2: hit=1,439,744 miss=188,507 out=4,827 avg_lat=8,930ms cost=$0.0318  (3 compactions)

Tier 2 — Per-Item Accuracy

Focused (200K) — 19/21 correct (90.5%)

Item 1: ✓✓✓  Item 2: ✓✓✓  Item 3: ✓✓✓  Item 4: ✓✓✓
Item 5: ✓✗✗  Item 6: ✓✓✓  Item 7: ✓✓✓

Full (1M) — 15/21 correct (71.4%)

Item 1: ✓✓✓  Item 2: ✓✓✓  Item 3: ✗✓✓  Item 4: ✗✓✓
Item 5: ✗✓✗  Item 6: ✓✓✗  Item 7: ✓✗✓

Key Findings

  1. 💰 200K costs MORE, not less — Compaction at ~180K tokens breaks the prefix cache. Cache-hit drops from 99.3% → 96.5%, and re-filling the cache costs more than the tokens saved. Per-session cost increases by ~8.6% ($0.0316 vs $0.0291).

  2. 🎯 200K is significantly more accurate — Focused 200K context achieves 90.5% accuracy vs 71.4% for 1M. The 1M window shows severe context degradation, especially at the head position (57.1%).

  3. ⏱️ Latency difference is negligible — ~300ms difference per request is not user-perceptible.

  4. 📊 Statistical note — 21 items per condition (1 trial each). Tier 2 focused: 90.5% ± 12.5% (95% CI). Tier 2 full: 71.4% ± 19.3% (95% CI). Accuracy gap direction is clear; exact magnitude has uncertainty due to sample size.

Recommendation

  • Default: 1M (lower cost, adequate for short conversations)
  • 200K as "Focused Mode" — explicit, labeled option for sessions where accuracy on early context matters (long refactoring, multi-file edits, complex architectural discussions)
  • 200K is not a cost-saving mode — it's a quality preference with a small cost premium due to cache effects

Caveats

  • DeepSeek cache is best-effort and time-volatile
  • Tier 1 compaction is a simulation; real Copilot compaction timing differs
  • 200K assumption based on literature's "safe zone" for coding tasks (most real coding fits ≤128K)
  • Prices from pricing.json, not hardcoded

@Vizards

Vizards commented Jun 17, 2026

Copy link
Copy Markdown
Owner

One more important context: we previously fixed this exact accounting issue in #71.

That PR changed the metadata from maxInputTokens: 1048576 to 655360 while keeping maxOutputTokens: 393216, because VS Code/Copilot derives the displayed context window from maxInputTokens + maxOutputTokens. The goal was to make the reported total match DeepSeek’s official 1M context window instead of reporting more than 1M.

This PR changes the metadata to:

  • maxInputTokens: 1000000
  • maxOutputTokens: 128000

which reports roughly 1.128M total context to VS Code/Copilot. Could you clarify the source for this new split and whether DeepSeek’s 1M limit is input-only or input+output combined?

If the official limit is still 1M total, then this seems to partially revert the fix from #71 and may cause Copilot to over-budget before compaction. I think the context-size selector should preserve correct total-context accounting unless we have a clear source or test showing that 1M input + 128K output is actually supported.

VS Code/Copilot derives the displayed context window from
maxInputTokens + maxOutputTokens. The selector reported
1,000,000 + 128,000 ≈ 1.128M for the 1M option, partially reverting the
accounting fixed in Vizards#71. Restore the model default to 655,360 + 393,216
(= 1,048,576, DeepSeek's official combined input+output limit) and map
each selectable window to an input/output split that sums to the
advertised total (200K → 125,000 + 75,000), preserving the same 5:3
reservation ratio.
@rbinar

rbinar commented Jun 17, 2026

Copy link
Copy Markdown
Author

Thanks @Vizards — agreed, and you're right that it partially reverted #71. Fixed in ed441a6.

The root cause was treating the dropdown value as maxInputTokens directly while keeping a separate maxOutputTokens, so the reported total (input + output) drifted above 1M. The selector now maps each window to an input/output split whose sum is the advertised total:

  • 1M655,360 + 393,216 = 1,048,576 — byte-for-byte the Fix DeepSeek V4 reported context window #71 split, so the default path matches DeepSeek's official 1M exactly.
  • 200K125,000 + 75,000 = 200,000 — same 5:3 input:output reservation, scaled down.

On the input-only vs combined question: I don't have a public DeepSeek source that splits the 1M into separate input/output budgets, so I deliberately kept #71's assumption that 1M is the total (combined) window rather than introduce a larger total. The base model metadata is back to 655,360 / 393,216; only selecting 200K changes the split. If DeepSeek ever documents a larger combined (or input-only) limit, we can bump it in a follow-up with that source.

@rbinar

rbinar commented Jun 17, 2026

Copy link
Copy Markdown
Author

Addressing the 1.116+ forward/backward compatibility concern from your first comment.

tl;dr — the fallback is safe on every supported version. The worst case is the dropdown not rendering, which silently degrades to the existing 1M default with no regression.

Behaviour at each layer

Layer Non-public field Behaviour if VS Code / Copilot doesn't support it
toChatInfo sets configurationSchema Not in 1.116 official typings Copilot Chat ignores unknown fields on LanguageModelChatInformation. Dropdown just doesn't appear; the model still works.
provideLanguageModelChatResponse reads modelConfiguration?.contextSize Not in 1.116 ProvideLanguageModelChatResponseOptions options.modelConfiguration is undefinedgetConfiguredContextSize returns 1000000 (1M). Default path.
getContextSize() reads deepseek-copilot.contextSize Standard vscode.workspace.getConfiguration Works on every VS Code version. Users can pick 200K via Settings UI even when the picker dropdown isn't rendered.
Context-size sync writes back to the workspace setting after each request Same standard API Works everywhere. When modelConfiguration isn't forwarded, the guard prepared.configuredContextSize !== currentContextSize is 1000000 !== 1000000 = false, so there's no spurious write.

Why this is already-validated territory

This is the same mechanism the existing reasoningEffort dropdown already ships with — identical ModelConfigurationOptions type augmentation and the same options.modelConfiguration?.<field> read pattern. So the context-size selector inherits exactly the same version-compatibility profile as thinking-effort, which is already in production. There's no new API boundary introduced here.

Net result

  • Dropdown not rendered (older Copilot/VS Code): falls back to the deepseek-copilot.contextSize setting, default 1M → behaviour identical to pre-PR. No regression.
  • Dropdown rendered but untouched: same 1M default.
  • Dropdown used: 200K/1M split applied, both totals stay within DeepSeek's 1M (per the Fix DeepSeek V4 reported context window #71 accounting fix above).

Every access to the non-public fields is optional-chained against unknown shapes, so there's nothing to guard beyond what's already in place. The fallback path is the safe default at every layer.

Resolve conflict in src/provider/models.ts: main added
isBYOK: true to toChatInfo return, while this branch
replaced individual maxInputTokens/maxOutputTokens with
...resolveContextWindow(). Keep both.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants