Skip to content

[pull] main from langwatch:main#279

Open
pull[bot] wants to merge 1741 commits into
erickirt:mainfrom
langwatch:main
Open

[pull] main from langwatch:main#279
pull[bot] wants to merge 1741 commits into
erickirt:mainfrom
langwatch:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Dec 17, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull Bot locked and limited conversation to collaborators Dec 17, 2025
@pull pull Bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Dec 17, 2025
@rogeriochaves rogeriochaves force-pushed the main branch 5 times, most recently from 1e7b14c to 2209258 Compare January 21, 2026 01:15
drewdrewthis and others added 20 commits May 5, 2026 08:12
…3808)

Phase 2 of #3458 (parity grinder) — setup domain.

Bound to existing tests via JSDoc @Scenario:
  simplified-setup.feature (2/12):
    * "Shows Integration configured alert when firstMessage exists"
      and "Shows Waiting for messages when no firstMessage" —
      bound to the corresponding cases in
      project.getHasFirstMessage.unit.test.ts. The render-side
      condition is driven entirely by this procedure's return value.

Kept @unimplemented (with justifying comments — needs harness):
  simplified-setup.feature (10):
    * Setup page UI rendering (API-key visibility/copy, endpoint
      section, SDK guides, "section X is not shown" assertions, copy
      tracking) — no JSDOM render fixture for the setup page yet.

  docker-dev-worktree-isolation.feature (6):
    * Dev-shell + port-allocator behaviour. Tests for the TypeScript
      port allocator + bash worktree helpers exist in
      `packages/server/test/` and `scripts/__tests__/*.bats`, but
      neither path is in the parity check's DEFAULT_TEST_ROOTS,
      so any JSDoc binding there would not be discovered.

specs/setup net @unimplemented: 18 → 18 (-2 bound, 16 justified).

`pnpm check:feature-parity` passes. `pnpm typecheck` clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of #3458 (parity grinder) — navigation domain.

Bound to existing tests via JSDoc @Scenario:
  child-drawer-nesting.feature (5/10):
    * SuiteFormDrawer → scenario editor child drawer (open + close
      + form-state preservation): bound to the suite-editor
      integration test cases that exercise that exact path.
    * SuiteFormDrawer → agentHttpEditor child drawer (open + close):
      bound to the matching cases in the same file.

Kept @unimplemented (with justifying comments — needs harness):
  child-drawer-nesting.feature (5):
    * ScenarioRunDetail → trace child drawer end-to-end flow
      (composition is exercised but not the full open/close cycle).
    * "Open a drawer from a page" (root-level) — implicit in every
      standalone drawer test but no dedicated assertion.
    * Escape-only-closes-topmost (semantics inherited from Chakra
      Drawer overlay stack; not asserted directly).
    * E2E "new scenario appears in picker" (no Playwright fixture).

  home-navigation.feature (10):
    * All page-level routing/active-state scenarios: MainMenu
      sidebar has no JSDOM render test, no Playwright/E2E suite
      for project routing. Cheap follow-up.

specs/navigation net @unimplemented: 20 → 20 (-5 bound, 15 justified).

`pnpm check:feature-parity` passes. `pnpm typecheck` clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cked (#3812)

test(python-sdk): document why 34 @unimplemented scenarios stay unbound

Phase 2 of #3458 (parity grinder) — python-sdk domain.

All 34 `@unimplemented` scenarios describe Python SDK behaviour with
pytest cases under `python-sdk/tests/`. The check-feature-parity
script's TEST_FILE_RE only matches `*.test.ts/tsx`, so pytest cases
cannot bind via JSDoc today.

Affected feature files (all add a single header comment so the
reason is durable rather than orphaned `@unimplemented` tags):
  * experiment-print-summary.feature (8): tests in
    python-sdk/tests/test_experiment.py + test_examples.py.
  * prompt-tags.feature (15): tests in
    python-sdk/tests/prompts/test_prompt.py +
    test_fetch_policies.py.
  * async-experiment-parallelism.feature (11): async-loop /
    asubmit / loop-bound-resource cases tracked separately in the
    pytest suite.

specs/python-sdk net @unimplemented: 34 → 34 (0 bound, 34 justified).

Cheap structural fix tracked separately: extend
check-feature-parity.ts to read pytest docstring tags so
python-sdk and python-server scenarios can bind too.

`pnpm check:feature-parity` passes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
)

Phase 2 of #3458 (parity grinder) — projects domain.

Bound to existing tests via JSDoc @Scenario:
  create-project-drawer.feature (5/22):
    * "Project name is required" + "Project name with only
      whitespace is invalid" — bound to ProjectForm.unit.test.ts
      validation cases.
    * "Show new team name field when creating new team" + "New team
      name is required when creating team" — bound to the
      requires-new-team-name validation case.
    * "Drawer receives correct organization when opened from
      different org" — bound to CreateProjectDrawer.test.tsx
      multi-org organizationId-prop case.
    * "Allow creation when plan has override enabled" — bound to
      project.create.planLimit.integration.test.ts override case.

  project-creation-flow.feature (1/12):
    * "Project creation calls correct API endpoint" — partially
      bound to the multi-org organizationId test (asserts mutation
      call payload).

Kept @unimplemented (with justifying comments — needs harness):
  create-project-drawer.feature (17): page-render flows for entry-
    point clicks (settings/team/dropdown), drawer chrome, close
    mechanisms, team selector, loading state. No JSDOM render
    fixture covers these.
  project-creation-flow.feature (11): success/error toast, dup-name
    handling, drawer close, redirect, form reset, tracking event —
    need a happy-path component test wiring drawer + mutation.
  project-list-refresh.feature (8): query invalidation assertions +
    list/dropdown immediate-render assertions. Cheap follow-up:
    assert `organization.getAll.invalidate` + `limits.getUsage
    .invalidate` fire on the existing mutation test.

specs/projects net @unimplemented: 42 → 42 (-7 bound, 35 justified).

`pnpm check:feature-parity` passes. `pnpm typecheck` clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cenarios (#3815)

Slice C2 of features-domain parity grinder (#3458). Maps KEEP-class
@unimplemented scenarios in marketing/onboarding/billing feature files to
existing tests via /** @Scenario "<title>" */ JSDoc bindings. Removes the
@unimplemented tag for each newly-bound scenario. UPDATE-class scenarios
(implementation diverged from spec wording) stay tagged + are listed in
each feature file's header comment for future scenario rewrites.

Per-file:
  - beta-pill.feature: 5 bound, 1 PARTIAL (Suites sidebar — no integration test)
  - signup-slack-notifications.feature: all 5 bound
  - onboarding/welcome-screens.feature: all 4 bound
  - onboarding/mcp-setup-prompt-compatibility.feature: 2 bound, 1 UPDATE
  - pricing-model-aware-free-plan.feature: 6 bound, 6 spans gap
  - settings-plans-comparison.feature: 5 bound, 4 UPDATE
  - trace-limit-upgrade-message.feature: 0 bound, 4 UPDATE (wording diverged)
  - customer-io-nurturing-integration.feature: 56 bound, 8 UPDATE

Bumps `pnpm check:feature-parity` enforced count by 83.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…DK (#3818)

test(typescript-sdk): bind 28 @unimplemented scenarios

Phase 2 of #3458 (parity grinder) — typescript-sdk domain.

Pre-classified by `specs/typescript-sdk/AUDIT_MANIFEST.md`.

Bound to existing tests via JSDoc @Scenario:
  cli-docs.feature (11/11):
    * docs index/relative path/leading-slash/dup-prefix scenarios —
      bound to docs.unit.test.ts normalizeDocsUrl + docsCommand
      cases.
    * scenario-docs index + relative path + dup-prefix — bound to
      same file's scenario describe block.

  cli-error-handling.feature (6/6):
    * "Sync surfaces specific conflict message" — cli-error-
      propagation.integration.test.ts.
    * "API errors surface meaningful message" + "fall back to raw
      JSON payload" — same file.
    * "Invalid API key returns clear authentication error" +
      "Network errors surface underlying cause" — cli-error-edge-
      cases.integration.test.ts.
    * "Common error conditions map to actionable messages" — the
      describe block of cli-error-propagation-commands.integration
      .test.ts (covers all five command rows in the spec outline).

  experiment-print-summary.feature (4/4):
    * Bound to experiment-print-summary.test.ts pass/fail/exit
      cases.

  prompt-tags.feature (7/24):
    * Fetch by tag via options + tag in cache key + different tags
      produce different cache entries + unassigned tag returns
      error + shorthand syntax passthrough + fetch without tag
      returns latest — bound to prompts.facade.unit.test.ts.
    * List tags GET / returns built-in+custom + Create custom tag
      via SDK + Delete custom tag via SDK — bound to
      prompt-tag-crud.unit.test.ts.
    * Assign tag to existing version + assignment confirmation —
      bound to prompt-tags.unit.test.ts.

Kept @unimplemented (with justifying header notes — needs harness):
  cli-prompt-tags.feature (28/30 stay unimpl):
    * SDK-level renameTag rows handled by sister PR #3696 — not
      duplicated here.
    * CLI command-level rows live under
      `typescript-sdk/src/cli/commands/{prompt,tag}/__tests__/`;
      tests exist but aren't @scenario-bound yet (each one needs a
      pass-through). 4 rows are marked DUPLICATE/UPDATE in the
      manifest and stay unimpl by design.

  prompt-tags.feature (17/24 stay unimpl):
    * `prompts.create`/`prompts.update` request-body shape with a
      tags list — covered by client-factory tests.
    * E2E flows under `typescript-sdk/e2e/prompts/prompt-tags.
      e2e.test.ts` are `it.skipIf(no API key)` — cheap follow-up
      adds @Scenario markers to those skip cases.

specs/typescript-sdk net @unimplemented: 75 → 75 (-28 bound, 47 justified).

`pnpm check:feature-parity` passes. `pnpm typecheck` clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NIFEST (#3819)

test(variables-ui): document why 78 @unimplemented scenarios stay unbound

Phase 2 of #3458 (parity grinder) — variables-ui domain.

All 78 `@unimplemented` scenarios across the three feature files
are pre-classified KEEP/UPDATE in
`specs/variables-ui/AUDIT_MANIFEST.md`. The underlying tests exist:

  * VariablesSection.test.tsx
  * VariableMappingInput.test.tsx
  * VariableTypeIcon.test.tsx
  * TargetVariablesPanel.test.tsx
  * VariableInsertMenu.test.tsx
  * PromptTextAreaWithVariables.test.tsx
  * liquidVariableExtraction.unit.test.ts

But each `it(...)` case still needs an individual JSDoc `@scenario`
marker — a mechanical pass per spec scenario. Adding header notes
to each feature file pointing at the manifest + test files so the
follow-up is unambiguous.

5 rows in `variable-insertion-menu.feature` are UPDATE-class (spec
diverges from source — e.g. search input only in buttonMenuMode,
badges show "Text"/"Number" not "STRING"/"OBJECT") and need spec
rewriting before binding.

specs/variables-ui net @unimplemented: 78 → 78 (0 bound, 78 justified).

`pnpm check:feature-parity` passes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tform parity (#3820)

* test(features): bind 51 + delete 15 + drop 1 file from enterprise/platform parity

Slice C3 of features-domain parity grinder (#3458). Mixed clearing actions
across enterprise/auth/platform feature files:

Bindings (51 scenarios):
  - enterprise-feature-guards.feature: 23 DUPLICATE scenarios bound
  - webhook-service-refactor.feature: 20 DUPLICATE bound (full coverage)
  - platform-evaluator-and-model-provider-tools.feature: 4 bound, 9 NoTest
  - scim-group-mapping.feature: 4 bound, 21 NoTest documented
  - user-deactivation.feature: 4 bound, 8 NoTest + 4 UPDATE documented

Deletions (16 scenario blocks + 1 file):
  - elasticsearch-write-disable-flags.feature: deleted entirely (8 DELETE
    scenarios — feature was reverted, columns dropped in migration
    20260403120000)
  - remove-dead-cost-checker-code.feature: 9 DELETE scenarios removed
    (cleanup completed in PR #2661)
  - subscription-service-refactor.feature: 3 DELETE removed
    (NullSubscriptionService never landed)
  - stripe-price-catalog-sync.feature: 3 DELETE removed (sync task
    never landed)

Header comments documenting NO_TEST and UPDATE gaps added to:
  - drawer-backdrop-transparency-blur.feature (1 NoTest)
  - tag-management.feature (6 NoTest, 1 UPDATE)
  - subscription-service-refactor.feature (5 NoTest, 1 UPDATE)
  - stripe-price-catalog-sync.feature (3 NoTest)

Bumps `pnpm check:feature-parity` enforced count by 51 + drops 8
scenarios from the @unimplemented backlog via deletion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger CI for parity work

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d tests (#3821)

Binds the lite-to-full role-upgrade enforcement scenarios in
specs/licensing/enforcement-members.feature to the existing prose unit
tests in license-limit-guard.unit.test.ts:

- "Allows upgrade from Lite Member to full member when under limit" →
  assertMemberTypeLimitNotExceeded "allows change when under limit"
- "Blocks upgrade from Lite Member to full member when at member limit" →
  assertMemberTypeLimitNotExceeded "throws when at limit"

Both tests directly exercise the role-change limit guard with
lite-to-full changeType, which is exactly what the scenarios specify.

Part of parity grinder Phase 2 (#3458). Builds on iter-1 PR #3803 and
iter-2 PR #3810. The "Blocks custom role change that would exceed full
member limit" scenario is left @unimplemented because no existing test
exercises it (it requires permission-array introspection coupled to
limit checks, not just changeType).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ity (#3822)

Slice C4 of features-domain parity grinder (#3458). Maps DUPLICATE-class
@unimplemented scenarios in specs/features/scenarios/* to existing tests
via /** @Scenario "<title>" */ JSDoc bindings; removes @unimplemented from
each newly-bound scenario; deletes 1 layout-only scenario; adds header
comments documenting KEEP-class no-test gaps.

Per-file:
  - scenario-run-status-config-location.feature: 2 bound
  - scenario-id-format.feature: 2 bound, 1 NoTest documented
  - unified-agent-target-section.feature: 1 bound, 3 NoTest documented
  - on-prem-hostname-validation.feature: 8 already bound, no changes
  - extensible-scenario-metadata.feature: 11 bound (full coverage)
  - run-view-side-by-side-layout.feature: 2 bound, 4 NoTest, 1 DELETE
    ("Drawer content scrolls when it overflows" — pure CSS overflow concern)
  - workflow-agent-interpolation.feature: 5 bound, 1 NoTest (Python-only AC)
  - workflow-agent-mapping-layer.feature: 9 bound, 4 NoTest documented

Bumps `pnpm check:feature-parity` enforced count by 40.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ature files (#3806)

* test(features): bind 127 @unimplemented dataset scenarios across 7 feature files

Slice C1 of features-domain parity grinder (#3458). Maps every KEEP-class
@unimplemented scenario in the dataset feature files to its existing test via
/** @Scenario "<title>" */ JSDoc bindings. Removes the @unimplemented tag
from each newly-bound scenario.

Per-file:
  - dataset-rest-api.feature: 36 scenarios bound, 3 NO_TEST_FOUND gaps kept
    @unimplemented (batch-create 404/empty-body/size-limit)
  - dataset-typescript-sdk.feature: 32 scenarios bound, 1 gap kept
    (list records explicit pagination)
  - dataset-mcp-tools.feature: 21 scenarios fully bound
  - dataset-file-upload-api.feature: 36 scenarios fully bound
  - dataset-cli.feature: 2 @Unit scenarios bound
  - dataset-python-sdk.feature: header comment justifying that Python tests
    cannot bind via TS-only parity checker; scenarios kept @unimplemented
  - upload-utils.unit.test.ts: bindings using single-quote delimiters where
    titles contain double-quotes ("id"/"selected" column rename scenarios)

Bumps `pnpm check:feature-parity` enforced count by 127.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger CI for parity work

* ci: retrigger flaky langchain integration test

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 12 files (#3824)

* test(features): suites group A — bind 19 scenarios + document gaps in 12 files

Slice C5/A of features-domain parity grinder (#3458). Apply parity-clearing
to 12 largest suites/* feature files. Bind DUPLICATE-class scenarios to
existing tests via /** @Scenario "<title>" */ JSDoc annotations; remove the
@unimplemented tag for newly-bound scenarios; document remaining KEEP-with-
NoTest and UPDATE-class scenarios in per-file header comments.

Files (12):
  - cancel-queued-running-jobs.feature
  - unified-run-table.feature
  - run-history-group-by.feature
  - unified-run-view-layout.feature
  - external-sdk-ci-sets-in-sidebar.feature
  - rename-suites-to-runs.feature
  - real-time-run-updates.feature
  - grid-view-and-borderless-tables.feature
  - suite-list-view-status.feature
  - footer-to-header-migration.feature
  - suite-url-routing.feature
  - unified-sidebar-list-items.feature

Bumps `pnpm check:feature-parity` enforced count by 19 over baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger CI for parity work

* fix(features): address CodeRabbit nits — unbind 2 mismatched cancellation scenarios + correct header counts

CodeRabbit caught 3 issues on PR #3824:

1. langwatch/src/server/scenarios/__tests__/cancellation.unit.test.ts:
   the @Scenario annotations on lines 129/139 didn't match what the tests
   actually verify. Removed both annotations, restored @unimplemented on the
   matching scenarios in cancel-queued-running-jobs.feature, and added them
   to the file's NO_TEST gap list.

2. specs/features/suites/rename-suites-to-runs.feature: header counts
   "8 of 16" → "9 of 20" and "8 NO_TEST" → "11 NO_TEST" to reflect actual file.

3. specs/features/suites/unified-run-table.feature: header counts "5 of 12"
   → "4 of 12" and "7 NO_TEST" → "8 NO_TEST".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…les (#3827)

* test(features): C8 devtools/eval-v3 — bind 17 + delete 13 across 6 files

Slice C8 of features-domain parity grinder (#3458). Final cluster:
devtools/* (4 files, mostly skill/harness gaps) + evaluations-v3/* (2 files,
mostly bindable to existing tests).

Files:
  evaluations-v3/thread-variables-in-trace-evaluator.feature: 10 bound,
    1 DELETE
  evaluations-v3/evaluator-run-rerun-enhancements.feature: 8 bound,
    3 NoTest + 1 UPDATE
  devtools/bullboard-queue-dashboard.feature: 1 bound, 4 NoTest
  devtools/worktree-creation.feature: 0 bound (16 HARNESS_GAP)
  devtools/orchestrator-bug-fix-workflow.feature: 4 DELETE
  devtools/issue-creation-skill.feature: 8 DELETE (template diverged)

17 bindings, 13 deletions, 11 files modified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger CI for parity work

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 12 files (#3825)

* test(features): suites group B — bind 17 scenarios + document gaps in 12 files

Slice C5/B of features-domain parity grinder (#3458). 12 mid-sized
suites/* feature files. Bind DUPLICATE-class scenarios via JSDoc and
document remaining KEEP/UPDATE in header comments. 2 files fully cleared:
suite-bugfixes-1956.feature and suite-archive-confirmation-dialog.feature.

Files:
  - suite-sidebar-status-summary, inline-add-target-and-scenario-buttons,
    target-selector-select-clear-all, all-runs-scenario-names,
    suite-bugfixes-1956, suite-run-confirmation-modal,
    suite-runs-time-filter, collapsible-suite-sidebar, nested-drawer-typing,
    run-scenario-target-selector-modal-stability, sidebar-summary-status,
    suite-archive-confirmation-dialog

17 bindings across 10 test files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger CI for parity work

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mall files (#3826)

* test(features): suites group C — bind 38 scenarios + delete 3 + 11 small files

Slice C5/C of features-domain parity grinder (#3458). Process the remaining
11 small suites/* feature files plus prompts/custom-prompt-tags.feature.
Bind DUPLICATE-class scenarios via JSDoc, delete 3 obsolete scenarios from
remove-label-tag-pills.feature, and document UPDATE-class divergences via
header comments in suite-url-nesting.feature.

Files:
  Suites (11):
    - all-runs-default-open (2 bound, fully cleared)
    - simulation-run-status-consistency (no @unimplemented)
    - all-runs-panel (3 bound, fully cleared)
    - single-loading-indicator (3 bound, fully cleared)
    - remove-label-tag-pills (3 deleted; entire scenario set obsolete)
    - remove-redundant-suites-label (2 bound, fully cleared)
    - suite-empty-state (3 bound, fully cleared)
    - trace-role-cost-accumulation (no @unimplemented)
    - suite-url-nesting (4 UPDATE — URL scheme diverged; header comment)
    - all-runs-batch-origin-label (2 bound, 1 NoTest)
    - all-runs-group-by (4 bound, fully cleared)
  Prompts (1):
    - custom-prompt-tags (19 bound, 1 NoTest)

Total: 38 bindings, 3 deletions, 11 test files updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger CI for parity work

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lan parity (#3829)

Adds a new "counting unit by organization profile" describe block to
usage-meter-policy.unit.test.ts that maps each scenario title verbatim
to a unit test. Removes @unimplemented from 5 scenarios in the feature
file. The 6th scenario (Self-hosted free org never blocked) stays
@unimplemented + documented as a NoTest gap because UsageService.checkLimit
has no IS_SAAS short-circuit yet.

Bound scenarios:
  - Free TIERED organization counts each span toward the limit
  - Free SEAT_EVENT organization counts each span toward the limit
  - Paid TIERED organization counts each trace as one unit
  - Paid SEAT_EVENT organization counts each span toward the limit
  - Licensed organization respects its own counting rule

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e parity (#3830)

Extract buildLimitMessage and buildUpgradeUrl from usage.service.ts into a
dedicated limit-message.ts module so they can be unit-tested directly without
spinning up the full UsageService. Adds 4 tests, one per scenario in
trace-limit-upgrade-message.feature, covering:

  - Free-tier org on SaaS told to upgrade with correct unit
  - Free-tier org on self-hosted told to buy a license
  - Paid TIERED org on SaaS told to upgrade with traces unit
  - Paid TIERED org on self-hosted told to buy a license

Refactor is behavior-preserving — usage.service.ts still calls the same
functions, just imported instead of inline. All 19 existing usage.service
tests still pass.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dge cases (#3832)

* test(features): bind 3 + implement 1 dataset-rest-api batch records edge cases

The header claimed 3 NoTest gaps for batch-create-records edge cases. In
reality 2 of them already had matching tests:
  - 404 for non-existent dataset (it("returns 404 Not Found") line 886)
  - 422 for empty body (it("returns 422 Unprocessable Entity for empty body")
    line 903)

Added @Scenario JSDoc to those 2 tests + wrote 1 NEW test for the
maximum-batch-size enforcement (1001-entry payload returns 422).
Removed @unimplemented from all 3 scenarios + updated header to claim
full 39/39 binding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger SIGABRT runtime crash

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
)

Phase 2 of #3458 (parity grinder) — members domain.

Bound to existing tests via JSDoc @Scenario:
  member-role-team-restrictions.feature (12/22):
    * "Lite Member" label + "Lite Member org role restricts team
      role to Viewer only" + "Member org role excludes Viewer" +
      "Member/Admin includes custom roles" + "Admin org role has
      all team role options" — bound to memberRoleConstraints
      unit tests covering getOrganizationRoleLabel +
      isTeamRoleAllowedForOrganizationRole.
    * Auto-correction scenarios (Member→Lite Member, Admin→Lite
      Member, Lite Member→Member, Admin→Member, Member→Admin) —
      bound to getAutoCorrectedTeamRoleForOrganizationRole tests.
    * "All team assignments respect Lite Member" + "Switching org
      role updates all team assignments" — bound to
      computeEffectiveTeamRoleUpdates EXTERNAL case.

  update-pending-invitation.feature (4/25):
    * "Member creates invitation request that requires approval"
      — bound to InvitesTable Pending Approval badge case.
    * "Admin creates immediate invite" — bound to InvitesTable
      Invited badge case.
    * "Admin approves invitation request" + "Admin rejects" —
      bound to WaitingApprovalActions admin Approve/Reject buttons.

Kept @unimplemented (with justifying comments — needs harness):
  member-role-team-restrictions.feature (10):
    * UI dropdown render scenarios — no JSDOM render of the
      AddMembers form yet.
    * "API rejects non-Viewer for Lite Members" + "Saving enforces
      Viewer" + "Editing does not persist until save" — covered in
      `organization.member-roles.integration.test.ts` but that
      suite is `describe.skip()` (#3240).

  update-pending-invitation.feature (21):
    * Backend integration covered by
      `organization.invites.integration.test.ts` — `describe.skip()`
      (#3240).
    * Playwright e2e under `agentic-e2e-tests/tests/members/*` —
      `test.fixme()` (#1811 CI flakiness).

specs/members net @unimplemented: 47 → 47 (-16 bound, 31 justified).

`pnpm check:feature-parity` passes. `pnpm typecheck` clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tests (#3811)

* test(private-dataplane): bind 20 @unimplemented scenarios to routing tests

Phase 2 of #3458 (parity grinder) — private-dataplane domain.

All 20 `@unimplemented` scenarios in this domain map directly onto
existing routing tests that exercise the private-vs-shared paths
end-to-end (testcontainers + real CH clients on the integration
side, env-var mocks on the unit side).

Bound to existing tests via JSDoc @Scenario:
  clickhouse-routing.feature (11/11):
    * Env-var parsing (3 scenarios) and shared-client default —
      bound to "when org has no private ClickHouse env var"
    * Private-client / private-instance routing / data-isolation
      proof (5 scenarios) — bound to "when org has a private
      ClickHouse env var configured"
    * Per-org caching — bound to "when called twice for the same
      organization"
    * getAllClickHouseInstances — bound to its dedicated case

  s3-routing.feature (5/5):
    * Env-var parse — bound to "parses a single org config"
    * Invalid JSON warning — bound to "skips the invalid entry"
    * Org-level routing (private + shared) — bound to the matching
      describe blocks
    * Project-level routing — bound to "when project belongs to org
      with private S3"

  data-isolation.feature (4/4):
    * Private-org spans-only — bound to span-storage repo "private-CH
      org" case
    * Shared-org spans-only + concurrent-routing — bound to
      "concurrent writes target different orgs"
    * Private-org events-only — bound to event-repo "private-CH org"
      case

specs/private-dataplane net @unimplemented: 20 → 20 (-20 bound).

`pnpm check:feature-parity` passes. `pnpm typecheck` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: retrigger flake CI

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
drewdrewthis and others added 30 commits May 21, 2026 19:53
…4150)

* feat(boxd): add preview lifecycle helpers to boxd-fork.sh

Adds three pure-ish orchestration functions:
- boxd_preview_vm_name BRANCH — returns preview-<slug(branch)>
- boxd_preview_up BRANCH — fork langwatch-golden-v2, checkout branch, docker compose up -d --build
- boxd_preview_down BRANCH — destroy the preview VM non-interactively
- boxd_preview_status BRANCH — print VM status, git HEAD, docker compose ps

Golden source controlled by LW_PREVIEW_GOLDEN_SOURCE (default: langwatch-golden-v2).
All boxd exec recipes wrap pipefail in bash -c (existing lint rule).

Unit tests in scripts/__tests__/boxd-fork.unit.bats cover:
- boxd_preview_vm_name name derivation and truncation
- LW_PREVIEW_GOLDEN_SOURCE default + env override
- arg validation on all three entry points
- pipefail-in-bash-c regression guard (extended to cover preview helpers)

Closes #4149

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(boxd): add boxd-preview, boxd-preview-down, boxd-preview-status Make targets

Three new targets in boxd.mk (all require BRANCH=<name>):
  make boxd-preview BRANCH=feat/foo         fork team golden, start full stack, print URL
  make boxd-preview-down BRANCH=feat/foo    destroy preview-<slug> VM
  make boxd-preview-status BRANCH=feat/foo  VM status + git HEAD + compose ps

Follows existing _boxd_require / BOXD_RUN_PREFIX patterns. BRANCH guard
matches fork-branch / fork-pr conventions.

Updates:
- boxd.mk: .PHONY list, surface comment, boxd-help, three new recipe blocks
- Makefile help: three-line summary under "Boxd workflows"
- dev/docs/boxd-makefile.md: quick reference, naming table, new env var row
  (LW_PREVIEW_GOLDEN_SOURCE), "PR preview lifecycle" section

Refs #4149

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(boxd-preview): validate branch shape in down/status (#4150 review)

Per CodeRabbit on PR #4150: boxd_preview_down and boxd_preview_status
only checked non-empty, then slugged to a VM name. A typo or invalid
branch could slug to a *different* valid VM name and target the wrong
VM (destroy is the real risk).

Mirror the git check-ref-format guard already in boxd_preview_up. Add
two regression tests matching the existing _up invalid-branch test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(boxd-preview): move env assignment off `.` builtin

`bash -c 'LW_PREVIEW_GOLDEN_SOURCE=$(VAR) . scripts/boxd-fork.sh && fn'`
was an antipattern. Command-prefix env assignments scope to a single
command — even when that command is `.` (source). Bash unsets the var as
soon as `.` returns, so the script's `:-langwatch-golden-v2` default
ran during sourcing but evaporated before the function body needed it.
Result: `set -u` tripped on `golden="${LW_PREVIEW_GOLDEN_SOURCE}"`
in `boxd_preview_up`, failing the call with "unbound variable".

Move the assignment to the make recipe line (subshell-level env), so the
bash subprocess inherits it normally.

Discovered by actually running `make boxd-preview BRANCH=main` for the
first time during PR verification — unit tests source the script with
the env set, so they never hit this path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: MrKrustyKlaws <krusty@langwatch.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…4163)

* fix(groupQueue): update tenant cap test to match new default of 50

* fix(groupQueue): align feature file scenario title with actual default of 50

The feature file still said "defaults to 100" but the code was changed
to DEFAULT_TENANT_CAP = 50. The @Scenario annotation in the test matched
the code (50), not the stale feature title (100), causing
check-feature-parity to fail with "unknown scenario".
…ests (#4167)

* test(groupQueue): add counter conservation tests for total-pending drift

Pin the total-pending counter invariant across every lifecycle path:
- happy path (stage → dispatch → complete)
- active key expiry (documents the current drift bug)
- retryRestage lifecycle (single retry)
- restageAndBlock + unblock lifecycle
- multi-retry (3 retries then success)
- proposed fix validation (proves naive DECR-at-dispatch breaks retries)
- reconciliation invariant (counter vs actual Redis state)

These tests are Phase 0 of the counter drift fix plan — they prove
the bug exists and will serve as the regression suite for the
structural fix.

* fix(groupQueue): increase reactor dedup TTLs to reduce Redis fan-out pressure

traceUpdateBroadcast: 1s → 30s (notification only, frontend debounces)
spanStorageBroadcast: 1s → 15s (notification only, frontend refetches)
originGate:           5s → 30s (evaluationTrigger has its own 30s delay)

For a 30-span trace these three reactors were creating ~72 separate
jobs instead of ~3. With the new TTLs, dedup squashes them during the
span burst window, cutting reactor lifecycle Redis commands by ~60%.

Active recurrence: same tenant (1,731 groups, 558K ops/sec) hitting
the same fan-out pattern as the May 21 incident.

* fix: address review comments — use dispatched groupId, remove unused var

* fix: update originGate unit test to match new 15s dedup TTL
#4168)

Cache tenant cap check results within each EVAL call to avoid
redundant GET calls. When 1,800 groups belong to one over-cap tenant,
this turns 1,800 separate GET tenant_active calls into 1 GET + 1,799
Lua table lookups.

Cache is invalidated after a successful dispatch INCRs the tenant
counter, so the cap check stays fresh when it matters.

Applied to both DISPATCH_LUA and DISPATCH_BATCH_LUA. Scan budget
is kept unchanged to avoid starvation risk for smaller tenants
when a large backlog clusters at the ZSET head.
This reactor has been a no-op since LangWatch 3.0 (April 8) when
ES writes were fully removed in favor of ClickHouse. The handler
body is `return;` but without the `disabled` flag, every evaluation
run still goes through the full STAGE → DISPATCH → COMPLETE
lifecycle (~25 Redis commands) to execute nothing.

Adding `disabled: true` makes the projectionRouter skip the
`send()` call entirely (projectionRouter.ts:632), eliminating
the Redis cost. The pipeline registry still creates the reactor
for compatibility — it just never fires.
* fix(prompts): clamp Deploy dialog Select width to row (#4006)

The version Select used `width="auto" minWidth="180px"` with no
horizontal budget on its parent HStack, so long commit messages
pushed each environment row past the modal edge.

Constrain the trigger to `flex=1 minWidth=0 maxWidth=280px` and let
its parent HStack flex/shrink so the Select truncates instead of
overflowing. The name/dot cluster stays at intrinsic width.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(feature-parity): tag manual layout scenario @unimplemented

The "Version Select inputs stay within the modal width" scenario is asserted
visually (jsdom does not compute layout) and has no automated test binding.
Per TESTING_PHILOSOPHY.md, @Manual alone is not exempt from the binding check;
@unimplemented is required to signal the gap is tracked. The DeployPromptDialog
integration suite is also describe.skip'd due to a vitest hang (TODO #3022),
so no automated binding exists or can be added without resolving that first.

* test(evidence): add browser screenshots for Select overflow fix

Chromium-rendered before/after showing the DeployPromptDialog
version Select staying within the modal row boundary after the
flex/minWidth fix (PR #4037).

Before: selectRight=860 vs rowRight=476 (overflows by 384px)
After:  selectRight=459 vs rowRight=476 (17px margin, within bounds)

Verified by: Playwright 1.60.0 + Chromium 148.0.7778.96 (headless)
Script: /tmp/layout-verify.mjs — mirrors exact CSS from the fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: trigger fresh run after main test fix (tenantCap)

Previous CI runs cached an old merge-base where tenantCap.unit.test.ts
expected DEFAULT_TENANT_CAP = 100; main has since corrected it to 50.
Empty commit forces a new workflow checkout against updated main.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ua (#4173)

* fix(groupQueue): skip GET activeKey for over-cap groups in dispatch Lua

When a tenant is over-cap, the dispatch scripts read the activeKey
(GET) only to discard the result — the tenantOverCap gate on the
next line guarantees the group is skipped. With 10K+ over-cap groups
in the ready ZSET this wastes ~355K GETs/sec on the Redis single
thread.

Hoist the tenantOverCap check above the GET so over-cap groups skip
immediately. Behaviorally identical: all five state combinations
produce the same outcome; every write path is inside the gate and
untouched. Applied to both DISPATCH_LUA and DISPATCH_BATCH_LUA.

* fix(groupQueue): revert DISPATCH_LUA change — unused in production

Only DISPATCH_BATCH_LUA is called by the dispatcher. Keep the change
scoped to the hot path.
…cause, supersedes #3979) (#4148)

* fix(python-sdk): scope disable_sending to trace lifetime (#3981)

Root cause for offline-experiment cell traces silently dropped on the
OTLP path. The langwatch_nlp worker pool reuses processes across event
types, and `langwatch.trace(disable_sending=True)` was permanently
flipping the singleton `Client._disable_sending` flag. Once any
`execute_evaluation` or `execute_optimization` ran on a worker, every
subsequent `execute_component` for an offline-experiment cell on the
same worker had its spans dropped by `ConditionalSpanExporter`, even
though the caller never opted into disabling.

PR #3979 worked around the symptom by synthesizing a stand-in span
from the orchestrator (`getApp().traces.recordSpan(...)`). That kept
analytics charts non-empty but masked real OTLP data loss — the
"real" LLM/tool/RAG spans for offline-experiment cells were still
missing from `trace_summaries`/`stored_spans`.

Fix: `LangWatchTrace` now snapshots the client's `disable_sending`
state on `__enter__`/`__aenter__` and restores it in `_cleanup`. The
per-trace request lives on the instance (`_disable_sending_request`)
instead of mutating singleton state at construction time. Nested
traces correctly restore the outer block's state on inner exit.

Regression test exercises the exact worker-reuse shape from nlp's
isolated_process_pool: a `disable_sending=True` trace followed by a
default trace on the same singleton must not inherit the disabled
flag.

This replaces the workaround in PR #3979. With this fix the synthetic
`recordSpan` block in `evaluations-v3/execution/orchestrator.ts` can
be removed — close #3979 and target this branch for the analytics
fix instead.

Closes #3981. Supersedes #3979.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(python-sdk): refcount disable_sending for concurrent-safe scoping

Address CodeRabbit critique on PR #4148: single-slot snapshot/restore in
LangWatchTrace was unsafe for overlapping concurrent traces. Two threads
each entering a `disable_sending=True` trace could corrupt the
restoration order — one thread's exit would flip the flag back to False
while the other was still holding it.

Switch to a refcount on the Client:
- `Client.acquire_disable_sending()` / `release_disable_sending()`
  bracket a `with langwatch.trace(disable_sending=True)` block.
- Thread-safe via `threading.Lock`.
- Captures the user-set baseline on the first acquire so explicit
  `client.disable_sending = True` set outside any trace block survives
  the trace exit; only the refcount-acquired delta is released.

Add two regression tests for the cases CodeRabbit named:
- user-set baseline survives trace exit.
- overlapping concurrent disable traces keep the flag True until the
  last one exits.

51/51 SDK tests pass (was 49 before the new cases).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(analytics): JOIN regression test for offline-experiment evaluations

Cherry-picked from #3979 (whose orchestrator workaround is no longer needed
once #3981 lands). The test seeds both `evaluation_runs` and `trace_summaries`
for synthetic offline-cell trace IDs, then asserts the analytics INNER JOIN
on (TenantId, TraceId) returns non-null metric buckets.

Independent of which write path populates trace_summaries — the test catches
future regressions in the analytics read path (JOIN, metric registry, filters)
either way.

Comments updated to reference the #3981 SDK fix as the row-source mechanism
instead of the abandoned workaround.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(analytics): retitle describe blocks to reflect runtime mechanism

Addressing CodeRabbit on PR #4148: describe-block header and one body
comment still attributed the trace_summaries row source to the
orchestrator's synthetic recordSpan workaround from #3979 (now closed).
The runtime source is now the normal OTel export pipeline (after the
SDK fix lands). Test behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: MrKrustyKlaws <krusty@langwatch.ai>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…s evaluations-v3 → experiments rename (#3893)

* docs(specs): add specs for experiments and runs list endpoints

Behavior specs for the new public REST list endpoints (experiments and
evaluation runs), the matching CLI commands, and the MCP tools. Specs
come first per BDD workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(api): add list endpoints for experiments and evaluation runs

Adds two missing public REST surfaces so API-key-authenticated clients
(CLI, MCP server, scripts) can enumerate experiments and their runs:

- GET /api/experiments — lists project experiments with run count and
  last-run timestamp. Goes through ExperimentService.getAll and
  ExperimentRunService.listRuns; no direct Prisma access from the route.
- GET /api/evaluations/v3/runs?experimentSlug=<slug> — lists runs for an
  experiment by slug. Reuses ExperimentRunService.listRuns. Returns 400
  for missing slug, 404 for unknown slug.

Both endpoints return camelCase shapes with project-scoped pagination
(default pageSize 50, max 200, hasMore flag). Refs #3885.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sdk): add EvaluationsApiService.listExperiments / listRuns

Wraps the new GET /api/experiments and GET /api/evaluations/v3/runs
endpoints. Types are hand-written because these routes are not yet
exposed via the generated OpenAPI paths — same pattern PR #3886 used
for getRunResults. Refs #3885.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cli): add experiment list and evaluation list-runs commands

- langwatch experiment list — lists experiments in the project with
  table or JSON output, --limit caps the page size.
- langwatch evaluation list-runs --experiment <slug> — lists runs for
  the given experiment slug; table mode shows runId, status, started/
  finished and a pass-rate summary.

Both commands check the API key and surface 404s with a non-zero exit
so CI scripts can branch on missing experiments. Refs #3885.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(mcp): add platform_experiment_list and platform_evaluation_list_runs tools

Surfaces the new list endpoints to LLM agents via MCP. Both tools emit
markdown so the agent can read them directly, and both clamp the page
size (default 25, max 100) to protect the agent's context window.

platform_experiment_list discovers experiment slugs; the result feeds
into platform_evaluation_list_runs to discover runIds; those runIds
then feed into platform_evaluation_results (PR #3886) — completing the
list → runs → results discovery chain. Refs #3885.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(specs): add specs for evaluation results CLI command and MCP tool

Behavior-level scenarios covering completed-run fetch, failed-row filtering,
evaluator narrowing, JSON piping, and graceful handling of running/missing
runs — for both the new `langwatch evaluation results` CLI command and the
`platform_evaluation_results` MCP tool.

Refs #3885

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sdk): add EvaluationsApiService.getRunResults

Adds a `getRunResults({ runId })` method that fetches per-row experiment-run
results from `GET /api/evaluations/v3/runs/:runId/results`. The route is not
yet declared in the generated OpenAPI types, so the response type
(EvaluationRunResultsResponse) is hand-written to mirror the control-plane
ExperimentRunWithItems type.

Refs #3885

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cli): add `langwatch evaluation results <runId>` command

Renders per-row experiment-run results so engineers can debug evaluator
scores and missed rows from the terminal. Supports `--filter failed` to
focus on failures, `--evaluator <name>` to narrow to a specific evaluator,
`--format json` to pipe the full payload, and `--limit <n>` to cap table
output (default 20).

Refs #3885

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(mcp): add platform_evaluation_results tool

Lets coding agents fetch per-row experiment-run results to debug evaluator
scores and missed rows. Output is markdown styled like the existing
run-evaluation tool: per-evaluator averages plus row-by-row scores and
failure details. Default row cap of 50 protects the agent's context window;
agents can narrow with `filter: 'failed'` or `evaluator`, or raise `limit`
when really needed.

Refs #3885

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sdk): handle 200+null body from /runs/:runId/results for missing runs

The evaluations-v3 /results endpoint returns HTTP 200 with a JSON `null`
body when the runId does not exist (rather than 404). Without this fix,
`getRunResults` returns null and the CLI crashed with `Cannot read
properties of null (reading 'dataset')` instead of a clean error.

Detected during live dogfood of the new `evaluation results` command
(issue #3885) — bogus runId triggered a TypeError. Now the SDK throws a
proper EvaluationsApiError with operation context, and the CLI prints a
"Run not found" message.

Refs #3885

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address evaluation listing review feedback

* fix: address evaluation results follow-up feedback

* test: bind 17 scenarios for experiments/runs list across api/cli/mcp

Adds @Scenario JSDoc bindings to existing it() blocks in the integration
and unit tests for the new experiments-list, evaluation-runs-list, mcp
list tools, and CLI list commands. Marks 6 scenarios @unimplemented in
the feature files where no direct test exists yet (empty-project,
limit-cap-by-user-input, missing-API-key error, etc.) — those represent
tracked gaps to fill in a follow-up.

Resolves the langwatch-app-ci `feature-parity` failure on PR #3893.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address coderabbit + codeql feedback on PR #3893

- typescript-sdk tests: replace inline `typeof import()` with top-level
  `import type * as` aliases (CLAUDE.md rule against inline import())
- typescript-sdk CLI evaluation results: status cell now reflects
  evaluator failures, not just entry.error
- mcp-server list-experiments: combine backslash + pipe + backtick
  escapes into a single regex pass to avoid the codeql-flagged
  incomplete-escape ordering issue
- clickhouse-experiment-run.service: argMax(CreatedAt, UpdatedAt) for
  run dedup (latest version), and add RunId tiebreaker to ORDER BY
- experiment.repository: add id-desc tiebreaker to paginated findMany
- mcp-server get-evaluation-results: aggregate evaluator summary from
  filtered rows so stats match the displayed subset under filter=failed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: rename evaluations-v3 → experiments across API, SDK, CLI, MCP

Aligns offline-evaluation surfaces with the project naming convention:
"evaluations" refers to online evaluations / guardrails; offline /
batch is "experiment". The `evaluations-v3` paths were misnamed
historically — this sweep renames public surfaces and supporting
internals in one pass:

- API: `/api/evaluations/v3/{execute,abort,:slug/run,runs,runs/:runId,
  runs/:runId/results}` → `/api/experiments/{...}`. Route file moved
  to `routes/experiments-v3.ts`. Existing `/api/experiments` Hono app
  (list endpoint) is unchanged; the two apps coexist under the same
  basePath.
- Server dir: `src/server/evaluations-v3/` → `src/server/experiments-v3/`
  (~30 internal imports updated).
- UI module: `src/evaluations-v3/` → `src/experiments-v3/`.
- Specs: `specs/evaluations-v3/` → `specs/experiments-v3/`.
- SDK: `EvaluationsApiService` (in `services/evaluations/`) →
  `ExperimentsApiService` (in `services/experiments/`). Local error
  class renamed to `ExperimentsApiServiceError` to avoid colliding
  with the existing `ExperimentsApiError` in `platformErrors.ts`.
  `EvaluationRun*` response/entry types renamed to `ExperimentRun*`.
  `EvaluationsFacade` and the online-eval `EvaluationsApiError` stay
  put — they correctly target `/api/evaluations/:slug/evaluate`.
- CLI: `langwatch evaluation {run,status,list-runs,results}` folded
  into the existing `langwatch experiment` group (now also exposes
  `list`). Files moved from `commands/evaluation/` → `commands/experiment/`.
- MCP: `platform_run_evaluation`, `platform_evaluation_status`,
  `platform_evaluation_list_runs`, `platform_evaluation_results` →
  `platform_run_experiment`, `platform_experiment_status`,
  `platform_experiment_list_runs`, `platform_experiment_results`.
  Tool files renamed to match; handler exports renamed
  (`handleEvaluation*` → `handleExperiment*`).

Left intentionally alone (separate concerns, larger blast radius):
- Internal types `EvaluationsV3State` / `persistedEvaluationsV3StateSchema`
  / `EvaluationV3Event` — woven through the workbench UI; follow-up.
- RBAC permission strings (`evaluations:manage`, `evaluations:view`).
- PostHog analytics event `evaluation_ran`.
- Online-eval surfaces: `/api/evaluations/:slug/evaluate`,
  `EvaluationsFacade`, `monitor` CLI group.

No backwards-compatibility aliases — PR is unreleased so all callers
are migrated in place.

Verification:
- typescript-sdk typecheck clean
- typescript-sdk build success
- typescript-sdk unit tests: 65/65 (SDK + CLI)
- MCP unit tests: 16/16 (get-experiment-results, experiment-list-tools)
- langwatch typecheck shows no rename-introduced errors (only
  pre-existing prisma-client errors caused by an unrelated root-owned
  generated file blocking `start:prepare:files`)
- CLI `--help` shows merged `experiment` group with the five subcommands
- OpenAPI generated types updated to `/api/experiments/runs/{runId}`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* polish(experiments): sweep stale "evaluation" strings in CLI + route errors

Found during E2E smoke testing on the dev server: route error messages
and CLI spinner/help strings still said "evaluation" where the rename
moved them to the experiment domain.

- routes/experiments-v3.ts: "Starting evaluation execution" →
  "Starting experiment execution", "Evaluation not found" →
  "Experiment not found", "Invalid evaluation configuration" →
  "Invalid experiment configuration", "Starting CI/CD evaluation
  execution" → "Starting CI/CD experiment execution"
- CLI experiment/run.ts: spinner "Starting evaluation" →
  "Starting experiment", success/fail/poll text "Evaluation" →
  "Experiment", failSpinner action "run evaluation" → "run experiment"
- CLI experiment/status.ts: failSpinner action "check evaluation status"
  → "check experiment status"
- CLI experiment/list-runs.ts: cross-ref text in help points at
  `langwatch experiment status/results` (not `langwatch evaluation ...`),
  failSpinner action "list evaluation runs" → "list experiment runs"
- CLI experiment/list.ts: cross-ref text in help points at
  `langwatch experiment list-runs` (not `langwatch evaluation list-runs`)
- CLI experiment/results.ts: local types EvaluationResults{Filter,
  Format,Options} → ExperimentResults*, failSpinner action "fetch
  evaluation results" → "fetch experiment results". Member names like
  `evaluations` / `evaluation` left alone — they refer to per-row
  evaluator results, which is correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(experiments): resolve run's experiment from runState, not "most recently updated experiment"

The previous fallback in GET /api/experiments/runs/:runId/results
guessed the owning experiment via the most-recently-updated experiment
in the project. This produced cryptic 404s whenever the user had edited
any other experiment after the one that owned the run, because the
downstream `experimentRunService.getRun({ projectId, experimentId,
runId })` only matches on the exact (projectId, experimentId, runId)
tuple in ClickHouse — runId alone is not unique across experiments.

Layered lookup, fast-path first:
  1. runStateManager.getRunState(runId) — Redis-backed, 24h TTL, covers
     hot runs.
  2. ?experimentSlug=<slug> query param → prisma lookup — covers older
     runs whose run state has expired but whose ClickHouse rows remain.
  3. Otherwise 404 with explicit guidance: "Pass ?experimentSlug=<slug>
     if the run is older than 24h."

The SDK already encodes runId only in the path, so existing CLI / MCP
flows hit path (1) on the freshly-completed run they just polled,
which is the dominant use case. Users who chase older runs now get a
clear actionable error instead of "Experiment not found".

Also adds a null-guard around `experimentRunService.getRun` — the
return type is `ExperimentRunWithItems | null` and the route was
silently treating `null` as success.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sdk-test): extend LLM simple question/response test to 30s timeout

This test makes a real ChatOpenAI call but used the default 15000ms
vitest timeout, while every sibling test in the same file
(`simple-agent-and-tool.integration.test.ts`) already overrides to
`{ timeout: 30_000 }`. Result: flaky timeouts in CI on slower OpenAI
responses (just observed on this PR's CI run despite no SDK code change
in this area).

Brings it in line with the rest of the file. No behavior change.
Boy-scout fix surfaced by the rename PR's CI run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mcp): replace inline imports with top-level imports for run-experiment handlers

Convert the two `await import(...)` calls for handleRunExperiment and
handleExperimentStatus to top-level imports, consistent with the other
experiment tool handlers already imported at the top of create-mcp-server.ts.
Addresses CodeRabbit inline-import suggestion; aligns with CLAUDE.md rule
"Never use inline import()".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings in experiment CLI/SDK/MCP

- status.ts: fix stale help text 'evaluation run' → 'experiment run'
- experiments-api.service.ts: use throw on handleApiError in getRunResults so
  TypeScript correctly narrows body to non-null before return
- experiments.facade.ts: replace summary! non-null assertion with summary ?? {}
  so a completed run with missing summary doesn't crash downstream arithmetic
- get-experiment-results.ts: extend finishedAt guard to also check stoppedAt,
  so stopped/aborted runs with partial results are displayed instead of blocked

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(sdk): remove erroneous throw before never-returning handleApiError

`handleApiError` has return type `never` — TypeScript narrows past the null
guard without a throw. The redundant `throw` triggered the
`@typescript-eslint/only-throw-error` lint rule because `never` is not an
Error object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(test): hoist EvaluatorEditorDrawer import in integration test

The inline `await import(...)` inside each test body caused the first test
to pay the full cold module-load cost (~10s in CI), consistently hitting the
15s test timeout. vi.mock() is hoisted by Vitest so a top-level import
receives the mocked module; no need for per-test dynamic imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(openapi): remove stale evaluations-v3 paths; regenerate api-client

Old /api/evaluations/v3/... paths survived the rebase merge alongside
the new /api/experiments/... paths, causing duplicate TypeScript
identifiers in the generated api-client.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(openapi): restore api-client.ts to minimal 3-line rename

Previous commit regenerated with a newer openapi-typescript version,
ballooning the diff to 13k lines. Restore the original PR version which
only renames the 2 path keys and 1 comment (evaluations/v3 → experiments).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(openapi): minimal path rename in openapiLangWatch.json

Replace the file with main's exact version + 3 targeted string
replacements to rename the 2 path keys and 1 description reference
from evaluations/v3 → experiments. Previous version had a diff-noise
problem from Python json.dumps re-encoding all unicode characters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: update Python SDK routes and thread experimentSlug through results API

Two merge-blocking bugs found in human review:

1. Python SDK still called /api/evaluations/v3/... routes which this PR
   renames to /api/experiments/... — would 404 in production.
   Updated platform_run.py (_start_run, _get_run_status) and
   experiment.py (_fetch_results_as_df).

2. CLI/MCP results lookup fails for runs older than 24h when Redis state
   expires. The server already accepts ?experimentSlug= to resolve from
   ClickHouse, but the SDK/CLI/MCP never forwarded it.
   Added experimentSlug param through:
   - ExperimentsApiService.getRunResults
   - experimentResultsCommand / ExperimentResultsOptions
   - CLI --experiment flag on `experiment results`
   - handleExperimentResults (MCP tool)
   - platform_experiment_results MCP schema

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: add backward-compat aliases for /api/evaluations/v3 routes

Python SDK e2e tests run against production, which does not have
the new /api/experiments/... routes until this PR is deployed.
Updating the SDK to call new routes caused 404 instead of 401 for
the invalid-api-key test.

Solution:
- Revert Python SDK back to the old /api/evaluations/v3/... routes
  (safe to call on both old and new production)
- Add server-side legacyAliasApp that mirrors all
  /api/evaluations/v3/* traffic to /api/experiments/*
  so existing callers continue to work after deployment

Python SDK can be updated to the new routes in a follow-up PR once
this one is deployed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci: require explicit approval for privileged PR checks

* ci: bootstrap workflow security guard

* ci: harden workflow guard parsing

* ci: tolerate commented workflow keys

* ci: accept wider workflow indentation

* ci: pin remaining workflow actions

* ci: audit privileged pull_request_target gates

* ci: skip missing codecov reports

* test: tolerate shared fixture race
… of blocking groups (#4178)

* fix(event-sourcing): retry through CH cluster-recovery errors instead of blocking

Two related fixes so fold-projection groups self-heal across a ClickHouse
rolling restart / ZooKeeper session-recovery cycle instead of parking in
gq:blocked on the first failure.

1. classifyClickHouseError now treats CH cluster-recovery codes as
   RECOVERABLE: 394 QUERY_WAS_CANCELLED, 242 TABLE_IS_READ_ONLY,
   999 KEEPER_EXCEPTION, 236 ABORTED, 33 CANNOT_READ_ALL_DATA. Same
   message fragments added for cases where @clickhouse/client surfaces
   the code only inside error.message ("Query was cancelled",
   "Table is in readonly mode", "Session expired", "Connection loss",
   "Coordination::Exception", "Write buffer has been canceled").
   Before this, a CH replica graceful shutdown produced a StoreError
   classified as CRITICAL, which made isRetryable=false in groupQueue
   and triggered RESTAGE_AND_BLOCK_LUA on the very first attempt without
   using any of the 15 retries.

2. JOB_RETRY_CONFIG bumped to maxAttempts=25, maxBackoffMs=600_000.
   The previous 15-attempt budget capped at 15s gave only ~2.5 minutes
   of total retry wait, which is shorter than a typical ZK session
   recovery + readonly-replica window. New schedule (500ms doubling up
   to a 10-minute cap, total ~2h 38m across 24 gaps) is enough room for
   any single rolling-restart cycle. Failed jobs stay in the Redis zset
   the whole time, so a longer budget never loses data, it just trades
   operator toil for auto-recovery.

3. resilient-client's inline insert retry loop now uses the same
   transient-message-fragments list, so the per-call retry also catches
   QUERY_WAS_CANCELLED / KEEPER_EXCEPTION / TABLE_IS_READ_ONLY etc.

* fix(event-sourcing): address CodeRabbit review on PR #4178

1. errorHandling.ts: add MEMORY_LIMIT_EXCEEDED to the message-fragment
   fallback so code-241 errors stay RECOVERABLE even when the error
   surfaces as message-only (no separate `code` field).

2. errorHandling.ts: export CLICKHOUSE_TRANSIENT_MESSAGE_FRAGMENTS as
   the canonical list.

3. resilient-client.ts: import the canonical list instead of
   hand-maintaining a partial duplicate. Locks the inline insert retry
   loop in lock-step with classifyClickHouseError so future additions
   stay in sync automatically.

4. resilient-client-metrics.unit.test.ts: add overload (202) and
   memory-limit (241) message-only cases to the transient retry matrix
   to guard the new parity.

5. errorHandling.unit.test.ts: cover MEMORY_LIMIT_EXCEEDED via
   message-only fallback.

6. shared.ts: correct cumulative-wait comment (2h 38m -> 2h 27m).
   Actual math: 1,023,500ms exponential + 7,800,000ms capped = 8,823,500ms.
…-chip jumps (#4179)

* feat(traces-v2): interactive error chip popover + blue glow on header-chip jumps

Two header chips (Error status, Eval) deep-link into trace-summary
sections but the operator could lose the destination in a long
accordion list, and the error chip's tooltip claimed it would "jump
to the first span with an error" while actually just opening the
summary tab without explaining anything. Three improvements:

1. Replace the static error-chip tooltip with a HoverCard popover
   that mirrors the Exceptions accordion content - same trace-level
   message in the same red treatment, same per-span pill row sourced
   from the new shared `rankedErrorSpans` helper so the popover and
   the full accordion can't drift. Clicking a span pill inside the
   popover jumps the drawer to that span; clicking the chip itself
   opens the full Exceptions section.

2. Extend `useFocusSectionStore` with an `exceptions` target. The
   error chip now publishes a focus request through the same store
   the eval chips already use, so the trace-summary observer
   uniformly handles "expand + scroll" for all header chips.

3. Add a one-shot blue pulse (`SectionFocusGlow`) keyed off
   `data-section-focus="1"` on each Section. The focus observer
   toggles the attribute after the scroll lands, the keyframe runs
   for 1.5s and self-clears. The colour palette mirrors the
   onboarding `DrawerGlow` recipe so the two effects read as the
   same affordance - the focus pulse is just shorter, runs once,
   and stays inside the section box.

Includes a tiny `scripts/dogfood-error-trace.ts` OTLP emitter that
fans out one trace with a workflow + chain + 2 llm spans all in
error so the popover has enough surface area to dogfood.

* fix(traces-v2): paint focus glow as a portal overlay so it isn't cropped

The prior box-shadow-on-section approach split visually into two halves
(outer halo cut by the drawer's overflow scroll viewport, inset ring
occluded by the sticky section header bg) and clipped on opposite sides.

Replace it with a fixed-position overlay rendered through a Portal at
the document body. The overlay measures the section's bounding rect and
tracks it across scroll / resize for the duration of the animation, so
no overflow ancestor in the drawer chrome can crop it. Z-index sits
above the drawer's stacking context.

* feat(traces-v2): re-focus + glow exceptions when a span pill is clicked

Span pills inside ExceptionsContent (both the popover and the accordion
variant) now fire an optional `onFocusSection` callback alongside the
span jump. The popover re-runs the chip's focus pipeline so the operator
sees the section open + pulse as they land; the accordion variant
re-pulses the section the operator is already viewing so the eye tracks
back to the row that owns the selected span.

* feat(traces-v2): span pill jumps to span tab + glows its exceptions section

Two coupled changes so the popover span pills land where the operator
expects:

1. Popover `jumpToSpan` no longer forces `setActiveTab("summary")` -
   `selectSpan` already flips to the span tab internally, so leaving
   the override in place was yanking the operator back to summary
   right after selectSpan moved them.

2. Extract the focus-glow observer into `useSectionFocusGlow` and
   mount it on `SpanAccordions` as well, so the follow-up
   `requestFocus({section: "exceptions"})` lands on whichever
   accordion stack ends up rendered (summary or span). The hook
   waits via MutationObserver when the target section isn't in the
   DOM yet (span tab mounts a skeleton while `useSpanDetail` loads).
- Move scripts/dogfood-error-trace.ts to langwatch/scripts/dogfood/traces/error-trace.ts
  so the dogfood folder stays organised by feature
- Drop the blanket /* eslint-disable */ on the dogfood script and the
  unused SEC_NS constant; nothing in the file actually needed either
- Guard rankedErrorSpans depthOf against parent cycles so a malformed
  span graph can't hang the ranker (covered by new unit test)
- Surface the Exceptions accordion + focus glow for error traces that
  only have span-level failures (no rolled-up trace.error). Previously
  the header popover could list pills that led to a gated section that
  never rendered
- Restructure the two unit suites to follow given/when/it BDD nesting
* fix(metrics): read backup status from system.backup_log

system.backups is an in-memory CH table that is wiped on every restart.
Since the CH image tag bumps on every app deploy, every deploy wipes the
table — and a freshly-rolled worker pod then sees zero rows and never
emits the clickhouse_backup_last_success_timestamp_seconds gauge, which
trips the "Backup Reporting Absent" alert despite backups being healthy.

system.backup_log is the persistent equivalent (same row schema for the
columns we care about) and survives CH restarts, so freshly-rolled
workers immediately see the most recent backup and emit the gauge on
their first collection tick.

* test(metrics): wrap backup_log tests in BDD given/when blocks

Per repository test style (CLAUDE.md), nested describe('given X') / describe('when Y') blocks are required for new tests. Restructures only the sections this PR added/modified; existing flat sections left alone to avoid scope creep.
#4182)

The chat-tab persistence effect deduped on message IDs only. The
latest assistant message gets a stable ID the moment streaming
starts (when content is still empty); the ID-set never changes
again for that turn, so the effect skipped every content delta.
The persisted snapshot ended up with the most recent assistant
message stuck at empty content, and convertScenarioMessagesToCopilotKit
then dropped empty-content rows on reload. Result: refresh and the
last assistant reply vanishes while older turns survive.

Key the dedup on (id, role, content-length) so each streaming chunk
re-triggers persistence. The dedup factor is now per-turn-deltas
instead of per-turn-new-message — a few dozen small localStorage
updates per turn, which is fine for the prompt playground.

Extract the key fn so the regression is unit-testable without
standing up the full CopilotKit chat stack.
* fix(queue): move pending counter DECR to dispatch to eliminate drift

The total-pending counter drifted to 827K (real: ~1,300) because
COMPLETE_LUA only DECRs when activeKey matches stagedJobId. When the
active key expires before the worker completes, the DECR is skipped
and the INCR from stage time is never compensated.

Move DECR to DISPATCH (ZREM from :jobs — always succeeds) and add
compensating INCRs in RETRY_RESTAGE and RESTAGE_AND_BLOCK (job
re-enters :jobs). COMPLETE no longer touches the counter.

Counter semantic: "jobs currently in :jobs ZSETs" (waiting to dispatch).
Every INCR has a guaranteed DECR — no conditional paths.

4 Lua script changes:
- DISPATCH_LUA/DISPATCH_BATCH_LUA: DECR after ZREM
- COMPLETE_LUA: remove DECR + totalPendingKey
- RETRY_RESTAGE_LUA: INCR after ZADD
- RESTAGE_AND_BLOCK_LUA: INCR after ZADD

* fix(queue): update DRAIN_GROUP_LUA and tests for new counter semantic

DRAIN_GROUP_LUA counted active jobs in its DECRBY (pendingCount +
hadActive). With DECR moved to dispatch, the active job was already
compensated — counting it again would double-DECR.

Update DRAIN to DECRBY pendingCount only (jobs in :jobs ZSET).

Update integration tests to match the new counter semantic:
- "leaks the counter" → counter stays accurate (DECR at dispatch)
- "proposed fix validation" → validates the real 4-script fix
- "reconciliation invariant" → ZCARD(:jobs) only, not + EXISTS(:active)
- "retryRestage breaks invariant" → invariant now holds (INCR at retry)
- Drain tests: active job already DECRed at dispatch, not counted again

* fix(queue): guard INCR on ZADD insertion to prevent duplicate overcount

ZADD returns 0 when the member already exists (score update only).
Guard INCR behind inserted == 1 in RESTAGE_AND_BLOCK_LUA and
RETRY_RESTAGE_LUA so a duplicate re-stage call cannot overcount
the pending counter.

Addresses CodeRabbit review feedback on PR #4181.

* chore(specs): bind counter conservation scenarios to integration tests

Add @Scenario annotations linking pending-counter-conservation.feature
scenarios to their corresponding integration tests in
scripts.integration.test.ts.
…4175)

* fix(groupQueue): check tenant cap before SISMEMBER in DISPATCH_BATCH

Move the tenant cap check above the SISMEMBER blocked check in
DISPATCH_BATCH_LUA. The cap result is cached per-tenant in a Lua
table, so over-cap groups now skip with 0 Redis commands instead
of 1 SISMEMBER each. With 10K+ over-cap groups in the ready ZSET,
this eliminates ~333K SISMEMBER/sec (~74% main thread CPU).

* test(groupQueue): add dispatchBatch integration tests for cap-before-SISMEMBER

Tests for DISPATCH_BATCH_LUA with tenant cap:
- Over-cap groups skipped, under-cap groups dispatched in same batch
- Blocked + over-cap group interaction
- Drift cleanup still runs for under-cap tenants

* test(groupQueue): link batch cap tests to tenant-soft-cap feature scenarios

Add 3 DISPATCH_BATCH scenarios to tenant-soft-cap.feature and annotate
integration tests with @Scenario references for parity tracking.

* test: harden cap-before-SISMEMBER integration tests

- Add cross-tenant dispatch assertion: stage a quiet-tenant job in
  second batch and assert it dispatches, not just that noisy ones don't
- Seed a zombie ready entry in drift-cleanup test so it exercises the
  actual cleanup path instead of passing trivially with both values at 0

Addresses CodeRabbit review feedback.

* fix(test): correct over-cap test to keep tenant at cap after restage

restageAndBlock DECRs the tenant_active counter, so with cap=1 and only
one dispatched group, the tenant drops to 0 (under cap) after restage.

Fix: start with cap=2 so two groups dispatch (counter=2), restage one
(counter→1), then lower cap to 1 for the second dispatch. This keeps
the tenant at cap and properly verifies over-cap groups are skipped.

* fix(test): complete quiet tenant job before lowering cap

The previous fix lowered the global cap from 2→1 but forgot that
proj_quiet also had 1 active job (counter=1), so it was also at cap
and nothing dispatched. Fix: complete proj_quiet/g1 before lowering
the cap so only proj_noisy remains at cap.

* test: assert g3 excluded from first batch to lock cap boundary
* fix(deps): move mcp-server pnpm.overrides to workspace root

pnpm warns that pnpm.overrides in mcp-server/package.json is ignored
because mcp-server is part of the langwatch/ workspace (declared via
../mcp-server in langwatch/pnpm-workspace.yaml). Overrides only take
effect at the workspace root.

protobufjs and fast-uri floors were already in langwatch/pnpm-workspace.yaml.
Adds the missing langsmith@<0.6.0 floor and removes the dead pnpm block
from mcp-server/package.json.

* fix(deps): keep mcp-server standalone overrides in pnpm-workspace.yaml

mcp-server is its own pnpm workspace (has pnpm-workspace.yaml and
pnpm-lock.yaml; CI installs it standalone via mcp-javascript-ci,
mcp-javascript-cd, sdk-javascript-ci, e2e-ci). Dropping the override
block from package.json without re-homing it broke standalone frozen
installs with ERR_PNPM_LOCKFILE_CONFIG_MISMATCH.

Re-add the same overrides at mcp-server/pnpm-workspace.yaml so the
standalone install honors them while the langwatch workspace install
continues to pick up its own overrides from langwatch/pnpm-workspace.yaml.
* feat(ops): add Redis pressure panel with memory % and engine CPU

During the 2026-05-21 incident the Redis main thread sat at 100% for 5+
hours while /ops exposed no Redis pressure signal — memory was technically
collected but only surfaced as a tiny sublabel on the DLQ tile, and engine
CPU was never collected at all. Investigators had to use the Redis CLI on
a replica to see what was happening.

This PR adds a first-class "Redis pressure" panel below the throughput
strip with three stats:

- Memory: used / max with percent, warning color when >= 80%
- Engine CPU: derived from two successive INFO cpu snapshots of
  used_cpu_{user,sys}_main_thread, warning color when >= 70%
- Connections: connected_clients

The legacy `redisMemoryUsed` sublabel under DLQ is removed in favor of the
real panel.

The engine-CPU derivation lives in a small pure helper (`redis-engine-cpu.ts`)
that returns null on the first cycle (no prior sample) and on the cycle
after a Redis restart (cumulative counter rewinds backwards), so we never
display a spurious huge or negative percent.

Covered by specs/ops/redis-pressure.feature and tested at three layers:

- Unit (8 tests): pure CPU-percent math, including first-cycle null,
  rounding, counter rewind, and same-instant samples.
- Integration (3 tests): real Redis container — parses INFO, produces a
  CPU sample, derives a non-negative percent after generating work.
- RTL (8 tests): renders the panel inside ChakraProvider, asserts the
  used/max/percent/peak layout, the "sampling…" placeholder on the first
  cycle, the unlimited-maxmemory fallback, and warning thresholds.

Also fixes a pre-existing biome noFloatingPromises warning on the first
collect() call in OpsMetricsCollector.start().

* test(ops): browser-mode screenshot capture for RedisPressurePanel

A small vitest browser test that runs the panel inside real Chromium via
Playwright and writes PNGs to __screenshots__/ for each rendered state
(healthy, sampling, saturated, unlimited). Not a regression test — the
PR-screenshots upload that the PR description embeds is the artifact.

* fix(ops): address coderabbit review (PR #4190 round 1)

- Remove unused VStack import from RedisPressurePanel
- Expose warning state as data-warning="true|false" on memory/CPU stats so
  RTL tests can pin the contract independently of Chakra v3's CSS-variable
  rendering details
- Add tests that verify data-warning is set above the 80% memory threshold
  and the 70% engine-CPU threshold, and unset below
- Align the unlimited-maxmemory spec scenario with the panel contract: the
  percent is omitted entirely (not shown as "-") when maxmemory=0

* refactor(ops): inline Redis stats into top stat strip, format bytes as MB/GB

The dedicated "Redis pressure" card stood awkwardly between the top stats and
the throughput chart. Three values fit naturally alongside Staged/s, Completed/s,
P50, P99, etc. — collapse the card and let the grid wrap.

Also fixes the memory display: Redis's `used_memory_human` returns "53.62M",
which is ambiguous (megabits? million bytes?). Format from the bytes value on
the frontend with explicit "MB"/"GB"/"KB" suffixes instead.

* fix(ops): use raw memory ratio for warning threshold (PR #4190 round 2)

- Spec: align maxmemory (9.69GB) with the displayed sublabel value so the
  scenario isn't internally inconsistent.
- Spec: pin the new edge case (79.95% raw, 80% displayed → no warning).
- Component: separate the raw ratio (threshold check) from the rounded value
  (display) so 79.95% does not falsely trigger the eviction warning.
- Test: cover the 79.95% edge case explicitly.

* fix(ops): stack sublabel under value and prevent wrapping (PR #4190)

When the strip got cramped, the sublabel sat awkwardly to the right of the
value and the label text broke onto two lines. Stacking the sublabel under
the value and pinning whiteSpace=nowrap on the label/value/sublabel forces
the grid to wrap whole tiles to the next row instead of mangling individual
ones.
* feat(langevals): stage large payloads to S3 via presigned URL

Topic clustering batches and long-input evaluator calls routinely
exceed Lambda's 6 MB sync invoke cap; large topic-clustering payloads
observed in prod hit 56 MB and 85 MB with one project losing every
daily clustering run to a 413.

Adds an opt-in S3 staging path: bodies above
LANGEVALS_STAGING_THRESHOLD_BYTES (default 5 MB) are uploaded to the
project's storage destination, the POST carries an empty body with the
GET presigned URL in X-Payload-S3-URL, and a FastAPI middleware on
langevals fetches + swaps the body before the route handler runs. Hard
caps reject oversized payloads before any network call: 16 MB for
evaluations, 180 MB for topic clustering.

Langevals never holds S3 credentials; only the control plane signs the
URL. Legacy langwatch_nlp topic clustering path stays unchanged so
self-hosted operators do not regress.

Telemetry tracked on both sides: control plane logs staged size, S3
key, TTL, and target host; langevals logs the fetched byte count and
latency.

* fix(topic-clustering): skip ES dual-write when Elasticsearch is not configured

storeResults dual-wrote topic assignments to both Elasticsearch and the
event-sourcing AssignTopic command queue. On SaaS prod (ClickHouse-only,
no ES) esClient() returned a throwing proxy whose .bulk() raised
"Elasticsearch is not configured", killing the function before the
AssignTopic block could run.

Result: Topic rows were created in Postgres (314 in the last 24h across
23 projects, confirmed via psql), but trace_summaries.TopicId in
ClickHouse never got written, leaving "Top Topics" empty in the UI for
every clustering-active project — including projects on the legacy
langwatch_nlp path that have nothing to do with the nlpgo flag flip.

Adds isElasticsearchConfigured() that mirrors esClient's resolution
order (per-org override → env) and guards the bulk write with it.
ClickHouse-only deployments now write topic assignments via the
event-sourcing path only, as intended.

Adds an end-to-end test that spawns langevals locally, points the TS
helper at real lw-dev S3, and exercises both the evaluator and
topic-clustering staging paths through real network + presigned URL
fetch. Skipped in CI (no shared dev S3 creds); run via
scripts/run-langevals-staging-e2e.sh.

* fix(langevals): coderabbit review follow-ups + feature-parity spec

* SSRF guard on X-Payload-S3-URL: reject non-https, localhost, RFC1918,
  loopback, link-local, multicast, reserved, unspecified IPs with 400
  before any httpx fetch. Six parametrized cases + a transport-level
  assertion that the rejected URL never reaches the network.
* Drop raw exception strings from staged-fetch error logs (httpx's
  __str__ embeds the full presigned URL with X-Amz-Signature). Replaced
  with error_type + upstream_status. Regression test asserts the secret
  URL substring never appears in captured log records.
* Reorder header spread in stagedLangevalsFetch so caller headers cannot
  override the contract-defining X-Payload-S3-URL or Content-Type.
* Switch test_staged_payload.py to a single import style for the
  staged_payload module (CodeQL flagged the mixed import/from import).
* Add specs/topic-clustering/trace-assignment.feature bound to the
  storeResults regression test so feature-parity CI doesn't break on the
  prior commit's unbound @Scenario annotations.

* fix(test): poll loop-blocked counter instead of fixed 1500ms quiet window

The depth=1 loop-prevention test slept 1500ms then asserted the prom
counter had incremented. Under parallel CI load (PR #4189 hit this
twice) the reactor → BullMQ → metric-write chain doesn't always
complete in 1500ms, so the counter read 0 and the test failed with
"expected 0 to be greater than or equal to 1".

Replace the fixed sleep with `waitFor` polling the counter up to 20s.
The dispatch assertion stays as a post-condition: by the time the
counter ticks, the reactor has decided not to dispatch.

* test(e2e): exercise real OpenAI through staged-payload path

The previous e2e test posted an `exact_match` body and a topic
clustering call with a fake api_key pointing at an unreachable
embeddings endpoint. That proved the middleware fetched the staged
body but said nothing about whether the downstream call actually
produced a valid evaluator verdict or real topics.

This rewrites the e2e to:
  - Stage an `llm_boolean` payload, hit real OpenAI through the
    presigned URL hop, assert a real verdict with non-zero cost.
  - Stage a `topic_clustering_batch` payload of 12 traces in two
    semantic clusters, hit real OpenAI embeddings + naming, assert
    the response actually contains named topics and trace_id ->
    topic_id assignments.

OPENAI_API_KEY is sourced from langwatch/.env by the wrapper
script so the operator runs the same command as before. Drops the
"posts inline" scenario because the wrapper-set threshold of 200
bytes makes it impossible to fit a real evaluator body under the
inline branch, and the unit tests already cover that branch.

* refactor(topic-clustering): delete Elasticsearch dual-write entirely

Topic clustering ran a legacy ES bulk write inside storeResults, plus
two dead ES helpers (fetchCountsFromElasticsearch,
fetchTracesFromElasticsearch) that nothing called, plus an
isElasticsearchConfigured gate added earlier in this PR to keep the
bulk from killing the AssignTopic queue on SaaS.

The right fix is to delete the whole ES path. ClickHouse is the
storage for per-trace assignments via the AssignTopic command queue
plus trace_summaries.TopicId projection; Postgres is the storage for
the topic catalog. Nothing else still reads metadata.topic_id from
ES, so dual-writing was just dead work.

Drops:
- storeResults ES bulk + isElasticsearchConfigured gate
- fetchCountsFromElasticsearch / fetchTracesFromElasticsearch
  (defined but never called)
- isElasticsearchConfigured from elasticsearch.ts (no other callers)
- topicClustering.integration.test.ts (describe.skip'd ES-only test)
- ES mocks + assertions in the unit tests
- Two ES branches in trace-assignment.feature

The unit test now pins one contract: storeResults emits AssignTopic
for each trace and never touches the elasticsearch module. Any
re-introduction of an ES write would have to deliberately update this
test.

Dogfood re-tested locally on the same X0O0F0Y82Bqjl3WXyTY_6 project
with ELASTICSEARCH_NODE_URL unset (SaaS shape): 12 traces seeded,
storeResults' new no-ES path emits 12 AssignTopic commands,
trace_summaries.TopicId populated 12/12, Top Topics shows the
clustered topic in the UI.

* refactor(langevals-staging): opt-in by default + drop bash wrapper

Two cleanups suggested while dogfooding the e2e:

1. Staging is now off by default. Self-hosted langevals on a plain HTTP
   service has no 6 MB sync-invoke cap so there's nothing for the S3
   hop to dodge — bodies just go inline regardless of size. SaaS opts
   in by setting LANGEVALS_STAGING_THRESHOLD_BYTES in terraform.

   z.coerce.number().int().positive().default(5_000_000)
     -> z.coerce.number().int().positive().optional()

   stagedLangevalsFetch checks `threshold === undefined` and routes
   straight to inline POST. EVAL_MAX_PAYLOAD_BYTES and
   TOPIC_CLUSTERING_MAX_PAYLOAD_BYTES (the hard rejection caps) keep
   their defaults — those are useful even outside the staging
   scenario.

   Dropped the .env.example block that exposed all four vars to
   self-hosted operators. SaaS-specific tunables don't belong in the
   self-hosted starter template.

2. Deleted scripts/run-langevals-staging-e2e.sh. The TS e2e already
   does everything except resolve AWS profile credentials; that's one
   line of bash, folded into a new `pnpm test:e2e:langevals-staging`
   script that exports `aws configure export-credentials` output then
   runs vitest. Bucket + region default to lw-dev values in the test
   itself, so the script is single-command for anyone with lw-dev SSO
   refreshed.

Coverage added:
  - stagedFetch.unit.test.ts gains a scenario asserting threshold-unset
    routes inline regardless of size.
  - staged-payload.feature gains a matching @Unit scenario.

* fix(langevals-staging): delete staged S3 object after use + pin inline Content-Type

Addresses Sergio's PR review:

P1 — staged payloads were written to S3 but never deleted. Those bodies
carry customer trace/input data plus provider credentials (evaluator
env X_LITELLM_*, topic-clustering litellm_params api_key /
vertex_credentials / bedrock keys), so leaving them in the bucket is a
retention/data-minimization problem, not just a Lambda-access one. The
presigned-URL TTL only bounds URL usability, not object lifetime.

Now stagePayload returns the client/bucket/key alongside the URL and
stagedLangevalsFetch deletes the object in a finally block once the
upstream responds. By then langevals has already fetched the presigned
URL during request handling, so the object is safe to drop. Delete is
best-effort and logged on failure; a bucket lifecycle rule on the
langevals-staging/ prefix (added in langwatch-saas terraform) is the
orphan/crash fallback for the rare path where the finally can't run.

P3 — the inline path spread caller headers before Content-Type, letting
callers override it; the staged path already pinned it last. Made the
inline path pin it last too so both branches behave identically (body
is always JSON-serialized here).

Tests: added a unit scenario asserting the staged object is deleted
with the same bucket+key it was written to, plus a bound @Unit feature
scenario.

* test(langevals-staging): e2e asserts staged object is deleted after use

The two real-S3 e2e scenarios previously asserted the staged object was
still present after the call (matching.length > 0). With the in-app
finally delete that's no longer true — and the old assertion masked
whether cleanup actually happened (the afterAll teardown deletes the
prefix anyway).

Flipped both to assert the object is gone (matching === []) right after
stagedLangevalsFetch resolves, before teardown. The "fetched staged
payload" log still proves the body round-tripped through S3 first, so
the chain is: upload -> langevals GET from presigned URL -> in-app
delete -> nothing lingers. Verified the dev bucket lists 0 objects
under langevals-staging/ after the run.
…tching, eval OOM, shiki dispose (#4192)

* fix(traces-v2): non-selectable newline marker, evaluator filter label, annotation batching, eval OOM, shiki dispose

Five trace-explorer fixes reported from dogfooding:

1. Newline marker (compact + comfortable I/O preview). The `↵` hard-break
   glyph now renders as a zero-width `::after` pseudo-element at the end
   of the broken line, like a GitHub diff gutter: it isn't part of the
   selectable text (so copy round-trips clean multi-line text), and it
   never occupies layout width (so it can't wrap onto a line of its own).

2. Evaluator filter label. Dropped the `[type]` prefix from the evaluator
   facet label — a project's evaluators are mostly the same type, so the
   pill just ate the room the name needs. Value still keys off the
   evaluator id for saved-query round-trips. Removed the now-dead
   type-pill rendering from FacetRow.

3. annotation.getByTraceIds "Input is too big for a single dispatch".
   tRPC v10 sends queries as GET, so a page of 100+ trace ids overflowed
   the batch link's URL ceiling. New `useAnnotationsByTraceIds` hook
   chunks ids into URL-safe batches via `useQueries` and flattens — no
   upper bound. Migrated all five callers; `annotations/all` now also
   declares both queries unconditionally (fixing a latent rules-of-hooks
   issue).

4. traces.getEvaluations 500 ("Query memory limit exceeded ... column
   Inputs"). `evaluation_runs` is ORDER BY (TenantId, EvaluationId), so a
   TraceId filter can't prune granules; reading the multi-MB `Inputs`
   column across a granule OOMed. The query now falls back to a light
   projection (no Inputs) on a memory-limit error, so operators still get
   verdicts/scores instead of a 500. The eval card hides inputs when
   absent.

5. ShikiError "instance has been disposed". Chakra's shiki adapter
   disposes the highlighter in `unloadContext` on every CodeBlock unmount
   / color-mode change, but we share one app-lifetime singleton across
   all CodeBlocks — the first unmount tore it down for the rest. Neuter
   the singleton's `dispose`.

* fix(traces-v2): keep newline marker clear of the clamp ellipsis

Widen the spacing before the ↵ glyph, and suppress the marker that lands on the truncated line so it never overlaps the line-clamp's … ellipsis. Markers on fully-visible lines keep showing, so the hard-break affordance survives truncation.

* feat(traces-v2): lazy-load evaluation inputs in the drawer

The verdict list keeps the heavy Inputs blob (public API + v1 tooltip depend on it) but degrades to a light projection under ClickHouse memory pressure. To still show inputs in the drawer when that happens, EvalCard now fetches a single evaluation's inputs on expand via traces.getEvaluationInputs, keyed by EvaluationId (the table sort key) so the read prunes granules and stays within memory. Inputs already carried by the list render without an extra request.

* fix(traces-v2): address review comments

- Dedupe traceIds before chunking in useAnnotationsByTraceIds so duplicate ids spanning chunks don't double annotations.
- Count traceGroups load toward the annotations table loading state in filtered mode so it no longer flashes an empty state before the ids arrive.
- Wrap the ClickHouse light-projection retry in its own try/catch so a retry failure still goes through the standard error log + wrapped message.
- Normalize CRLF/CR in the preview newline marker so a stray carriage return can't cling to a line.
- Give the test ResizeObserver stub a one-arg constructor so callers aren't flagged for a superfluous argument.

* fix(traces-v2): make getEvaluationInputs project-protected, not public-share

The procedure was publicProcedure + checkPermissionOrPubliclyShared scoped to a single traceId, but the read is keyed by evaluationId (only tenant-scoped) with traceId dropped. A public-share token for one trace could then read any evaluation's inputs in the project by supplying another evaluationId. Switch to protectedProcedure + checkProjectPermission so the auth scope (project) matches the data scope, and drop the now-unnecessary traceId. Public-shared trace drawers still get inputs eagerly from the public getEvaluations; this lazy fallback is project-gated.

* fix(spans): make getForPromptStudio protected, not public

Prompt Studio is an authenticated project-scoped feature with no public-share path. It was declared publicProcedure but guarded only by checkProjectPermission, so it was effectively protected (anon resolves to no permission) yet labeled public. Switch to protectedProcedure to make the intent explicit and avoid the public-by-default footgun.

* docs(traces-v2): drop history-flavored code comments

Comments describe the final state of the code, not how it got there.
…#4196)

* fix(gateway): route managed-bedrock through the customer VPC endpoint

Managed-Bedrock customers reach Bedrock through their own PrivateLink VPC
endpoint, and their IAM role only authorizes bedrock:InvokeModel when the
request arrives via that endpoint. The gateway dispatched Bedrock through
bifrost, which hardcodes the public bedrock-runtime host and SigV4-signs over
it, so the request missed the VPCE and the customer role returned a 403 (the
old litellm path set aws_bedrock_runtime_endpoint to the VPCE and worked).

When a Bedrock credential carries a runtime endpoint, dispatch through the
official aws-sdk-go-v2 bedrockruntime Converse/ConverseStream with
BaseEndpoint pinned to that VPCE, so the SDK signs and sends to the right
host. The endpoint already reaches the gateway credential on the nlpgo path
(Credential.Extra). Every other Bedrock request and every other provider stays
on bifrost untouched. Responses are mapped into the same BifrostChatResponse
shape the normal path emits, so downstream cost/trace/stream handling is
unchanged.

Validated end to end against real Bedrock (Converse via BaseEndpoint returns a
200 through the dispatch path).

* fix(gateway): satisfy golangci-lint on bedrock vpce dispatch

Use promoted embedded fields, index-range over Tools to avoid value
copy, and http.StatusOK in the test.

* fix(gateway): accept litellm aws_* credential keys on bedrock vpce dispatch

The two nlpgo entry points name AWS credential fields differently: the
dispatcheradapter (Studio / workflows) translates litellm_params to the
Bifrost-canonical access_key / secret_key / session_token / region,
while the gatewayproxy (/go/proxy) keeps the litellm aws_* names. The
VPCE credential reader now honors both conventions so the dispatch is
correct regardless of which route built the credential.

Add a committed streaming live test (dispatchBedrockVPCEStream) alongside
the non-streaming one, a unit test for the aws_* key path, and assert the
dispatcheradapter threads the runtime endpoint through.

* fix(gateway): honor litellm aws_* keys on the normal bedrock key path

The gatewayproxy (/go/proxy) route keeps the litellm aws_* credential
names while the dispatcheradapter translates to canonical ones. The
Bifrost bedrock key builder read canonical only, so a /go/proxy bedrock
request landed with empty credentials. Read both conventions via the
shared credExtra helper so neither nlpgo route is left credential-less.

* fix(gateway): validate bedrock vpce endpoint to close SSRF surface

The runtime endpoint arrives from per-request credential Extra, so an
unconstrained value would let a request steer the gateway's outbound
call at an arbitrary host. Gate the managed-Bedrock dispatch behind
bedrockVPCEEndpoint, which requires an http/https URL whose host is
within the AWS-controlled amazonaws.com domain (covers both the public
endpoint and PrivateLink VPC endpoints) and fails closed otherwise.

* test(nlpgo): exercise the bedrock vpce intercept through the full nlpgo stack

The bedrock live e2e drives the real nlpgo HTTP chain (router →
dispatcheradapter → dispatcher → gateway intercept). Threading an
optional BEDROCK_VPCE_ENDPOINT into litellm_params routes that chain
through the managed-Bedrock VPCE dispatch instead of the public bifrost
path, so the endpoint-to-dispatch wiring is proven end to end, not just
the dispatch function in isolation.

* test(nlpgo): thread aws_session_token so the managed temp-cred path is exercisable

Managed-Bedrock customers authenticate with STS-temporary credentials,
which require the session token alongside the access/secret keys. Pass
AWS_SESSION_TOKEN through litellm_params when set so the harness can run
the real managed customer path, not just long-lived keys.

* fix(gateway): address coderabbit review on managed-bedrock vpce

- guard the VPCE intercept (Dispatch + DispatchStream) to RequestTypeChat so /v1/messages stays on the raw-forward path; routing Anthropic-native bodies through Converse would drop messages-only fields like thinking. The studio/eval LLM path is ChatCompletions, so the managed-bedrock customer path is unaffected.

- require https for non-PrivateLink endpoints in validateBedrockEndpoint: http is now allowed only for .vpce.amazonaws.com hosts, so public bedrock-runtime cannot be downgraded onto plaintext transport. The customer http VPCE :80 still passes (vpce host).

- bound the live stream test context with a 2m timeout so a stalled upstream cannot hang it.
…on (#4198)

Multi-MB base64 span attributes (data: image/audio data URLs embedded in
multimodal LLM input/output, plus very large params) were entering the
event-sourcing fold path. The trace-processing pipeline read-modify-writes
each trace's accumulated fold state in Redis per event; when that state is
multi-MB, every op saturates the single-threaded Redis command loop, folding
throughput collapses, staging outpaces it, and the backlog and Redis memory
diverge.

Cap attribute values over 256KB with a short placeholder in RecordSpanCommand,
the shared chokepoint every ingestion path dispatches through (collector REST
and OTLP both call commands.traces.recordSpan). Capping there shrinks the
SpanReceivedEvent before it is folded, keeping fold state in KB not MB.

Recurses through arrayValue/kvlistValue and handles bytesValue. Never throws on
malformed values; normal traces are untouched. Helps new traces; the existing
backlog is handled operationally.
…ore cycle (#4199)

* fix(event-sourcing): coalesce backed-up fold groups into one apply/store cycle

Fold projections process one event at a time: read state, apply, write back.
For a single aggregate with N queued events that is N load+store round-trips
over state that grows with N (O(n^2)), and the GroupQueue's per-group active
gate forces strict one-at-a-time draining. A 233-span trace backed its fold
group up and diverged instead of draining, stalling the single-threaded Redis
command path for every tenant.

Add opt-in per-group batch coalescing: while a worker holds a group's active
slot (which already excludes the group from concurrent dispatch), drain the
group's remaining due jobs in one pop-only Lua call and fold them in a single
load/apply/store cycle, in occurredAt order. A backed-up group now drains in
O(n) and self-heals. When the queue keeps up, batches are size 1 and the
per-event path is byte-for-byte unchanged. Coalescing is bounded by a per-fold
max batch (default 100) and enabled for all fold projections; other queues are
untouched. A failed batch re-stages its drained siblings so none are lost, and
the drain decrements total-pending per job exactly as dispatch does.

Also raise the Redis fold-cache TTL from 30s to 300s so the accumulated state
stays warm across an aggregate's event stream instead of expiring mid-stream
and forcing a ClickHouse re-read of the full state on every event.

Tests: executeBatch equivalence + ordering + out-of-order re-fold (unit);
drainGroupReady semantics and pending-counter conservation (integration, real
Redis); end-to-end coalescing, max-batch cap, disabled-passthrough, and
failed-batch re-stage (integration, real Redis).

* fix(event-sourcing): per-event reactor dispatch under coalescing + review fixes

Address review on the coalescing PR:

- BLOCKER (per-span reactors): processFoldProjectionBatch fired fold reactors
  once with only the last event, which silently dropped N-1 spans for per-span
  reactors (customEvaluationSync extracts evals from event.data.span,
  evaluationTrigger/scenarioExecution key on each event), and only under backlog
  (batch>=2) — exactly the recovery path. Now dispatch reactors per event with
  the final fold state: event-sensitive reactors see every event; fold-state
  reactors dedup via makeJobId to one effective run; the O(n) win stays in the
  single fold load/store, not the cheap reactor enqueues.
- Dispatch reactors (and fold) in occurredAt order so reactor metadata and the
  final state are consistent regardless of drain/arrival order.
- Global processBatch: guard that every payload maps to the same registry entry,
  falling back to per-item processing on any mismatch (never misroute).
- Fold cache TTL is now env-overridable (LANGWATCH_FOLD_CACHE_TTL_SECONDS, read
  at call time) so residency can be dialed down without a redeploy; default 300s.
- redis-fold-cache.feature: correct the store ordering wording (ClickHouse
  first, then Redis cache) to match RedisCachedFoldStore.

Coalescing stays default-on for every fold: the final folded state is identical
to per-event folding, per-event reactor dispatch preserves event-sensitive
reactors, and out-of-order is handled identically (executeBatch sorts by
occurredAt + same checkpoint re-fold). A fold can still opt out via
options.coalesceMaxBatch = 1.

Tests: projectionRouter unit tests assert a coalesced batch loads/stores once
but dispatches reactors per event, in occurredAt order; a deterministic
ClickHouse integration test folds many spans via executeBatch and reads the
exact span count back from the trace_summaries table.

* test(event-sourcing): poll ClickHouse read-back in coalescing integration test

The coalescing fold writes the trace summary once; reading it straight back
raced ClickHouse insert visibility. Poll the read (as the other pipeline
integration tests do) so the assertion is deterministic. Verified green against
live ClickHouse: 40 spans coalesced -> spanCount 40 persisted.

* fix(test): use branded TenantId in coalescing CH test store context

ProjectionStoreContext.tenantId is a branded TenantId, not a plain string.
Type-only fix (the store already coerces with String()); unblocks tsgo.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

⤵️ pull merge-conflict Resolve conflicts manually

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants