[pull] main from langwatch:main by pull[bot] · Pull Request #279 · erickirt/langwatch

pull · 2025-12-17T07:46:02Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

@Scenario

…3808) Phase 2 of #3458 (parity grinder) — setup domain. Bound to existing tests via JSDoc @Scenario: simplified-setup.feature (2/12): * "Shows Integration configured alert when firstMessage exists" and "Shows Waiting for messages when no firstMessage" — bound to the corresponding cases in project.getHasFirstMessage.unit.test.ts. The render-side condition is driven entirely by this procedure's return value. Kept @unimplemented (with justifying comments — needs harness): simplified-setup.feature (10): * Setup page UI rendering (API-key visibility/copy, endpoint section, SDK guides, "section X is not shown" assertions, copy tracking) — no JSDOM render fixture for the setup page yet. docker-dev-worktree-isolation.feature (6): * Dev-shell + port-allocator behaviour. Tests for the TypeScript port allocator + bash worktree helpers exist in `packages/server/test/` and `scripts/__tests__/*.bats`, but neither path is in the parity check's DEFAULT_TEST_ROOTS, so any JSDoc binding there would not be discovered. specs/setup net @unimplemented: 18 → 18 (-2 bound, 16 justified). `pnpm check:feature-parity` passes. `pnpm typecheck` clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

Phase 2 of #3458 (parity grinder) — navigation domain. Bound to existing tests via JSDoc @Scenario: child-drawer-nesting.feature (5/10): * SuiteFormDrawer → scenario editor child drawer (open + close + form-state preservation): bound to the suite-editor integration test cases that exercise that exact path. * SuiteFormDrawer → agentHttpEditor child drawer (open + close): bound to the matching cases in the same file. Kept @unimplemented (with justifying comments — needs harness): child-drawer-nesting.feature (5): * ScenarioRunDetail → trace child drawer end-to-end flow (composition is exercised but not the full open/close cycle). * "Open a drawer from a page" (root-level) — implicit in every standalone drawer test but no dedicated assertion. * Escape-only-closes-topmost (semantics inherited from Chakra Drawer overlay stack; not asserted directly). * E2E "new scenario appears in picker" (no Playwright fixture). home-navigation.feature (10): * All page-level routing/active-state scenarios: MainMenu sidebar has no JSDOM render test, no Playwright/E2E suite for project routing. Cheap follow-up. specs/navigation net @unimplemented: 20 → 20 (-5 bound, 15 justified). `pnpm check:feature-parity` passes. `pnpm typecheck` clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cked (#3812) test(python-sdk): document why 34 @unimplemented scenarios stay unbound Phase 2 of #3458 (parity grinder) — python-sdk domain. All 34 `@unimplemented` scenarios describe Python SDK behaviour with pytest cases under `python-sdk/tests/`. The check-feature-parity script's TEST_FILE_RE only matches `*.test.ts/tsx`, so pytest cases cannot bind via JSDoc today. Affected feature files (all add a single header comment so the reason is durable rather than orphaned `@unimplemented` tags): * experiment-print-summary.feature (8): tests in python-sdk/tests/test_experiment.py + test_examples.py. * prompt-tags.feature (15): tests in python-sdk/tests/prompts/test_prompt.py + test_fetch_policies.py. * async-experiment-parallelism.feature (11): async-loop / asubmit / loop-bound-resource cases tracked separately in the pytest suite. specs/python-sdk net @unimplemented: 34 → 34 (0 bound, 34 justified). Cheap structural fix tracked separately: extend check-feature-parity.ts to read pytest docstring tags so python-sdk and python-server scenarios can bind too. `pnpm check:feature-parity` passes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

) Phase 2 of #3458 (parity grinder) — projects domain. Bound to existing tests via JSDoc @Scenario: create-project-drawer.feature (5/22): * "Project name is required" + "Project name with only whitespace is invalid" — bound to ProjectForm.unit.test.ts validation cases. * "Show new team name field when creating new team" + "New team name is required when creating team" — bound to the requires-new-team-name validation case. * "Drawer receives correct organization when opened from different org" — bound to CreateProjectDrawer.test.tsx multi-org organizationId-prop case. * "Allow creation when plan has override enabled" — bound to project.create.planLimit.integration.test.ts override case. project-creation-flow.feature (1/12): * "Project creation calls correct API endpoint" — partially bound to the multi-org organizationId test (asserts mutation call payload). Kept @unimplemented (with justifying comments — needs harness): create-project-drawer.feature (17): page-render flows for entry- point clicks (settings/team/dropdown), drawer chrome, close mechanisms, team selector, loading state. No JSDOM render fixture covers these. project-creation-flow.feature (11): success/error toast, dup-name handling, drawer close, redirect, form reset, tracking event — need a happy-path component test wiring drawer + mutation. project-list-refresh.feature (8): query invalidation assertions + list/dropdown immediate-render assertions. Cheap follow-up: assert `organization.getAll.invalidate` + `limits.getUsage .invalidate` fire on the existing mutation test. specs/projects net @unimplemented: 42 → 42 (-7 bound, 35 justified). `pnpm check:feature-parity` passes. `pnpm typecheck` clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

…cenarios (#3815) Slice C2 of features-domain parity grinder (#3458). Maps KEEP-class @unimplemented scenarios in marketing/onboarding/billing feature files to existing tests via /** @Scenario "<title>" */ JSDoc bindings. Removes the @unimplemented tag for each newly-bound scenario. UPDATE-class scenarios (implementation diverged from spec wording) stay tagged + are listed in each feature file's header comment for future scenario rewrites. Per-file: - beta-pill.feature: 5 bound, 1 PARTIAL (Suites sidebar — no integration test) - signup-slack-notifications.feature: all 5 bound - onboarding/welcome-screens.feature: all 4 bound - onboarding/mcp-setup-prompt-compatibility.feature: 2 bound, 1 UPDATE - pricing-model-aware-free-plan.feature: 6 bound, 6 spans gap - settings-plans-comparison.feature: 5 bound, 4 UPDATE - trace-limit-upgrade-message.feature: 0 bound, 4 UPDATE (wording diverged) - customer-io-nurturing-integration.feature: 56 bound, 8 UPDATE Bumps `pnpm check:feature-parity` enforced count by 83. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

…DK (#3818) test(typescript-sdk): bind 28 @unimplemented scenarios Phase 2 of #3458 (parity grinder) — typescript-sdk domain. Pre-classified by `specs/typescript-sdk/AUDIT_MANIFEST.md`. Bound to existing tests via JSDoc @Scenario: cli-docs.feature (11/11): * docs index/relative path/leading-slash/dup-prefix scenarios — bound to docs.unit.test.ts normalizeDocsUrl + docsCommand cases. * scenario-docs index + relative path + dup-prefix — bound to same file's scenario describe block. cli-error-handling.feature (6/6): * "Sync surfaces specific conflict message" — cli-error- propagation.integration.test.ts. * "API errors surface meaningful message" + "fall back to raw JSON payload" — same file. * "Invalid API key returns clear authentication error" + "Network errors surface underlying cause" — cli-error-edge- cases.integration.test.ts. * "Common error conditions map to actionable messages" — the describe block of cli-error-propagation-commands.integration .test.ts (covers all five command rows in the spec outline). experiment-print-summary.feature (4/4): * Bound to experiment-print-summary.test.ts pass/fail/exit cases. prompt-tags.feature (7/24): * Fetch by tag via options + tag in cache key + different tags produce different cache entries + unassigned tag returns error + shorthand syntax passthrough + fetch without tag returns latest — bound to prompts.facade.unit.test.ts. * List tags GET / returns built-in+custom + Create custom tag via SDK + Delete custom tag via SDK — bound to prompt-tag-crud.unit.test.ts. * Assign tag to existing version + assignment confirmation — bound to prompt-tags.unit.test.ts. Kept @unimplemented (with justifying header notes — needs harness): cli-prompt-tags.feature (28/30 stay unimpl): * SDK-level renameTag rows handled by sister PR #3696 — not duplicated here. * CLI command-level rows live under `typescript-sdk/src/cli/commands/{prompt,tag}/__tests__/`; tests exist but aren't @scenario-bound yet (each one needs a pass-through). 4 rows are marked DUPLICATE/UPDATE in the manifest and stay unimpl by design. prompt-tags.feature (17/24 stay unimpl): * `prompts.create`/`prompts.update` request-body shape with a tags list — covered by client-factory tests. * E2E flows under `typescript-sdk/e2e/prompts/prompt-tags. e2e.test.ts` are `it.skipIf(no API key)` — cheap follow-up adds @Scenario markers to those skip cases. specs/typescript-sdk net @unimplemented: 75 → 75 (-28 bound, 47 justified). `pnpm check:feature-parity` passes. `pnpm typecheck` clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…NIFEST (#3819) test(variables-ui): document why 78 @unimplemented scenarios stay unbound Phase 2 of #3458 (parity grinder) — variables-ui domain. All 78 `@unimplemented` scenarios across the three feature files are pre-classified KEEP/UPDATE in `specs/variables-ui/AUDIT_MANIFEST.md`. The underlying tests exist: * VariablesSection.test.tsx * VariableMappingInput.test.tsx * VariableTypeIcon.test.tsx * TargetVariablesPanel.test.tsx * VariableInsertMenu.test.tsx * PromptTextAreaWithVariables.test.tsx * liquidVariableExtraction.unit.test.ts But each `it(...)` case still needs an individual JSDoc `@scenario` marker — a mechanical pass per spec scenario. Adding header notes to each feature file pointing at the manifest + test files so the follow-up is unambiguous. 5 rows in `variable-insertion-menu.feature` are UPDATE-class (spec diverges from source — e.g. search input only in buttonMenuMode, badges show "Text"/"Number" not "STRING"/"OBJECT") and need spec rewriting before binding. specs/variables-ui net @unimplemented: 78 → 78 (0 bound, 78 justified). `pnpm check:feature-parity` passes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tform parity (#3820) * test(features): bind 51 + delete 15 + drop 1 file from enterprise/platform parity Slice C3 of features-domain parity grinder (#3458). Mixed clearing actions across enterprise/auth/platform feature files: Bindings (51 scenarios): - enterprise-feature-guards.feature: 23 DUPLICATE scenarios bound - webhook-service-refactor.feature: 20 DUPLICATE bound (full coverage) - platform-evaluator-and-model-provider-tools.feature: 4 bound, 9 NoTest - scim-group-mapping.feature: 4 bound, 21 NoTest documented - user-deactivation.feature: 4 bound, 8 NoTest + 4 UPDATE documented Deletions (16 scenario blocks + 1 file): - elasticsearch-write-disable-flags.feature: deleted entirely (8 DELETE scenarios — feature was reverted, columns dropped in migration 20260403120000) - remove-dead-cost-checker-code.feature: 9 DELETE scenarios removed (cleanup completed in PR #2661) - subscription-service-refactor.feature: 3 DELETE removed (NullSubscriptionService never landed) - stripe-price-catalog-sync.feature: 3 DELETE removed (sync task never landed) Header comments documenting NO_TEST and UPDATE gaps added to: - drawer-backdrop-transparency-blur.feature (1 NoTest) - tag-management.feature (6 NoTest, 1 UPDATE) - subscription-service-refactor.feature (5 NoTest, 1 UPDATE) - stripe-price-catalog-sync.feature (3 NoTest) Bumps `pnpm check:feature-parity` enforced count by 51 + drops 8 scenarios from the @unimplemented backlog via deletion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger CI for parity work --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d tests (#3821) Binds the lite-to-full role-upgrade enforcement scenarios in specs/licensing/enforcement-members.feature to the existing prose unit tests in license-limit-guard.unit.test.ts: - "Allows upgrade from Lite Member to full member when under limit" → assertMemberTypeLimitNotExceeded "allows change when under limit" - "Blocks upgrade from Lite Member to full member when at member limit" → assertMemberTypeLimitNotExceeded "throws when at limit" Both tests directly exercise the role-change limit guard with lite-to-full changeType, which is exactly what the scenarios specify. Part of parity grinder Phase 2 (#3458). Builds on iter-1 PR #3803 and iter-2 PR #3810. The "Blocks custom role change that would exceed full member limit" scenario is left @unimplemented because no existing test exercises it (it requires permission-array introspection coupled to limit checks, not just changeType). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

…ity (#3822) Slice C4 of features-domain parity grinder (#3458). Maps DUPLICATE-class @unimplemented scenarios in specs/features/scenarios/* to existing tests via /** @Scenario "<title>" */ JSDoc bindings; removes @unimplemented from each newly-bound scenario; deletes 1 layout-only scenario; adds header comments documenting KEEP-class no-test gaps. Per-file: - scenario-run-status-config-location.feature: 2 bound - scenario-id-format.feature: 2 bound, 1 NoTest documented - unified-agent-target-section.feature: 1 bound, 3 NoTest documented - on-prem-hostname-validation.feature: 8 already bound, no changes - extensible-scenario-metadata.feature: 11 bound (full coverage) - run-view-side-by-side-layout.feature: 2 bound, 4 NoTest, 1 DELETE ("Drawer content scrolls when it overflows" — pure CSS overflow concern) - workflow-agent-interpolation.feature: 5 bound, 1 NoTest (Python-only AC) - workflow-agent-mapping-layer.feature: 9 bound, 4 NoTest documented Bumps `pnpm check:feature-parity` enforced count by 40. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

…ature files (#3806) * test(features): bind 127 @unimplemented dataset scenarios across 7 feature files Slice C1 of features-domain parity grinder (#3458). Maps every KEEP-class @unimplemented scenario in the dataset feature files to its existing test via /** @Scenario "<title>" */ JSDoc bindings. Removes the @unimplemented tag from each newly-bound scenario. Per-file: - dataset-rest-api.feature: 36 scenarios bound, 3 NO_TEST_FOUND gaps kept @unimplemented (batch-create 404/empty-body/size-limit) - dataset-typescript-sdk.feature: 32 scenarios bound, 1 gap kept (list records explicit pagination) - dataset-mcp-tools.feature: 21 scenarios fully bound - dataset-file-upload-api.feature: 36 scenarios fully bound - dataset-cli.feature: 2 @Unit scenarios bound - dataset-python-sdk.feature: header comment justifying that Python tests cannot bind via TS-only parity checker; scenarios kept @unimplemented - upload-utils.unit.test.ts: bindings using single-quote delimiters where titles contain double-quotes ("id"/"selected" column rename scenarios) Bumps `pnpm check:feature-parity` enforced count by 127. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger CI for parity work * ci: retrigger flaky langchain integration test --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

… 12 files (#3824) * test(features): suites group A — bind 19 scenarios + document gaps in 12 files Slice C5/A of features-domain parity grinder (#3458). Apply parity-clearing to 12 largest suites/* feature files. Bind DUPLICATE-class scenarios to existing tests via /** @Scenario "<title>" */ JSDoc annotations; remove the @unimplemented tag for newly-bound scenarios; document remaining KEEP-with- NoTest and UPDATE-class scenarios in per-file header comments. Files (12): - cancel-queued-running-jobs.feature - unified-run-table.feature - run-history-group-by.feature - unified-run-view-layout.feature - external-sdk-ci-sets-in-sidebar.feature - rename-suites-to-runs.feature - real-time-run-updates.feature - grid-view-and-borderless-tables.feature - suite-list-view-status.feature - footer-to-header-migration.feature - suite-url-routing.feature - unified-sidebar-list-items.feature Bumps `pnpm check:feature-parity` enforced count by 19 over baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger CI for parity work * fix(features): address CodeRabbit nits — unbind 2 mismatched cancellation scenarios + correct header counts CodeRabbit caught 3 issues on PR #3824: 1. langwatch/src/server/scenarios/__tests__/cancellation.unit.test.ts: the @Scenario annotations on lines 129/139 didn't match what the tests actually verify. Removed both annotations, restored @unimplemented on the matching scenarios in cancel-queued-running-jobs.feature, and added them to the file's NO_TEST gap list. 2. specs/features/suites/rename-suites-to-runs.feature: header counts "8 of 16" → "9 of 20" and "8 NO_TEST" → "11 NO_TEST" to reflect actual file. 3. specs/features/suites/unified-run-table.feature: header counts "5 of 12" → "4 of 12" and "7 NO_TEST" → "8 NO_TEST". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…les (#3827) * test(features): C8 devtools/eval-v3 — bind 17 + delete 13 across 6 files Slice C8 of features-domain parity grinder (#3458). Final cluster: devtools/* (4 files, mostly skill/harness gaps) + evaluations-v3/* (2 files, mostly bindable to existing tests). Files: evaluations-v3/thread-variables-in-trace-evaluator.feature: 10 bound, 1 DELETE evaluations-v3/evaluator-run-rerun-enhancements.feature: 8 bound, 3 NoTest + 1 UPDATE devtools/bullboard-queue-dashboard.feature: 1 bound, 4 NoTest devtools/worktree-creation.feature: 0 bound (16 HARNESS_GAP) devtools/orchestrator-bug-fix-workflow.feature: 4 DELETE devtools/issue-creation-skill.feature: 8 DELETE (template diverged) 17 bindings, 13 deletions, 11 files modified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger CI for parity work --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… 12 files (#3825) * test(features): suites group B — bind 17 scenarios + document gaps in 12 files Slice C5/B of features-domain parity grinder (#3458). 12 mid-sized suites/* feature files. Bind DUPLICATE-class scenarios via JSDoc and document remaining KEEP/UPDATE in header comments. 2 files fully cleared: suite-bugfixes-1956.feature and suite-archive-confirmation-dialog.feature. Files: - suite-sidebar-status-summary, inline-add-target-and-scenario-buttons, target-selector-select-clear-all, all-runs-scenario-names, suite-bugfixes-1956, suite-run-confirmation-modal, suite-runs-time-filter, collapsible-suite-sidebar, nested-drawer-typing, run-scenario-target-selector-modal-stability, sidebar-summary-status, suite-archive-confirmation-dialog 17 bindings across 10 test files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger CI for parity work --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mall files (#3826) * test(features): suites group C — bind 38 scenarios + delete 3 + 11 small files Slice C5/C of features-domain parity grinder (#3458). Process the remaining 11 small suites/* feature files plus prompts/custom-prompt-tags.feature. Bind DUPLICATE-class scenarios via JSDoc, delete 3 obsolete scenarios from remove-label-tag-pills.feature, and document UPDATE-class divergences via header comments in suite-url-nesting.feature. Files: Suites (11): - all-runs-default-open (2 bound, fully cleared) - simulation-run-status-consistency (no @unimplemented) - all-runs-panel (3 bound, fully cleared) - single-loading-indicator (3 bound, fully cleared) - remove-label-tag-pills (3 deleted; entire scenario set obsolete) - remove-redundant-suites-label (2 bound, fully cleared) - suite-empty-state (3 bound, fully cleared) - trace-role-cost-accumulation (no @unimplemented) - suite-url-nesting (4 UPDATE — URL scheme diverged; header comment) - all-runs-batch-origin-label (2 bound, 1 NoTest) - all-runs-group-by (4 bound, fully cleared) Prompts (1): - custom-prompt-tags (19 bound, 1 NoTest) Total: 38 bindings, 3 deletions, 11 test files updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger CI for parity work --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lan parity (#3829) Adds a new "counting unit by organization profile" describe block to usage-meter-policy.unit.test.ts that maps each scenario title verbatim to a unit test. Removes @unimplemented from 5 scenarios in the feature file. The 6th scenario (Self-hosted free org never blocked) stays @unimplemented + documented as a NoTest gap because UsageService.checkLimit has no IS_SAAS short-circuit yet. Bound scenarios: - Free TIERED organization counts each span toward the limit - Free SEAT_EVENT organization counts each span toward the limit - Paid TIERED organization counts each trace as one unit - Paid SEAT_EVENT organization counts each span toward the limit - Licensed organization respects its own counting rule Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e parity (#3830) Extract buildLimitMessage and buildUpgradeUrl from usage.service.ts into a dedicated limit-message.ts module so they can be unit-tested directly without spinning up the full UsageService. Adds 4 tests, one per scenario in trace-limit-upgrade-message.feature, covering: - Free-tier org on SaaS told to upgrade with correct unit - Free-tier org on self-hosted told to buy a license - Paid TIERED org on SaaS told to upgrade with traces unit - Paid TIERED org on self-hosted told to buy a license Refactor is behavior-preserving — usage.service.ts still calls the same functions, just imported instead of inline. All 19 existing usage.service tests still pass. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

…dge cases (#3832) * test(features): bind 3 + implement 1 dataset-rest-api batch records edge cases The header claimed 3 NoTest gaps for batch-create-records edge cases. In reality 2 of them already had matching tests: - 404 for non-existent dataset (it("returns 404 Not Found") line 886) - 422 for empty body (it("returns 422 Unprocessable Entity for empty body") line 903) Added @Scenario JSDoc to those 2 tests + wrote 1 NEW test for the maximum-batch-size enforcement (1001-entry payload returns 422). Removed @unimplemented from all 3 scenarios + updated header to claim full 39/39 binding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger SIGABRT runtime crash --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

) Phase 2 of #3458 (parity grinder) — members domain. Bound to existing tests via JSDoc @Scenario: member-role-team-restrictions.feature (12/22): * "Lite Member" label + "Lite Member org role restricts team role to Viewer only" + "Member org role excludes Viewer" + "Member/Admin includes custom roles" + "Admin org role has all team role options" — bound to memberRoleConstraints unit tests covering getOrganizationRoleLabel + isTeamRoleAllowedForOrganizationRole. * Auto-correction scenarios (Member→Lite Member, Admin→Lite Member, Lite Member→Member, Admin→Member, Member→Admin) — bound to getAutoCorrectedTeamRoleForOrganizationRole tests. * "All team assignments respect Lite Member" + "Switching org role updates all team assignments" — bound to computeEffectiveTeamRoleUpdates EXTERNAL case. update-pending-invitation.feature (4/25): * "Member creates invitation request that requires approval" — bound to InvitesTable Pending Approval badge case. * "Admin creates immediate invite" — bound to InvitesTable Invited badge case. * "Admin approves invitation request" + "Admin rejects" — bound to WaitingApprovalActions admin Approve/Reject buttons. Kept @unimplemented (with justifying comments — needs harness): member-role-team-restrictions.feature (10): * UI dropdown render scenarios — no JSDOM render of the AddMembers form yet. * "API rejects non-Viewer for Lite Members" + "Saving enforces Viewer" + "Editing does not persist until save" — covered in `organization.member-roles.integration.test.ts` but that suite is `describe.skip()` (#3240). update-pending-invitation.feature (21): * Backend integration covered by `organization.invites.integration.test.ts` — `describe.skip()` (#3240). * Playwright e2e under `agentic-e2e-tests/tests/members/*` — `test.fixme()` (#1811 CI flakiness). specs/members net @unimplemented: 47 → 47 (-16 bound, 31 justified). `pnpm check:feature-parity` passes. `pnpm typecheck` clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Scenario

…tests (#3811) * test(private-dataplane): bind 20 @unimplemented scenarios to routing tests Phase 2 of #3458 (parity grinder) — private-dataplane domain. All 20 `@unimplemented` scenarios in this domain map directly onto existing routing tests that exercise the private-vs-shared paths end-to-end (testcontainers + real CH clients on the integration side, env-var mocks on the unit side). Bound to existing tests via JSDoc @Scenario: clickhouse-routing.feature (11/11): * Env-var parsing (3 scenarios) and shared-client default — bound to "when org has no private ClickHouse env var" * Private-client / private-instance routing / data-isolation proof (5 scenarios) — bound to "when org has a private ClickHouse env var configured" * Per-org caching — bound to "when called twice for the same organization" * getAllClickHouseInstances — bound to its dedicated case s3-routing.feature (5/5): * Env-var parse — bound to "parses a single org config" * Invalid JSON warning — bound to "skips the invalid entry" * Org-level routing (private + shared) — bound to the matching describe blocks * Project-level routing — bound to "when project belongs to org with private S3" data-isolation.feature (4/4): * Private-org spans-only — bound to span-storage repo "private-CH org" case * Shared-org spans-only + concurrent-routing — bound to "concurrent writes target different orgs" * Private-org events-only — bound to event-repo "private-CH org" case specs/private-dataplane net @unimplemented: 20 → 20 (-20 bound). `pnpm check:feature-parity` passes. `pnpm typecheck` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: retrigger flake CI Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…4150) * feat(boxd): add preview lifecycle helpers to boxd-fork.sh Adds three pure-ish orchestration functions: - boxd_preview_vm_name BRANCH — returns preview-<slug(branch)> - boxd_preview_up BRANCH — fork langwatch-golden-v2, checkout branch, docker compose up -d --build - boxd_preview_down BRANCH — destroy the preview VM non-interactively - boxd_preview_status BRANCH — print VM status, git HEAD, docker compose ps Golden source controlled by LW_PREVIEW_GOLDEN_SOURCE (default: langwatch-golden-v2). All boxd exec recipes wrap pipefail in bash -c (existing lint rule). Unit tests in scripts/__tests__/boxd-fork.unit.bats cover: - boxd_preview_vm_name name derivation and truncation - LW_PREVIEW_GOLDEN_SOURCE default + env override - arg validation on all three entry points - pipefail-in-bash-c regression guard (extended to cover preview helpers) Closes #4149 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(boxd): add boxd-preview, boxd-preview-down, boxd-preview-status Make targets Three new targets in boxd.mk (all require BRANCH=<name>): make boxd-preview BRANCH=feat/foo fork team golden, start full stack, print URL make boxd-preview-down BRANCH=feat/foo destroy preview-<slug> VM make boxd-preview-status BRANCH=feat/foo VM status + git HEAD + compose ps Follows existing _boxd_require / BOXD_RUN_PREFIX patterns. BRANCH guard matches fork-branch / fork-pr conventions. Updates: - boxd.mk: .PHONY list, surface comment, boxd-help, three new recipe blocks - Makefile help: three-line summary under "Boxd workflows" - dev/docs/boxd-makefile.md: quick reference, naming table, new env var row (LW_PREVIEW_GOLDEN_SOURCE), "PR preview lifecycle" section Refs #4149 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(boxd-preview): validate branch shape in down/status (#4150 review) Per CodeRabbit on PR #4150: boxd_preview_down and boxd_preview_status only checked non-empty, then slugged to a VM name. A typo or invalid branch could slug to a *different* valid VM name and target the wrong VM (destroy is the real risk). Mirror the git check-ref-format guard already in boxd_preview_up. Add two regression tests matching the existing _up invalid-branch test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(boxd-preview): move env assignment off `.` builtin `bash -c 'LW_PREVIEW_GOLDEN_SOURCE=$(VAR) . scripts/boxd-fork.sh && fn'` was an antipattern. Command-prefix env assignments scope to a single command — even when that command is `.` (source). Bash unsets the var as soon as `.` returns, so the script's `:-langwatch-golden-v2` default ran during sourcing but evaporated before the function body needed it. Result: `set -u` tripped on `golden="${LW_PREVIEW_GOLDEN_SOURCE}"` in `boxd_preview_up`, failing the call with "unbound variable". Move the assignment to the make recipe line (subshell-level env), so the bash subprocess inherits it normally. Discovered by actually running `make boxd-preview BRANCH=main` for the first time during PR verification — unit tests source the script with the env set, so they never hit this path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: MrKrustyKlaws <krusty@langwatch.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

@Scenario

…4163) * fix(groupQueue): update tenant cap test to match new default of 50 * fix(groupQueue): align feature file scenario title with actual default of 50 The feature file still said "defaults to 100" but the code was changed to DEFAULT_TENANT_CAP = 50. The @Scenario annotation in the test matched the code (50), not the stale feature title (100), causing check-feature-parity to fail with "unknown scenario".

…ests (#4167) * test(groupQueue): add counter conservation tests for total-pending drift Pin the total-pending counter invariant across every lifecycle path: - happy path (stage → dispatch → complete) - active key expiry (documents the current drift bug) - retryRestage lifecycle (single retry) - restageAndBlock + unblock lifecycle - multi-retry (3 retries then success) - proposed fix validation (proves naive DECR-at-dispatch breaks retries) - reconciliation invariant (counter vs actual Redis state) These tests are Phase 0 of the counter drift fix plan — they prove the bug exists and will serve as the regression suite for the structural fix. * fix(groupQueue): increase reactor dedup TTLs to reduce Redis fan-out pressure traceUpdateBroadcast: 1s → 30s (notification only, frontend debounces) spanStorageBroadcast: 1s → 15s (notification only, frontend refetches) originGate: 5s → 30s (evaluationTrigger has its own 30s delay) For a 30-span trace these three reactors were creating ~72 separate jobs instead of ~3. With the new TTLs, dedup squashes them during the span burst window, cutting reactor lifecycle Redis commands by ~60%. Active recurrence: same tenant (1,731 groups, 558K ops/sec) hitting the same fan-out pattern as the May 21 incident. * fix: address review comments — use dispatched groupId, remove unused var * fix: update originGate unit test to match new 15s dedup TTL

#4168) Cache tenant cap check results within each EVAL call to avoid redundant GET calls. When 1,800 groups belong to one over-cap tenant, this turns 1,800 separate GET tenant_active calls into 1 GET + 1,799 Lua table lookups. Cache is invalidated after a successful dispatch INCRs the tenant counter, so the cap check stays fresh when it matters. Applied to both DISPATCH_LUA and DISPATCH_BATCH_LUA. Scan budget is kept unchanged to avoid starvation risk for smaller tenants when a large backlog clusters at the ZSET head.

This reactor has been a no-op since LangWatch 3.0 (April 8) when ES writes were fully removed in favor of ClickHouse. The handler body is `return;` but without the `disabled` flag, every evaluation run still goes through the full STAGE → DISPATCH → COMPLETE lifecycle (~25 Redis commands) to execute nothing. Adding `disabled: true` makes the projectionRouter skip the `send()` call entirely (projectionRouter.ts:632), eliminating the Redis cost. The pipeline registry still creates the reactor for compatibility — it just never fires.

@Manual

* fix(prompts): clamp Deploy dialog Select width to row (#4006) The version Select used `width="auto" minWidth="180px"` with no horizontal budget on its parent HStack, so long commit messages pushed each environment row past the modal edge. Constrain the trigger to `flex=1 minWidth=0 maxWidth=280px` and let its parent HStack flex/shrink so the Select truncates instead of overflowing. The name/dot cluster stays at intrinsic width. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(feature-parity): tag manual layout scenario @unimplemented The "Version Select inputs stay within the modal width" scenario is asserted visually (jsdom does not compute layout) and has no automated test binding. Per TESTING_PHILOSOPHY.md, @Manual alone is not exempt from the binding check; @unimplemented is required to signal the gap is tracked. The DeployPromptDialog integration suite is also describe.skip'd due to a vitest hang (TODO #3022), so no automated binding exists or can be added without resolving that first. * test(evidence): add browser screenshots for Select overflow fix Chromium-rendered before/after showing the DeployPromptDialog version Select staying within the modal row boundary after the flex/minWidth fix (PR #4037). Before: selectRight=860 vs rowRight=476 (overflows by 384px) After: selectRight=459 vs rowRight=476 (17px margin, within bounds) Verified by: Playwright 1.60.0 + Chromium 148.0.7778.96 (headless) Script: /tmp/layout-verify.mjs — mirrors exact CSS from the fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: trigger fresh run after main test fix (tenantCap) Previous CI runs cached an old merge-base where tenantCap.unit.test.ts expected DEFAULT_TENANT_CAP = 100; main has since corrected it to 50. Empty commit forces a new workflow checkout against updated main. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ua (#4173) * fix(groupQueue): skip GET activeKey for over-cap groups in dispatch Lua When a tenant is over-cap, the dispatch scripts read the activeKey (GET) only to discard the result — the tenantOverCap gate on the next line guarantees the group is skipped. With 10K+ over-cap groups in the ready ZSET this wastes ~355K GETs/sec on the Redis single thread. Hoist the tenantOverCap check above the GET so over-cap groups skip immediately. Behaviorally identical: all five state combinations produce the same outcome; every write path is inside the gate and untouched. Applied to both DISPATCH_LUA and DISPATCH_BATCH_LUA. * fix(groupQueue): revert DISPATCH_LUA change — unused in production Only DISPATCH_BATCH_LUA is called by the dispatcher. Keep the change scoped to the hot path.

…cause, supersedes #3979) (#4148) * fix(python-sdk): scope disable_sending to trace lifetime (#3981) Root cause for offline-experiment cell traces silently dropped on the OTLP path. The langwatch_nlp worker pool reuses processes across event types, and `langwatch.trace(disable_sending=True)` was permanently flipping the singleton `Client._disable_sending` flag. Once any `execute_evaluation` or `execute_optimization` ran on a worker, every subsequent `execute_component` for an offline-experiment cell on the same worker had its spans dropped by `ConditionalSpanExporter`, even though the caller never opted into disabling. PR #3979 worked around the symptom by synthesizing a stand-in span from the orchestrator (`getApp().traces.recordSpan(...)`). That kept analytics charts non-empty but masked real OTLP data loss — the "real" LLM/tool/RAG spans for offline-experiment cells were still missing from `trace_summaries`/`stored_spans`. Fix: `LangWatchTrace` now snapshots the client's `disable_sending` state on `__enter__`/`__aenter__` and restores it in `_cleanup`. The per-trace request lives on the instance (`_disable_sending_request`) instead of mutating singleton state at construction time. Nested traces correctly restore the outer block's state on inner exit. Regression test exercises the exact worker-reuse shape from nlp's isolated_process_pool: a `disable_sending=True` trace followed by a default trace on the same singleton must not inherit the disabled flag. This replaces the workaround in PR #3979. With this fix the synthetic `recordSpan` block in `evaluations-v3/execution/orchestrator.ts` can be removed — close #3979 and target this branch for the analytics fix instead. Closes #3981. Supersedes #3979. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(python-sdk): refcount disable_sending for concurrent-safe scoping Address CodeRabbit critique on PR #4148: single-slot snapshot/restore in LangWatchTrace was unsafe for overlapping concurrent traces. Two threads each entering a `disable_sending=True` trace could corrupt the restoration order — one thread's exit would flip the flag back to False while the other was still holding it. Switch to a refcount on the Client: - `Client.acquire_disable_sending()` / `release_disable_sending()` bracket a `with langwatch.trace(disable_sending=True)` block. - Thread-safe via `threading.Lock`. - Captures the user-set baseline on the first acquire so explicit `client.disable_sending = True` set outside any trace block survives the trace exit; only the refcount-acquired delta is released. Add two regression tests for the cases CodeRabbit named: - user-set baseline survives trace exit. - overlapping concurrent disable traces keep the flag True until the last one exits. 51/51 SDK tests pass (was 49 before the new cases). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(analytics): JOIN regression test for offline-experiment evaluations Cherry-picked from #3979 (whose orchestrator workaround is no longer needed once #3981 lands). The test seeds both `evaluation_runs` and `trace_summaries` for synthetic offline-cell trace IDs, then asserts the analytics INNER JOIN on (TenantId, TraceId) returns non-null metric buckets. Independent of which write path populates trace_summaries — the test catches future regressions in the analytics read path (JOIN, metric registry, filters) either way. Comments updated to reference the #3981 SDK fix as the row-source mechanism instead of the abandoned workaround. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(analytics): retitle describe blocks to reflect runtime mechanism Addressing CodeRabbit on PR #4148: describe-block header and one body comment still attributed the trace_summaries row source to the orchestrator's synthetic recordSpan workaround from #3979 (now closed). The runtime source is now the normal OTel export pipeline (after the SDK fix lands). Test behavior is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: MrKrustyKlaws <krusty@langwatch.ai> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@Scenario

…s evaluations-v3 → experiments rename (#3893) * docs(specs): add specs for experiments and runs list endpoints Behavior specs for the new public REST list endpoints (experiments and evaluation runs), the matching CLI commands, and the MCP tools. Specs come first per BDD workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(api): add list endpoints for experiments and evaluation runs Adds two missing public REST surfaces so API-key-authenticated clients (CLI, MCP server, scripts) can enumerate experiments and their runs: - GET /api/experiments — lists project experiments with run count and last-run timestamp. Goes through ExperimentService.getAll and ExperimentRunService.listRuns; no direct Prisma access from the route. - GET /api/evaluations/v3/runs?experimentSlug=<slug> — lists runs for an experiment by slug. Reuses ExperimentRunService.listRuns. Returns 400 for missing slug, 404 for unknown slug. Both endpoints return camelCase shapes with project-scoped pagination (default pageSize 50, max 200, hasMore flag). Refs #3885. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(sdk): add EvaluationsApiService.listExperiments / listRuns Wraps the new GET /api/experiments and GET /api/evaluations/v3/runs endpoints. Types are hand-written because these routes are not yet exposed via the generated OpenAPI paths — same pattern PR #3886 used for getRunResults. Refs #3885. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): add experiment list and evaluation list-runs commands - langwatch experiment list — lists experiments in the project with table or JSON output, --limit caps the page size. - langwatch evaluation list-runs --experiment <slug> — lists runs for the given experiment slug; table mode shows runId, status, started/ finished and a pass-rate summary. Both commands check the API key and surface 404s with a non-zero exit so CI scripts can branch on missing experiments. Refs #3885. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(mcp): add platform_experiment_list and platform_evaluation_list_runs tools Surfaces the new list endpoints to LLM agents via MCP. Both tools emit markdown so the agent can read them directly, and both clamp the page size (default 25, max 100) to protect the agent's context window. platform_experiment_list discovers experiment slugs; the result feeds into platform_evaluation_list_runs to discover runIds; those runIds then feed into platform_evaluation_results (PR #3886) — completing the list → runs → results discovery chain. Refs #3885. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(specs): add specs for evaluation results CLI command and MCP tool Behavior-level scenarios covering completed-run fetch, failed-row filtering, evaluator narrowing, JSON piping, and graceful handling of running/missing runs — for both the new `langwatch evaluation results` CLI command and the `platform_evaluation_results` MCP tool. Refs #3885 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(sdk): add EvaluationsApiService.getRunResults Adds a `getRunResults({ runId })` method that fetches per-row experiment-run results from `GET /api/evaluations/v3/runs/:runId/results`. The route is not yet declared in the generated OpenAPI types, so the response type (EvaluationRunResultsResponse) is hand-written to mirror the control-plane ExperimentRunWithItems type. Refs #3885 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): add `langwatch evaluation results <runId>` command Renders per-row experiment-run results so engineers can debug evaluator scores and missed rows from the terminal. Supports `--filter failed` to focus on failures, `--evaluator <name>` to narrow to a specific evaluator, `--format json` to pipe the full payload, and `--limit <n>` to cap table output (default 20). Refs #3885 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(mcp): add platform_evaluation_results tool Lets coding agents fetch per-row experiment-run results to debug evaluator scores and missed rows. Output is markdown styled like the existing run-evaluation tool: per-evaluator averages plus row-by-row scores and failure details. Default row cap of 50 protects the agent's context window; agents can narrow with `filter: 'failed'` or `evaluator`, or raise `limit` when really needed. Refs #3885 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sdk): handle 200+null body from /runs/:runId/results for missing runs The evaluations-v3 /results endpoint returns HTTP 200 with a JSON `null` body when the runId does not exist (rather than 404). Without this fix, `getRunResults` returns null and the CLI crashed with `Cannot read properties of null (reading 'dataset')` instead of a clean error. Detected during live dogfood of the new `evaluation results` command (issue #3885) — bogus runId triggered a TypeError. Now the SDK throws a proper EvaluationsApiError with operation context, and the CLI prints a "Run not found" message. Refs #3885 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address evaluation listing review feedback * fix: address evaluation results follow-up feedback * test: bind 17 scenarios for experiments/runs list across api/cli/mcp Adds @Scenario JSDoc bindings to existing it() blocks in the integration and unit tests for the new experiments-list, evaluation-runs-list, mcp list tools, and CLI list commands. Marks 6 scenarios @unimplemented in the feature files where no direct test exists yet (empty-project, limit-cap-by-user-input, missing-API-key error, etc.) — those represent tracked gaps to fill in a follow-up. Resolves the langwatch-app-ci `feature-parity` failure on PR #3893. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address coderabbit + codeql feedback on PR #3893 - typescript-sdk tests: replace inline `typeof import()` with top-level `import type * as` aliases (CLAUDE.md rule against inline import()) - typescript-sdk CLI evaluation results: status cell now reflects evaluator failures, not just entry.error - mcp-server list-experiments: combine backslash + pipe + backtick escapes into a single regex pass to avoid the codeql-flagged incomplete-escape ordering issue - clickhouse-experiment-run.service: argMax(CreatedAt, UpdatedAt) for run dedup (latest version), and add RunId tiebreaker to ORDER BY - experiment.repository: add id-desc tiebreaker to paginated findMany - mcp-server get-evaluation-results: aggregate evaluator summary from filtered rows so stats match the displayed subset under filter=failed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: rename evaluations-v3 → experiments across API, SDK, CLI, MCP Aligns offline-evaluation surfaces with the project naming convention: "evaluations" refers to online evaluations / guardrails; offline / batch is "experiment". The `evaluations-v3` paths were misnamed historically — this sweep renames public surfaces and supporting internals in one pass: - API: `/api/evaluations/v3/{execute,abort,:slug/run,runs,runs/:runId, runs/:runId/results}` → `/api/experiments/{...}`. Route file moved to `routes/experiments-v3.ts`. Existing `/api/experiments` Hono app (list endpoint) is unchanged; the two apps coexist under the same basePath. - Server dir: `src/server/evaluations-v3/` → `src/server/experiments-v3/` (~30 internal imports updated). - UI module: `src/evaluations-v3/` → `src/experiments-v3/`. - Specs: `specs/evaluations-v3/` → `specs/experiments-v3/`. - SDK: `EvaluationsApiService` (in `services/evaluations/`) → `ExperimentsApiService` (in `services/experiments/`). Local error class renamed to `ExperimentsApiServiceError` to avoid colliding with the existing `ExperimentsApiError` in `platformErrors.ts`. `EvaluationRun*` response/entry types renamed to `ExperimentRun*`. `EvaluationsFacade` and the online-eval `EvaluationsApiError` stay put — they correctly target `/api/evaluations/:slug/evaluate`. - CLI: `langwatch evaluation {run,status,list-runs,results}` folded into the existing `langwatch experiment` group (now also exposes `list`). Files moved from `commands/evaluation/` → `commands/experiment/`. - MCP: `platform_run_evaluation`, `platform_evaluation_status`, `platform_evaluation_list_runs`, `platform_evaluation_results` → `platform_run_experiment`, `platform_experiment_status`, `platform_experiment_list_runs`, `platform_experiment_results`. Tool files renamed to match; handler exports renamed (`handleEvaluation*` → `handleExperiment*`). Left intentionally alone (separate concerns, larger blast radius): - Internal types `EvaluationsV3State` / `persistedEvaluationsV3StateSchema` / `EvaluationV3Event` — woven through the workbench UI; follow-up. - RBAC permission strings (`evaluations:manage`, `evaluations:view`). - PostHog analytics event `evaluation_ran`. - Online-eval surfaces: `/api/evaluations/:slug/evaluate`, `EvaluationsFacade`, `monitor` CLI group. No backwards-compatibility aliases — PR is unreleased so all callers are migrated in place. Verification: - typescript-sdk typecheck clean - typescript-sdk build success - typescript-sdk unit tests: 65/65 (SDK + CLI) - MCP unit tests: 16/16 (get-experiment-results, experiment-list-tools) - langwatch typecheck shows no rename-introduced errors (only pre-existing prisma-client errors caused by an unrelated root-owned generated file blocking `start:prepare:files`) - CLI `--help` shows merged `experiment` group with the five subcommands - OpenAPI generated types updated to `/api/experiments/runs/{runId}` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * polish(experiments): sweep stale "evaluation" strings in CLI + route errors Found during E2E smoke testing on the dev server: route error messages and CLI spinner/help strings still said "evaluation" where the rename moved them to the experiment domain. - routes/experiments-v3.ts: "Starting evaluation execution" → "Starting experiment execution", "Evaluation not found" → "Experiment not found", "Invalid evaluation configuration" → "Invalid experiment configuration", "Starting CI/CD evaluation execution" → "Starting CI/CD experiment execution" - CLI experiment/run.ts: spinner "Starting evaluation" → "Starting experiment", success/fail/poll text "Evaluation" → "Experiment", failSpinner action "run evaluation" → "run experiment" - CLI experiment/status.ts: failSpinner action "check evaluation status" → "check experiment status" - CLI experiment/list-runs.ts: cross-ref text in help points at `langwatch experiment status/results` (not `langwatch evaluation ...`), failSpinner action "list evaluation runs" → "list experiment runs" - CLI experiment/list.ts: cross-ref text in help points at `langwatch experiment list-runs` (not `langwatch evaluation list-runs`) - CLI experiment/results.ts: local types EvaluationResults{Filter, Format,Options} → ExperimentResults*, failSpinner action "fetch evaluation results" → "fetch experiment results". Member names like `evaluations` / `evaluation` left alone — they refer to per-row evaluator results, which is correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(experiments): resolve run's experiment from runState, not "most recently updated experiment" The previous fallback in GET /api/experiments/runs/:runId/results guessed the owning experiment via the most-recently-updated experiment in the project. This produced cryptic 404s whenever the user had edited any other experiment after the one that owned the run, because the downstream `experimentRunService.getRun({ projectId, experimentId, runId })` only matches on the exact (projectId, experimentId, runId) tuple in ClickHouse — runId alone is not unique across experiments. Layered lookup, fast-path first: 1. runStateManager.getRunState(runId) — Redis-backed, 24h TTL, covers hot runs. 2. ?experimentSlug=<slug> query param → prisma lookup — covers older runs whose run state has expired but whose ClickHouse rows remain. 3. Otherwise 404 with explicit guidance: "Pass ?experimentSlug=<slug> if the run is older than 24h." The SDK already encodes runId only in the path, so existing CLI / MCP flows hit path (1) on the freshly-completed run they just polled, which is the dominant use case. Users who chase older runs now get a clear actionable error instead of "Experiment not found". Also adds a null-guard around `experimentRunService.getRun` — the return type is `ExperimentRunWithItems | null` and the route was silently treating `null` as success. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sdk-test): extend LLM simple question/response test to 30s timeout This test makes a real ChatOpenAI call but used the default 15000ms vitest timeout, while every sibling test in the same file (`simple-agent-and-tool.integration.test.ts`) already overrides to `{ timeout: 30_000 }`. Result: flaky timeouts in CI on slower OpenAI responses (just observed on this PR's CI run despite no SDK code change in this area). Brings it in line with the rest of the file. No behavior change. Boy-scout fix surfaced by the rename PR's CI run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(mcp): replace inline imports with top-level imports for run-experiment handlers Convert the two `await import(...)` calls for handleRunExperiment and handleExperimentStatus to top-level imports, consistent with the other experiment tool handlers already imported at the top of create-mcp-server.ts. Addresses CodeRabbit inline-import suggestion; aligns with CLAUDE.md rule "Never use inline import()". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address review findings in experiment CLI/SDK/MCP - status.ts: fix stale help text 'evaluation run' → 'experiment run' - experiments-api.service.ts: use throw on handleApiError in getRunResults so TypeScript correctly narrows body to non-null before return - experiments.facade.ts: replace summary! non-null assertion with summary ?? {} so a completed run with missing summary doesn't crash downstream arithmetic - get-experiment-results.ts: extend finishedAt guard to also check stoppedAt, so stopped/aborted runs with partial results are displayed instead of blocked Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(sdk): remove erroneous throw before never-returning handleApiError `handleApiError` has return type `never` — TypeScript narrows past the null guard without a throw. The redundant `throw` triggered the `@typescript-eslint/only-throw-error` lint rule because `never` is not an Error object. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(test): hoist EvaluatorEditorDrawer import in integration test The inline `await import(...)` inside each test body caused the first test to pay the full cold module-load cost (~10s in CI), consistently hitting the 15s test timeout. vi.mock() is hoisted by Vitest so a top-level import receives the mocked module; no need for per-test dynamic imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(openapi): remove stale evaluations-v3 paths; regenerate api-client Old /api/evaluations/v3/... paths survived the rebase merge alongside the new /api/experiments/... paths, causing duplicate TypeScript identifiers in the generated api-client.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(openapi): restore api-client.ts to minimal 3-line rename Previous commit regenerated with a newer openapi-typescript version, ballooning the diff to 13k lines. Restore the original PR version which only renames the 2 path keys and 1 comment (evaluations/v3 → experiments). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(openapi): minimal path rename in openapiLangWatch.json Replace the file with main's exact version + 3 targeted string replacements to rename the 2 path keys and 1 description reference from evaluations/v3 → experiments. Previous version had a diff-noise problem from Python json.dumps re-encoding all unicode characters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: update Python SDK routes and thread experimentSlug through results API Two merge-blocking bugs found in human review: 1. Python SDK still called /api/evaluations/v3/... routes which this PR renames to /api/experiments/... — would 404 in production. Updated platform_run.py (_start_run, _get_run_status) and experiment.py (_fetch_results_as_df). 2. CLI/MCP results lookup fails for runs older than 24h when Redis state expires. The server already accepts ?experimentSlug= to resolve from ClickHouse, but the SDK/CLI/MCP never forwarded it. Added experimentSlug param through: - ExperimentsApiService.getRunResults - experimentResultsCommand / ExperimentResultsOptions - CLI --experiment flag on `experiment results` - handleExperimentResults (MCP tool) - platform_experiment_results MCP schema Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: add backward-compat aliases for /api/evaluations/v3 routes Python SDK e2e tests run against production, which does not have the new /api/experiments/... routes until this PR is deployed. Updating the SDK to call new routes caused 404 instead of 401 for the invalid-api-key test. Solution: - Revert Python SDK back to the old /api/evaluations/v3/... routes (safe to call on both old and new production) - Add server-side legacyAliasApp that mirrors all /api/evaluations/v3/* traffic to /api/experiments/* so existing callers continue to work after deployment Python SDK can be updated to the new routes in a follow-up PR once this one is deployed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#4155)

* ci: require explicit approval for privileged PR checks * ci: bootstrap workflow security guard * ci: harden workflow guard parsing * ci: tolerate commented workflow keys * ci: accept wider workflow indentation * ci: pin remaining workflow actions * ci: audit privileged pull_request_target gates * ci: skip missing codecov reports * test: tolerate shared fixture race

… of blocking groups (#4178) * fix(event-sourcing): retry through CH cluster-recovery errors instead of blocking Two related fixes so fold-projection groups self-heal across a ClickHouse rolling restart / ZooKeeper session-recovery cycle instead of parking in gq:blocked on the first failure. 1. classifyClickHouseError now treats CH cluster-recovery codes as RECOVERABLE: 394 QUERY_WAS_CANCELLED, 242 TABLE_IS_READ_ONLY, 999 KEEPER_EXCEPTION, 236 ABORTED, 33 CANNOT_READ_ALL_DATA. Same message fragments added for cases where @clickhouse/client surfaces the code only inside error.message ("Query was cancelled", "Table is in readonly mode", "Session expired", "Connection loss", "Coordination::Exception", "Write buffer has been canceled"). Before this, a CH replica graceful shutdown produced a StoreError classified as CRITICAL, which made isRetryable=false in groupQueue and triggered RESTAGE_AND_BLOCK_LUA on the very first attempt without using any of the 15 retries. 2. JOB_RETRY_CONFIG bumped to maxAttempts=25, maxBackoffMs=600_000. The previous 15-attempt budget capped at 15s gave only ~2.5 minutes of total retry wait, which is shorter than a typical ZK session recovery + readonly-replica window. New schedule (500ms doubling up to a 10-minute cap, total ~2h 38m across 24 gaps) is enough room for any single rolling-restart cycle. Failed jobs stay in the Redis zset the whole time, so a longer budget never loses data, it just trades operator toil for auto-recovery. 3. resilient-client's inline insert retry loop now uses the same transient-message-fragments list, so the per-call retry also catches QUERY_WAS_CANCELLED / KEEPER_EXCEPTION / TABLE_IS_READ_ONLY etc. * fix(event-sourcing): address CodeRabbit review on PR #4178 1. errorHandling.ts: add MEMORY_LIMIT_EXCEEDED to the message-fragment fallback so code-241 errors stay RECOVERABLE even when the error surfaces as message-only (no separate `code` field). 2. errorHandling.ts: export CLICKHOUSE_TRANSIENT_MESSAGE_FRAGMENTS as the canonical list. 3. resilient-client.ts: import the canonical list instead of hand-maintaining a partial duplicate. Locks the inline insert retry loop in lock-step with classifyClickHouseError so future additions stay in sync automatically. 4. resilient-client-metrics.unit.test.ts: add overload (202) and memory-limit (241) message-only cases to the transient retry matrix to guard the new parity. 5. errorHandling.unit.test.ts: cover MEMORY_LIMIT_EXCEEDED via message-only fallback. 6. shared.ts: correct cumulative-wait comment (2h 38m -> 2h 27m). Actual math: 1,023,500ms exponential + 7,800,000ms capped = 8,823,500ms.

…-chip jumps (#4179) * feat(traces-v2): interactive error chip popover + blue glow on header-chip jumps Two header chips (Error status, Eval) deep-link into trace-summary sections but the operator could lose the destination in a long accordion list, and the error chip's tooltip claimed it would "jump to the first span with an error" while actually just opening the summary tab without explaining anything. Three improvements: 1. Replace the static error-chip tooltip with a HoverCard popover that mirrors the Exceptions accordion content - same trace-level message in the same red treatment, same per-span pill row sourced from the new shared `rankedErrorSpans` helper so the popover and the full accordion can't drift. Clicking a span pill inside the popover jumps the drawer to that span; clicking the chip itself opens the full Exceptions section. 2. Extend `useFocusSectionStore` with an `exceptions` target. The error chip now publishes a focus request through the same store the eval chips already use, so the trace-summary observer uniformly handles "expand + scroll" for all header chips. 3. Add a one-shot blue pulse (`SectionFocusGlow`) keyed off `data-section-focus="1"` on each Section. The focus observer toggles the attribute after the scroll lands, the keyframe runs for 1.5s and self-clears. The colour palette mirrors the onboarding `DrawerGlow` recipe so the two effects read as the same affordance - the focus pulse is just shorter, runs once, and stays inside the section box. Includes a tiny `scripts/dogfood-error-trace.ts` OTLP emitter that fans out one trace with a workflow + chain + 2 llm spans all in error so the popover has enough surface area to dogfood. * fix(traces-v2): paint focus glow as a portal overlay so it isn't cropped The prior box-shadow-on-section approach split visually into two halves (outer halo cut by the drawer's overflow scroll viewport, inset ring occluded by the sticky section header bg) and clipped on opposite sides. Replace it with a fixed-position overlay rendered through a Portal at the document body. The overlay measures the section's bounding rect and tracks it across scroll / resize for the duration of the animation, so no overflow ancestor in the drawer chrome can crop it. Z-index sits above the drawer's stacking context. * feat(traces-v2): re-focus + glow exceptions when a span pill is clicked Span pills inside ExceptionsContent (both the popover and the accordion variant) now fire an optional `onFocusSection` callback alongside the span jump. The popover re-runs the chip's focus pipeline so the operator sees the section open + pulse as they land; the accordion variant re-pulses the section the operator is already viewing so the eye tracks back to the row that owns the selected span. * feat(traces-v2): span pill jumps to span tab + glows its exceptions section Two coupled changes so the popover span pills land where the operator expects: 1. Popover `jumpToSpan` no longer forces `setActiveTab("summary")` - `selectSpan` already flips to the span tab internally, so leaving the override in place was yanking the operator back to summary right after selectSpan moved them. 2. Extract the focus-glow observer into `useSectionFocusGlow` and mount it on `SpanAccordions` as well, so the follow-up `requestFocus({section: "exceptions"})` lands on whichever accordion stack ends up rendered (summary or span). The hook waits via MutationObserver when the target section isn't in the DOM yet (span tab mounts a skeleton while `useSpanDetail` loads).

- Move scripts/dogfood-error-trace.ts to langwatch/scripts/dogfood/traces/error-trace.ts so the dogfood folder stays organised by feature - Drop the blanket /* eslint-disable */ on the dogfood script and the unused SEC_NS constant; nothing in the file actually needed either - Guard rankedErrorSpans depthOf against parent cycles so a malformed span graph can't hang the ranker (covered by new unit test) - Surface the Exceptions accordion + focus glow for error traces that only have span-level failures (no rolled-up trace.error). Previously the header popover could list pills that led to a gated section that never rendered - Restructure the two unit suites to follow given/when/it BDD nesting

* fix(metrics): read backup status from system.backup_log system.backups is an in-memory CH table that is wiped on every restart. Since the CH image tag bumps on every app deploy, every deploy wipes the table — and a freshly-rolled worker pod then sees zero rows and never emits the clickhouse_backup_last_success_timestamp_seconds gauge, which trips the "Backup Reporting Absent" alert despite backups being healthy. system.backup_log is the persistent equivalent (same row schema for the columns we care about) and survives CH restarts, so freshly-rolled workers immediately see the most recent backup and emit the gauge on their first collection tick. * test(metrics): wrap backup_log tests in BDD given/when blocks Per repository test style (CLAUDE.md), nested describe('given X') / describe('when Y') blocks are required for new tests. Restructures only the sections this PR added/modified; existing flat sections left alone to avoid scope creep.

#4182) The chat-tab persistence effect deduped on message IDs only. The latest assistant message gets a stable ID the moment streaming starts (when content is still empty); the ID-set never changes again for that turn, so the effect skipped every content delta. The persisted snapshot ended up with the most recent assistant message stuck at empty content, and convertScenarioMessagesToCopilotKit then dropped empty-content rows on reload. Result: refresh and the last assistant reply vanishes while older turns survive. Key the dedup on (id, role, content-length) so each streaming chunk re-triggers persistence. The dedup factor is now per-turn-deltas instead of per-turn-new-message — a few dozen small localStorage updates per turn, which is fine for the prompt playground. Extract the key fn so the regression is unit-testable without standing up the full CopilotKit chat stack.

… dropdown (#4186)

@Scenario

* fix(queue): move pending counter DECR to dispatch to eliminate drift The total-pending counter drifted to 827K (real: ~1,300) because COMPLETE_LUA only DECRs when activeKey matches stagedJobId. When the active key expires before the worker completes, the DECR is skipped and the INCR from stage time is never compensated. Move DECR to DISPATCH (ZREM from :jobs — always succeeds) and add compensating INCRs in RETRY_RESTAGE and RESTAGE_AND_BLOCK (job re-enters :jobs). COMPLETE no longer touches the counter. Counter semantic: "jobs currently in :jobs ZSETs" (waiting to dispatch). Every INCR has a guaranteed DECR — no conditional paths. 4 Lua script changes: - DISPATCH_LUA/DISPATCH_BATCH_LUA: DECR after ZREM - COMPLETE_LUA: remove DECR + totalPendingKey - RETRY_RESTAGE_LUA: INCR after ZADD - RESTAGE_AND_BLOCK_LUA: INCR after ZADD * fix(queue): update DRAIN_GROUP_LUA and tests for new counter semantic DRAIN_GROUP_LUA counted active jobs in its DECRBY (pendingCount + hadActive). With DECR moved to dispatch, the active job was already compensated — counting it again would double-DECR. Update DRAIN to DECRBY pendingCount only (jobs in :jobs ZSET). Update integration tests to match the new counter semantic: - "leaks the counter" → counter stays accurate (DECR at dispatch) - "proposed fix validation" → validates the real 4-script fix - "reconciliation invariant" → ZCARD(:jobs) only, not + EXISTS(:active) - "retryRestage breaks invariant" → invariant now holds (INCR at retry) - Drain tests: active job already DECRed at dispatch, not counted again * fix(queue): guard INCR on ZADD insertion to prevent duplicate overcount ZADD returns 0 when the member already exists (score update only). Guard INCR behind inserted == 1 in RESTAGE_AND_BLOCK_LUA and RETRY_RESTAGE_LUA so a duplicate re-stage call cannot overcount the pending counter. Addresses CodeRabbit review feedback on PR #4181. * chore(specs): bind counter conservation scenarios to integration tests Add @Scenario annotations linking pending-counter-conservation.feature scenarios to their corresponding integration tests in scripts.integration.test.ts.

@Scenario

…4175) * fix(groupQueue): check tenant cap before SISMEMBER in DISPATCH_BATCH Move the tenant cap check above the SISMEMBER blocked check in DISPATCH_BATCH_LUA. The cap result is cached per-tenant in a Lua table, so over-cap groups now skip with 0 Redis commands instead of 1 SISMEMBER each. With 10K+ over-cap groups in the ready ZSET, this eliminates ~333K SISMEMBER/sec (~74% main thread CPU). * test(groupQueue): add dispatchBatch integration tests for cap-before-SISMEMBER Tests for DISPATCH_BATCH_LUA with tenant cap: - Over-cap groups skipped, under-cap groups dispatched in same batch - Blocked + over-cap group interaction - Drift cleanup still runs for under-cap tenants * test(groupQueue): link batch cap tests to tenant-soft-cap feature scenarios Add 3 DISPATCH_BATCH scenarios to tenant-soft-cap.feature and annotate integration tests with @Scenario references for parity tracking. * test: harden cap-before-SISMEMBER integration tests - Add cross-tenant dispatch assertion: stage a quiet-tenant job in second batch and assert it dispatches, not just that noisy ones don't - Seed a zombie ready entry in drift-cleanup test so it exercises the actual cleanup path instead of passing trivially with both values at 0 Addresses CodeRabbit review feedback. * fix(test): correct over-cap test to keep tenant at cap after restage restageAndBlock DECRs the tenant_active counter, so with cap=1 and only one dispatched group, the tenant drops to 0 (under cap) after restage. Fix: start with cap=2 so two groups dispatch (counter=2), restage one (counter→1), then lower cap to 1 for the second dispatch. This keeps the tenant at cap and properly verifies over-cap groups are skipped. * fix(test): complete quiet tenant job before lowering cap The previous fix lowered the global cap from 2→1 but forgot that proj_quiet also had 1 active job (counter=1), so it was also at cap and nothing dispatched. Fix: complete proj_quiet/g1 before lowering the cap so only proj_noisy remains at cap. * test: assert g3 excluded from first batch to lock cap boundary

* fix(deps): move mcp-server pnpm.overrides to workspace root pnpm warns that pnpm.overrides in mcp-server/package.json is ignored because mcp-server is part of the langwatch/ workspace (declared via ../mcp-server in langwatch/pnpm-workspace.yaml). Overrides only take effect at the workspace root. protobufjs and fast-uri floors were already in langwatch/pnpm-workspace.yaml. Adds the missing langsmith@<0.6.0 floor and removes the dead pnpm block from mcp-server/package.json. * fix(deps): keep mcp-server standalone overrides in pnpm-workspace.yaml mcp-server is its own pnpm workspace (has pnpm-workspace.yaml and pnpm-lock.yaml; CI installs it standalone via mcp-javascript-ci, mcp-javascript-cd, sdk-javascript-ci, e2e-ci). Dropping the override block from package.json without re-homing it broke standalone frozen installs with ERR_PNPM_LOCKFILE_CONFIG_MISMATCH. Re-add the same overrides at mcp-server/pnpm-workspace.yaml so the standalone install honors them while the langwatch workspace install continues to pick up its own overrides from langwatch/pnpm-workspace.yaml.

* feat(ops): add Redis pressure panel with memory % and engine CPU During the 2026-05-21 incident the Redis main thread sat at 100% for 5+ hours while /ops exposed no Redis pressure signal — memory was technically collected but only surfaced as a tiny sublabel on the DLQ tile, and engine CPU was never collected at all. Investigators had to use the Redis CLI on a replica to see what was happening. This PR adds a first-class "Redis pressure" panel below the throughput strip with three stats: - Memory: used / max with percent, warning color when >= 80% - Engine CPU: derived from two successive INFO cpu snapshots of used_cpu_{user,sys}_main_thread, warning color when >= 70% - Connections: connected_clients The legacy `redisMemoryUsed` sublabel under DLQ is removed in favor of the real panel. The engine-CPU derivation lives in a small pure helper (`redis-engine-cpu.ts`) that returns null on the first cycle (no prior sample) and on the cycle after a Redis restart (cumulative counter rewinds backwards), so we never display a spurious huge or negative percent. Covered by specs/ops/redis-pressure.feature and tested at three layers: - Unit (8 tests): pure CPU-percent math, including first-cycle null, rounding, counter rewind, and same-instant samples. - Integration (3 tests): real Redis container — parses INFO, produces a CPU sample, derives a non-negative percent after generating work. - RTL (8 tests): renders the panel inside ChakraProvider, asserts the used/max/percent/peak layout, the "sampling…" placeholder on the first cycle, the unlimited-maxmemory fallback, and warning thresholds. Also fixes a pre-existing biome noFloatingPromises warning on the first collect() call in OpsMetricsCollector.start(). * test(ops): browser-mode screenshot capture for RedisPressurePanel A small vitest browser test that runs the panel inside real Chromium via Playwright and writes PNGs to __screenshots__/ for each rendered state (healthy, sampling, saturated, unlimited). Not a regression test — the PR-screenshots upload that the PR description embeds is the artifact. * fix(ops): address coderabbit review (PR #4190 round 1) - Remove unused VStack import from RedisPressurePanel - Expose warning state as data-warning="true|false" on memory/CPU stats so RTL tests can pin the contract independently of Chakra v3's CSS-variable rendering details - Add tests that verify data-warning is set above the 80% memory threshold and the 70% engine-CPU threshold, and unset below - Align the unlimited-maxmemory spec scenario with the panel contract: the percent is omitted entirely (not shown as "-") when maxmemory=0 * refactor(ops): inline Redis stats into top stat strip, format bytes as MB/GB The dedicated "Redis pressure" card stood awkwardly between the top stats and the throughput chart. Three values fit naturally alongside Staged/s, Completed/s, P50, P99, etc. — collapse the card and let the grid wrap. Also fixes the memory display: Redis's `used_memory_human` returns "53.62M", which is ambiguous (megabits? million bytes?). Format from the bytes value on the frontend with explicit "MB"/"GB"/"KB" suffixes instead. * fix(ops): use raw memory ratio for warning threshold (PR #4190 round 2) - Spec: align maxmemory (9.69GB) with the displayed sublabel value so the scenario isn't internally inconsistent. - Spec: pin the new edge case (79.95% raw, 80% displayed → no warning). - Component: separate the raw ratio (threshold check) from the rounded value (display) so 79.95% does not falsely trigger the eviction warning. - Test: cover the 79.95% edge case explicitly. * fix(ops): stack sublabel under value and prevent wrapping (PR #4190) When the strip got cramped, the sublabel sat awkwardly to the right of the value and the label text broke onto two lines. Stacking the sublabel under the value and pinning whiteSpace=nowrap on the label/value/sublabel forces the grid to wrap whole tiles to the next row instead of mangling individual ones.

@Scenario

* feat(langevals): stage large payloads to S3 via presigned URL Topic clustering batches and long-input evaluator calls routinely exceed Lambda's 6 MB sync invoke cap; large topic-clustering payloads observed in prod hit 56 MB and 85 MB with one project losing every daily clustering run to a 413. Adds an opt-in S3 staging path: bodies above LANGEVALS_STAGING_THRESHOLD_BYTES (default 5 MB) are uploaded to the project's storage destination, the POST carries an empty body with the GET presigned URL in X-Payload-S3-URL, and a FastAPI middleware on langevals fetches + swaps the body before the route handler runs. Hard caps reject oversized payloads before any network call: 16 MB for evaluations, 180 MB for topic clustering. Langevals never holds S3 credentials; only the control plane signs the URL. Legacy langwatch_nlp topic clustering path stays unchanged so self-hosted operators do not regress. Telemetry tracked on both sides: control plane logs staged size, S3 key, TTL, and target host; langevals logs the fetched byte count and latency. * fix(topic-clustering): skip ES dual-write when Elasticsearch is not configured storeResults dual-wrote topic assignments to both Elasticsearch and the event-sourcing AssignTopic command queue. On SaaS prod (ClickHouse-only, no ES) esClient() returned a throwing proxy whose .bulk() raised "Elasticsearch is not configured", killing the function before the AssignTopic block could run. Result: Topic rows were created in Postgres (314 in the last 24h across 23 projects, confirmed via psql), but trace_summaries.TopicId in ClickHouse never got written, leaving "Top Topics" empty in the UI for every clustering-active project — including projects on the legacy langwatch_nlp path that have nothing to do with the nlpgo flag flip. Adds isElasticsearchConfigured() that mirrors esClient's resolution order (per-org override → env) and guards the bulk write with it. ClickHouse-only deployments now write topic assignments via the event-sourcing path only, as intended. Adds an end-to-end test that spawns langevals locally, points the TS helper at real lw-dev S3, and exercises both the evaluator and topic-clustering staging paths through real network + presigned URL fetch. Skipped in CI (no shared dev S3 creds); run via scripts/run-langevals-staging-e2e.sh. * fix(langevals): coderabbit review follow-ups + feature-parity spec * SSRF guard on X-Payload-S3-URL: reject non-https, localhost, RFC1918, loopback, link-local, multicast, reserved, unspecified IPs with 400 before any httpx fetch. Six parametrized cases + a transport-level assertion that the rejected URL never reaches the network. * Drop raw exception strings from staged-fetch error logs (httpx's __str__ embeds the full presigned URL with X-Amz-Signature). Replaced with error_type + upstream_status. Regression test asserts the secret URL substring never appears in captured log records. * Reorder header spread in stagedLangevalsFetch so caller headers cannot override the contract-defining X-Payload-S3-URL or Content-Type. * Switch test_staged_payload.py to a single import style for the staged_payload module (CodeQL flagged the mixed import/from import). * Add specs/topic-clustering/trace-assignment.feature bound to the storeResults regression test so feature-parity CI doesn't break on the prior commit's unbound @Scenario annotations. * fix(test): poll loop-blocked counter instead of fixed 1500ms quiet window The depth=1 loop-prevention test slept 1500ms then asserted the prom counter had incremented. Under parallel CI load (PR #4189 hit this twice) the reactor → BullMQ → metric-write chain doesn't always complete in 1500ms, so the counter read 0 and the test failed with "expected 0 to be greater than or equal to 1". Replace the fixed sleep with `waitFor` polling the counter up to 20s. The dispatch assertion stays as a post-condition: by the time the counter ticks, the reactor has decided not to dispatch. * test(e2e): exercise real OpenAI through staged-payload path The previous e2e test posted an `exact_match` body and a topic clustering call with a fake api_key pointing at an unreachable embeddings endpoint. That proved the middleware fetched the staged body but said nothing about whether the downstream call actually produced a valid evaluator verdict or real topics. This rewrites the e2e to: - Stage an `llm_boolean` payload, hit real OpenAI through the presigned URL hop, assert a real verdict with non-zero cost. - Stage a `topic_clustering_batch` payload of 12 traces in two semantic clusters, hit real OpenAI embeddings + naming, assert the response actually contains named topics and trace_id -> topic_id assignments. OPENAI_API_KEY is sourced from langwatch/.env by the wrapper script so the operator runs the same command as before. Drops the "posts inline" scenario because the wrapper-set threshold of 200 bytes makes it impossible to fit a real evaluator body under the inline branch, and the unit tests already cover that branch. * refactor(topic-clustering): delete Elasticsearch dual-write entirely Topic clustering ran a legacy ES bulk write inside storeResults, plus two dead ES helpers (fetchCountsFromElasticsearch, fetchTracesFromElasticsearch) that nothing called, plus an isElasticsearchConfigured gate added earlier in this PR to keep the bulk from killing the AssignTopic queue on SaaS. The right fix is to delete the whole ES path. ClickHouse is the storage for per-trace assignments via the AssignTopic command queue plus trace_summaries.TopicId projection; Postgres is the storage for the topic catalog. Nothing else still reads metadata.topic_id from ES, so dual-writing was just dead work. Drops: - storeResults ES bulk + isElasticsearchConfigured gate - fetchCountsFromElasticsearch / fetchTracesFromElasticsearch (defined but never called) - isElasticsearchConfigured from elasticsearch.ts (no other callers) - topicClustering.integration.test.ts (describe.skip'd ES-only test) - ES mocks + assertions in the unit tests - Two ES branches in trace-assignment.feature The unit test now pins one contract: storeResults emits AssignTopic for each trace and never touches the elasticsearch module. Any re-introduction of an ES write would have to deliberately update this test. Dogfood re-tested locally on the same X0O0F0Y82Bqjl3WXyTY_6 project with ELASTICSEARCH_NODE_URL unset (SaaS shape): 12 traces seeded, storeResults' new no-ES path emits 12 AssignTopic commands, trace_summaries.TopicId populated 12/12, Top Topics shows the clustered topic in the UI. * refactor(langevals-staging): opt-in by default + drop bash wrapper Two cleanups suggested while dogfooding the e2e: 1. Staging is now off by default. Self-hosted langevals on a plain HTTP service has no 6 MB sync-invoke cap so there's nothing for the S3 hop to dodge — bodies just go inline regardless of size. SaaS opts in by setting LANGEVALS_STAGING_THRESHOLD_BYTES in terraform. z.coerce.number().int().positive().default(5_000_000) -> z.coerce.number().int().positive().optional() stagedLangevalsFetch checks `threshold === undefined` and routes straight to inline POST. EVAL_MAX_PAYLOAD_BYTES and TOPIC_CLUSTERING_MAX_PAYLOAD_BYTES (the hard rejection caps) keep their defaults — those are useful even outside the staging scenario. Dropped the .env.example block that exposed all four vars to self-hosted operators. SaaS-specific tunables don't belong in the self-hosted starter template. 2. Deleted scripts/run-langevals-staging-e2e.sh. The TS e2e already does everything except resolve AWS profile credentials; that's one line of bash, folded into a new `pnpm test:e2e:langevals-staging` script that exports `aws configure export-credentials` output then runs vitest. Bucket + region default to lw-dev values in the test itself, so the script is single-command for anyone with lw-dev SSO refreshed. Coverage added: - stagedFetch.unit.test.ts gains a scenario asserting threshold-unset routes inline regardless of size. - staged-payload.feature gains a matching @Unit scenario. * fix(langevals-staging): delete staged S3 object after use + pin inline Content-Type Addresses Sergio's PR review: P1 — staged payloads were written to S3 but never deleted. Those bodies carry customer trace/input data plus provider credentials (evaluator env X_LITELLM_*, topic-clustering litellm_params api_key / vertex_credentials / bedrock keys), so leaving them in the bucket is a retention/data-minimization problem, not just a Lambda-access one. The presigned-URL TTL only bounds URL usability, not object lifetime. Now stagePayload returns the client/bucket/key alongside the URL and stagedLangevalsFetch deletes the object in a finally block once the upstream responds. By then langevals has already fetched the presigned URL during request handling, so the object is safe to drop. Delete is best-effort and logged on failure; a bucket lifecycle rule on the langevals-staging/ prefix (added in langwatch-saas terraform) is the orphan/crash fallback for the rare path where the finally can't run. P3 — the inline path spread caller headers before Content-Type, letting callers override it; the staged path already pinned it last. Made the inline path pin it last too so both branches behave identically (body is always JSON-serialized here). Tests: added a unit scenario asserting the staged object is deleted with the same bucket+key it was written to, plus a bound @Unit feature scenario. * test(langevals-staging): e2e asserts staged object is deleted after use The two real-S3 e2e scenarios previously asserted the staged object was still present after the call (matching.length > 0). With the in-app finally delete that's no longer true — and the old assertion masked whether cleanup actually happened (the afterAll teardown deletes the prefix anyway). Flipped both to assert the object is gone (matching === []) right after stagedLangevalsFetch resolves, before teardown. The "fetched staged payload" log still proves the body round-tripped through S3 first, so the chain is: upload -> langevals GET from presigned URL -> in-app delete -> nothing lingers. Verified the dev bucket lists 0 objects under langevals-staging/ after the run.

…tching, eval OOM, shiki dispose (#4192) * fix(traces-v2): non-selectable newline marker, evaluator filter label, annotation batching, eval OOM, shiki dispose Five trace-explorer fixes reported from dogfooding: 1. Newline marker (compact + comfortable I/O preview). The `↵` hard-break glyph now renders as a zero-width `::after` pseudo-element at the end of the broken line, like a GitHub diff gutter: it isn't part of the selectable text (so copy round-trips clean multi-line text), and it never occupies layout width (so it can't wrap onto a line of its own). 2. Evaluator filter label. Dropped the `[type]` prefix from the evaluator facet label — a project's evaluators are mostly the same type, so the pill just ate the room the name needs. Value still keys off the evaluator id for saved-query round-trips. Removed the now-dead type-pill rendering from FacetRow. 3. annotation.getByTraceIds "Input is too big for a single dispatch". tRPC v10 sends queries as GET, so a page of 100+ trace ids overflowed the batch link's URL ceiling. New `useAnnotationsByTraceIds` hook chunks ids into URL-safe batches via `useQueries` and flattens — no upper bound. Migrated all five callers; `annotations/all` now also declares both queries unconditionally (fixing a latent rules-of-hooks issue). 4. traces.getEvaluations 500 ("Query memory limit exceeded ... column Inputs"). `evaluation_runs` is ORDER BY (TenantId, EvaluationId), so a TraceId filter can't prune granules; reading the multi-MB `Inputs` column across a granule OOMed. The query now falls back to a light projection (no Inputs) on a memory-limit error, so operators still get verdicts/scores instead of a 500. The eval card hides inputs when absent. 5. ShikiError "instance has been disposed". Chakra's shiki adapter disposes the highlighter in `unloadContext` on every CodeBlock unmount / color-mode change, but we share one app-lifetime singleton across all CodeBlocks — the first unmount tore it down for the rest. Neuter the singleton's `dispose`. * fix(traces-v2): keep newline marker clear of the clamp ellipsis Widen the spacing before the ↵ glyph, and suppress the marker that lands on the truncated line so it never overlaps the line-clamp's … ellipsis. Markers on fully-visible lines keep showing, so the hard-break affordance survives truncation. * feat(traces-v2): lazy-load evaluation inputs in the drawer The verdict list keeps the heavy Inputs blob (public API + v1 tooltip depend on it) but degrades to a light projection under ClickHouse memory pressure. To still show inputs in the drawer when that happens, EvalCard now fetches a single evaluation's inputs on expand via traces.getEvaluationInputs, keyed by EvaluationId (the table sort key) so the read prunes granules and stays within memory. Inputs already carried by the list render without an extra request. * fix(traces-v2): address review comments - Dedupe traceIds before chunking in useAnnotationsByTraceIds so duplicate ids spanning chunks don't double annotations. - Count traceGroups load toward the annotations table loading state in filtered mode so it no longer flashes an empty state before the ids arrive. - Wrap the ClickHouse light-projection retry in its own try/catch so a retry failure still goes through the standard error log + wrapped message. - Normalize CRLF/CR in the preview newline marker so a stray carriage return can't cling to a line. - Give the test ResizeObserver stub a one-arg constructor so callers aren't flagged for a superfluous argument. * fix(traces-v2): make getEvaluationInputs project-protected, not public-share The procedure was publicProcedure + checkPermissionOrPubliclyShared scoped to a single traceId, but the read is keyed by evaluationId (only tenant-scoped) with traceId dropped. A public-share token for one trace could then read any evaluation's inputs in the project by supplying another evaluationId. Switch to protectedProcedure + checkProjectPermission so the auth scope (project) matches the data scope, and drop the now-unnecessary traceId. Public-shared trace drawers still get inputs eagerly from the public getEvaluations; this lazy fallback is project-gated. * fix(spans): make getForPromptStudio protected, not public Prompt Studio is an authenticated project-scoped feature with no public-share path. It was declared publicProcedure but guarded only by checkProjectPermission, so it was effectively protected (anon resolves to no permission) yet labeled public. Switch to protectedProcedure to make the intent explicit and avoid the public-by-default footgun. * docs(traces-v2): drop history-flavored code comments Comments describe the final state of the code, not how it got there.

…#4196) * fix(gateway): route managed-bedrock through the customer VPC endpoint Managed-Bedrock customers reach Bedrock through their own PrivateLink VPC endpoint, and their IAM role only authorizes bedrock:InvokeModel when the request arrives via that endpoint. The gateway dispatched Bedrock through bifrost, which hardcodes the public bedrock-runtime host and SigV4-signs over it, so the request missed the VPCE and the customer role returned a 403 (the old litellm path set aws_bedrock_runtime_endpoint to the VPCE and worked). When a Bedrock credential carries a runtime endpoint, dispatch through the official aws-sdk-go-v2 bedrockruntime Converse/ConverseStream with BaseEndpoint pinned to that VPCE, so the SDK signs and sends to the right host. The endpoint already reaches the gateway credential on the nlpgo path (Credential.Extra). Every other Bedrock request and every other provider stays on bifrost untouched. Responses are mapped into the same BifrostChatResponse shape the normal path emits, so downstream cost/trace/stream handling is unchanged. Validated end to end against real Bedrock (Converse via BaseEndpoint returns a 200 through the dispatch path). * fix(gateway): satisfy golangci-lint on bedrock vpce dispatch Use promoted embedded fields, index-range over Tools to avoid value copy, and http.StatusOK in the test. * fix(gateway): accept litellm aws_* credential keys on bedrock vpce dispatch The two nlpgo entry points name AWS credential fields differently: the dispatcheradapter (Studio / workflows) translates litellm_params to the Bifrost-canonical access_key / secret_key / session_token / region, while the gatewayproxy (/go/proxy) keeps the litellm aws_* names. The VPCE credential reader now honors both conventions so the dispatch is correct regardless of which route built the credential. Add a committed streaming live test (dispatchBedrockVPCEStream) alongside the non-streaming one, a unit test for the aws_* key path, and assert the dispatcheradapter threads the runtime endpoint through. * fix(gateway): honor litellm aws_* keys on the normal bedrock key path The gatewayproxy (/go/proxy) route keeps the litellm aws_* credential names while the dispatcheradapter translates to canonical ones. The Bifrost bedrock key builder read canonical only, so a /go/proxy bedrock request landed with empty credentials. Read both conventions via the shared credExtra helper so neither nlpgo route is left credential-less. * fix(gateway): validate bedrock vpce endpoint to close SSRF surface The runtime endpoint arrives from per-request credential Extra, so an unconstrained value would let a request steer the gateway's outbound call at an arbitrary host. Gate the managed-Bedrock dispatch behind bedrockVPCEEndpoint, which requires an http/https URL whose host is within the AWS-controlled amazonaws.com domain (covers both the public endpoint and PrivateLink VPC endpoints) and fails closed otherwise. * test(nlpgo): exercise the bedrock vpce intercept through the full nlpgo stack The bedrock live e2e drives the real nlpgo HTTP chain (router → dispatcheradapter → dispatcher → gateway intercept). Threading an optional BEDROCK_VPCE_ENDPOINT into litellm_params routes that chain through the managed-Bedrock VPCE dispatch instead of the public bifrost path, so the endpoint-to-dispatch wiring is proven end to end, not just the dispatch function in isolation. * test(nlpgo): thread aws_session_token so the managed temp-cred path is exercisable Managed-Bedrock customers authenticate with STS-temporary credentials, which require the session token alongside the access/secret keys. Pass AWS_SESSION_TOKEN through litellm_params when set so the harness can run the real managed customer path, not just long-lived keys. * fix(gateway): address coderabbit review on managed-bedrock vpce - guard the VPCE intercept (Dispatch + DispatchStream) to RequestTypeChat so /v1/messages stays on the raw-forward path; routing Anthropic-native bodies through Converse would drop messages-only fields like thinking. The studio/eval LLM path is ChatCompletions, so the managed-bedrock customer path is unaffected. - require https for non-PrivateLink endpoints in validateBedrockEndpoint: http is now allowed only for .vpce.amazonaws.com hosts, so public bedrock-runtime cannot be downgraded onto plaintext transport. The customer http VPCE :80 still passes (vpce host). - bound the live stream test context with a 2m timeout so a stalled upstream cannot hang it.

…on (#4198) Multi-MB base64 span attributes (data: image/audio data URLs embedded in multimodal LLM input/output, plus very large params) were entering the event-sourcing fold path. The trace-processing pipeline read-modify-writes each trace's accumulated fold state in Redis per event; when that state is multi-MB, every op saturates the single-threaded Redis command loop, folding throughput collapses, staging outpaces it, and the backlog and Redis memory diverge. Cap attribute values over 256KB with a short placeholder in RecordSpanCommand, the shared chokepoint every ingestion path dispatches through (collector REST and OTLP both call commands.traces.recordSpan). Capping there shrinks the SpanReceivedEvent before it is folded, keeping fold state in KB not MB. Recurses through arrayValue/kvlistValue and handles bytesValue. Never throws on malformed values; normal traces are untouched. Helps new traces; the existing backlog is handled operationally.

…ore cycle (#4199) * fix(event-sourcing): coalesce backed-up fold groups into one apply/store cycle Fold projections process one event at a time: read state, apply, write back. For a single aggregate with N queued events that is N load+store round-trips over state that grows with N (O(n^2)), and the GroupQueue's per-group active gate forces strict one-at-a-time draining. A 233-span trace backed its fold group up and diverged instead of draining, stalling the single-threaded Redis command path for every tenant. Add opt-in per-group batch coalescing: while a worker holds a group's active slot (which already excludes the group from concurrent dispatch), drain the group's remaining due jobs in one pop-only Lua call and fold them in a single load/apply/store cycle, in occurredAt order. A backed-up group now drains in O(n) and self-heals. When the queue keeps up, batches are size 1 and the per-event path is byte-for-byte unchanged. Coalescing is bounded by a per-fold max batch (default 100) and enabled for all fold projections; other queues are untouched. A failed batch re-stages its drained siblings so none are lost, and the drain decrements total-pending per job exactly as dispatch does. Also raise the Redis fold-cache TTL from 30s to 300s so the accumulated state stays warm across an aggregate's event stream instead of expiring mid-stream and forcing a ClickHouse re-read of the full state on every event. Tests: executeBatch equivalence + ordering + out-of-order re-fold (unit); drainGroupReady semantics and pending-counter conservation (integration, real Redis); end-to-end coalescing, max-batch cap, disabled-passthrough, and failed-batch re-stage (integration, real Redis). * fix(event-sourcing): per-event reactor dispatch under coalescing + review fixes Address review on the coalescing PR: - BLOCKER (per-span reactors): processFoldProjectionBatch fired fold reactors once with only the last event, which silently dropped N-1 spans for per-span reactors (customEvaluationSync extracts evals from event.data.span, evaluationTrigger/scenarioExecution key on each event), and only under backlog (batch>=2) — exactly the recovery path. Now dispatch reactors per event with the final fold state: event-sensitive reactors see every event; fold-state reactors dedup via makeJobId to one effective run; the O(n) win stays in the single fold load/store, not the cheap reactor enqueues. - Dispatch reactors (and fold) in occurredAt order so reactor metadata and the final state are consistent regardless of drain/arrival order. - Global processBatch: guard that every payload maps to the same registry entry, falling back to per-item processing on any mismatch (never misroute). - Fold cache TTL is now env-overridable (LANGWATCH_FOLD_CACHE_TTL_SECONDS, read at call time) so residency can be dialed down without a redeploy; default 300s. - redis-fold-cache.feature: correct the store ordering wording (ClickHouse first, then Redis cache) to match RedisCachedFoldStore. Coalescing stays default-on for every fold: the final folded state is identical to per-event folding, per-event reactor dispatch preserves event-sensitive reactors, and out-of-order is handled identically (executeBatch sorts by occurredAt + same checkpoint re-fold). A fold can still opt out via options.coalesceMaxBatch = 1. Tests: projectionRouter unit tests assert a coalesced batch loads/stores once but dispatches reactors per event, in occurredAt order; a deterministic ClickHouse integration test folds many spans via executeBatch and reads the exact span count back from the trace_summaries table. * test(event-sourcing): poll ClickHouse read-back in coalescing integration test The coalescing fold writes the trace summary once; reading it straight back raced ClickHouse insert visibility. Poll the read (as the other pipeline integration tests do) so the assertion is deterministic. Verified green against live ClickHouse: 40 spans coalesced -> spanCount 40 persisted. * fix(test): use branded TenantId in coalescing CH test store context ProjectionStoreContext.tenantId is a branded TenantId, not a plain string. Type-only fix (the store already coerces with String()); unblocks tsgo.

pull Bot locked and limited conversation to collaborators Dec 17, 2025

pull Bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Dec 17, 2025

rogeriochaves force-pushed the main branch from 5471f23 to e7499ef Compare December 26, 2025 13:10

rogeriochaves force-pushed the main branch 5 times, most recently from 1e7b14c to 2209258 Compare January 21, 2026 01:15

rogeriochaves force-pushed the main branch from 42c1a03 to 7a48b22 Compare March 26, 2026 15:58

drewdrewthis and others added 20 commits May 5, 2026 08:12

drewdrewthis and others added 30 commits May 21, 2026 19:53

chore(main): release mcp-server 0.9.0 (#4029)

5f4b8e3

chore(main): release typescript-sdk 0.29.0 (#4032)

2b3cb5a

chore(ci): sign release images with cosign and publish CycloneDX SBOMs (

9b65850

#4155)

Release docs (auto-release on merge to main) [skip ci]

c48bbf0

Release docs (auto-release on merge to main) [skip ci]

b144a8f

feat(presence): move sharing-presence toggle from main menu to avatar…

1b480d0

… dropdown (#4186)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from langwatch:main#279

[pull] main from langwatch:main#279
pull[bot] wants to merge 1741 commits into
erickirt:mainfrom
langwatch:main

pull Bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

pull Bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pull Bot commented Dec 17, 2025 •

edited

Loading