Skip to content

feat(mcp): first-class metric source support#2437

Open
dhable wants to merge 21 commits into
mainfrom
hdx-4347-mcp-tools-improved-metrics-support
Open

feat(mcp): first-class metric source support#2437
dhable wants to merge 21 commits into
mainfrom
hdx-4347-mcp-tools-improved-metrics-support

Conversation

@dhable

@dhable dhable commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Why: Metric sources were MVP-only in MCP — describe_source early-returned with just metricTables, builder tiles on metric sources saved but failed at render time, the query tools couldn't carry metricName / metricType, and the prompt told agents to fall back to raw SQL. Closes that gap.

What changed:

  • New tools
    • clickstack_list_metrics — paginated metric-name catalog with kind, namePattern (ILIKE), time-window filters, and opaque cursor pagination.
    • clickstack_describe_metric — per-metric unit, description, attribute keys, sampled values. kind is required (paired with metricName by every upstream discovery tool, so requiring it eliminates the multi-kind probe path).
  • Extended tools
    • clickstack_describe_source now picks a representative metric table (gauge → sum → histogram) and runs the existing column / map-key / value-sampling pipeline against it, plus a per-kind metric-name sample.
    • clickstack_timeseries / clickstack_table accept metricType, metricName, isDelta on each select item, plus aggFn:"increase" for Sum counters. valueExpression defaults to "Value". Emits a hint when the renderer's 20-group cap on increase + groupBy may apply. Reject up-front when metricType is paired with a non-metric source (or omitted on a metric source) instead of falling through to a cryptic ClickHouse error.
  • Prompt — replaces the "use raw SQL for metric tiles" workaround with positive discovery-workflow guidance and one worked example per supported kind.
  • Out of scope: summary and "exponential histogram" kinds (no renderer support).

Live MCP testing findings & fixes

Live testing against a populated metric source surfaced 5 issues. All are fixed in this PR.

# Symptom Root cause Fix
1 Cast to ObjectId failed for value "..." (type string) at path "_id" for model "Source" leaked through clickstack_describe_source / _list_metrics / _describe_metric on a bad sourceId getSource() didn't guard against malformed input Pre-check ObjectId.isValid + try/catch around findOne; all tools surface their existing "Source not found" branch (19bc4205)
2 clickstack_timeseries / clickstack_table accepted metricType+metricName on non-metric sources and failed at SQL render with Unknown expression or function identifier "Value" No source-kind ↔ select-item shape guardrail in runConfigTile New assertSourceKindMatchesSelect() helper called between source-not-found and the connection lookup — both directions of the mismatch return agent-actionable errors (7001a915)
3 metricTables map in clickstack_list_sources / _describe_source responses included a stray _id ObjectId alongside the valid kind keys Inline type: { … } Mongoose subdoc auto-adds _id Declare a child Schema(..., { _id: false }) for metricTables; defense-in-depth sanitizeMetricTables() at the MCP response boundary (7ef9d9ca)
4 clickstack_describe_metric consistently timed out at 10 s on multi-kind metrics (e.g. container.cpu.usage lives in both gauge and sum) Auto-detect path probed every populated kind serially, plus fetchAttributeKeys ran an unbounded WHERE MetricName = ? scan with no time filter Require kind (every upstream tool already pairs it with name); bound fetchAttributeKeys with a CTE sample LIMIT + server-side max_execution_time (30203f5e)
5 clickstack_describe_source on a populated metric source timed out on the cold call, then succeeded on retry Stages 2–5 fell through to the no-rollup main-table-scan path without timestampValueExpression, so the helpers' time filter was silently dropped Thread source.timestampValueExpression through to getMapKeys, both getAllKeyValues calls, and sampleMetricNamesForKind (31bc21b5)

Screenshots or video

N/A — backend-only.

How to test on Vercel preview

N/A — non-UI change.

References

  • Linear Issue: Closes HDX-4347
  • Parent: HDX-4079

@changeset-bot

changeset-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 3b0b570

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@hyperdx/api Minor
@hyperdx/app Minor
@hyperdx/otel-collector Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment Jun 12, 2026 8:30pm
hyperdx-storybook Ready Ready Preview, Comment Jun 12, 2026 8:30pm

Request Review

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds first-class metric source support to the HyperDX MCP server, replacing the previous "use raw SQL" workaround. It introduces clickstack_list_metrics and clickstack_describe_metric tools, extends clickstack_describe_source to discover metric schemas, and wires metricType/metricName/isDelta/aggFn:"increase" into the existing timeseries and table query tools.

  • Five production issues fixed: ObjectId validation in getSource, source-kind/select-item mismatch guardrail, metricTables._id leak via Mongoose subdoc, fetchAttributeKeys unbounded scan, and describeSource cold-start timeout — all addressed with targeted fixes.
  • QUERYABLE_METRIC_KINDS and sanitizeMetricTables consolidated into a shared metricKinds.ts module imported by every metric-aware tool; previously duplicated in three files.
  • clickstack_list_metrics (new) is the only tool in this PR that lacks the AbortController + Promise.race timeout guard applied to describeMetric and describeSource; its ClickHouse queries also carry no max_execution_time setting.

Confidence Score: 4/5

Safe to merge with the understanding that clickstack_list_metrics may hang under ClickHouse load; all other production issues flagged in the PR description are correctly addressed.

The new clickstack_list_metrics tool omits the timeout/abort circuit-breaker that both sibling tools in this PR (describeMetric, describeSource) were explicitly given after production timeout incidents. The GROUP BY MetricName query in fetchMetricsForKind has no abort_signal, no clickhouse_settings max_execution_time, and the outer handler has no Promise.race wrapper — the same conditions that caused the cold-start and multi-kind timeouts described in the PR. All other changes are well-guarded and the broader refactoring (metricKinds.ts consolidation, ObjectId guard, Mongoose _id: false, assertSourceKindMatchesSelect) is solid.

packages/api/src/mcp/tools/sources/listMetrics.ts — the fetchMetricsForKind function and the tool handler both need timeout/abort plumbing consistent with describeMetric.ts and describeSource.ts.

Important Files Changed

Filename Overview
packages/api/src/mcp/tools/sources/listMetrics.ts New paginated metric-catalog tool. Cursor encoding/pagination logic is correct, but unlike the other metric tools introduced here it has no wall-clock timeout, no abort signal on ClickHouse queries, and no per-query max_execution_time setting.
packages/api/src/mcp/tools/sources/describeMetric.ts New per-metric drill-down tool. AbortController timeout is hoisted into a finally block (clearing the timer on success), sampleAttributeValues now mirrors fetchAttributeKeys's CTE LIMIT + max_execution_time pattern, and kind is required rather than auto-detected serially.
packages/api/src/mcp/tools/sources/describeSource.ts Extended with metric-source support: picks a representative metric table, runs the existing column/map/value pipeline against it, and samples MetricName per kind. timestampValueExpression is correctly threaded into all metadata helpers.
packages/api/src/mcp/tools/sources/metricKinds.ts New shared module consolidating QUERYABLE_METRIC_KINDS and sanitizeMetricTables that were previously duplicated across three files; compile-time assertion keeps the string literals in sync with the MetricsDataType enum.
packages/api/src/mcp/tools/query/helpers.ts Adds assertSourceKindMatchesSelect guardrail, metric select defaulting, and increase+groupBy top-N hint. Uses ?? for valueExpression defaulting. Logic is sound.
packages/api/src/mcp/tools/query/schemas.ts Adds metricType/metricName/isDelta/increase to select item schema; getMetricSelectIssues enforces cross-field rules (sum-only increase, histogram quantile+level, etc.); applyMetricSelectDefaults defaults valueExpression to "Value" for metric items.
packages/api/src/controllers/sources.ts getSource now validates ObjectId before calling findOne and wraps in try/catch to convert Mongoose CastError into null, preventing internal error leakage through MCP tools.
packages/api/src/models/source.ts MetricTablesSchema declared with _id: false so Mongoose subdoc no longer auto-generates an ObjectId that leaks into MCP responses.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant MCP as MCP Server
    participant CH as ClickHouse

    Agent->>MCP: clickstack_list_sources
    MCP-->>Agent: sources[] (with sanitized metricTables)

    Agent->>MCP: clickstack_describe_source(sourceId)
    MCP->>CH: getColumns(representativeMetricTable)
    MCP->>CH: getMapKeys / getAllKeyValues (timestampValueExpression scoped)
    MCP->>CH: sampleMetricNamesForKind (per populated kind)
    MCP-->>Agent: columns, mapKeys, metricNames sample

    Agent->>MCP: clickstack_list_metrics(sourceId, kind?, namePattern?)
    MCP->>CH: fetchMetricsForKind (GROUP BY MetricName, time-filtered)
    Note over MCP,CH: No AbortSignal / max_execution_time
    MCP-->>Agent: metrics[], nextCursor?

    Agent->>MCP: clickstack_describe_metric(sourceId, metricName, kind)
    MCP->>CH: fetchAttributeKeys (CTE LIMIT + max_execution_time:8s)
    MCP->>CH: sampleAttributeValues (CTE LIMIT + max_execution_time:8s)
    Note over MCP: Promise.race(impl, 10s timeout)
    MCP-->>Agent: attributeKeys, sampledValues, unit, description

    Agent->>MCP: clickstack_timeseries / clickstack_table
    MCP->>MCP: validateMetricSelectItems + applyMetricSelectDefaults
    MCP->>MCP: assertSourceKindMatchesSelect
    MCP->>CH: queryChartConfig (max_execution_time:30s)
    MCP-->>Agent: result + increase top-N hint?
Loading

Fix All in Claude Code Fix All in Conductor Fix All in Cursor Fix All in Codex

Reviews (9): Last reviewed commit: "[HDX-4347] feat(mcp): distinguish unsamp..." | Re-trigger Greptile

Comment thread packages/api/src/mcp/tools/sources/listMetrics.ts Outdated
Comment thread packages/api/src/mcp/tools/sources/describeMetric.ts
Comment thread packages/api/src/mcp/tools/sources/describeMetric.ts Outdated
Comment thread packages/api/src/mcp/tools/query/helpers.ts Outdated
@dhable dhable force-pushed the hdx-4347-mcp-tools-improved-metrics-support branch from 91f17a6 to 6f63d7f Compare June 10, 2026 18:26
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 202 passed • 3 skipped • 1319s

Status Count
✅ Passed 202
❌ Failed 0
⚠️ Flaky 1
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

);
throw e;
} finally {
if (timeoutId !== undefined) clearTimeout(timeoutId);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by fix: this setTimeout-never-cleared pattern is pre-existing and was not flagged by greptile (the surrounding code is untouched by HDX-4347). Fixing here for consistency with the same fix in describeMetric.ts two files over — both files now clearTimeout in finally so a stale controller.abort() will no longer fire DESCRIBE_TIMEOUT_MS after every successful describe call, and the setTimeout closure is released immediately on resolution.

Comment thread packages/api/src/mcp/tools/sources/describeMetric.ts
dhable added 14 commits June 11, 2026 15:53
…emas

Extend the MCP select-item schemas with first-class metric fields so the
clickstack_timeseries, clickstack_table, and dashboard builder tile tools
can accept metric-source queries:

- Add 'increase' to MCP_AGG_FN_OPTIONS (sum-only counter-increase aggFn)
- Add optional metricType (gauge/sum/histogram), metricName, and isDelta
  fields to mcpSelectItemSchema and mcpTileSelectItemSchema
- Restrict metricType to the three kinds the renderer supports today;
  summary and exponential histogram are excluded
- Extract cross-field validation into getMetricSelectIssues so both
  schemas share the same metric rules: metricType<->metricName paired,
  'increase' is sum-only, histogram only allows quantile (with level) or
  count, isDelta is gauge-only, valueExpression is required for non-count
  aggFns unless metricType is set (defaults to 'Value' for metric sources)
- Apply validation imperatively in the timeseries/table handlers via
  validateMetricSelectItems (works around a Zod 3.x quirk where
  .superRefine widens optional-field inference and breaks the strict
  consumer types of mergeWhereIntoSelectItems and resolveOrderBy)
- Update timeseries/table tool descriptions with a METRIC SOURCES
  subsection covering the discovery workflow chain and per-kind aggFn
  guidance, plus the 20-group top-N cap on sum+increase+groupBy
- Exclude 'increase' from AGG_FN_NAMES so resolveOrderBy doesn't try to
  synthesize a non-existent SQL increase() function

Adds 20 unit tests covering the validator branches and the resolveOrderBy
handling of 'increase'. No new TypeScript errors above the pre-existing
baseline on main.
…er path

Make clickstack_timeseries and clickstack_table actually work against
metric sources. Before this change the builder branch of runConfigTile
fell back to the source's empty tableName and never threaded
source.metricTables onto the chart config, so any metric query failed
with 'Both table name and UUID are empty' at render time.

For metric sources, the builder branch now:
- Sets from.tableName = '' (renderer picks the per-kind ClickHouse table
  from metricTables at render time)
- Adds metricTables: source.metricTables to the chart config (mirrors
  packages/api/src/routers/external-api/v2/charts.ts:261-267)
- Defaults each select item's valueExpression to 'Value' when missing,
  matching the agent-friendly external-API behavior so metric queries
  can omit valueExpression and have it resolve correctly

Also surfaces the renderer's increase+groupBy top-N cap (20 groups) to
agents:
- Adds annotateIncreaseTopNHint helper that attaches a neutral hint to
  successful, non-empty tool results when any select item uses
  aggFn:'increase' alongside a non-empty groupBy
- Exposes INCREASE_TOP_N_CAP=20 as a named constant referencing the
  renderer's INCREASE_MAX_NUM_GROUPS in common-utils
- Wires the hint into both clickstack_timeseries and clickstack_table

Adds 9 unit tests covering the hint helper across the increase+groupBy
condition, empty results, error results, missing groupBy, missing
increase aggFn, and malformed/missing content envelopes. No new
TypeScript errors above the pre-existing baseline on main.
Replace the early-return for metric sources at describeSource.ts:97-115
with a metric-aware discovery flow so agents can use the existing
describe-then-query workflow against metric sources without falling back
to clickstack_sql.

For metric sources, describe_source now:
- Picks a representative metric table (gauge -> sum -> histogram) via
  pickRepresentativeMetricTable, exposed in meta.discoveryMetricKind so
  agents know which kind's schema they're seeing
- Reuses the existing stages 1-4 (columns, map keys, low-cardinality
  values, map-attribute values) against the representative table — the
  OTel Collector metric schemas share Attributes/ResourceAttributes/
  ScopeAttributes shape, so one sample is representative enough for the
  starter view
- Samples up to 20 distinct MetricName values per queryable kind in a
  new stage 5 (sampleMetricNamesForKind), reusing the discovery dateRange
  and respecting the existing wall-clock timeout / partial-result flag
- Defensively enriches each sampled metric with MetricUnit /
  MetricDescription when those columns exist on the kind's table
  (getColumns presence check); falls back gracefully on schemas that
  omit them
- Surfaces the new clickstack_list_metrics / clickstack_describe_metric
  discovery tools via nextSteps.discovery so agents can paginate or
  drill into per-metric attribute keys when the sample isn't enough

Updates the existing locking test in sources.test.ts that asserted
columns/lowCardinalityValues/etc. were undefined for metric sources;
the test now asserts the rich discovery payload (columns present,
discoveryMetricKind set to 'gauge', nextSteps mention the new
discovery tools).

No new TypeScript errors above the pre-existing baseline on main.
…ribe_metric

Add two new metric-specific discovery tools so agents can find and
characterise metrics on a metric source without falling back to
clickstack_sql or guessing OTel semantic-convention names.

clickstack_list_metrics
- Paginated metric-name catalog scoped to a metric source
- Optional kind filter (gauge/sum/histogram); omit to scan every
  populated kind in order
- Optional namePattern (ClickHouse ILIKE) and time window
- Defaults to last 24h to surface only metrics with recent data
- Opaque base64 cursor pagination over (kind, lastName) pairs so
  multi-kind scans resume cleanly mid-result
- Default limit 50, max 500
- Defensive MetricUnit/MetricDescription enrichment using getColumns
  presence checks (matches the OTel Collector default schema; falls
  back gracefully on custom schemas)
- Empty-result hint guides the agent to widen the filter or drop kind

clickstack_describe_metric
- Per-metric drill-down: kind(s), unit, description, attributeKeys per
  map column, sampled values (when sampleValues=true, default)
- kind auto-detection across all populated tables when omitted, via
  parallel SELECT 1 probes per kind
- Reuses metadata.getAllFields with metricName filter for attribute-key
  discovery scoped to the specific metric (existing helper at
  packages/common-utils/src/core/metadata.ts:736-816)
- Per-kind usage guidance baked into the response (aggFn hints, level
  requirement for histogram quantile, 20-group cap for sum + increase)
- Pre-built clickstack_timeseries call example in nextSteps.query with
  the right aggFn / level for the detected kind
- 10s wall-clock timeout matches clickstack_describe_source

Both tools reject non-Metric sources with a friendly hint pointing at
clickstack_list_sources. Tool descriptions document the full discovery
workflow chain: list_sources -> describe_source -> list_metrics ->
describe_metric -> timeseries|table.

Notable implementation detail: the queryable-kind tuple is declared as
plain string literals ('gauge'|'sum'|'histogram') rather than
MetricsDataType enum members because Zod 3.x's z.enum(...) infers an
unknown-widened type when fed an enum-member tuple, which breaks the MCP
SDK callback inference downstream. A compile-time assertion keeps the
string literals in sync with the enum.

No new TypeScript errors above the pre-existing baseline on main.
…idance in prompt

The clickstack_save_dashboard prompt previously told the model to fall
back to raw SQL tiles for any metric source because the MCP select-item
shape did not carry metricType / metricName and builder tiles failed at
render time with 'Both table name and UUID are empty'. Stages 1-3b
fixed both gaps. Update the prompt and its regression tests to teach the
new discovery + builder workflow.

prompts/dashboards/content.ts
- Replace the 'NOTE: Authoring builder tiles on a metric source is not
  reliable today...' paragraph with a dedicated '== METRIC SOURCES =='
  section that documents:
  * the required select-item fields (metricType, metricName) and the
    valueExpression='Value' default
  * per-kind aggregation guidance for gauge / sum / histogram including
    the histogram quantile + level requirement and the 20-group cap on
    sum + increase + groupBy
  * the five-step discovery workflow (list_sources -> describe_source
    -> list_metrics -> describe_metric -> timeseries|table)
  * one worked example per supported kind using real OTel metric names
- Drop 'metric tiles with multiple metricTables' and 'builder tiles on
  metric sources' from the validation-known-gaps paragraphs at lines
  127 and 1247 since both gaps are closed.

__tests__/dashboards/prompts.test.ts
- Rewrite the two regression tests that asserted the old workaround
  language. The new tests assert the positive guidance is present
  (metricType / metricName / isDelta fields are named, per-kind aggFn
  rules and 20-group cap are documented, the five-step workflow chain
  is enumerated, one worked example per kind appears) AND assert the
  old workaround language is gone (no 'Authoring builder tiles on a
  metric source is not reliable', no 'Both table name and UUID are
  empty').
…s, cursor

Adds unit and integration coverage for the metric pieces shipped in
Stages 1-3b.

listMetricsCursor.test.ts (NEW, unit)
- 12 tests for encodeCursor / decodeCursor round-trips and rejection
  cases. Exercises malformed base64, malformed JSON, missing fields,
  wrong types, non-queryable kinds (summary, exponential histogram),
  empty input, and unicode metric names. Exports cursor helpers from
  listMetrics.ts as @internal so they can be tested without spinning
  up ClickHouse.

sources.test.ts (integration; expanded)
- clickstack_list_metrics:
  * rejects non-metric sources and unknown source IDs
  * returns metrics across every populated kind when kind is omitted
  * narrows to a single kind when kind is set
  * applies namePattern as a server-side ILIKE filter
  * paginates via opaque nextCursor and resumes cleanly
  * rejects malformed cursors with an actionable error
  * surfaces the empty-result hint
- clickstack_describe_metric:
  * rejects non-metric sources
  * returns an actionable hint when the metric is unknown
  * auto-detects kind for a gauge metric, returns attribute keys +
    sampled values + matching usage example
  * respects sampleValues:false
  * returns the increase-shaped query example for sum kind
  * returns the quantile + level example for histogram kind
  * rejects an explicit kind when the source has no table for it

queryTool.test.ts (integration; expanded)
- clickstack_timeseries and clickstack_table:
  * gauge avg over time
  * sum + aggFn:'increase' table
  * histogram + aggFn:'quantile' + level:0.95 timeseries
- Schema rejection paths:
  * aggFn:'increase' on a gauge -> friendly schema error
  * aggFn:'quantile' on a histogram without level
  * aggFn:'avg' on a histogram (not supported)
  * metricType set without metricName
  * isDelta on a non-gauge metric

All unit tests pass (103 total: 63 query + 12 cursor + 28 prompts).
Integration tests require make dev-int with the standard Docker stack
to run end-to-end (they rely on bulkInsertMetricsGauge/Sum/Histogram
seeding). No new TypeScript errors above the pre-existing baseline on
main.
Update the user-facing MCP tool reference to reflect the new metric
discovery + querying capabilities shipped in Stages 1-3b.

- Add clickstack_list_metrics and clickstack_describe_metric rows to
  the tool table
- Mark clickstack_describe_source / clickstack_timeseries /
  clickstack_table as supporting metric sources alongside log/trace
- Add a new 'Metric Sources' section covering:
  * the required metricType + metricName fields on each select item
  * the valueExpression='Value' default
  * per-kind aggregation guidance (gauge/sum/histogram, the increase
    top-N cap on sum + groupBy, the level requirement for histogram
    quantile)
  * the unsupported metric kinds (summary, exponential histogram)
  * the five-step discovery workflow chaining list_sources ->
    describe_source -> list_metrics -> describe_metric ->
    timeseries|table

Closes the documentation acceptance criterion on HDX-4347.
- Extract QUERYABLE_METRIC_KINDS to a shared
  packages/api/src/mcp/tools/sources/metricKinds.ts so describeMetric,
  listMetrics, describeSource, and the query / dashboard tile schemas
  all reference a single source of truth. Adding a new queryable kind
  (e.g. when the renderer learns 'summary') is now a one-file change
  (greptile).
- Drop the unused 'ms' import in listMetrics.ts that would have failed
  CI lint (greptile).
- Clear the describe_metric wall-clock timeout in 'finally' so a stale
  controller.abort() does not fire DESCRIBE_TIMEOUT_MS after every
  successful call, and the setTimeout closure does not stay pinned
  for the same duration (greptile).
- Apply the same clearTimeout fix to describe_source as a drive-by:
  the bug pre-existed this PR but the pattern is now consistent across
  both tools (PR comment posted on the line).
- Use ?? instead of || for the metric-source valueExpression default
  in runConfigTile so an explicit empty string is not silently swapped
  for 'Value' (greptile).
- Remove the unused MCP_METRIC_TYPE_OPTIONS export; mcpMetricTypeSchema
  now builds directly off the shared QUERYABLE_METRIC_KINDS tuple
  (knip). The dashboard tile mcpTileMetricTypeSchema also switches to
  the shared constant for consistency.

Verified: yarn tsc --noEmit baseline diff = 0, yarn lint clean,
yarn knip clean, 105 unit tests pass.
Bug 1: timeseries / table metric queries failed with 'Value expression
is required for non-count aggregation functions'

The Stage 2 default of valueExpression='Value' for metric items ran
inside runConfigTile, AFTER buildTile had already called
externalDashboardTileSchemaWithId.parse(). That schema's superRefine
rejects non-count aggregations with an empty valueExpression, so any
metric query that omitted the field never reached the runtime default.

Add applyMetricSelectDefaults() to tools/query/schemas.ts and call it
in the timeseries / table handlers BEFORE buildTile, mirroring the
agent-friendly default the external API v2 charts path applies on the
REST surface. The runtime default in runConfigTile stays as defensive
coverage for saved dashboards.

Bug 2: clickstack_describe_metric returned an empty attributeKeys map
for metric scopes that DID have data

metadata.getAllFields({ metricName }) was returning empty arrays for
metric-scoped Map-column scans in the CI environment. Replace the call
with fetchAttributeKeys(), a direct query that issues
'arrayDistinct(arrayFlatten(groupArray(mapKeys(col))))' for each Map
column on the kind's table. This matches the proven pattern used by
the web app's useFetchMetricResourceAttrs hook and is more transparent
than the metadata helper for this single-purpose case.

Verified end-to-end against make dev-int:
- queryTool.test.ts: 53/53 pass (was 50/53 before this commit)
- sources.test.ts: 76/76 pass (was 75/76 before this commit)
- query.test.ts unit suite: 69/69 (added 4 new tests for
  applyMetricSelectDefaults)

No new TypeScript errors above the pre-existing baseline.
The new applyMetricSelectDefaults tests inferred narrow object shapes
from inline literals, so 'valueExpression' was absent from the inferred
generic type and TS errored on accessing 'out[0].valueExpression'.
Annotate each test input as 'SelectItem' so the optional field stays in
the inferred type. Tests behave identically at runtime; this is a
TypeScript-only fix.
Calling clickstack_describe_source / clickstack_list_metrics /
clickstack_describe_metric with a non-ObjectId sourceId surfaced a raw
Mongoose CastError to the MCP client:

  Cast to ObjectId failed for value "…" (type string) at path "_id"
  for model "Source"

The well-formed-but-missing path was already handled by each tool's
"Source X not found" branch, but malformed IDs threw before getSource
returned. Pre-check via mongoose.Types.ObjectId.isValid and wrap the
findOne in a defensive try/catch so getSource returns null on any
malformed input and the existing not-found branch fires cleanly.

Add a controller unit test asserting both the malformed and missing
ObjectId paths return null.
…ype mismatch

clickstack_timeseries and clickstack_table accepted metricType +
metricName on any source. When the source was not a metric source, the
renderer composed SQL that referenced the metric "Value" column and let
ClickHouse fail with:

  Unknown expression or function identifier `Value` in scope
  SELECT AVG(toFloat64OrDefault(toString(Value))) … FROM default.otel_traces

The reciprocal case (metric source with no metricType) reached the
renderer in the same broken state.

Add assertSourceKindMatchesSelect() in tools/query/helpers.ts and call
it inside runConfigTile right after getSource(). Both clickstack_timeseries
and clickstack_table inherit the guard, with agent-actionable error
messages that mirror the wording on clickstack_list_metrics /
clickstack_describe_metric.

Add unit tests covering both mismatch directions plus the no-op paths
(raw-string select, empty metricType, non-array select).
clickstack_list_sources and clickstack_describe_source emitted the
embedded metricTables subdoc with a stray Mongoose-generated _id key
mixed in with the legitimate kind keys:

  "metricTables": {
    "gauge": "otel_metrics_gauge",
    "sum": "otel_metrics_sum",
    "_id": "6a2ad3b4da94be764e2deed8"
  }

The metricTables field was declared as an inline `type: {...}` literal
which Mongoose interprets as an embedded subdoc schema with an
auto-generated _id field. Convert it to a real child Schema with
{ _id: false } so newly persisted documents stop carrying the stray id.

Add a sanitizeMetricTables() helper as belt-and-braces defense for
documents persisted before the schema fix, and use it at the MCP
response boundary in both list_sources and describe_source. The helper
looks up each valid MetricsDataType key via property access so it
handles both plain objects and Mongoose subdoc instances.

Tests: unit tests for sanitizeMetricTables covering the _id strip,
non-string-value drop, unknown-key drop, and Mongoose-subdoc shape.
Extend the existing describe_source integration test to assert the
response payload only contains valid kind keys.
dhable added 2 commits June 11, 2026 15:53
clickstack_describe_metric historically auto-detected the metric kind
when omitted, probing every populated metric table for the given
MetricName and returning one entry per kind that matched. This path
was the consistent source of 10s wall-clock timeouts on multi-kind
metrics (e.g. container.cpu.usage which lives in both gauge and sum
tables), even with sampleValues:false.

The auto-detect path was also redundant: the upstream discovery tools
both surface kind alongside name:
  - clickstack_list_metrics returns { name, kind, ... }
  - clickstack_describe_source returns metricNames grouped by kind

By the time an agent reaches clickstack_describe_metric it already
has the kind. Requiring it on the input drops the detectKindsForMetric
probe, eliminates the multi-kind discovery path, and removes the only
place the tool ran across more than one ClickHouse table.

Also bound the per-kind attribute-keys discovery so the single-kind
path stays fast on production-scale metric tables. The previous
inline SQL ran an unbounded `WHERE MetricName = ?` scan with no
time filter, which still exceeded the wall-clock timeout on tables
with millions of rows per metric. Wrap the scan in a CTE that
samples at most 100k matching rows (MetricName + time window) and
aggregate map keys from that sample, plus an 8s server-side
max_execution_time as a belt-and-braces safety net.

Changes:
- Make `kind` required in describeMetricSchema (was optional).
- Remove detectKindsForMetric — auto-detect path no longer exists.
- Collapse the if/else branch to the explicit-kind path.
- Always return exactly one kindDetail; emit a top-level hint when
  the (metric, kind) pair has no signal so the agent can widen the
  window or verify the kind via clickstack_list_metrics.
- Bound fetchAttributeKeys with a CTE-LIMIT sample + max_execution_time
  so the discovery query stays within the wall clock on hot metrics.
- Update the tool description and the timeout hint accordingly.

Tests: drop the auto-detect and multi-kind tests; existing tests that
already pinned `kind` are unchanged. Update the unknown-metric test
to reflect the new always-one-kind shape. Add a test asserting the
tool rejects calls with missing kind.
…e_source

clickstack_describe_source consistently timed out on the cold-cache
first call against a populated metric source, then succeeded on retry.
The bottleneck was the value-sampling stages (3 + 4) and the metric-
name sampling stage (5), which all delegate to metadata.getAllKeyValues
/ metadata.getMapKeys.

When the source has no metadataMaterializedViews configured (the case
for every metric source today), those helpers fall back to scanning
the raw metric table. The fallback path only applies a time filter
when timestampValueExpression is provided alongside dateRange — and
describe_source was passing dateRange but not timestampValueExpression.
The result was an unbounded scan per key against the full metric
table on cold cache, easily exceeding the 10s wall-clock timeout.

Pass source.timestampValueExpression through to:
  - metadata.getMapKeys              (stage 2)
  - metadata.getAllKeyValues × 2     (stages 3 + 4)
  - sampleMetricNamesForKind         (stage 5)
    and its inner metadata.getAllKeyValues call

For log / trace sources this is also a small win because the no-MVs
fallback now scopes its scan, but the practical impact is on metric
sources where the rollup MVs are not yet configured.

No new tests: existing sources.test.ts coverage already exercises
every stage on a metric source. The change is observably load-bearing
on the live MCP server — the first call now returns the full payload
in under the 10s budget instead of timing out.
@dhable dhable force-pushed the hdx-4347-mcp-tools-improved-metrics-support branch from 31bc21b to e27d683 Compare June 11, 2026 20:56
@dhable dhable changed the title [HDX-4347] feat(mcp): first-class metric source support feat(mcp): first-class metric source support Jun 11, 2026
…AttributeKeys

Greptile pointed out that sampleAttributeValues hit the same
(MetricName, time range) filter on the same metric table as
fetchAttributeKeys but lacked any of the scan bounds the earlier
commit added there:

  - no inner LIMIT subquery
  - no max_execution_time / timeout_overflow_mode

On a hot metric with millions of matching rows in the default 24-hour
window the query could run until the client-side AbortController
fires the wall-clock timeout, after which the ClickHouse server may
still be processing the work.

Apply the same pattern used in fetchAttributeKeys:
  - Wrap the FROM/WHERE clause in a subquery that LIMITs at
    METRIC_ATTR_KEYS_SAMPLE_SIZE matching rows. The inner SELECT
    projects only the distinct set of map columns referenced by the
    sampled keys.
  - Add clickhouse_settings: { max_execution_time:
    METRIC_ATTR_KEYS_MAX_EXEC_SECONDS, timeout_overflow_mode: 'break' }
    as a server-side ceiling.

Live retest against the metric source on staging confirms the
attributeValues payload still comes back rich (same k8s.* diversity
on container.cpu.usage) — the 100k-row sample is plenty for value
discovery.

Refs: #2437 (comment)
@dhable dhable marked this pull request as ready for review June 11, 2026 21:21
@dhable dhable requested a review from brandon-pereira June 11, 2026 21:21
@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label Jun 11, 2026
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Large diff: 2341 production lines changed (threshold: 1000)
  • Touches API routes or data models — hidden complexity risk

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 14
  • Production lines changed: 2341 (+ 1567 in test files, excluded from tier calculation)
  • Branch: hdx-4347-mcp-tools-improved-metrics-support
  • Author: dhable

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

Comment thread packages/api/src/controllers/sources.ts Outdated
return await Source.findOne({ _id: sourceId, team });
} catch {
// Defense-in-depth: if Mongoose still throws (e.g. a future cast
// path), treat it as "not found" so the MCP tools surface a clean

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs(non-blocking): we should make this comment more generic since this controller is used across the app (eg so the caller can surface a clean error)

@@ -92,14 +109,28 @@
}
const { startDate, endDate } = timeRange;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(existing, non-blocking): Not related to this PR, but during testing this the top reason that LLM went from using timeseries to raw sql was the following:

timeseries.ts threads granularity into the tile config (lines 135–142), but buildTile in
helpers.ts (lines 187–202) parses the input through externalDashboardTileSchemaWithId. The
line/stacked_bar sub-schema in zod.ts (lines 246–261) has no granularity field, so Zod
silently strips it.

// packages/api/src/utils/zod.ts:246-261
// ❌ granularity is not declared — Zod strips any extra fields
const lineTileSchema = z.object({
  type: z.literal('line'),
  // ... other fields, but no granularity
});

Downstream, convertToInternalTileConfig never re-injects it — contrast with the REST charts path,
which explicitly defaults it:

// packages/api/src/routes/charts.ts:289
granularity: translatedGranularity ?? 'auto',

Impact

The renderer only includes the time bucket in SELECT/GROUP BY/ORDER BY when
chartConfig.granularity != null:

// packages/api/src/renderChartConfig.ts:109-128
if (chartConfig.granularity != null) {
  // adds __hdx_time_bucket to SELECT, GROUP BY, ORDER BY
}

Because granularity was stripped, every MCP timeseries call for gauge/sum metrics (and
log/trace sources) returns a single collapsed row per group — no timestamp column at all

regardless of what the agent passed.

The gauge/sum translation does bucket the inner CTE with granularity || 'auto', but then
spreads ...restChartConfig for the outer query, so the outer aggregation collapses. Histogram
only escapes this by hard-coding __hdx_time_bucket directly (line 2106).

// 1 row, the data likely collapsed into a single time bucket. Add a
// hint so the agent knows to adjust the time range.
if (
result.content?.[0]?.type === 'text' &&

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue(existing, non-blocking): Another existing but more clear one now with metrics:

Bug: Hint Heuristics Fire on Wrong Signal, Recommend Wrong Fix, and Clobber Each Other

1. Single-Bucket Hint Triggers on the Wrong Signal

The "only 1 time bucket" hint in timeseries.ts:170-189 triggers on data.length === 1, which conflates three distinct cases:

  • Genuine single-bucket collapse (granularity too coarse)
  • A single group when groupBy is set (working as intended)
  • The no-time-column collapse from previous comment — where the advice to widen range or use coarser granularity can't help and "coarser" is actually backwards

A more robust signal would be to inspect result.meta for the presence of the __hdx_time_bucket column rather than counting rows. This distinguishes the no-time-column case (issue #1) from a genuine single-bucket result, and avoids misfiring when groupBy produces exactly one group.

`aggFn:"increase" combined with groupBy is capped at the top ${INCREASE_TOP_N_CAP} ` +
`groups by max bucket sum at the renderer layer. ` +
`Results may not include every group present in the underlying data.`;
first.text = JSON.stringify(parsed, null, 2);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: annotateIncreaseTopNHint in helpers.ts:99-121 emits a "may be truncated" disclaimer whenever increase + groupBy returns any data — even with fewer than 20 groups — making it a false alarm rather than an actual detection.

Additionally, both this function and the single-bucket hint in timeseries.ts assign to parsed.hint directly. Since the single-bucket hint runs first (timeseries.ts:165) and annotateIncreaseTopNHint runs after (timeseries.ts:178), one silently overwrites the other with no indication either existed.

Suggested fix: change hint to hints: string[] and append at each site. Gate the truncation hint on an actual group count threshold (e.g. >= 20) rather than emitting unconditionally.

{ tableName, error: e instanceof Error ? e.message : String(e) },
'sampleAttributeValues failed',
);
return {};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Failures are silently discarded in several places — describeMetric.ts:301-307, describeMetric.ts:412-418, and per-kind skips in listMetrics.ts:385-395 — causing transient ClickHouse errors to surface as the noSignal hint (describeMetric.ts:584-604):

"No data found — try widening startTime/endTime"

Agents cannot distinguish "genuinely empty" from "fetch failed" and will follow incorrect recovery steps as a result.

Suggested fix: instead of swallowing errors into empty returns, surface a partialFailure indicator in the response shape — recording which kinds failed and why. This gives agents a recoverable, accurate signal ("some metric kinds failed") rather than a misleading empty-data path.

connectionId,
clickhouse_settings: {
max_execution_time: METRIC_ATTR_KEYS_MAX_EXEC_SECONDS,
timeout_overflow_mode: 'break',

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Callers can't distinguish "not sampled" from "no values". A per-key sampled: boolean/truncated: boolean would allow agents to know to adjust their query

@@ -92,14 +109,28 @@
}
const { startDate, endDate } = timeRange;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea(non-blocking): probably OOS but I waned a better historic graph

It would be helpful if the timeseries tool returned per-bucket rows with a timestamp column for metric sources using increase + groupBy, rather than collapsing to per-group aggregates with no ts column — that way the bucket breakdown is actually visible in the output.

@brandon-pereira

Copy link
Copy Markdown
Member

Overall - metrics support works very good! It fell back to sql a few times, everytime it did I asked my agent to explain why (findings above). I tested it on my local instance and a larger internal instance with tons of metrics, it worked 🥳

dhable added 4 commits June 12, 2026 15:15
getSource is called from REST routers, external-api v2, and MCP tools
alike — the inline comments referenced only the MCP surface. Reword to
caller-neutral language per review feedback.

Refs: #2437 (comment)
… to hints[]

Two problems with the query-tool hint plumbing:

1. annotateIncreaseTopNHint emitted the 'results may be truncated'
   disclaimer whenever increase + groupBy returned any data — even 3
   groups — making it a false alarm rather than a detection.
2. Both annotateIncreaseTopNHint and the single-bucket hint in
   timeseries.ts assigned to parsed.hint directly, so whichever ran
   second silently overwrote the other.

Fixes:
- Add countDistinctGroups(): resolves comma-separated groupBy segments
  against the result row keys (bracket-aware split) and counts distinct
  combinations. The truncation hint now only fires when the distinct
  group count reaches INCREASE_TOP_N_CAP. When no segment resolves to a
  row key (e.g. bracket-syntax map attrs render as arrayElement(...)
  aliases), skip the hint instead of crying wolf.
- Replace the single hint: string with hints: string[] across the
  query-tool response shape (annotateIncreaseTopNHint, the
  single-bucket hint, and formatQueryResult's empty-result hint) via a
  shared appendHint() helper, so multiple advisories coexist.

Tests: rewrite the annotateIncreaseTopNHint block — at/below cap,
distinct-groups-vs-raw-rows, multi-column groupBy combinations,
unresolvable groupBy, and append-not-overwrite coexistence.

Refs: #2437 (comment)
…ist_metrics

Discovery sub-query failures were silently swallowed into empty
returns (fetchAttributeKeys / sampleAttributeValues catch → {};
listMetrics per-kind catch → continue). Transient ClickHouse errors
then surfaced as the 'No data found — try widening startTime/endTime'
hint, so agents could not distinguish 'genuinely empty' from 'fetch
failed' and followed incorrect recovery steps.

- fetchAttributeKeys / sampleAttributeValues now return a typed
  { ok, data } | { ok: false, error } result. Errors are sanitised
  (single line, capped at 200 chars).
- describe_metric collects failures into partialFailure[{stage,error}]
  on the response and suppresses the no-data hint when any stage
  failed — empty attributeKeys can't be trusted as 'no data' then.
  A retry-oriented hint is emitted instead.
- list_metrics collects per-kind fetch failures into
  partialFailure[{kind,error}] and suppresses the 'No metrics matched'
  hint when failures explain the empty result.

Tests: new cases pointing a kind's metricTables entry at the logs
table (exists, but no MetricName/TimeUnix columns) to force the
discovery query to throw; assert partialFailure is present and the
misleading widen-the-window hints are absent. Happy-path test asserts
partialFailure stays undefined.

Refs: #2437 (comment)
sampleAttributeValues caps value sampling at MAX_ATTR_KEYS_TO_SAMPLE
(12) keys and only emits keys with non-empty samples — so a key absent
from attributeValues was indistinguishable between 'skipped by the
cap' and 'sampled but empty in the window'. Agents could not tell when
to re-query a key directly.

Add attributeValuesMeta to each kindDetail:
  - sampledKeys:   display names actually queried for values
  - truncatedKeys: display names skipped because the cap fired

A key in truncatedKeys was never queried; a key in sampledKeys but
missing from attributeValues is genuinely empty in the window. Tool
description documents the semantics so agents know to query truncated
keys directly.

Tests: 15-key metric → 12 sampled + 3 truncated, disjoint sets
covering all keys; small metric → truncatedKeys empty.

Refs: #2437 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants