Add TOOLS_DOMAIN perf metrics job and Prometheus verification#60
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Reviewer's GuideAdds an end-to-end performance metrics job that auto-resolves Pushgateway/Prometheus endpoints (including TOOLS_DOMAIN support), deduplicates stress metrics per scenario with a publish timestamp gauge, and introduces a Prometheus verification script plus CLI/Makefile wiring and unit tests for the new behavior. Sequence diagram for perf-test end-to-end stress metrics publication and verificationsequenceDiagram
actor Dev as Developer
participant Make as Makefile_perf-test
participant Cmd as scripts_index_js
participant Npm as npm_runner
participant Perf as run_perf_metrics_job_js
participant StressTest as npm_test_stress
participant StressMetrics as npm_stress_metrics
participant Pub as publish_stress_metrics_js
participant Push as Pushgateway
participant Ver as verify_prometheus_metrics_js
participant Prom as Prometheus
Dev->>Make: make perf-test
Make->>Cmd: node scripts/index.js perf-test
Cmd->>Npm: run npm script perf:test
Npm->>Perf: node scripts/run-perf-metrics-job.js
Perf->>Perf: resolveMonitoringEndpoints
Perf->>Perf: buildDefaultInstanceName
Perf->>Perf: set env PUSHGATEWAY_URL
Perf->>Perf: set env PROMETHEUS_URL
Perf->>Perf: set env PUSHGATEWAY_JOB
Perf->>Perf: set env PUSHGATEWAY_INSTANCE
Perf->>Perf: set env STRESS_METRICS_PUBLISH_TS_SECONDS
Perf->>Perf: set env PROMETHEUS_MIN_PUBLISH_TS_SECONDS
Perf->>StressTest: run "npm run test:stress" with env
StressTest-->>Perf: stress benchmarks completed
Perf->>StressMetrics: run "npm run stress:metrics" with env
StressMetrics->>Pub: node scripts/publish-stress-metrics.js
Pub->>Pub: normalizeBenchmarkRecord
Pub->>Pub: selectLatestRecordPerScenario
Pub->>Pub: buildPrometheusPayload
Pub->>Push: push metrics with job and instance labels
Pub-->>Perf: return success
Perf->>Ver: waitForStressMetrics(options)
loop until timeout or metrics present
Ver->>Prom: HTTP GET /api/v1/query
Prom-->>Ver: JSON samples for publish_timestamp_seconds
Ver->>Ver: check values against minPublishTimestampSeconds
end
Ver-->>Perf: verified metrics for job and instance
Perf-->>Npm: return job result
Npm-->>Cmd: perf:test success
Cmd-->>Dev: "Performance metrics job completed successfully"
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
Review Summary by QodoAdd TOOLS_DOMAIN perf metrics job with Prometheus verification
WalkthroughsDescription• Add TOOLS_DOMAIN-aware perf metrics job orchestration with auto-resolved endpoints • Implement Prometheus verification script to validate pushed metrics visibility • Add per-scenario deduplication to prevent Pushgateway duplicate-series rejections • Wire new commands into script runner, help text, and Makefile • Add comprehensive unit tests for metrics, orchestration, and verification Diagramflowchart LR
A["TOOLS_DOMAIN env var"] -->|normalizeToolsDomain| B["Monitoring endpoints"]
B -->|resolveMonitoringEndpoints| C["Pushgateway + Prometheus URLs"]
D["Stress benchmarks"] -->|normalizeBenchmarkRecord| E["Benchmark records"]
E -->|selectLatestRecordPerScenario| F["Deduplicated records"]
F -->|buildPrometheusPayload| G["Prometheus text format"]
G -->|publish| C
C -->|queryPrometheus| H["Verify metrics visible"]
H -->|waitForStressMetrics| I["Confirmation"]
File Changes2. scripts/lib/utils.js
|
📝 WalkthroughWalkthroughAdds a perf/stress metrics workflow: new Make/npm targets and CLI aliases, two new Node scripts to run and verify Prometheus-published metrics, enhancements to metrics publishing, and corresponding unit tests and docs. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as "Make / npm / scripts/index.js"
participant Orchestrator as "run-perf-metrics-job"
participant Stress as "npm run test:stress"
participant Publish as "npm run stress:metrics / publish-stress-metrics"
participant Pushgateway as "Pushgateway"
participant Prometheus as "Prometheus"
participant Verifier as "verify-prometheus-metrics"
User->>CLI: make perf-test / npm run perf:test
CLI->>Orchestrator: start runPerfMetricsJob
Orchestrator->>Stress: run test:stress (env)
Stress->>Pushgateway: push raw metrics
Orchestrator->>Publish: run stress:metrics (env)
Publish->>Pushgateway: publish aggregated/latest metrics
Prometheus->>Pushgateway: scrape metrics
Orchestrator->>Verifier: waitForStressMetrics(prometheusUrl, job, instance)
Verifier->>Prometheus: query /api/v1/query
Prometheus-->>Verifier: return metric values
Verifier-->>Orchestrator: verification result
Orchestrator-->>CLI: success / failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Mehdi-Bl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes a robust and automated pipeline for collecting, publishing, and verifying performance metrics. By introducing a unified Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- There are several small helpers that are now duplicated across scripts (e.g.,
toFiniteNumber,trimTrailingSlashes, numeric/time helpers); consider extracting them into a shared utility module for consistency and easier maintenance. - In
waitForStressMetrics, the timeout error only reports that no samples matched; including which queries were attempted (job vs exported_job variants) in the error message would make diagnosing label/selector issues in Prometheus much easier.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There are several small helpers that are now duplicated across scripts (e.g., `toFiniteNumber`, `trimTrailingSlashes`, numeric/time helpers); consider extracting them into a shared utility module for consistency and easier maintenance.
- In `waitForStressMetrics`, the timeout error only reports that no samples matched; including which queries were attempted (job vs exported_job variants) in the error message would make diagnosing label/selector issues in Prometheus much easier.
## Individual Comments
### Comment 1
<location> `scripts/verify-prometheus-metrics.js:52-61` </location>
<code_context>
+function requestJson(endpointUrl) {
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Add a request timeout when calling Prometheus to avoid hanging indefinitely on network issues.
Since this request doesn’t set a timeout, a stalled connection could cause the script to hang even if `waitForStressMetrics` times out. Please add a `request.setTimeout(...)` that aborts the request and rejects the promise so the higher-level timeout can take effect under bad network conditions.
Suggested implementation:
```javascript
const REQUEST_TIMEOUT_MS = 10000; // 10 seconds
function requestJson(endpointUrl) {
```
```javascript
const request = client.request(
endpointUrl,
{ method: 'GET' },
(response) => {
```
```javascript
(response) => {
const responseChunks = [];
```
```javascript
response.on('data', (chunk) => responseChunks.push(chunk));
```
```javascript
response.on('end', () => {
const responseBody = Buffer.concat(responseChunks).toString('utf8');
```
```javascript
const client = endpointUrl.protocol === 'https:' ? https : http;
const request = client.request(
endpointUrl,
{ method: 'GET' },
(response) => {
const responseChunks = [];
response.on('data', (chunk) => responseChunks.push(chunk));
response.on('end', () => {
const responseBody = Buffer.concat(responseChunks).toString('utf8');
});
}
);
request.setTimeout(REQUEST_TIMEOUT_MS, () => {
request.abort();
reject(
new Error(
`Request to ${endpointUrl.href} timed out after ${REQUEST_TIMEOUT_MS} ms`
)
);
});
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@scripts/verify-prometheus-metrics.js`:
- Around line 52-87: The requestJson function lacks a per-request timeout so a
hung HTTP connection can block the whole script; fix by adding a configurable
timeout (e.g., const REQUEST_TIMEOUT_MS = 30_000) and applying it to the created
request (use request.setTimeout(REQUEST_TIMEOUT_MS, () => { request.destroy(new
Error(`Request timeout after ${REQUEST_TIMEOUT_MS}ms`)) }) or pass the timeout
option) and also handle the request 'error'/'timeout' events to reject with a
clear error; update the rejection message produced by requestJson to include
that it was a timeout when applicable and ensure any response listeners are
cleaned up when aborting the request.
🧹 Nitpick comments (3)
scripts/run-perf-metrics-job.js (1)
11-14:toFiniteNumberis duplicated across three scripts.This helper now exists in
publish-stress-metrics.js,verify-prometheus-metrics.js, and this file. Consider extracting it into a shared module (e.g.,scripts/lib/metric-utils.js) to keep the tooling DRY.tests/unit/scripts/verify-prometheus-metrics.test.js (1)
20-43: Consider asserting the resolved metric value in the success test.The test verifies
result.querycontainsexported_job=and thatqueryFnwas called, but doesn't assert on the returned metric values (e.g., that the resolved timestamp1_700_000_005is captured in the result). Adding an assertion likeexpect(result.values).toContain(1_700_000_005)(or equivalent for the result shape) would strengthen confidence that the return contract is correct.tests/unit/scripts/perf-metrics-job.test.js (1)
71-75: Consider testing with a malformedTOOLS_DOMAIN(not just missing).The fast-fail test passes an empty
env, which validates the "no domain at all" path. You might also want a case whereTOOLS_DOMAINis set but invalid (e.g.,{ TOOLS_DOMAIN: 'http://' }) to confirm the orchestrator surfaces the normalization error rather than silently producing broken URLs. This aligns with thenormalizeToolsDomainrejection test on line 12 but at the orchestration level.
Code Review by Qodo
✅ 1.
|
| function requestJson(endpointUrl) { | ||
| return new Promise((resolve, reject) => { | ||
| const client = endpointUrl.protocol === 'https:' ? https : http; | ||
| const request = client.request( | ||
| endpointUrl, | ||
| { method: 'GET' }, | ||
| (response) => { | ||
| const responseChunks = []; | ||
| response.on('data', (chunk) => responseChunks.push(chunk)); | ||
| response.on('end', () => { | ||
| const responseBody = Buffer.concat(responseChunks).toString('utf8'); | ||
| if (!response.statusCode || response.statusCode < 200 || response.statusCode >= 300) { | ||
| reject( | ||
| new Error( | ||
| `Prometheus returned ${response.statusCode || 'unknown status'}${ | ||
| responseBody ? `: ${responseBody}` : '' | ||
| }` | ||
| ) | ||
| ); | ||
| return; | ||
| } | ||
|
|
||
| try { | ||
| resolve(JSON.parse(responseBody)); | ||
| } catch (error) { | ||
| const safeMessage = error instanceof Error ? error.message : String(error); | ||
| reject(new Error(`Failed to parse Prometheus response: ${safeMessage}`)); | ||
| } | ||
| }); | ||
| } | ||
| ); | ||
|
|
||
| request.on('error', (error) => reject(error)); | ||
| request.end(); | ||
| }); |
There was a problem hiding this comment.
1. requestjson() lacks timeout 📘 Rule violation ⛯ Reliability
The Prometheus HTTP request has no socket/request timeout, so a stalled connection can hang longer than the intended verification timeoutMs. This violates the requirement to handle external dependency failure modes robustly.
Agent Prompt
## Issue description
`scripts/verify-prometheus-metrics.js` makes HTTP(S) requests without any explicit timeout/abort handling. A stalled TCP connection or slow server can cause the verification flow to hang longer than `timeoutMs`.
## Issue Context
This script is used to verify that pushed metrics appear in Prometheus. It should degrade gracefully when Prometheus is unreachable or slow.
## Fix Focus Areas
- scripts/verify-prometheus-metrics.js[52-86]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| if (toolsDomain) { | ||
| logger.log(`Resolved monitoring endpoints from TOOLS_DOMAIN=${toolsDomain}`); | ||
| } | ||
| logger.log(`Pushgateway endpoint: ${pushgatewayUrl}`); | ||
| logger.log(`Prometheus endpoint: ${prometheusUrl}`); | ||
| logger.log(`Pushgateway job="${jobName}" instance="${instanceName}"`); | ||
|
|
There was a problem hiding this comment.
2. Logs print raw endpoints 📘 Rule violation ⛨ Security
The perf job logs full pushgatewayUrl/prometheusUrl, which may include credentials (userinfo) or tokens (query params) if configured that way in env. This violates secure logging requirements by potentially exposing secrets and also produces unstructured logs.
Agent Prompt
## Issue description
The perf metrics job logs raw endpoint URLs that come from environment variables; these URLs may contain secrets (e.g., basic auth userinfo or query tokens). This creates a risk of leaking credentials into logs.
## Issue Context
`pushgatewayUrl` / `prometheusUrl` can be provided via `PUSHGATEWAY_URL` / `PROMETHEUS_URL`. Some deployments embed credentials in URLs.
## Fix Focus Areas
- scripts/run-perf-metrics-job.js[95-101]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Code Review
This pull request introduces an automated performance metrics pipeline, including stress test execution, metric publishing to Pushgateway, and verification in Prometheus. The changes are well-structured, testable, and include comprehensive unit tests and documentation updates. I've identified a potential bug in endpoint resolution and a few areas where the code can be made more readable and maintainable. Overall, this is a solid contribution.
| } | ||
|
|
||
| return value.slice(startIndex, endIndex); | ||
| } |
There was a problem hiding this comment.
Using parsed.host can unintentionally include the port number from the TOOLS_DOMAIN URL. For example, if TOOLS_DOMAIN is example.com:8443, parsed.host would be example.com:8443. This would lead to incorrectly constructed URLs like https://pushgateway.example.com:8443, which is likely not the desired behavior. You should use parsed.hostname instead, which contains the host name without the port, to ensure correct endpoint resolution.
| } | |
| const normalizedHost = parsed.hostname.replace(/^\.+/, '').replace(/\.+$/, '').trim().toLowerCase(); |
| async function runPerfMetricsJob(options = {}) { | ||
| const env = options.env || process.env; | ||
| const nowFn = options.nowFn || Date.now; | ||
| const hostName = options.hostName || os.hostname(); |
There was a problem hiding this comment.
This line is functionally correct but a bit dense and hard to read. To improve readability and maintainability, you could rewrite this to more clearly express the intent, which is to default to true unless explicitly set to false.
| const hostName = options.hostName || os.hostname(); | |
| const strictMode = String((env.PUSHGATEWAY_STRICT || 'true').trim().toLowerCase() !== 'false'); |
| function trimTrailingSlashes(value) { | ||
| let lastIndex = value.length; | ||
| while (lastIndex > 0 && value[lastIndex - 1] === '/') { | ||
| lastIndex -= 1; | ||
| } | ||
|
|
||
| return value.slice(0, lastIndex); | ||
| } |
There was a problem hiding this comment.
This function can be simplified and made more concise by using a regular expression to remove trailing slashes. This improves readability and is a more idiomatic way to perform this kind of string manipulation in JavaScript.
function trimTrailingSlashes(value) {
return String(value).replace(/\/+$/, '');
}| function extractNumericValues(resultSet) { | ||
| if (!Array.isArray(resultSet)) { | ||
| return []; | ||
| } | ||
|
|
||
| const values = []; | ||
| for (const resultItem of resultSet) { | ||
| if (Array.isArray(resultItem?.value) && resultItem.value.length >= 2) { | ||
| const parsedValue = toFiniteNumber(resultItem.value[1]); | ||
| if (parsedValue !== null) { | ||
| values.push(parsedValue); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return values; | ||
| } |
There was a problem hiding this comment.
This function can be written more concisely using modern JavaScript functional array methods like flatMap. This would make the code more declarative and easier to read and maintain.
function extractNumericValues(resultSet) {
return (resultSet || []).flatMap((item) => {
const value = item?.value?.[1];
const num = toFiniteNumber(value);
return num === null ? [] : [num];
});
}
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@scripts/run-perf-metrics-job.js`:
- Around line 134-135: The timeout and poll interval variables (timeoutMs and
pollIntervalMs) accept negative values because toFiniteNumber returns negatives
and the `||` fallback only covers falsy zeros; change the assignment to clamp to
a positive number by validating the parsed value from
toFiniteNumber(env.PROMETHEUS_VERIFY_TIMEOUT_MS) and
toFiniteNumber(env.PROMETHEUS_VERIFY_POLL_INTERVAL_MS) and using the default
only if the parsed value is not a finite positive (> 0) number (e.g., replace
the current expressions with a small helper or inline check that uses the parsed
value if > 0, otherwise uses 60_000 and 5_000 respectively).
🧹 Nitpick comments (3)
scripts/run-perf-metrics-job.js (3)
11-14:toFiniteNumberis duplicated fromverify-prometheus-metrics.js.This identical function already exists in
scripts/verify-prometheus-metrics.js(line 11). Consider extracting it into a shared utility module to avoid drift between the two copies.#!/bin/bash # Verify the duplication ast-grep --pattern 'function toFiniteNumber($_) { $$$ }'
31-50:normalizeEndpointUrlandnormalizeToolsDomainshare similar URL-parsing boilerplate.Both functions trim input, prepend
https://if missing, and parse withnew URL(...). This is a minor duplication — could be consolidated with a shared parse step. Not urgent given the functions serve different purposes (one returns full URL, the other extracts hostname).Also applies to: 66-81
194-200: Stack trace is discarded on failure.Only
error.messageis logged. For CI debugging, the full stack can be valuable. Consider logging the stack at a debug/verbose level or when a--verboseflag is set.Optional: include stack trace
runPerfMetricsJob().catch((error) => { const safeMessage = error instanceof Error ? error.message : String(error); console.error(`Performance metrics job failed: ${safeMessage}`); + if (error instanceof Error && error.stack) { + console.error(error.stack); + } process.exit(1); });
| const timeoutMs = toFiniteNumber(env.PROMETHEUS_VERIFY_TIMEOUT_MS) || 60_000; | ||
| const pollIntervalMs = toFiniteNumber(env.PROMETHEUS_VERIFY_POLL_INTERVAL_MS) || 5_000; |
There was a problem hiding this comment.
Negative environment values would be accepted as valid timeouts.
toFiniteNumber converts any finite number (including negatives), and the || default fallback only triggers on 0 or null. If someone sets PROMETHEUS_VERIFY_TIMEOUT_MS=-1, it would be used as-is, which could cause unexpected behavior in the polling loop.
Proposed fix: clamp to positive values
- const timeoutMs = toFiniteNumber(env.PROMETHEUS_VERIFY_TIMEOUT_MS) || 60_000;
- const pollIntervalMs = toFiniteNumber(env.PROMETHEUS_VERIFY_POLL_INTERVAL_MS) || 5_000;
+ const rawTimeout = toFiniteNumber(env.PROMETHEUS_VERIFY_TIMEOUT_MS);
+ const timeoutMs = rawTimeout != null && rawTimeout > 0 ? rawTimeout : 60_000;
+ const rawPoll = toFiniteNumber(env.PROMETHEUS_VERIFY_POLL_INTERVAL_MS);
+ const pollIntervalMs = rawPoll != null && rawPoll > 0 ? rawPoll : 5_000;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const timeoutMs = toFiniteNumber(env.PROMETHEUS_VERIFY_TIMEOUT_MS) || 60_000; | |
| const pollIntervalMs = toFiniteNumber(env.PROMETHEUS_VERIFY_POLL_INTERVAL_MS) || 5_000; | |
| const rawTimeout = toFiniteNumber(env.PROMETHEUS_VERIFY_TIMEOUT_MS); | |
| const timeoutMs = rawTimeout != null && rawTimeout > 0 ? rawTimeout : 60_000; | |
| const rawPoll = toFiniteNumber(env.PROMETHEUS_VERIFY_POLL_INTERVAL_MS); | |
| const pollIntervalMs = rawPoll != null && rawPoll > 0 ? rawPoll : 5_000; |
🤖 Prompt for AI Agents
In `@scripts/run-perf-metrics-job.js` around lines 134 - 135, The timeout and poll
interval variables (timeoutMs and pollIntervalMs) accept negative values because
toFiniteNumber returns negatives and the `||` fallback only covers falsy zeros;
change the assignment to clamp to a positive number by validating the parsed
value from toFiniteNumber(env.PROMETHEUS_VERIFY_TIMEOUT_MS) and
toFiniteNumber(env.PROMETHEUS_VERIFY_POLL_INTERVAL_MS) and using the default
only if the parsed value is not a finite positive (> 0) number (e.g., replace
the current expressions with a small helper or inline check that uses the parsed
value if > 0, otherwise uses 60_000 and 5_000 respectively).



Summary
Validation
Summary by Sourcery
Introduce an automated performance metrics pipeline that publishes deduplicated stress benchmark metrics to Pushgateway and verifies their visibility in Prometheus.
New Features:
Enhancements:
Build:
Documentation:
Tests:
Summary by CodeRabbit
Release Notes
New Features
Tests
Documentation