-
-
Notifications
You must be signed in to change notification settings - Fork 261
Description
Environment
| Field | Value |
|---|---|
| PCP version | 7.0.3 (pmproxy is bundled, same version) |
| OS / distribution | Fedora Linux 43 (Container Image) |
| Kernel | 6.11.3-200.fc40.aarch64 |
| Redis version | 7.4.7 |
| Deployment | podman container — quay.io/performancecopilot/pcp:latest with systemd as PID 1 |
Summary
GET /series/values exhibits three distinct failure tiers when an unsupported time format is passed as start or finish. All are triggered by ordinary HTTP query parameters — no authentication is required.
| Tier | Severity | Formats that trigger it | Behaviour |
|---|---|---|---|
| 1 | Medium | -30s, -2m, -1h (small abbreviated units) |
Malformed HTTP response (Content-Length mismatch); pmproxy process survives |
| 2 | High | -7d, -2w (large abbreviated units) |
pmproxy process crashes; systemd restarts it within a few seconds |
| 3 | Critical | 2024-01-15T10:30:00Z (ISO-8601 + Z suffix) |
pmproxy process crashes and does not self-recover; service is down until operator intervenes |
The Tier 3 crash is an unauthenticated denial-of-service: any HTTP client that can reach pmproxy can bring it down with a single request containing a Z-suffix ISO-8601 timestamp — a format that is valid RFC 3339 and expected by virtually every REST API client.
Discovered while building pmmcp, an MCP server wrapping pmproxy.
Steps to Reproduce
1. Start pmproxy with Redis backend
# docker-compose.yml or podman compose
podman compose up -d
# Wait ~30 s for pmproxy to ingest initial metrics, then verify:
curl -s "http://localhost:44322/series/sources?match=*"2. Get a valid series ID
SERIES=$(curl -s "http://localhost:44322/series/query?expr=kernel.all.cpu.user" \
| python3 -c "import sys, json; d=json.load(sys.stdin); print(d[0] if d else '')")
echo "Series: $SERIES"3. Trigger Tier 1 — malformed HTTP response (pmproxy survives)
# All three cause a Content-Length mismatch; pmproxy stays up
curl -v "http://localhost:44322/series/values?series=${SERIES}&start=-30s&finish=now&samples=10"
curl -v "http://localhost:44322/series/values?series=${SERIES}&start=-2m&finish=now&samples=10"
curl -v "http://localhost:44322/series/values?series=${SERIES}&start=-1h&finish=now&samples=10"Expected error from curl: curl: (56) Recv failure or truncated body with Content-Length mismatch.
4. Trigger Tier 2 — crash + systemd recovery
curl -v "http://localhost:44322/series/values?series=${SERIES}&start=-7d&finish=now&samples=10"
# pmproxy crashes; watch systemd restart it:
# journalctl -u pmproxy -f (or: podman logs pmmcp-pcp-1 -f)
# Verify it came back:
sleep 5 && curl -s "http://localhost:44322/series/sources?match=*"5. Trigger Tier 3 — hard crash, no recovery (⚠ takes pmproxy down)
curl -v "http://localhost:44322/series/values?series=${SERIES}&start=2024-01-15T10:30:00Z&finish=now&samples=10"
# pmproxy crashes and does NOT recover within ~10 s:
sleep 10 && curl -s "http://localhost:44322/series/sources?match=*"
# → connection refused6. Run the full automated test matrix
Save the reproduce script below as reproduce_series_values_crash.py, then:
pip install httpx
PMPROXY_URL=http://localhost:44322 python reproduce_series_values_crash.pyActual Behaviour
Full output from reproduce_series_values_crash.py run against quay.io/performancecopilot/pcp:latest:
pmproxy URL: http://localhost:44322
Checking pmproxy health...
pmproxy is healthy.
Discovering a series ID to use for tests...
Discovered series for 'kernel.all.cpu.user': e9dc3ea00548a0f4...
Using series: e9dc3ea00548a0f4abd8d9fa8a71675a1f7a5513
Using finish: now
Running test matrix...
Category Format HTTP Result Error Type Alive?
------------------------------------------------------------------------------------------------------------------------------
Abbreviated -30s N/A BAD_RESPONSE RemoteProtocolError yes
Abbreviated -2m N/A BAD_RESPONSE RemoteProtocolError yes
Abbreviated -1h N/A BAD_RESPONSE RemoteProtocolError yes
[health] pmproxy appears DOWN after '-7d' — waiting for recovery...
Abbreviated -7d N/A *** CRASHED_AND_RECOVERED *** RemoteProtocolError yes
note: pmproxy crashed and was restarted by systemd
[health] pmproxy appears DOWN after '-2w' — waiting for recovery...
Abbreviated -2w N/A *** CRASHED_AND_RECOVERED *** RemoteProtocolError yes
note: pmproxy crashed and was restarted by systemd
Full PCP relative -30seconds 200 OK - yes
note: 2 data point(s) returned
Full PCP relative -2minutes 200 OK - yes
note: 10 data point(s) returned
Full PCP relative -1hours 200 OK - yes
note: 10 data point(s) returned
Full PCP relative -7days 200 OK - yes
note: 10 data point(s) returned
ISO-8601 no TZ 2024-01-15T10:30:00 200 OK - yes
note: 10 data point(s) returned
[health] pmproxy appears DOWN after '2024-01-15T10:30:00Z' — waiting for recovery...
[health] pmproxy down, retrying in 2.0s... (attempt 1/3)
[health] pmproxy down, retrying in 2.0s... (attempt 2/3)
ISO-8601 with Z 2024-01-15T10:30:00Z N/A *** CRASH *** RemoteProtocolError NO
note: pmproxy crashed and did not recover within ~8s
ISO-8601 with +11:00 2024-01-15T10:30:00+11:00 - SKIPPED - N/A
note: pmproxy not healthy before request
ISO-8601 with +00:00 2024-01-15T10:30:00+00:00 - SKIPPED - N/A
Special now (as start) - SKIPPED - N/A
Unix timestamp 1705310400 - SKIPPED - N/A
================================================================================
SUMMARY
================================================================================
OK: 5
BAD_RESPONSE: 3
CRASH: 1
CRASHED_AND_RECOVERED: 2
SKIPPED: 4
BAD_RESPONSE formats (pmproxy misbehaves but survives):
'-30s' — RemoteProtocolError: Server disconnected without sending a response.
'-2m' — RemoteProtocolError: Server disconnected without sending a response.
'-1h' — RemoteProtocolError: Server disconnected without sending a response.
CRASH / CRASH+RECOVERED formats (pmproxy process died):
'2024-01-15T10:30:00Z' — CRASH: RemoteProtocolError: Server disconnected without sending a response.
'-7d' — CRASHED_AND_RECOVERED: RemoteProtocolError: Server disconnected without sending a response.
'-2w' — CRASHED_AND_RECOVERED: RemoteProtocolError: Server disconnected without sending a response.
The four SKIPPED rows (+11:00, +00:00, now as start, Unix epoch) could not be tested because the Z-suffix crash left pmproxy unresponsive beyond the 8-second retry window. Their behaviour is unknown and should be verified separately.
Expected Behaviour
GET /series/values should:
- Either accept all common time formats (abbreviated units, ISO-8601 with TZ, epoch seconds), or
- Return a well-formed
400 Bad Requestwith a JSON error body for any format it does not support.
Under no circumstances should a query parameter value cause a malformed HTTP response, a process crash, or a denial of service.
Complete Format Test Matrix
| Category | start value |
Result |
|---|---|---|
| Abbreviated | -30s |
BAD_RESPONSE — Content-Length mismatch, pmproxy survives |
| Abbreviated | -2m |
BAD_RESPONSE — Content-Length mismatch, pmproxy survives |
| Abbreviated | -1h |
BAD_RESPONSE — Content-Length mismatch, pmproxy survives |
| Abbreviated | -7d |
CRASHED_AND_RECOVERED — process crash, systemd restarts |
| Abbreviated | -2w |
CRASHED_AND_RECOVERED — process crash, systemd restarts |
| Full PCP relative | -30seconds |
✅ OK — 2 data points |
| Full PCP relative | -2minutes |
✅ OK — 10 data points |
| Full PCP relative | -1hours |
✅ OK — 10 data points |
| Full PCP relative | -7days |
✅ OK — 10 data points |
| ISO-8601 no TZ | 2024-01-15T10:30:00 |
✅ OK — 10 data points |
ISO-8601 with Z |
2024-01-15T10:30:00Z |
CRASH — hard crash, no recovery |
ISO-8601 with +11:00 |
2024-01-15T10:30:00+11:00 |
⚠ SKIPPED — untested |
ISO-8601 with +00:00 |
2024-01-15T10:30:00+00:00 |
⚠ SKIPPED — untested |
| Special | now (as start) |
⚠ SKIPPED — untested |
| Unix epoch | 1705310400 |
⚠ SKIPPED — untested |
Observed pattern in abbreviated units: The threshold between Tier 1 (malformed response) and Tier 2 (crash) appears correlated with time window size. Short windows (-30s, -2m, -1h) cause a response encoding error but pmproxy survives; longer windows (-7d, -2w) crash the process. This suggests an integer overflow or allocation failure in time arithmetic when the resulting epoch delta is large.
Client-Side Error Message
All failure tiers produce the same client-visible error:
RemoteProtocolError: Server disconnected without sending a response.
For Tier 1, this is a Content-Length mismatch — pmproxy writes a Content-Length header for N bytes, then closes the connection after sending fewer bytes. The HTTP framing is broken before the body is complete.
Workaround (client-side)
Safe formats confirmed by testing: full PCP relative (-2minutes, -1hours, -7days) and ISO-8601 without TZ designator (2024-01-15T10:30:00).
Avoid: abbreviated units (-2m, -1h, etc.) and ISO-8601 with Z or numeric offset.
In pmmcp we added a pre-call expansion step:
_SHORT_UNIT_MAP = {"s": "seconds", "m": "minutes", "h": "hours", "d": "days", "w": "weeks"}
def _expand_time_units(expr: str) -> str:
"""Convert abbreviated units to full forms before passing to pmproxy."""
if expr in ("now", ""):
return expr
match = re.fullmatch(r"(-\d+)\s*([smhdw])$", expr.strip())
if match:
n, unit = match.group(1), match.group(2)
return f"{n}{_SHORT_UNIT_MAP[unit]}"
return exprSee: pmmcp commit 41bb7ab
For absolute timestamps with timezone, strip the TZ designator and convert to UTC before sending (e.g. 2024-01-15T10:30:00 for UTC noon-equivalent).
Suggested Fix Directions
-
Fix the hard crash on
Z-suffix ISO-8601 (highest priority / DoS):2024-01-15T10:30:00Zis valid RFC 3339. This crash likely originates from an unguarded pointer dereference or uncaught exception in the time parser when it encounters theZterminator. -
Fix the crash on large abbreviated units:
-7dand-2wcrash while-30s/-2m/-1hdo not. The difference is the scale of the resulting epoch offset. Possible integer or buffer overflow in time arithmetic. -
Fix the Content-Length mismatch on small abbreviated units: pmproxy is committing to a
Content-Lengththen failing to write the promised bytes. The failure path should produce a400error response with a JSON body, not a broken HTTP response. -
Add input validation at the HTTP handler boundary: Validate
start/finishbefore touching the time parser. An unrecognised format should return{"success": false, "message": "invalid time format: ..."}with HTTP 400. This protects all three tiers regardless of the parser bug. -
Document accepted time formats:
man pmproxyand the REST API reference do not specify which time format(s)/series/valuesaccepts. ISO-8601 with timezone is a reasonable default expectation. If unsupported, it should be explicitly documented.
Reproduce Script
See reproduce_series_values_crash.py
Related
- pmmcp client-side workaround commit: tallpsmith/pmmcp@41bb7ab
- pmproxy REST API reference: https://man7.org/linux/man-pages/man1/pmproxy.1.html
- PCP series time grammar:
man pmseries