Skip to content

pmproxy /series/values crashes on certain time formats — 3 severity tiers including unauthenticated DoS #2503

@tallpsmith

Description

@tallpsmith

Environment

Field Value
PCP version 7.0.3 (pmproxy is bundled, same version)
OS / distribution Fedora Linux 43 (Container Image)
Kernel 6.11.3-200.fc40.aarch64
Redis version 7.4.7
Deployment podman container — quay.io/performancecopilot/pcp:latest with systemd as PID 1

Summary

GET /series/values exhibits three distinct failure tiers when an unsupported time format is passed as start or finish. All are triggered by ordinary HTTP query parameters — no authentication is required.

Tier Severity Formats that trigger it Behaviour
1 Medium -30s, -2m, -1h (small abbreviated units) Malformed HTTP response (Content-Length mismatch); pmproxy process survives
2 High -7d, -2w (large abbreviated units) pmproxy process crashes; systemd restarts it within a few seconds
3 Critical 2024-01-15T10:30:00Z (ISO-8601 + Z suffix) pmproxy process crashes and does not self-recover; service is down until operator intervenes

The Tier 3 crash is an unauthenticated denial-of-service: any HTTP client that can reach pmproxy can bring it down with a single request containing a Z-suffix ISO-8601 timestamp — a format that is valid RFC 3339 and expected by virtually every REST API client.

Discovered while building pmmcp, an MCP server wrapping pmproxy.


Steps to Reproduce

1. Start pmproxy with Redis backend

# docker-compose.yml or podman compose
podman compose up -d

# Wait ~30 s for pmproxy to ingest initial metrics, then verify:
curl -s "http://localhost:44322/series/sources?match=*"

2. Get a valid series ID

SERIES=$(curl -s "http://localhost:44322/series/query?expr=kernel.all.cpu.user" \
  | python3 -c "import sys, json; d=json.load(sys.stdin); print(d[0] if d else '')")
echo "Series: $SERIES"

3. Trigger Tier 1 — malformed HTTP response (pmproxy survives)

# All three cause a Content-Length mismatch; pmproxy stays up
curl -v "http://localhost:44322/series/values?series=${SERIES}&start=-30s&finish=now&samples=10"
curl -v "http://localhost:44322/series/values?series=${SERIES}&start=-2m&finish=now&samples=10"
curl -v "http://localhost:44322/series/values?series=${SERIES}&start=-1h&finish=now&samples=10"

Expected error from curl: curl: (56) Recv failure or truncated body with Content-Length mismatch.

4. Trigger Tier 2 — crash + systemd recovery

curl -v "http://localhost:44322/series/values?series=${SERIES}&start=-7d&finish=now&samples=10"
# pmproxy crashes; watch systemd restart it:
# journalctl -u pmproxy -f   (or: podman logs pmmcp-pcp-1 -f)

# Verify it came back:
sleep 5 && curl -s "http://localhost:44322/series/sources?match=*"

5. Trigger Tier 3 — hard crash, no recovery (⚠ takes pmproxy down)

curl -v "http://localhost:44322/series/values?series=${SERIES}&start=2024-01-15T10:30:00Z&finish=now&samples=10"
# pmproxy crashes and does NOT recover within ~10 s:
sleep 10 && curl -s "http://localhost:44322/series/sources?match=*"
# → connection refused

6. Run the full automated test matrix

Save the reproduce script below as reproduce_series_values_crash.py, then:

pip install httpx
PMPROXY_URL=http://localhost:44322 python reproduce_series_values_crash.py

Actual Behaviour

Full output from reproduce_series_values_crash.py run against quay.io/performancecopilot/pcp:latest:

pmproxy URL: http://localhost:44322

Checking pmproxy health...
  pmproxy is healthy.

Discovering a series ID to use for tests...
  Discovered series for 'kernel.all.cpu.user': e9dc3ea00548a0f4...

Using series: e9dc3ea00548a0f4abd8d9fa8a71675a1f7a5513
Using finish: now

Running test matrix...

  Category                  Format                        HTTP      Result                    Error Type              Alive?
  ------------------------------------------------------------------------------------------------------------------------------
  Abbreviated               -30s                          N/A       BAD_RESPONSE              RemoteProtocolError     yes
  Abbreviated               -2m                           N/A       BAD_RESPONSE              RemoteProtocolError     yes
  Abbreviated               -1h                           N/A       BAD_RESPONSE              RemoteProtocolError     yes
    [health] pmproxy appears DOWN after '-7d' — waiting for recovery...
  Abbreviated               -7d                           N/A       *** CRASHED_AND_RECOVERED ***  RemoteProtocolError     yes
                                                                note: pmproxy crashed and was restarted by systemd
    [health] pmproxy appears DOWN after '-2w' — waiting for recovery...
  Abbreviated               -2w                           N/A       *** CRASHED_AND_RECOVERED ***  RemoteProtocolError     yes
                                                                note: pmproxy crashed and was restarted by systemd
  Full PCP relative         -30seconds                    200       OK                        -                       yes
                                                                note: 2 data point(s) returned
  Full PCP relative         -2minutes                     200       OK                        -                       yes
                                                                note: 10 data point(s) returned
  Full PCP relative         -1hours                       200       OK                        -                       yes
                                                                note: 10 data point(s) returned
  Full PCP relative         -7days                        200       OK                        -                       yes
                                                                note: 10 data point(s) returned
  ISO-8601 no TZ            2024-01-15T10:30:00           200       OK                        -                       yes
                                                                note: 10 data point(s) returned
    [health] pmproxy appears DOWN after '2024-01-15T10:30:00Z' — waiting for recovery...
    [health] pmproxy down, retrying in 2.0s... (attempt 1/3)
    [health] pmproxy down, retrying in 2.0s... (attempt 2/3)
  ISO-8601 with Z           2024-01-15T10:30:00Z          N/A       *** CRASH ***             RemoteProtocolError     NO
                                                                note: pmproxy crashed and did not recover within ~8s
  ISO-8601 with +11:00      2024-01-15T10:30:00+11:00     -         SKIPPED                   -                       N/A
                                                                note: pmproxy not healthy before request
  ISO-8601 with +00:00      2024-01-15T10:30:00+00:00     -         SKIPPED                   -                       N/A
  Special                   now (as start)                -         SKIPPED                   -                       N/A
  Unix timestamp            1705310400                    -         SKIPPED                   -                       N/A

================================================================================
SUMMARY
================================================================================
  OK:                    5
  BAD_RESPONSE:          3
  CRASH:                 1
  CRASHED_AND_RECOVERED: 2
  SKIPPED:               4

BAD_RESPONSE formats (pmproxy misbehaves but survives):
  '-30s' — RemoteProtocolError: Server disconnected without sending a response.
  '-2m' — RemoteProtocolError: Server disconnected without sending a response.
  '-1h' — RemoteProtocolError: Server disconnected without sending a response.

CRASH / CRASH+RECOVERED formats (pmproxy process died):
  '2024-01-15T10:30:00Z' — CRASH: RemoteProtocolError: Server disconnected without sending a response.
  '-7d' — CRASHED_AND_RECOVERED: RemoteProtocolError: Server disconnected without sending a response.
  '-2w' — CRASHED_AND_RECOVERED: RemoteProtocolError: Server disconnected without sending a response.

The four SKIPPED rows (+11:00, +00:00, now as start, Unix epoch) could not be tested because the Z-suffix crash left pmproxy unresponsive beyond the 8-second retry window. Their behaviour is unknown and should be verified separately.


Expected Behaviour

GET /series/values should:

  1. Either accept all common time formats (abbreviated units, ISO-8601 with TZ, epoch seconds), or
  2. Return a well-formed 400 Bad Request with a JSON error body for any format it does not support.

Under no circumstances should a query parameter value cause a malformed HTTP response, a process crash, or a denial of service.


Complete Format Test Matrix

Category start value Result
Abbreviated -30s BAD_RESPONSE — Content-Length mismatch, pmproxy survives
Abbreviated -2m BAD_RESPONSE — Content-Length mismatch, pmproxy survives
Abbreviated -1h BAD_RESPONSE — Content-Length mismatch, pmproxy survives
Abbreviated -7d CRASHED_AND_RECOVERED — process crash, systemd restarts
Abbreviated -2w CRASHED_AND_RECOVERED — process crash, systemd restarts
Full PCP relative -30seconds ✅ OK — 2 data points
Full PCP relative -2minutes ✅ OK — 10 data points
Full PCP relative -1hours ✅ OK — 10 data points
Full PCP relative -7days ✅ OK — 10 data points
ISO-8601 no TZ 2024-01-15T10:30:00 ✅ OK — 10 data points
ISO-8601 with Z 2024-01-15T10:30:00Z CRASH — hard crash, no recovery
ISO-8601 with +11:00 2024-01-15T10:30:00+11:00 ⚠ SKIPPED — untested
ISO-8601 with +00:00 2024-01-15T10:30:00+00:00 ⚠ SKIPPED — untested
Special now (as start) ⚠ SKIPPED — untested
Unix epoch 1705310400 ⚠ SKIPPED — untested

Observed pattern in abbreviated units: The threshold between Tier 1 (malformed response) and Tier 2 (crash) appears correlated with time window size. Short windows (-30s, -2m, -1h) cause a response encoding error but pmproxy survives; longer windows (-7d, -2w) crash the process. This suggests an integer overflow or allocation failure in time arithmetic when the resulting epoch delta is large.


Client-Side Error Message

All failure tiers produce the same client-visible error:

RemoteProtocolError: Server disconnected without sending a response.

For Tier 1, this is a Content-Length mismatch — pmproxy writes a Content-Length header for N bytes, then closes the connection after sending fewer bytes. The HTTP framing is broken before the body is complete.


Workaround (client-side)

Safe formats confirmed by testing: full PCP relative (-2minutes, -1hours, -7days) and ISO-8601 without TZ designator (2024-01-15T10:30:00).

Avoid: abbreviated units (-2m, -1h, etc.) and ISO-8601 with Z or numeric offset.

In pmmcp we added a pre-call expansion step:

_SHORT_UNIT_MAP = {"s": "seconds", "m": "minutes", "h": "hours", "d": "days", "w": "weeks"}

def _expand_time_units(expr: str) -> str:
    """Convert abbreviated units to full forms before passing to pmproxy."""
    if expr in ("now", ""):
        return expr
    match = re.fullmatch(r"(-\d+)\s*([smhdw])$", expr.strip())
    if match:
        n, unit = match.group(1), match.group(2)
        return f"{n}{_SHORT_UNIT_MAP[unit]}"
    return expr

See: pmmcp commit 41bb7ab

For absolute timestamps with timezone, strip the TZ designator and convert to UTC before sending (e.g. 2024-01-15T10:30:00 for UTC noon-equivalent).


Suggested Fix Directions

  1. Fix the hard crash on Z-suffix ISO-8601 (highest priority / DoS): 2024-01-15T10:30:00Z is valid RFC 3339. This crash likely originates from an unguarded pointer dereference or uncaught exception in the time parser when it encounters the Z terminator.

  2. Fix the crash on large abbreviated units: -7d and -2w crash while -30s/-2m/-1h do not. The difference is the scale of the resulting epoch offset. Possible integer or buffer overflow in time arithmetic.

  3. Fix the Content-Length mismatch on small abbreviated units: pmproxy is committing to a Content-Length then failing to write the promised bytes. The failure path should produce a 400 error response with a JSON body, not a broken HTTP response.

  4. Add input validation at the HTTP handler boundary: Validate start/finish before touching the time parser. An unrecognised format should return {"success": false, "message": "invalid time format: ..."} with HTTP 400. This protects all three tiers regardless of the parser bug.

  5. Document accepted time formats: man pmproxy and the REST API reference do not specify which time format(s) /series/values accepts. ISO-8601 with timezone is a reasonable default expectation. If unsupported, it should be explicitly documented.


Reproduce Script

See reproduce_series_values_crash.py

docker-compose.yml


Related

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions