Skip to content

feat(qos): hot path optimization, shared state via Redis, and block consensus#507

Open
jorgecuesta wants to merge 13 commits intofix/header-mismatch-and-heuristic-retriesfrom
feat/hot-path-optimization
Open

feat(qos): hot path optimization, shared state via Redis, and block consensus#507
jorgecuesta wants to merge 13 commits intofix/header-mismatch-and-heuristic-retriesfrom
feat/hot-path-optimization

Conversation

@jorgecuesta
Copy link
Contributor

@jorgecuesta jorgecuesta commented Jan 30, 2026

Summary

This PR implements a comprehensive set of optimizations and reliability improvements for PATH:

  1. Hot Path Optimization - Raw byte passthrough eliminates JSON parsing from the critical request path
  2. Shared State via Redis - Archival status, perceived block height, and reputation scores are now shared across all replicas
  3. Block Height Consensus - Median-anchored algorithm protects against malicious endpoints reporting extreme block heights
  4. Health Check-Based Archival Detection - Archival capability is now determined by actual historical queries during health checks
  5. Enhanced Observability - /ready endpoint now provides detailed endpoint info including block consensus stats

Changes

Hot Path Optimization

  • Store and return raw response bytes without JSON parsing on the request path
  • Defer heavy JSON parsing to async observation queue
  • Convert all extractors (EVM, Cosmos, Solana) to use gjson for reduced allocations
  • Remove unused response parsing code (~1000 lines removed)

Shared State via Redis

  • Archival status: Stored with TTL, read-through on cache miss for non-leader replicas
  • Perceived block number: Atomic max semantics via Lua script for cross-replica consistency
  • Reputation scores: Extended with archival fields (IsArchival, ArchivalExpiresAt)
  • Background sync: Every 5 seconds with immediate sync on startup

Block Height Consensus Mechanism

  • Median-anchored algorithm protects against malicious/misconfigured endpoints
  • Outlier rejection: blocks > median + (syncAllowance × 3) are filtered
  • Self-adjusting using existing sync_allowance config (no new configuration needed)
  • 2-minute sliding window with up to 1000 observations

Health Check Improvements

  • Move archival detection from synthetic QoS checks to health checks
  • EVM extractor only evaluates archival-related methods (eth_getBalance, eth_getStorageAt)
  • Health check executor marks endpoints as archival after all validations pass
  • Add archival health checks for all EVM chains aligned with E2E config
  • Scale health check worker pool dynamically based on service/endpoint count

Enhanced /ready Endpoint

  • Add ?detailed=true query parameter for comprehensive endpoint info
  • Includes: reputation scores, archival status, tier classification, cooldown status
  • New fields: perceived_block_height, median_block_height, block_observations

Bug Fixes

  • Strip trailing newlines from JSON responses (hot path optimization artifact)
  • Return error response instead of empty body for failed requests
  • Preserve backend error responses through error handling chain
  • Handle endpoint unavailability race condition with proper re-selection

Test Plan

  • Unit tests pass (make test_unit)
  • E2E tests pass for eth, pocket, xrplevm (make e2e_test eth,pocket,xrplevm)
  • Block consensus tests verify outlier rejection
  • Shared state test script verifies cross-replica consistency (scripts/test_shared_state.sh)
  • go fmt, go vet, golangci-lint pass

jorgecuesta and others added 4 commits January 30, 2026 01:00
This commit implements the hot path optimization plan:

EVM:
- Remove passthroughMode concept (was dead code)
- Store raw bytes in UpdateWithResponse without parsing
- Return raw bytes in GetHTTPResponse for low-latency client response
- Add archival detection during request validation
- Route archival requests to archival-capable endpoints only
- Add X-Archival-Request response header
- Use gjson for async observation creation

Cosmos:
- Add rawEndpointResponse struct for raw byte storage
- Store raw bytes in UpdateWithResponse without parsing
- Return raw bytes in GetHTTPResponse
- Use gjson for async observation creation

Solana:
- Add rawEndpointResponse struct for raw byte storage
- Store raw bytes in UpdateWithResponse without parsing
- Return raw bytes in GetHTTPResponse
- Use gjson for async observation creation

The heavy JSON parsing is now deferred to the async observation queue,
reducing hot path latency significantly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit converts all QoS extractors to use gjson for efficient field
extraction without full JSON unmarshalling:

EVM extractor:
- Use gjson for method detection (isMethod)
- Use gjson for result extraction (extractStringResult)
- Use gjson for error checking in IsSyncing, IsArchival, IsValidResponse

Cosmos extractor:
- Use gjson for block height extraction (CometBFT and REST formats)
- Use gjson for chain ID extraction
- Use gjson for sync status checking
- Use gjson for archival detection

Solana extractor:
- Use gjson for block height extraction from getEpochInfo
- Use gjson for chain ID extraction from getVersion
- Use gjson for health status checking
- Use gjson for archival detection

This reduces allocations in the async observation path where extractors run
via the ObservationQueue, complementing the raw byte passthrough optimization
on the hot path.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit removes the old response parsing code that is no longer
needed after the raw byte passthrough optimization:

EVM:
- Remove response_blocknumber.go, response_chainid.go, response_getbalance.go
- Clean up response.go and response_generic.go to keep only used functions

Cosmos:
- Remove response_cometbft_health.go, response_cometbft_status.go,
  response_cosmos_status.go, response_evm_chainid.go, response_jsonrpc.go,
  response_rest.go, response_rest_unrecognized.go
- Keep response_jsonrpc_unrecognized.go (still used for batch error handling)
- Remove unused endpoint response validator functions

Solana:
- Remove response.go, response_generic.go, response_getepochinfo.go,
  response_gethealth.go
- Add error constants to context.go (still used by context_batch.go)
- Remove unused response interface

All tests pass with 0 linter issues.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ation

- Move archival endpoint detection from synthetic QoS checks to health checks
- EVM extractor now only evaluates archival-related methods (eth_getBalance, etc.)
- Health check executor marks endpoints as archival after all validations pass
- Remove unused serviceStateLock and archivalState consensus-based system
- Add BSC archival health check with verified expected balance (0x231f41a5cec600)
- Add IsArchivalCheck flag to QueuedObservation for future use

This ensures archival capability is determined by actual historical queries
during health checks rather than synthetic QoS requests, preventing false
positives from non-archival methods like eth_blockNumber.
@jorgecuesta jorgecuesta requested a review from oten91 January 30, 2026 07:24
@jorgecuesta jorgecuesta self-assigned this Jan 30, 2026
@jorgecuesta jorgecuesta added e2e bug Something isn't working code health Non functional changes to cleanup docs/code gateway Changes related to the Gateway actor evm EVM related work performance labels Jan 30, 2026
Add ?detailed=true query parameter to /ready and /ready/<service> endpoints
to return comprehensive endpoint information including:

- Reputation scores (0-100), success/error counts, latency metrics
- Archival capability status (is_archival, expires_at)
- Tier classification (1=best, 2=good, 3=probation)
- Cooldown status and remaining time
- RPC types supported by each endpoint

Implementation details:
- Add EndpointDetails types to protocol package
- Add QoSArchivalReporter interface for archival status queries
- Add QoSServiceRegistry interface for cross-layer QoS access
- Wire QoS registry from request parser to protocol layer
- Update CLAUDE.md with operational endpoint documentation

Also includes bug fixes:
- Strip trailing newlines from JSON responses (hot path optimization)
- Return error response instead of empty body for failed requests
- Handle endpoint unavailability race condition with proper re-selection
…eck workers

- Scale health check worker pool dynamically based on services × endpoints × checks
- Add DefaultMinHealthCheckWorkers (100), DefaultEndpointsPerServiceEstimate (50)
- Use HealthCheckWorkerMultiplier (2x) for headroom on parallel execution
- Log worker pool configuration at startup for debugging

- Preserve response bytes through error handling chain so actual backend errors
  (e.g., "state is pruned", "header not found") are returned to users instead
  of generic wrapper messages
- Add errHeuristicDetectedBackendError for backend issues vs malformed payloads
- Update error classification to handle heuristic-detected backend errors
…ith E2E config

Added archival health checks using eth_getBalance at historical blocks for:
- Mainnet EVM: eth, poly, bsc, avax, bera, sonic, ink, moonbeam, gnosis, celo, fantom, oasys, kaia, xrplevm
- L2 chains: arb-one, op, base, scroll, linea, blast, boba, metis, taiko, opbnb

Contract addresses and block numbers aligned with E2E test configuration (services_shannon.yaml).
Expected balances verified via DRPC and operator archival endpoints.

Pending archival verification: zksync-era, iotex, fuse
…nsus

Implements Redis-backed shared state for PATH replicas to ensure consistent
QoS decisions across all instances. This addresses the 75% failure rate on
archival requests caused by non-leader replicas lacking archival status.

Key changes:

1. Shared State via Redis:
   - Archival status: stored with TTL, read-through on cache miss
   - Perceived block number: atomic max semantics via Lua script
   - Reputation scores: already shared, now includes archival fields

2. Block Height Consensus Mechanism:
   - Median-anchored algorithm protects against malicious/misconfigured endpoints
   - Outlier rejection: blocks > median + (syncAllowance × 3) are filtered
   - Self-adjusting using existing sync_allowance config
   - 2-minute sliding window with up to 1000 observations

3. Cross-Replica Synchronization:
   - Background sync every 5 seconds for perceived block number
   - Immediate sync on startup before serving requests
   - Write-through to Redis on block updates

4. Enhanced /ready endpoint:
   - perceived_block_height: consensus result
   - median_block_height: anchor for outlier detection
   - block_observations: observation count in window

Files changed:
- qos/evm/block_consensus.go: New consensus implementation
- qos/evm/qos.go: Integration with consensus and Redis
- qos/evm/service_state.go: Added blockConsensus field
- reputation/: Added archival fields and perceived block methods
- router/operational_endpoints.go: Block stats in /ready response
- protocol/shannon/operational.go: New interface methods
- gateway/qos.go: QoSBlockConsensusReporter interface
- Add ArchivalRequirementChecker interface for proper type assertions
- Add SelectMultipleWithArchival method to EndpointSelector interface
- Implement archival-aware endpoint selection in EVM, Cosmos, Solana, NoOp
- Filter endpoints for archival capability in hedge racing path
- Filter endpoints for archival capability in batch request path
- Filter endpoints for archival capability in retry path
- Add filterArchivalEndpointsForFallback for fallback selection
- Respect archival requirements even when all endpoints fail validation

This ensures archival requests only go to archival-capable endpoints,
preventing failures from non-archival nodes returning "state is pruned".
@jorgecuesta jorgecuesta changed the title fix(evm): refactor archival detection to use health check-based validation feat(qos): hot path optimization, shared state via Redis, and block consensus Jan 31, 2026
- User requests no longer mark endpoints as archival (they were incorrectly
  marking endpoints when getting successful responses to recent blocks)
- Only health checks via markEndpointArchival() can set archival=true
  (validates response contains expected value for known historical block)
- User requests can still clear archival status when they fail, catching
  false positives where health check passed but actual requests fail
- Added clearEndpointArchival() to explicitly clear status when health
  check validation fails (expected_response_contains mismatch)
- Added missing archival error indicators: "state histories", "not fully
  indexed", "historical data"
@jorgecuesta jorgecuesta force-pushed the feat/hot-path-optimization branch from c737ac5 to c9a298b Compare January 31, 2026 04:15
…timization

Resolve false "no archival-capable endpoints" errors, fix cross-replica
archival state propagation via Redis, and optimize the hot path by
replacing synchronous Redis calls with in-memory cache lookups.

Archival filtering:
- Add structured endpoint filter errors with diagnostic context
- Filter fresh (unobserved) endpoints for archival requests using the
  archival cache, with cold-start fallback when cache is empty
- Add bi-directional archival state updates in endpoint observations
- Enforce RPCType_JSON_RPC consistency for all archival status keys

Cross-replica sync fixes (reputation/storage/redis.go):
- Fix parseKey for compound endpoint addresses containing "://" URLs
- Fix missing strings.ToUpper on rpcType before proto enum lookup
- Reduce archival cache refresh interval from 2h to 5s

Hot path optimization:
- Add TTL-based ArchivalCache for O(1) archival status lookups
- Replace synchronous Redis calls with in-memory cache in endpoint
  selection path
- Remove artificial concurrency semaphore from HTTP client
- Add background cache refresh worker for archival status sync

Observability:
- Add diagnostic headers (X-Archival-Request, X-Suppliers-Tried)
- Add CODE_PATH instrumentation markers for request flow tracing
- Add request ID propagation through endpoint selection
- Add structured errors for block height and chain ID filtering
…currency

Fixes nil pointer dereference panic in handleResult when processing batch
requests. The panic occurred because multiple goroutines in handleBatchRelayRequest
were concurrently reading/writing rc.suppliersTried without synchronization.

Root cause: handleBatchRelayRequest spawns parallel goroutines for each batch
item, all sharing the same requestContext. The hedgeRacer's handleResult and
recordLoser methods would concurrently append to rc.suppliersTried, causing
data race corruption that manifested as nil pointer dereference at offset 0x8.

Fix:
- Add sync.Mutex (suppliersMu) to requestContext struct
- Create thread-safe helper methods:
  - addSupplierTried(): atomic check-and-add with deduplication
  - addSuppliersTried(): batch add multiple suppliers
  - getSuppliersTried(): returns copy of slice
  - getSuppliersTriedCount(): returns count safely
- Update all access sites in hedge.go and http_request_context_handle_request.go
  to use these thread-safe methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working code health Non functional changes to cleanup docs/code e2e evm EVM related work gateway Changes related to the Gateway actor performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant