Skip to content

feat: improved error messages + hedge racing + retry enhancements#505

Open
jorgecuesta wants to merge 3 commits intomainfrom
feat/response-heuristic-analysis
Open

feat: improved error messages + hedge racing + retry enhancements#505
jorgecuesta wants to merge 3 commits intomainfrom
feat/response-heuristic-analysis

Conversation

@jorgecuesta
Copy link
Contributor

@jorgecuesta jorgecuesta commented Jan 15, 2026

This PR introduces several reliability and observability improvements to PATH:

1. Protocol Error Propagation

  • Problem: When requests failed due to no available endpoints, users received a generic error: "protocol-level error: no endpoint responses received"
  • Solution: Added SetProtocolError to the RequestQoSContext interface to propagate specific errors from the protocol layer to client responses
  • Result: Users now see specific errors like:
  • "no valid endpoints available for service: service X"
  • "selected endpoint is not available: relay request will fail: service X endpoint pokt1..."

2. Hedge Racing (New Feature)

  • Spawn a parallel "hedge" request after a configurable delay if the primary hasn't responded
  • The first successful response wins; the other is canceled
  • Configurable via retry_config.hedge_delay and retry_config.connect_timeout
  • Track outcomes via X-Hedge-Result header

3. Retry Enhancements

  • Time Budget: max_retry_latency skips retries when the failed request already took too long
  • Endpoint Rotation: Each retry attempt uses a different endpoint
  • Heuristic Detection: Retry on JSON-RPC errors hidden in HTTP 200 responses
  • Observability: Track via X-Retry-Count and X-Suppliers-Tried headers

4. Heuristic Response Analysis

  • Detect errors in response payloads despite HTTP 200 status
  • Identify: JSON-RPC errors, HTML error pages, empty responses, malformed JSON
  • Record correcting reputation signals for detected failures
  • Support retry decisions based on payload analysis

5. Response Metadata Headers

Header Description
X-Retry-Count Number of retry attempts (0 = first attempt succeeded)
X-Suppliers-Tried Comma-separated list of attempted supplier addresses
X-Hedge-Result Hedge racing outcome: primary_only, primary_won, hedge_won, both_failed
X-App-Address Application address used for the relay
X-Supplier-Address Supplier address of the responding endpoint
X-Session-ID Session ID for the relay

6. Health Check Improvements

  • Add response heuristic analysis to health checks
  • Record heuristic-detected errors to reputation service
  • Better debug logging for health check validation

Configuration

services:                                                                                                                                                                       
- service_id: eth                                                                                                                                                             
  retry_config:                                                                                                                                                               
    enabled: true                                                                                                                                                             
    max_retries: 2                                                                                                                                                            
    hedge_delay: 500ms        # Spawn hedge after 500ms                                                                                                                       
    connect_timeout: 200ms    # Fast-fail on connection issues                                                                                                                
    max_retry_latency: 5s     # Skip retry if request took >5s                                                                                                                
    retry_on_5xx: true                                                                                                                                                        
    retry_on_timeout: true                                                                                                                                                    
    retry_on_connection: true                                                                                                                                                 

Test Plan

  • Manual testing with ETH, Cosmos chains
  • Batch requests verified working (IDs preserved)
  • Protocol error messages propagate correctly
  • Build passes
  • E2E tests (some upstream endpoint flakiness noted)")

- Add qos/heuristic package for tiered response analysis:
  - Tier 1: Structural checks (empty, HTML, XML, non-JSON, malformed)
  - Tier 2: Protocol-specific analysis (JSON-RPC result vs error fields)
  - Tier 3: Error pattern detection (blockchain-specific, rate limits, etc.)

- Fix backend 5xx reputation bug in protocol/shannon/context.go:
  - Backend 5xx responses now record CriticalErrorSignal (-25) instead of success
  - 4xx errors not penalized (often user's fault)

- Wire heuristic detection to reputation in gateway:
  - When heuristic detects payload error despite HTTP 200, record correcting signal
  - High confidence errors: MajorError (-10)
  - Lower confidence errors: MinorError (-3)

- Cleanup: Remove unused hydrateLogger and hydrateLoggerWithPayload functions
@jorgecuesta jorgecuesta force-pushed the feat/response-heuristic-analysis branch from 5b4d352 to f8cb9ee Compare January 23, 2026 16:54
Protocol Error Propagation:
- Add SetProtocolError to RequestQoSContext interface for specific error messages
- Implement in all QoS contexts (EVM, Cosmos, Solana, NoOp, HealthCheck)
- Replace generic "no endpoint responses received" with specific errors like
  "no valid endpoints available for service" or endpoint-specific details

Hedge Racing (New Feature):
- Add hedge_delay config to spawn parallel request after initial delay
- Primary request races against hedge; first success wins
- Configurable connect_timeout for fast-fail on connection issues
- Track hedge results via X-Hedge-Result response header

Retry Enhancements:
- Add max_retry_latency to skip retries when request already took too long
- Add retry endpoint rotation to try different endpoints on each retry
- Track retry count and suppliers tried in response headers
- Add heuristic-based retry detection for JSON-RPC errors in HTTP 200

Response Metadata Headers:
- X-Retry-Count: number of retry attempts
- X-Suppliers-Tried: comma-separated list of attempted suppliers
- X-Hedge-Result: outcome of hedge racing (primary_won, hedge_won, etc.)
- X-App-Address, X-Supplier-Address, X-Session-ID for debugging

Heuristic Response Analysis:
- Detect errors in response payloads despite HTTP 200 status
- Identify JSON-RPC errors, HTML error pages, empty responses
- Record correcting reputation signals for heuristic-detected failures
- Support retry decisions based on payload analysis

Health Check Improvements:
- Add response heuristic analysis to health checks
- Record heuristic errors to reputation service
- Better logging for health check validation

Metrics:
- Add retry distribution metrics by reason (5xx, timeout, connection, heuristic)
- Add hedge racing outcome metrics
- Add batch size tracking metrics

Config:
- Add retry_config with hedge_delay, connect_timeout, max_retry_latency
- Update schema and example configs
@jorgecuesta jorgecuesta changed the title feat: heuristic response analysis + reputation fixes feat: improved error messages + hedge racing + retry enhancements Jan 24, 2026
@jorgecuesta jorgecuesta self-assigned this Jan 24, 2026
@jorgecuesta jorgecuesta added bug Something isn't working e2e config labels Jan 24, 2026
@jorgecuesta jorgecuesta marked this pull request as ready for review January 24, 2026 10:40
@oten91 oten91 changed the base branch from main to staging January 24, 2026 10:44
@oten91 oten91 changed the base branch from staging to main January 24, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working config e2e

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant