feat(qos): hot path optimization, shared state via Redis, and block consensus#507
Open
jorgecuesta wants to merge 13 commits intofix/header-mismatch-and-heuristic-retriesfrom
Open
feat(qos): hot path optimization, shared state via Redis, and block consensus#507jorgecuesta wants to merge 13 commits intofix/header-mismatch-and-heuristic-retriesfrom
jorgecuesta wants to merge 13 commits intofix/header-mismatch-and-heuristic-retriesfrom
Conversation
This commit implements the hot path optimization plan: EVM: - Remove passthroughMode concept (was dead code) - Store raw bytes in UpdateWithResponse without parsing - Return raw bytes in GetHTTPResponse for low-latency client response - Add archival detection during request validation - Route archival requests to archival-capable endpoints only - Add X-Archival-Request response header - Use gjson for async observation creation Cosmos: - Add rawEndpointResponse struct for raw byte storage - Store raw bytes in UpdateWithResponse without parsing - Return raw bytes in GetHTTPResponse - Use gjson for async observation creation Solana: - Add rawEndpointResponse struct for raw byte storage - Store raw bytes in UpdateWithResponse without parsing - Return raw bytes in GetHTTPResponse - Use gjson for async observation creation The heavy JSON parsing is now deferred to the async observation queue, reducing hot path latency significantly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit converts all QoS extractors to use gjson for efficient field extraction without full JSON unmarshalling: EVM extractor: - Use gjson for method detection (isMethod) - Use gjson for result extraction (extractStringResult) - Use gjson for error checking in IsSyncing, IsArchival, IsValidResponse Cosmos extractor: - Use gjson for block height extraction (CometBFT and REST formats) - Use gjson for chain ID extraction - Use gjson for sync status checking - Use gjson for archival detection Solana extractor: - Use gjson for block height extraction from getEpochInfo - Use gjson for chain ID extraction from getVersion - Use gjson for health status checking - Use gjson for archival detection This reduces allocations in the async observation path where extractors run via the ObservationQueue, complementing the raw byte passthrough optimization on the hot path. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit removes the old response parsing code that is no longer needed after the raw byte passthrough optimization: EVM: - Remove response_blocknumber.go, response_chainid.go, response_getbalance.go - Clean up response.go and response_generic.go to keep only used functions Cosmos: - Remove response_cometbft_health.go, response_cometbft_status.go, response_cosmos_status.go, response_evm_chainid.go, response_jsonrpc.go, response_rest.go, response_rest_unrecognized.go - Keep response_jsonrpc_unrecognized.go (still used for batch error handling) - Remove unused endpoint response validator functions Solana: - Remove response.go, response_generic.go, response_getepochinfo.go, response_gethealth.go - Add error constants to context.go (still used by context_batch.go) - Remove unused response interface All tests pass with 0 linter issues. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ation - Move archival endpoint detection from synthetic QoS checks to health checks - EVM extractor now only evaluates archival-related methods (eth_getBalance, etc.) - Health check executor marks endpoints as archival after all validations pass - Remove unused serviceStateLock and archivalState consensus-based system - Add BSC archival health check with verified expected balance (0x231f41a5cec600) - Add IsArchivalCheck flag to QueuedObservation for future use This ensures archival capability is determined by actual historical queries during health checks rather than synthetic QoS requests, preventing false positives from non-archival methods like eth_blockNumber.
Add ?detailed=true query parameter to /ready and /ready/<service> endpoints to return comprehensive endpoint information including: - Reputation scores (0-100), success/error counts, latency metrics - Archival capability status (is_archival, expires_at) - Tier classification (1=best, 2=good, 3=probation) - Cooldown status and remaining time - RPC types supported by each endpoint Implementation details: - Add EndpointDetails types to protocol package - Add QoSArchivalReporter interface for archival status queries - Add QoSServiceRegistry interface for cross-layer QoS access - Wire QoS registry from request parser to protocol layer - Update CLAUDE.md with operational endpoint documentation Also includes bug fixes: - Strip trailing newlines from JSON responses (hot path optimization) - Return error response instead of empty body for failed requests - Handle endpoint unavailability race condition with proper re-selection
…eck workers - Scale health check worker pool dynamically based on services × endpoints × checks - Add DefaultMinHealthCheckWorkers (100), DefaultEndpointsPerServiceEstimate (50) - Use HealthCheckWorkerMultiplier (2x) for headroom on parallel execution - Log worker pool configuration at startup for debugging - Preserve response bytes through error handling chain so actual backend errors (e.g., "state is pruned", "header not found") are returned to users instead of generic wrapper messages - Add errHeuristicDetectedBackendError for backend issues vs malformed payloads - Update error classification to handle heuristic-detected backend errors
…ith E2E config Added archival health checks using eth_getBalance at historical blocks for: - Mainnet EVM: eth, poly, bsc, avax, bera, sonic, ink, moonbeam, gnosis, celo, fantom, oasys, kaia, xrplevm - L2 chains: arb-one, op, base, scroll, linea, blast, boba, metis, taiko, opbnb Contract addresses and block numbers aligned with E2E test configuration (services_shannon.yaml). Expected balances verified via DRPC and operator archival endpoints. Pending archival verification: zksync-era, iotex, fuse
…nsus Implements Redis-backed shared state for PATH replicas to ensure consistent QoS decisions across all instances. This addresses the 75% failure rate on archival requests caused by non-leader replicas lacking archival status. Key changes: 1. Shared State via Redis: - Archival status: stored with TTL, read-through on cache miss - Perceived block number: atomic max semantics via Lua script - Reputation scores: already shared, now includes archival fields 2. Block Height Consensus Mechanism: - Median-anchored algorithm protects against malicious/misconfigured endpoints - Outlier rejection: blocks > median + (syncAllowance × 3) are filtered - Self-adjusting using existing sync_allowance config - 2-minute sliding window with up to 1000 observations 3. Cross-Replica Synchronization: - Background sync every 5 seconds for perceived block number - Immediate sync on startup before serving requests - Write-through to Redis on block updates 4. Enhanced /ready endpoint: - perceived_block_height: consensus result - median_block_height: anchor for outlier detection - block_observations: observation count in window Files changed: - qos/evm/block_consensus.go: New consensus implementation - qos/evm/qos.go: Integration with consensus and Redis - qos/evm/service_state.go: Added blockConsensus field - reputation/: Added archival fields and perceived block methods - router/operational_endpoints.go: Block stats in /ready response - protocol/shannon/operational.go: New interface methods - gateway/qos.go: QoSBlockConsensusReporter interface
- Add ArchivalRequirementChecker interface for proper type assertions - Add SelectMultipleWithArchival method to EndpointSelector interface - Implement archival-aware endpoint selection in EVM, Cosmos, Solana, NoOp - Filter endpoints for archival capability in hedge racing path - Filter endpoints for archival capability in batch request path - Filter endpoints for archival capability in retry path - Add filterArchivalEndpointsForFallback for fallback selection - Respect archival requirements even when all endpoints fail validation This ensures archival requests only go to archival-capable endpoints, preventing failures from non-archival nodes returning "state is pruned".
- User requests no longer mark endpoints as archival (they were incorrectly marking endpoints when getting successful responses to recent blocks) - Only health checks via markEndpointArchival() can set archival=true (validates response contains expected value for known historical block) - User requests can still clear archival status when they fail, catching false positives where health check passed but actual requests fail - Added clearEndpointArchival() to explicitly clear status when health check validation fails (expected_response_contains mismatch) - Added missing archival error indicators: "state histories", "not fully indexed", "historical data"
c737ac5 to
c9a298b
Compare
…timization Resolve false "no archival-capable endpoints" errors, fix cross-replica archival state propagation via Redis, and optimize the hot path by replacing synchronous Redis calls with in-memory cache lookups. Archival filtering: - Add structured endpoint filter errors with diagnostic context - Filter fresh (unobserved) endpoints for archival requests using the archival cache, with cold-start fallback when cache is empty - Add bi-directional archival state updates in endpoint observations - Enforce RPCType_JSON_RPC consistency for all archival status keys Cross-replica sync fixes (reputation/storage/redis.go): - Fix parseKey for compound endpoint addresses containing "://" URLs - Fix missing strings.ToUpper on rpcType before proto enum lookup - Reduce archival cache refresh interval from 2h to 5s Hot path optimization: - Add TTL-based ArchivalCache for O(1) archival status lookups - Replace synchronous Redis calls with in-memory cache in endpoint selection path - Remove artificial concurrency semaphore from HTTP client - Add background cache refresh worker for archival status sync Observability: - Add diagnostic headers (X-Archival-Request, X-Suppliers-Tried) - Add CODE_PATH instrumentation markers for request flow tracing - Add request ID propagation through endpoint selection - Add structured errors for block height and chain ID filtering
…currency Fixes nil pointer dereference panic in handleResult when processing batch requests. The panic occurred because multiple goroutines in handleBatchRelayRequest were concurrently reading/writing rc.suppliersTried without synchronization. Root cause: handleBatchRelayRequest spawns parallel goroutines for each batch item, all sharing the same requestContext. The hedgeRacer's handleResult and recordLoser methods would concurrently append to rc.suppliersTried, causing data race corruption that manifested as nil pointer dereference at offset 0x8. Fix: - Add sync.Mutex (suppliersMu) to requestContext struct - Create thread-safe helper methods: - addSupplierTried(): atomic check-and-add with deduplication - addSuppliersTried(): batch add multiple suppliers - getSuppliersTried(): returns copy of slice - getSuppliersTriedCount(): returns count safely - Update all access sites in hedge.go and http_request_context_handle_request.go to use these thread-safe methods
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements a comprehensive set of optimizations and reliability improvements for PATH:
/readyendpoint now provides detailed endpoint info including block consensus statsChanges
Hot Path Optimization
Shared State via Redis
Block Height Consensus Mechanism
median + (syncAllowance × 3)are filteredsync_allowanceconfig (no new configuration needed)Health Check Improvements
Enhanced
/readyEndpoint?detailed=truequery parameter for comprehensive endpoint infoperceived_block_height,median_block_height,block_observationsBug Fixes
Test Plan
make test_unit)make e2e_test eth,pocket,xrplevm)scripts/test_shared_state.sh)go fmt,go vet,golangci-lintpass