Skip to content

refactor: replace PartialSuccess with clear SUCCESS/FAILED/ERROR execution status#212

Merged
chrisli30 merged 14 commits into
mainfrom
staging
Apr 14, 2026
Merged

refactor: replace PartialSuccess with clear SUCCESS/FAILED/ERROR execution status#212
chrisli30 merged 14 commits into
mainfrom
staging

Conversation

@chrisli30
Copy link
Copy Markdown
Member

Summary

  • Replace the ambiguous PartialSuccess execution status with clear three-value semantics:
    • SUCCESS — all executed steps succeeded (branch skips are normal behavior)
    • FAILED — one or more node-level steps failed
    • ERROR — system-level failure (VM crash, compilation error)
  • Update proto (enum value 4 reserved, ERROR=5 added), regenerate bindings
  • Update protobuf-to-types conversion in SDK
  • Update all test assertions to match new semantics
  • Swap loopNode GraphQL test endpoint from gateway.thegraph.com (now requires auth) to free public countries API

Aligns with EigenLayer-AVS PR AvaProtocol/EigenLayer-AVS#521

Test plan

  • partialSuccess.test.ts — 4/4 pass (step failures → Failed, all succeed → Success)
  • errorCodeConsistency.test.ts — 2/2 pass
  • simulateWorkflow.test.ts — 7/7 pass
  • getExecution.test.ts — 6/6 pass
  • contractWrite.test.ts — all pass
  • ethTransfer.test.ts — all pass
  • graphqlQuery.test.ts — all pass
  • loopNode.test.ts — 1 pre-existing failure (GraphQL runner backend bug, tracked in bug: Loop node GraphQL runner never executes during iterations EigenLayer-AVS#523)
  • Build passes cleanly (npm run build)

chrisli30 and others added 4 commits April 8, 2026 23:47
…W gRPC (#210)

Co-authored-by: Wei Lin <wei@avaprotocol.org>
The aggregator's authWithAPIKey flow re-mints a short-lived JWT whose
`sub` claim is taken from the targetAddress passed to
getSignatureFormat, NOT from the admin API key's own subject. The SDK
previously used "0x0000000000000000000000000000000000000000" as a dummy
when no targetAddress was supplied, which produced a re-minted token
with sub: 0x0... and silently broke every downstream ownership check.

We hit this trying to recover funds from a stale Base smart wallet via
withdrawAll: the WithdrawFunds RPC kept reporting
"smart wallet not found for owner 0x0000..." even though the admin API
key had the correct subject. The SDK's dummy-zero-address bypass was
the culprit, and the in-tree withdrawAllFromWallet didn't pass
targetAddress to getAuthenticatedClient.

Two changes:

1. getAuthenticatedClient now refuses the API-key path when
   targetAddress is missing:
   - api-key-only:   throws a clear error explaining the requirement.
   - prefer-api-key: falls through to the private-key path so commands
                     that don't know the target owner ahead of time
                     (getWorkflow / getExecution by ID) still work
                     when both auth methods are configured.

2. withdrawAllFromWallet now passes ownerAddress as targetAddress so
   the re-minted JWT's sub matches the smart wallet's actual owner.

Tested end-to-end against aggregator-base on production: withdrawAll
now successfully recovered ETH and USDC from a stale wallet
(0x71c8f4D7...) for owner 0xc60e71bd... using an admin API key.
Align SDK with EigenLayer-AVS PR #521 which removes the ambiguous
PARTIAL_SUCCESS execution status.

New model:
  SUCCESS - all executed steps succeeded (branch skips are normal)
  FAILED  - one or more node-level steps failed
  ERROR   - system-level failure (VM crash, compilation error)

- Update proto, regenerate bindings (enum value 4 reserved, ERROR=5 added)
- Replace ExecutionStatus.PartialSuccess with .Error in types enum
- Update protobuf-to-types conversion in both execution model and index
- Update all test assertions: PartialSuccess -> Failed for step failures,
  [Success, PartialSuccess] -> [Success, Failed] for flaky assertions
- Swap loopNode GraphQL test to use free public endpoint (countries API)
  since gateway.thegraph.com now requires auth
@claude
Copy link
Copy Markdown

claude Bot commented Apr 13, 2026

Code Review

Overall this is a clean, well-motivated refactor. The three-value semantics (SUCCESS / FAILED / ERROR) are clearer than the old PartialSuccess, the proto reservation of value 4 is correct proto3 practice, and the TIMEOUT_DURATION_SLOW derivation from TimeoutPresets.SLOW is nicely future-proof. A few issues worth addressing before merge:


1. Breaking change needs a changeset [High]

Removing ExecutionStatus.PartialSuccess from @avaprotocol/types and adding ExecutionStatus.Error is a semver-major breaking change for any consumer of the types package. The repo uses Changesets for release management (yarn changeset), but no .changeset/*.md file is included in this PR. Without it, the breaking change won't be reflected in the version bump.


2. Stale JSDoc comment [Medium]

packages/sdk-js/src/index.ts:76 was not updated:

// Still says: 4=PARTIAL_SUCCESS
* @param protobufStatus - The numeric status from protobuf (0=UNSPECIFIED, 1=PENDING, 2=SUCCESS, 3=FAILED, 4=PARTIAL_SUCCESS)

Should be updated to document that value 4 is reserved and 5=ERROR.


3. Non-deterministic test assertions still present [Medium]

CLAUDE.md explicitly forbids: "never use conditional checks that allow tests to pass on errors". This PR converts [Success, PartialSuccess] to [Success, Failed] but the non-deterministic pattern itself remains in several files. Since these assertions are being touched anyway, this is a good opportunity to fix them:

Multi-value toContain pattern (gasTracking.test.ts x5, ethTransfer.test.ts x2, loopNode.test.ts, templates/*.test.ts x3):

// Masks regressions -- a genuine failure still passes this assertion
expect([ExecutionStatus.Success, ExecutionStatus.Failed]).toContain(triggerResult.status);

Conditional assertion pattern (getExecution.test.ts:70-75):

// Conditional assertions are explicitly forbidden in CLAUDE.md
if (hasWriteFailure) {
  expect(execution.status).toBe(ExecutionStatus.Failed);
} else {
  expect(execution.status).toBe(ExecutionStatus.Success);
}

If the outcome is genuinely environment-dependent, at minimum open a tracking issue and add a comment rather than silently accepting both outcomes.


4. External API dependency in loopNode.test.ts [Medium]

The test now calls https://countries.trevorblades.com/graphql (a third-party public service). If that endpoint is slow, rate-limited, or down, CI fails for infrastructure reasons unrelated to the code change. The PR description references issue #211 for a mock server plan -- consider skipping this specific test in CI until that lands (similar to how describeIfSepolia gates network-dependent tests).


5. test:core silently excludes withdraw tests [Low]

package.json adds --testPathIgnorePatterns=tests/core/withdraw.test.ts to test:core. A developer running yarn test:core for full core coverage will silently miss withdraw failures. The new test:withdraw command is good, but the exclusion isn't self-documenting in context -- a short comment (e.g. # withdraw is slow; run separately with yarn test:withdraw) would help.


6. Generic error message assertion [Low]

tests/executions/simulateWorkflow.test.ts:356:

expect((error as Error).message).toContain("failed");

"failed" is common enough to pass on unrelated errors. If the server returns a specific string for this case, assert on that. If the message is unpredictable, expect(error).toBeDefined() is more honest about what's being tested.


7. Duplicate conversion functions [Low / Pre-existing]

Two functions do the same work:

  • convertProtobufExecutionStatus in packages/sdk-js/src/index.ts:79
  • convertProtobufExecutionStatusToTypes in packages/sdk-js/src/models/execution.ts

This PR touches both, which is a good time to have one delegate to the other. Not a blocker.


What's done well

  • Proto reservation: reserved 4; reserved "EXECUTION_STATUS_PARTIAL_SUCCESS"; is correct proto3 practice to prevent accidental reuse of a retired wire value.
  • GitHub Actions -T fix: The TTY/\r corruption bug is subtle; the comment explains it clearly enough that the next person won't remove it.
  • TIMEOUT_DURATION_SLOW derivation: Deriving the jest timeout from TimeoutPresets.SLOW's own fields (timeout * (1 + retries) + retryDelay * retries + buffer) means it stays in sync automatically if the preset changes.
  • example.ts auth flow fix: The guard against a missing targetAddress with api-key-only is correct, and the inline comment explaining the re-minted JWT sub claim issue is genuinely useful context for future maintainers.
  • partialSuccess.test.ts cleanup: Collapsing expect([PartialSuccess, Failed, Success, Unspecified]).toContain(...) to expect(...).toBe(Failed) is the right direction and directly improves test determinism.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors execution status semantics by removing the ambiguous PartialSuccess status and standardizing on SUCCESS / FAILED / ERROR across the proto, SDK, and integration tests, while also addressing a few E2E reliability issues (timeouts, external GraphQL endpoint, CI API key capture).

Changes:

  • Update protobuf ExecutionStatus (reserve value 4, add ERROR=5) and regenerate gRPC bindings.
  • Update SDK protobuf→types execution-status conversions and adjust all test assertions to the new semantics.
  • Improve E2E stability: add a slow Jest timeout preset for bundler-backed tests, adjust withdraw RPC timeouts, and swap the LoopNode GraphQL test endpoint to a free public API.

Reviewed changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/workflows/triggerWorkflow.test.ts Use slow Jest timeout for bundler-backed trigger tests.
tests/utils/utils.ts Add TIMEOUT_DURATION_SLOW derived from TimeoutPresets.SLOW.
tests/templates/uniswapv3_stoploss.test.ts Update expected statuses to Success/Failed per new semantics.
tests/templates/recurring-payment-with-report.test.ts Update expected statuses to Success/Failed.
tests/templates/batch-recurring-payment-with-email.test.ts Update expected statuses to Success/Failed.
tests/nodes/loopNode.test.ts Use slow Jest timeout; swap GraphQL runner endpoint and inputs; update status expectations.
tests/nodes/graphqlQuery.test.ts Update network-error simulation expectation to Failed.
tests/nodes/ethTransfer.test.ts Use slow Jest timeout and update simulation status expectations.
tests/nodes/contractWrite.test.ts Use slow Jest timeout and update write-failure status mapping to Failed.
tests/executions/simulateWorkflow.test.ts Update failure expectation from PartialSuccess to Failed.
tests/executions/runNodeWithInputs.test.ts Use slow Jest timeout for bundler-backed calls.
tests/executions/partialSuccess.test.ts Rename suite and rewrite assertions to the new Failed semantics.
tests/executions/getExecution.test.ts Tighten write-failure assertion to Failed.
tests/executions/gasTracking.test.ts Update accepted statuses to Success/Failed.
tests/executions/errorCodeConsistency.test.ts Update expected status to Failed.
tests/core/withdraw.test.ts Increase timeouts for withdraw calls using TimeoutPresets.SLOW.
packages/types/src/enums.ts Replace PartialSuccess with Error in ExecutionStatus.
packages/sdk-js/src/models/execution.ts Map protobuf EXECUTION_STATUS_ERROR to types ExecutionStatus.Error.
packages/sdk-js/src/index.ts Map protobuf EXECUTION_STATUS_ERROR to types ExecutionStatus.Error.
package.json Split withdraw tests into test:withdraw; exclude them from test:core.
grpc_codegen/avs_pb.js Regenerated enum: remove PARTIAL_SUCCESS, add ERROR.
grpc_codegen/avs_pb.d.ts Regenerated TS enum: remove PARTIAL_SUCCESS, add ERROR.
grpc_codegen/avs.proto Update ExecutionStatus definition: reserve 4, add ERROR=5, clarify semantics.
examples/example.ts Harden API-key auth flow (require targetAddress); bind withdraw auth to owner address.
config/chains.ts Update Base avsEndpoint from localhost to the public aggregator endpoint.
.github/workflows/dev-test-on-pr.yml Fix API key capture by disabling TTY allocation (docker compose exec -T).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/sdk-js/src/index.ts Outdated
@@ -86,8 +86,8 @@ function convertProtobufExecutionStatus(
return ExecutionStatus.Success;
case ProtobufExecutionStatus.EXECUTION_STATUS_FAILED:
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protobuf enum previously used value 4 for PARTIAL_SUCCESS; if an older backend (or cached execution records) still returns numeric 4, this converter will currently fall through to Unspecified. Consider explicitly handling numeric value 4 (map to ExecutionStatus.Failed) to preserve backward compatibility while migrating.

Suggested change
case ProtobufExecutionStatus.EXECUTION_STATUS_FAILED:
case ProtobufExecutionStatus.EXECUTION_STATUS_FAILED:
case 4:

Copilot uses AI. Check for mistakes.
case avs_pb.ExecutionStatus.EXECUTION_STATUS_PARTIAL_SUCCESS:
return ExecutionStatus.PartialSuccess;
case avs_pb.ExecutionStatus.EXECUTION_STATUS_ERROR:
return ExecutionStatus.Error;
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in sdk-js/index.ts: protobuf can legally carry unknown enum numeric values. To avoid misreporting legacy status value 4 (former PARTIAL_SUCCESS) as Unspecified, handle numeric 4 explicitly (e.g., map to ExecutionStatus.Failed).

Suggested change
return ExecutionStatus.Error;
return ExecutionStatus.Error;
case 4:
return ExecutionStatus.Failed;

Copilot uses AI. Check for mistakes.
Comment thread package.json Outdated
"clean": "rm -rf node_modules yarn.lock && yarn workspaces run clean",
"test": "jest --config jest.config.cjs --detectOpenHandles --runInBand --silent",
"test:core": "dotenv -e .env.$TEST_ENV jest tests/core/ --runInBand --silent",
"test:core": "dotenv -e .env.$TEST_ENV jest tests/core/ --testPathIgnorePatterns=tests/core/withdraw.test.ts --runInBand --silent",
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test:core now excludes withdraw.test.ts, but the CI workflow still runs only yarn test:core for the core suite (no test:withdraw). This effectively removes withdraw coverage from PR checks; consider either keeping withdraw in test:core or updating CI to run the new test:withdraw script as its own suite.

Suggested change
"test:core": "dotenv -e .env.$TEST_ENV jest tests/core/ --testPathIgnorePatterns=tests/core/withdraw.test.ts --runInBand --silent",
"test:core": "dotenv -e .env.$TEST_ENV jest tests/core/ --runInBand --silent",

Copilot uses AI. Check for mistakes.
Comment on lines +207 to 208
API_KEY=$(docker compose exec -T aggregator /ava create-api-key --config /app/config/aggregator.runtime.yaml --role=admin --subject=0x804e49e8C4eDb560AE7c48B554f6d2e27Bb81557 2>/dev/null)
echo "AVS_API_KEY=$API_KEY" >> $GITHUB_ENV
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since test:core now skips withdraw.test.ts (package.json), CI currently won’t run withdraw tests unless test:withdraw is added to the matrix or executed as an extra step. Consider extending the workflow to include withdraw as its own suite (or run it in the core job) so coverage doesn’t regress.

Suggested change
API_KEY=$(docker compose exec -T aggregator /ava create-api-key --config /app/config/aggregator.runtime.yaml --role=admin --subject=0x804e49e8C4eDb560AE7c48B554f6d2e27Bb81557 2>/dev/null)
echo "AVS_API_KEY=$API_KEY" >> $GITHUB_ENV
API_KEY=$(docker compose exec -T aggregator /ava create-api-key --config /app/config/aggregator.runtime.yaml --role=admin --subject=0x804e49e8C4eDb560AE7c48B554f6d2e27Bb81557 2>/dev/null | tr -d '\r')
API_KEY=${API_KEY%$'\n'}
echo "::add-mask::$API_KEY"
{
echo "AVS_API_KEY<<EOF"
echo "$API_KEY"
echo "EOF"
} >> "$GITHUB_ENV"

Copilot uses AI. Check for mistakes.
Comment thread examples/example.ts
Comment on lines 619 to +624
withdrawClient = await getAuthenticatedClient({
strategy: avsApiKey ? "prefer-api-key" : "private-key-only",
commandName: "withdrawAll",
// Bind the re-minted JWT to the wallet's actual owner so
// validateSmartWalletOwnership hits w:<owner>:<wallet> on the server.
targetAddress: ownerAddress,
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withdrawAllFromWallet still advertises [owner-address] as optional, but when AVS_API_KEY is set and ownerAddress is omitted, getAuthenticatedClient will skip API-key auth and can fail before it has a chance to resolve the owner later in the function. Consider either (1) requiring ownerAddress when using API-key auth (and updating the usage/error), or (2) resolving the owner address before calling getAuthenticatedClient when AVS_API_KEY is present.

Copilot uses AI. Check for mistakes.
- Map proto enum value 4 (retired PARTIAL_SUCCESS) to Failed for
  backward compat with older aggregator responses or stored executions
- Update JSDoc to reflect new enum values (4=reserved, 5=ERROR)
- Tighten simulateWorkflow error assertion from "failed" to "steps failed"
Major bump for @avaprotocol/types and @avaprotocol/sdk-js —
ExecutionStatus.PartialSuccess removed, Error added.
Tests previously used mock-api.ap-aggregator.local -- a non-existent
host that produced DNS failures. Several tests asserted on the error
message as a way to verify the execution pipeline, which was fragile
and couldn't cover success paths.

Replace with a real HTTP mock server started in Jest globalSetup on
localhost:19876. The aggregator can reach it, responses are deterministic,
and tests can now verify both success and error behavior.

Also consolidates duplicate fee/status converters in sdk-js -- index.ts
now imports from models/execution.ts instead of reimplementing the
same conversion logic inline.

Add CLAUDE.md rule: always pipe e2e test output to ./logs/ so failures
can be inspected without rerunning.

Closes #211
Sepolia produces blocks every ~12s. Waiting for 5 blocks (~60s) was

flaky and often exceeded the 50s polling timeout. interval=1 makes

the test complete in ~12s.
The try/catch silently replaced real errors with an empty items

array, hiding auth/network/server failures as a 'no executions yet'

timeout. Let errors propagate so failures surface with the real cause.
Pull the Permit2 address from uniswapV3Contracts config so the test

stays correct across chains. Also remove stale 'Expected address' comment

on the smart wallet derivation since the derived address depends on

chainId/factory/salt and shouldn't be pinned in a comment.
Without '--', dotenv-cli strips any --flag args before invoking jest.
This meant --testPathIgnorePatterns=tests/core/withdraw.test.ts on
test:core was silently dropped, so CI was still running withdraw tests
and failing with insufficient-balance errors.

Add '--' to all test:* scripts so jest receives its flags intact.
Confirmed locally: yarn test:core now correctly excludes withdraw.test.ts.
Releases:
  @avaprotocol/types@2.13.0
  @avaprotocol/sdk-js@2.17.0

[skip ci]
@chrisli30 chrisli30 merged commit 3ff909b into main Apr 14, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants