Skip to content

Deterministic WASM trap in cpython!py_gl_call with no structured error (code=None) — causes consensus-worker retry loop #288

@MuncleUscles

Description

@MuncleUscles

Summary

Deterministic WASM trap in cpython!py_gl_call with code=None, causes=[] — no structured GenVM error is produced, only a Fingerprint { module_instances: {"cpython": ..., "softfloat": ...} } snapshot. Repros every time for one specific (contract, calldata) pair in production Studio.

Because the error carries no structured code, the consensus worker classifies it as retryable and loops forever — one poisoned tx produced ~6,500 identical errors before we cancelled it manually.

Environment

  • Executor: v0.2.12 (prd) — note: current release is v0.2.16, so this may already be fixed. First question: does this still repro on v0.2.16? If yes, details below. If no, we'll just upgrade.
  • Studio backend: yeagerai/simulator-jsonrpc@sha256:7217dfadd020… (prd, 2026-04-15)
  • Reference: Sentry GENLAYER-STUDIO-11X, stuck tx hash 0x451098a355fe114f89575d720595cf87ed34a065e425a0aa3c56adbd14e6b1d5 (now CANCELED in prd DB), contract 0xf074a62BBfd331e62221a159853D536EA2ca9733.

Error output

ERROR backend.consensus.worker:_transaction_context:633 GenVM internal error during transaction 0x451098…:
  code=None,
  causes=[],
  is_fatal=False,
  is_leader=True,
  message=GenVM internal error,
  detail=Fingerprint {
    module_instances: {
      "cpython":   ModuleFingerprint { memories: [MemoryFingerprint([17, 168, 94, 202, 147, 39, 85, 220, 86, 96, 211, 83, 178, 29, 159, 37, 93, 53, 82, 163, 229, 19, 40, 219, 75, 163, 203, 76, 203, 98, 94, 248])] },
      "softfloat": ModuleFingerprint { memories: [MemoryFingerprint([252, 158, 79, 163, 68, 141, 165, 13, 198, 255, 81, 74, 186, 5, 104, 186, 4, 82, 118, 245, 141, 60, 96, 253, 244, 197, 195, 210, 139, 22, 172, 200])] }
    }
  }

Caused by:
    0: error while executing at wasm backtrace:
           0: 0x1039b  - cpython!py_gl_call
           1: 0xa3f111 - cpython!cfunction_call
           2: 0x8fead3 - cpython!_PyObject_MakeTpCall
           3: 0x8ff81b - cpython!PyObject_Vectorcall
           4: 0x9172b6 - cpython!_PyEval_EvalFrameDefault
           5: 0x906093 - cpython!PyEval_EvalCode
           6: 0xad3d5b - cpython!run_eval_code_obj
           7: 0xad3bfc - cpython!run_mod.llvm.8421269000175133780
           8: 0xad1c83 - cpython!_PyRun_SimpleFileObject

The trap is inside py_gl_call (the cpython module's host-call dispatcher), before any Lua code in llm.lua runs — that's why no structured cause bubbles up.

Repro contract

# { "Depends": "py-genlayer:1jb45aa8ynh2a9c9xn3b7qqh8sm5q93hwfp7jqmwsfhh8jpz09h6" }
from genlayer import *
import base64
import json


class MarketSignalOracle(gl.Contract):
    market_name: str
    last_evaluation: str
    evaluations_by_candidate: str
    recent_candidate_ids: str

    def __init__(self, market_name: str):
        self.market_name = market_name
        self.last_evaluation = json.dumps({"candidateId":"","verdict":"ignore","confidence":0.0,"reasoningSummary":"No candidate has been evaluated yet.","alertDecision":False,"tags":[],"metadata":{"market":market_name,"symbol":"","version":"v1","dominantSignal":"none","riskBias":"neutral"}}, sort_keys=True)
        self.evaluations_by_candidate = json.dumps({}, sort_keys=True)
        self.recent_candidate_ids = json.dumps([])

    @gl.public.write
    def evaluate_candidate(self, candidate_id: str, symbol: str, captured_at: str,
                           raw_signal_json: str, metrics_json: str, history_context_json: str) -> None:
        raw_signal      = json.loads(raw_signal_json)
        metrics         = json.loads(metrics_json)
        history_context = json.loads(history_context_json)

        prompt = f"""
You are a GenLayer market signal evaluator for crypto derivatives.
Evaluate one anomaly candidate for {self.market_name}.

Candidate ID: {candidate_id}
Symbol: {symbol}
Captured at: {captured_at}
Raw signal JSON: {json.dumps(raw_signal, sort_keys=True)}
Metrics JSON: {json.dumps(metrics, sort_keys=True)}
History context JSON: {json.dumps(history_context, sort_keys=True)}
"""

        def leader_fn():
            return gl.nondet.exec_prompt(prompt, response_format="json")

        def validator_fn(leader_result) -> bool:
            return isinstance(leader_result, gl.vm.Return)

        evaluation = gl.vm.run_nondet_unsafe(leader_fn, validator_fn)
        self.last_evaluation = json.dumps(evaluation, sort_keys=True)

(Full original contract is longer but the minimal repro above has the same two gl.* calls on the hot path.)

Constructor arg: market_name="BTCUSDT".

Repro calldata (for evaluate_candidate)

Arguments, decoded from the prd calldata:

Arg Value
candidate_id "base64-test-20260415-01"
symbol "BTCUSDT"
captured_at "2026-04-15T14:35:00Z"
raw_signal_json '{"score":84,"severity":"high","triggeredRules":["funding_dislocation","open_interest_spike"]}'
metrics_json '{"price":75139.6,"priceChange1h":1.9,"priceChange24h":4.2,"volume24h":123456789,"volumeChangeShort":22.4,"openInterest":99887766,"openInterestChange1h":14.8,"fundingRate":0.00045,"fundingRateDelta":0.0002,"longLiquidations":2500000,"shortLiquidations":1200000}'
history_context_json (base64 of a JSON object with snapshotWindow:"6h" and a recentEvents array)

Total calldata size: ~770 bytes. Not abnormally large — so this is unlikely to be a memory/stack blow-up from sheer size.

The contract name base64-test-20260415-01 suggests the user was specifically probing base64-encoded JSON string arguments. That might be a hint about what's upsetting the host.

What we've confirmed

  1. Deterministic. Same tx ran ~6,500 times with the same fingerprint.
  2. Not a user-code error. The trap is in py_gl_call, before user Python code can throw. Python stack frames above it (PyEval_EvalFrameDefault, etc.) are just the dispatch path into the host call.
  3. No structured cause. Because of that, causes=[] and code=None, which defeats the consensus-worker's classification.
  4. Two gl. calls on the hot path* — we can't tell from the studio-side logs which of them traps. We're not capturing executor stderr in Sentry, only the consensus-worker wrapper. If you have a way to run this locally with executor-side logging, that will nail down which call.

Asks

  1. Is this fixed in v0.2.16? Please try the repro above. If yes, we'll just bump our prd executor.
  2. If still reproing on head: need to know which of the two gl.* calls traps, and why it's not producing a structured error (so the Python side can't classify it as non-retryable).
  3. Independent of this bug, a structured error for this class of trap would be very useful — even INTERNAL_ERROR with a one-line message is enough for the studio side to stop hammering poisoned txs.

Cross-ref: we'll also add a max-retry counter on our side (studio consensus/worker.py) so one poisoned tx can't produce thousands of events regardless of the GenVM-side root cause.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions