You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Proposes a fourth repository in the WasmAgent ecosystem — erp-agent
— positioned as a sibling of bscode rather than a replacement.
The two reference apps share the same runtime (wasmagent-js) and
data factory (trace-pipeline); they differ only in their tools,
verifiers, and target deployment surface.
The thesis: the wasmagent flywheel is task-agnostic by design. bscode is the first reference app (coding); ERP / business-API
agents are a different vertical with materially different economics
and a much weaker public data baseline. Adding a sibling proves the
task-agnostic claim and opens a market with stronger willingness
to pay than coding tools.
This is a scoping RFC: it asks for agreement on the structure
and the boundary between "what lives in the new repo" vs. "what gets
upstreamed into wasmagent-js / aep / trace-pipeline". It is not yet
a build plan.
Why this is structurally possible
The current ecosystem already separates task-agnostic primitives
from task-specific glue, even though we've only ever instantiated
one task (coding via bscode). The split:
Layer
Repo
Task-agnostic?
WASM runtime, model adapters, ranking, AEP emitter
wasmagent-js
yes
MCP firewall, gateway, taint, consent, lease
wasmagent-js
yes
Compliance verifier framework + repair planner
wasmagent-js
mostly — verifier interface is task-agnostic; concrete verifiers are not
Concretely, packages/core/src/agents/verifiers/types.ts:38-44 already declares the verify_method field as an open string union — built-ins are
listed for autocomplete, but applications can register custom kinds
via VerificationPipeline.register(). The docstring explicitly says:
This keeps the protocol product-agnostic — bscode's "build_passes"
or a CI's "lighthouse_score_min" verifier registers without
touching WasmAgent core.
ERP-specific verifiers (order_state_machine_valid, ledger_balanced, permission_boundary_respected, ...) fit the same registration
pattern. Nothing in the runtime needs to know they exist.
apps/worker/src/build-results.ts → renamed to verifier-results.ts (same shape, different semantic)
apps/worker/src/rollout-adapter.ts (same pattern, adapted to ERP verifiers)
apps/worker/src/trajectoryExport.ts
apps/worker/scripts/test-aep-roundtrip.ts
The diff-from-bscode surface is:
apps/worker/src/tools/ (all ERP API SDK wrappers + governance metadata)
apps/worker/src/verifiers/ (domain logic — the moat)
apps/worker/src/policies/ (lease shapes specific to financial/order side-effects)
Verifier taxonomy
Coding verifiers operate on deterministic build artifacts
(exitCode === 0). ERP verifiers operate on business-state
invariants which are mostly post-condition checks against the
target system's API. Proposed taxonomy:
Verifier family
What it checks
Example
State machine
Did the entity transition through a legal state edge?
Did the call respect the principal's role / segment / region?
Buyer in EMEA cannot approve PO over €5k without VP sign-off
Idempotency
Did a retry produce the same observable effect?
Two POSTs with the same idempotency-key → one row, not two
Audit-trail completeness
Did the underlying ERP create the expected audit records?
Approval action → audit_log row with principal + reason
Schema-drift detection
Did the response still match the contract we trained on?
NetSuite added a field, our prompt template now drifts
The first four are the high-value moat; the last three are
defensive. All seven plug into VerificationPipeline.register()
exactly like BuildPassesVerifier does today.
Compatibility with AEP v0.2 / v0.3
ERP tool calls map cleanly onto the current AEP record shape and
benefit from the v0.3 additions in #7 more than coding does:
AEP v0.3 field
Why it matters more for ERP than for coding
side_effect_class
Coding: read vs write vs network. ERP: read vs financial-mutate vs network-egress-to-third-party. Distinction is regulatory.
Pre/post digest over an explicit table + row predicate is exactly the shape an ERP post-condition verifier needs. The coverage descriptor (database, table, rows_predicate) was designed with this in mind even though the prompt was @armorer-labs.
argument_drift
High-stakes: a model that "approves PO #1234" then drifts to "approves PO #5678" is a real bug class in production. v0.3's one-record-per-event rule makes the audit trail explicit.
approval_mode: "bounded-lease"
The natural shape for "this agent can post journal entries in cost-centre X for the next 60 minutes, up to 10 entries, total ≤ $50k".
deny_reason_class: "missing-delegation"
Maps directly to SoD (segregation-of-duties) violations in financial controls.
No new AEP schema needs to be invented for ERP. The v0.3 RFC fields
work as-is. This is the strongest argument that the architecture
is task-agnostic in practice, not just in slides.
Training-data strategy
Coding has abundant public training signal (SWE-bench, MBPP,
HumanEval, IFEval). ERP has near-zero — every customer's business
rules, fields, and permissions are different, and no one publishes
training data over real ledger data.
This is both the opportunity and the constraint.
Opportunity
Training records produced by an ERP agent operating under real
business constraints are scarce by definition. They are the moat
that bscode-derived coding data cannot be.
Constraint
You cannot fork-execute ERP API calls the way you can fork
sandbox builds. There is no "try 100 branches, see which one
posts the right journal entry" — every call has externally-visible
side effects (or audit-log entries even if "rolled back"). Three
implications:
Generation happens in production runs, not in synthetic
sweeps. A human operator + AI assistant produces one
trajectory per real task. Trust-score gating and AEP signature
verification become more important, not less.
Verifier-based reward, not fork ranking.RolloutForkRunner
doesn't fit. Instead, RolloutSingleRunner + verifier ensemble
produces a labelled record. Routes more like RLHF-from-real-use
than DPO-from-ranked-rollouts.
Training stays close to customer data boundary. Two
acceptable shapes:
Local training: customer runs trace-pipeline inside
their VPC, model weights never leave.
Federated contribution: customer opts in to share
redacted training records (using AEP's existing redaction_profile field at packages/aep/src/types.ts:55) in exchange for
improved model weights.
Provenance
trace-pipeline/evomerge/schemas/training.py already carries a Provenance.source: str field on every SftTrainingRecord / DpoTrainingRecord. bscode emits source = "bscode-trajectory";
erp-agent would emit source = "erp-agent-trajectory" (or finer,
e.g. "erp-agent-odoo"). Downstream training can either filter by
source (separate domain models) or pool them (general capability
SFT across both verticals).
Pooling decision matrix:
Training stage
Pool bscode + erp-agent?
SFT — general capability (tool use, instruction following)
yes
SFT — domain reasoning
no (separate models)
DPO — "follow tool schema correctly"
yes
DPO — "right answer"
no
Router training (which task → which capability)
yes
Verifier ensemble for trust score
yes (each verifier reports independently)
Choice of first ERP target
PoC should target one ERP, not a portfolio. Three candidates:
Target
Pro
Con
Odoo (open source, XML-RPC + REST)
Source-available; testable locally; large SME market; SDKs in many languages
Less brand presence with enterprise procurement
NetSuite SuiteQL
Strong mid-market; reasonable API; query-rich
Account access expensive; auth (TBA) tedious
SAP S/4 OData
Largest TAM; API is well-typed
Sandbox access locked behind partner agreements; long sales cycle
Proposed first target: Odoo. Reasons:
Lowest friction to set up a real test environment (Docker compose).
Open source means the schema and the SDKs are public — we can write
reference verifiers without an NDA.
SME segment ≈ best fit for an MIT-licensed reference app: customers
willing to self-host AI tooling tend to also self-host their ERP.
Once Odoo proves the pattern works, the SAP/NetSuite adaptation is
mostly a different SDK call inside the same tools/ shape.
Open question: is there appetite for a parallel erp-agent-netsuite
branch / fork from day one, or pick-one-and-finish-it?
What needs to change in existing repos
Mostly nothing. The reference design is "drop a new sibling
repo, depend on the same npm packages, define your own tools and
verifiers." Concrete required changes:
wasmagent-js/packages/aep — no schema change beyond what
v0.3 (#7) already proposes. ERP-specific fields stay in tool
payload, not in the AEP envelope.
trace-pipeline — no schema change. Add an entry to the
documented list of recognised Provenance.source values
(purely documentation; the field is already str).
docs/ecosystem.md — update the diagram to show two
reference apps under the same runtime + data factory. The
"How the loop closes" pseudocode becomes generic ("agent runs
tasks → …") with bscode and erp-agent as parallel instances.
docs/BRANCH_PROTECTION.md — extend the scope sentence to
include the new repo. Already a shared canonical doc per the
recent reorg, so this is a one-line edit.
Phased rollout
Three phases. Phase 1 commits to nothing concrete; phase 2 commits
to engineering work; phase 3 commits to customers.
Phase 1 — Architecture lock (1 week, this RFC's scope)
Agree on the structure proposed here (or its revisions in
comments).
Pick first ERP target.
Decide whether to extract @wasmagent/worker-template now or
later. (Recommendation: later — copy bscode first; extract once
the pattern is proven across two repos.)
No code changes.
Phase 2 — PoC (≤ 4 weeks)
Stand up erp-agent repo, mirror bscode's worker structure.
Run one end-to-end loop: operator says "create a quote for
customer ABC" → AI calls Odoo → AEP records the action →
trace-pipeline computes trust score → output the first ERP
training record (Provenance.source = "erp-agent-odoo").
Success criterion: produce one verified ERP training record.
That's enough to prove the pipeline; everything after is volume.
Phase 3 — 1–2 paying customers (3–6 months)
Recruit two design-partner customers (one Odoo, one larger ERP
for the upgrade path).
Local-training shape: customer runs trace-pipeline inside their
VPC; we ship updates as model-merge recipes, not weights.
Federated-contribution shape: customer opts in to share
redacted SFT/DPO records; in exchange they get improved adapter
weights.
Success criterion: a verified ERP-domain DPO record set that
outperforms a coding-only-trained baseline on the customer's own
held-out tasks.
Risks
Verifier-development cost is real. ERP verifiers need
domain experts who've actually implemented SAP / NetSuite /
Odoo flows. Hiring or partnering for this is a different
problem than hiring engineers. Mitigation: start with
permission-boundary and idempotency verifiers (cheap, general);
take state-machine and ledger verifiers as a learning curve.
Customers won't share ledger data even if anonymized.
This is the open-source-AI version of every healthcare data
problem. Mitigation: lead with local-training, treat
federated contribution as opt-in upside, not the default.
A bad PoC could damage the bscode story. If erp-agent
ships visibly buggy verifiers, anyone evaluating bscode will
wonder if the runtime is at fault. Mitigation: clear
labelling (erp-agent is experimental); separate maturity
tiers in the org README; don't cross-link until erp-agent is
beta.
Engineering bandwidth. With three repos and one
maintainer + one new contributor, adding a fourth is
ambitious. Mitigation: Phase 2 is ≤ 4 weeks of one
contributor's time, ~80% copy-paste from bscode. The
expensive parts are tools + verifiers, which are
straightforwardly parallel work to add a contributor for.
Non-goals
Not a generic "ERP integration platform" (Mulesoft / Boomi /
Tray.io territory). The point is agent training data, not
integration plumbing.
Not a hosted SaaS product. The reference app is MIT
self-host like bscode; commercial offering (if any) sits at
the trace-pipeline data-loop layer.
Not a Salesforce / Workday / SAP partner integration. Those
are sales channels, not engineering deliverables.
Not changing the AEP v0.3 design to accommodate ERP. The
v0.3 fields already fit; if they don't, we report back to #7 as a finding from a second runtime — which is exactly
the empirical bar #7 sets for promoting decision_envelope
to normative in v0.4.
Open questions
@wasmagent/worker-template — extract now or later? Extract
now means erp-agent and any future sibling share a real package;
later means we copy bscode and refactor after the pattern is
proven twice. (Recommendation in this RFC: later.)
First ERP target — Odoo (recommended) or NetSuite?
Verifier ownership — @wasmagent/erp-verifiers as a npm
package that lives in wasmagent-js (next to @wasmagent/core/agents/verifiers/), or kept inside erp-agent until a third consumer needs it?
Repo visibility — public from day one (matches bscode)
or private until PoC is presentable?
Domain expert recruiting — does this RFC's acceptance
imply a hiring commitment? (Not necessarily; could be
contractor / advisor for the verifier portion.)
AEP v0.3 ERP feedback loop — if erp-agent finds the v0.3
schema insufficient for some ERP scenario, that's a strong
signal for the v0.4 design. Should we instrument the PoC to
report back v0.3-coverage findings to #7?
Related
AEP v0.3 RFC: #7 — pre-condition for several
ERP-specific record shapes
Comments welcome from anyone who has built agents against a
production ERP, particularly on the verifier taxonomy (which
classes I'm missing, which ones are too general to be useful) and
on the federated-vs-local training-data shape.
Summary
Proposes a fourth repository in the WasmAgent ecosystem —
erp-agent— positioned as a sibling of
bscoderather than a replacement.The two reference apps share the same runtime (
wasmagent-js) anddata factory (
trace-pipeline); they differ only in their tools,verifiers, and target deployment surface.
The thesis: the wasmagent flywheel is task-agnostic by design.
bscodeis the first reference app (coding); ERP / business-APIagents are a different vertical with materially different economics
and a much weaker public data baseline. Adding a sibling proves the
task-agnostic claim and opens a market with stronger willingness
to pay than coding tools.
This is a scoping RFC: it asks for agreement on the structure
and the boundary between "what lives in the new repo" vs. "what gets
upstreamed into wasmagent-js / aep / trace-pipeline". It is not yet
a build plan.
Why this is structurally possible
The current ecosystem already separates task-agnostic primitives
from task-specific glue, even though we've only ever instantiated
one task (coding via bscode). The split:
wasmagent-jswasmagent-jswasmagent-jswasmagent-jsvalidate-aep,trust-score,audit-reporttrace-pipelineTrainingDataExporter(SFT / DPO / PPO / router records)trace-pipelinebscodefs_write,bash,read_file, ...)bscodeBuildPassesVerifier,VisualAssertVerifier)wasmagent-js(lib) +bscode(adapter)Concretely,
packages/core/src/agents/verifiers/types.ts:38-44already declares theverify_methodfield as an open string union — built-ins arelisted for autocomplete, but applications can register custom kinds
via
VerificationPipeline.register(). The docstring explicitly says:ERP-specific verifiers (
order_state_machine_valid,ledger_balanced,permission_boundary_respected, ...) fit the same registrationpattern. Nothing in the runtime needs to know they exist.
Proposed structure
The shared-with-bscode surface is roughly:
apps/worker/src/app.ts(Hono app skeleton)apps/worker/src/middleware/auth.ts,rateLimit.tsapps/worker/src/config/productionGuard.tsapps/worker/src/build-results.ts→ renamed toverifier-results.ts(same shape, different semantic)apps/worker/src/rollout-adapter.ts(same pattern, adapted to ERP verifiers)apps/worker/src/trajectoryExport.tsapps/worker/scripts/test-aep-roundtrip.tsThe diff-from-bscode surface is:
apps/worker/src/tools/(all ERP API SDK wrappers + governance metadata)apps/worker/src/verifiers/(domain logic — the moat)apps/worker/src/policies/(lease shapes specific to financial/order side-effects)Verifier taxonomy
Coding verifiers operate on deterministic build artifacts
(
exitCode === 0). ERP verifiers operate on business-stateinvariants which are mostly post-condition checks against the
target system's API. Proposed taxonomy:
quote → sales_orderrequires customer.credit_status == "ok"The first four are the high-value moat; the last three are
defensive. All seven plug into
VerificationPipeline.register()exactly like
BuildPassesVerifierdoes today.Compatibility with AEP v0.2 / v0.3
ERP tool calls map cleanly onto the current AEP record shape and
benefit from the v0.3 additions in #7 more than coding does:
side_effect_classstate_digest_kind: "db-rowset"+ coverage descriptordatabase,table,rows_predicate) was designed with this in mind even though the prompt was @armorer-labs.argument_driftapproval_mode: "bounded-lease"deny_reason_class: "missing-delegation"No new AEP schema needs to be invented for ERP. The v0.3 RFC fields
work as-is. This is the strongest argument that the architecture
is task-agnostic in practice, not just in slides.
Training-data strategy
Coding has abundant public training signal (SWE-bench, MBPP,
HumanEval, IFEval). ERP has near-zero — every customer's business
rules, fields, and permissions are different, and no one publishes
training data over real ledger data.
This is both the opportunity and the constraint.
Opportunity
Training records produced by an ERP agent operating under real
business constraints are scarce by definition. They are the moat
that bscode-derived coding data cannot be.
Constraint
You cannot fork-execute ERP API calls the way you can fork
sandbox builds. There is no "try 100 branches, see which one
posts the right journal entry" — every call has externally-visible
side effects (or audit-log entries even if "rolled back"). Three
implications:
Generation happens in production runs, not in synthetic
sweeps. A human operator + AI assistant produces one
trajectory per real task. Trust-score gating and AEP signature
verification become more important, not less.
Verifier-based reward, not fork ranking.
RolloutForkRunnerdoesn't fit. Instead,
RolloutSingleRunner+ verifier ensembleproduces a labelled record. Routes more like RLHF-from-real-use
than DPO-from-ranked-rollouts.
Training stays close to customer data boundary. Two
acceptable shapes:
trace-pipelineinsidetheir VPC, model weights never leave.
redacted training records (using AEP's existing
redaction_profilefield atpackages/aep/src/types.ts:55) in exchange forimproved model weights.
Provenance
trace-pipeline/evomerge/schemas/training.pyalready carries aProvenance.source: strfield on everySftTrainingRecord/DpoTrainingRecord. bscode emitssource = "bscode-trajectory";erp-agent would emit
source = "erp-agent-trajectory"(or finer,e.g.
"erp-agent-odoo"). Downstream training can either filter bysource (separate domain models) or pool them (general capability
SFT across both verticals).
Pooling decision matrix:
Choice of first ERP target
PoC should target one ERP, not a portfolio. Three candidates:
Proposed first target: Odoo. Reasons:
reference verifiers without an NDA.
willing to self-host AI tooling tend to also self-host their ERP.
mostly a different SDK call inside the same
tools/shape.Open question: is there appetite for a parallel
erp-agent-netsuitebranch / fork from day one, or pick-one-and-finish-it?
What needs to change in existing repos
Mostly nothing. The reference design is "drop a new sibling
repo, depend on the same npm packages, define your own tools and
verifiers." Concrete required changes:
wasmagent-js— optionally extractapps/workerfrom bscodeinto a reusable template package (
@wasmagent/worker-template).This is not blocking the erp-agent PoC; it's a quality-of-life
refactor that would let future siblings (feat: zero-tech-debt — brand, schema, tier, stability, e2e data loop #3, chore: release packages #4, ...) skip the
90% boilerplate copy.
wasmagent-js/packages/aep— no schema change beyond whatv0.3 (#7) already proposes. ERP-specific fields stay in tool
payload, not in the AEP envelope.
trace-pipeline— no schema change. Add an entry to thedocumented list of recognised
Provenance.sourcevalues(purely documentation; the field is already
str).docs/ecosystem.md— update the diagram to show tworeference apps under the same runtime + data factory. The
"How the loop closes" pseudocode becomes generic ("agent runs
tasks → …") with bscode and erp-agent as parallel instances.
docs/BRANCH_PROTECTION.md— extend the scope sentence toinclude the new repo. Already a shared canonical doc per the
recent reorg, so this is a one-line edit.
Phased rollout
Three phases. Phase 1 commits to nothing concrete; phase 2 commits
to engineering work; phase 3 commits to customers.
Phase 1 — Architecture lock (1 week, this RFC's scope)
comments).
@wasmagent/worker-templatenow orlater. (Recommendation: later — copy bscode first; extract once
the pattern is proven across two repos.)
No code changes.
Phase 2 — PoC (≤ 4 weeks)
erp-agentrepo, mirror bscode's worker structure.quote, update customer, read journal entry).
order_state_machine_valid,customer_field_consistency,permission_boundary_respected).customer ABC" → AI calls Odoo → AEP records the action →
trace-pipeline computes trust score → output the first ERP
training record (
Provenance.source = "erp-agent-odoo").Success criterion: produce one verified ERP training record.
That's enough to prove the pipeline; everything after is volume.
Phase 3 — 1–2 paying customers (3–6 months)
for the upgrade path).
VPC; we ship updates as model-merge recipes, not weights.
redacted SFT/DPO records; in exchange they get improved adapter
weights.
Success criterion: a verified ERP-domain DPO record set that
outperforms a coding-only-trained baseline on the customer's own
held-out tasks.
Risks
Verifier-development cost is real. ERP verifiers need
domain experts who've actually implemented SAP / NetSuite /
Odoo flows. Hiring or partnering for this is a different
problem than hiring engineers. Mitigation: start with
permission-boundary and idempotency verifiers (cheap, general);
take state-machine and ledger verifiers as a learning curve.
Customers won't share ledger data even if anonymized.
This is the open-source-AI version of every healthcare data
problem. Mitigation: lead with local-training, treat
federated contribution as opt-in upside, not the default.
A bad PoC could damage the bscode story. If erp-agent
ships visibly buggy verifiers, anyone evaluating bscode will
wonder if the runtime is at fault. Mitigation: clear
labelling (erp-agent is experimental); separate maturity
tiers in the org README; don't cross-link until erp-agent is
beta.
Engineering bandwidth. With three repos and one
maintainer + one new contributor, adding a fourth is
ambitious. Mitigation: Phase 2 is ≤ 4 weeks of one
contributor's time, ~80% copy-paste from bscode. The
expensive parts are tools + verifiers, which are
straightforwardly parallel work to add a contributor for.
Non-goals
Tray.io territory). The point is agent training data, not
integration plumbing.
self-host like bscode; commercial offering (if any) sits at
the trace-pipeline data-loop layer.
are sales channels, not engineering deliverables.
v0.3 fields already fit; if they don't, we report back to
#7 as a finding from a second runtime — which is exactly
the empirical bar #7 sets for promoting
decision_envelopeto normative in v0.4.
Open questions
@wasmagent/worker-template— extract now or later? Extractnow means erp-agent and any future sibling share a real package;
later means we copy bscode and refactor after the pattern is
proven twice. (Recommendation in this RFC: later.)
@wasmagent/erp-verifiersas a npmpackage that lives in wasmagent-js (next to
@wasmagent/core/agents/verifiers/), or kept insideerp-agentuntil a third consumer needs it?or private until PoC is presentable?
imply a hiring commitment? (Not necessarily; could be
contractor / advisor for the verifier portion.)
schema insufficient for some ERP scenario, that's a strong
signal for the v0.4 design. Should we instrument the PoC to
report back v0.3-coverage findings to #7?
Related
ERP-specific record shapes
packages/core/src/agents/verifiers/types.tsapps/worker/src/rollout-adapter.tsevomerge/schemas/training.pydocs/ecosystem.mdComments welcome from anyone who has built agents against a
production ERP, particularly on the verifier taxonomy (which
classes I'm missing, which ones are too general to be useful) and
on the federated-vs-local training-data shape.