AgentConductor API

This document describes the current stable Python API exposed by the repository root package.

The API is still in an early milestone. It currently provides:

a typed solve entrypoint with bounded multi-turn execution over explicit orchestrator modes
a typed multi-turn solve-state contract that records per-turn history
deterministic and learned-policy topology-planning entrypoints
a single-turn topology-execution entrypoint whose non-testing worker roles can run through a model-backed runtime seam and whose testing role is backed by a local subprocess judge adapter
typed topology schema objects for single-turn plans
validation rules for topology structure before execution

The repository now supports a local bounded multi-turn solve loop, but it still does not implement the paper's full benchmark-grade inference runtime.

Public Entry Points

Stable callable API:

agentconductor.solve_problem
agentconductor.plan_problem_topology
agentconductor.plan_problem_topology_candidate
agentconductor.revise_problem_topology_candidate
agentconductor.execute_topology_plan
agentconductor.serialize_topology_plan_to_yaml
agentconductor.parse_topology_plan_yaml
agentconductor.evaluate_candidate_against_benchmark
agentconductor.evaluate_candidate_against_benchmark_record
agentconductor.load_canonical_benchmark_dataset
agentconductor.evaluate_candidate_batch
agentconductor.run_benchmark_evaluation_entrypoint
agentconductor.run_batch_evaluation_entrypoint
agentconductor.build_reproduction_audit
agentconductor.write_reproduction_audit
agentconductor.write_reproduction_audit_entrypoint
agentconductor.generate_sft_dataset_entrypoint
agentconductor.load_sft_checkpoint_entrypoint
agentconductor.run_sft_baseline_entrypoint
agentconductor.compute_reward_breakdown_entrypoint
agentconductor.run_rl_baseline_entrypoint

Stable public topology contract:

agentconductor.TopologyPlan
agentconductor.TopologyStep
agentconductor.AgentInvocation
agentconductor.AgentReference
agentconductor.AgentRole
agentconductor.TopologyValidationError
agentconductor.TopologySchemaError
agentconductor.TopologyLogicError

Other public types:

agentconductor.AgentExecutionResult
agentconductor.StepExecutionResult
agentconductor.TopologyExecutionResult
agentconductor.BenchmarkAdapter
agentconductor.BenchmarkArtifactIdentifiers
agentconductor.BenchmarkDatasetFormat
agentconductor.BenchmarkDatasetSource
agentconductor.BenchmarkExecutionPhase
agentconductor.BenchmarkEvaluationResult
agentconductor.BenchmarkEvaluationStatus
agentconductor.BenchmarkExecutionSettings
agentconductor.BenchmarkInvocationMode
agentconductor.BenchmarkPhaseArtifactIdentifiers
agentconductor.BenchmarkPhaseExecutionSettings
agentconductor.BenchmarkPhaseResourceLimits
agentconductor.BenchmarkPhaseResult
agentconductor.BenchmarkPhaseStatus
agentconductor.BenchmarkProblemDefinition
agentconductor.BenchmarkRuntimeMode
agentconductor.BenchmarkTestCase
agentconductor.BenchmarkVerdictMapping
agentconductor.BenchmarkVendorPollSnapshot
agentconductor.BenchmarkVendorSubmissionReceipt
agentconductor.BenchmarkVendorSubmissionState
agentconductor.CanonicalBenchmarkDataset
agentconductor.CanonicalBenchmarkRecord
agentconductor.DistributedEvaluationBatch
agentconductor.DistributedEvaluationConfig
agentconductor.DistributedEvaluationResult
agentconductor.DistributedEvaluationStatus
agentconductor.DistributedEvaluationTask
agentconductor.EvaluationProblemDefinition
agentconductor.EvaluationProblemResult
agentconductor.EvaluationRunArtifact
agentconductor.EvaluationRunMetadata
agentconductor.EvaluationSummary
agentconductor.ReproductionAudit
agentconductor.ReproductionChecklistItem
agentconductor.ReproductionClaim
agentconductor.ReproductionStatus
agentconductor.ExecutionStatus
agentconductor.RewardBreakdown
agentconductor.TestingOutcome
agentconductor.CodeCandidate
agentconductor.JudgeTestCase
agentconductor.JudgeCaseResult
agentconductor.JudgeResourceLimits
agentconductor.SandboxCapabilityState
agentconductor.SandboxBindingState
agentconductor.SandboxTestSpec
agentconductor.SandboxExecutionResult
agentconductor.SandboxRuntimeCapabilities
agentconductor.PythonSubprocessJudgeAdapter
agentconductor.PythonSubprocessSandboxAdapter
agentconductor.NodeJsBenchmarkJudgeAdapter
agentconductor.CppBenchmarkJudgeAdapter
agentconductor.JavaBenchmarkJudgeAdapter
agentconductor.PythonBenchmarkJudgeAdapter
agentconductor.MultiLanguageBenchmarkJudgeAdapter
agentconductor.StubBenchmarkAdapter
agentconductor.StubBenchmarkSubmission
agentconductor.StubVendorNativeBenchmarkAdapter
agentconductor.StubVendorSubmissionScenario
agentconductor.TopologyExecutionError
agentconductor.SolveState
agentconductor.SolveTurnRecord
agentconductor.TestingFeedback
agentconductor.TopologyRevisionInput
agentconductor.StopReason
agentconductor.SolveStateTransitionError
agentconductor.ProblemInstance
agentconductor.DifficultyLevel
agentconductor.SolveRequest
agentconductor.SolveResult
agentconductor.SolveStatus
agentconductor.LearnedTopologyPlan
agentconductor.OrchestratorCheckpointMetadata
agentconductor.OrchestratorCheckpointError
agentconductor.OrchestratorCheckpointSelectionError
agentconductor.OrchestratorCheckpointLoadError
agentconductor.OrchestratorMode
agentconductor.OrchestratorPromptRequest
agentconductor.TopologyOrchestratorPolicy
agentconductor.TopologyPromptKind
agentconductor.TopologyCandidateExtractionError
agentconductor.SftTrainingArtifact
agentconductor.SftTrainingConfig
agentconductor.SyntheticTopologySample
agentconductor.RlTrainingArtifact
agentconductor.RlTrainingConfig
agentconductor.RepositoryFrozenOrchestratorBundle
agentconductor.RepositoryFrozenOrchestratorRuntime
agentconductor.RepositoryWorkerModelRuntime
agentconductor.WorkerGenerationRequest
agentconductor.WorkerGenerationResult
agentconductor.WorkerRoleRuntime
agentconductor.WorkerRuntimeError

Installation and Import

From the repository root:

uv sync

Then import from Python:

from agentconductor import (
    LearnedTopologyPlan,
    parse_topology_plan_yaml,
    ProblemInstance,
    TopologyOrchestratorPolicy,
    TopologyPlan,
    execute_topology_plan,
    plan_problem_topology,
    plan_problem_topology_candidate,
    serialize_topology_plan_to_yaml,
    solve_problem,
)

Solve API

`solve_problem(problem, *, max_turns=None, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1, worker_runtime=None) -> SolveResult`

Plan and execute a structured bounded multi-turn solve for a problem instance.

Parameters:

problem: ProblemInstance
max_turns: int | None = None
orchestrator_policy: TopologyOrchestratorPolicy | None = None
orchestrator_checkpoint: str | Path | None = None
orchestrator_checkpoint_id: str | None = None
orchestrator_device: str = "cpu"
orchestrator_max_attempts: int = 1
worker_runtime: WorkerRoleRuntime | None = None

Behavior:

uses the explicit difficulty from problem when present
defaults missing difficulty to DifficultyLevel.MEDIUM
validates the turn budget against the current baseline limit
uses deterministic planning when no orchestrator policy is provided
uses the learned YAML planning path when orchestrator_policy is provided
can also resolve the learned YAML planning path from explicit checkpoint metadata when orchestrator_checkpoint is provided
executes plan -> evaluate in a bounded loop up to the current turn budget
consumes typed prior-turn testing feedback when planning a later turn
routes non-testing worker roles through the provided worker_runtime or the default repository-local model-backed worker runtime
returns the final candidate code, role trace, and final testing outcome

Returned SolveResult fields:

status
selected_difficulty
planned_turns
max_nodes
available_roles
topology
execution
candidate_solution
testing_outcome
solve_state
notes

Implementation inference:

the medium-difficulty fallback is an engineering inference until the repository implements the paper's real difficulty inference mechanism
deterministic planning remains the repository-local fallback when no learned policy is configured
the current checkpoint-backed frozen path loads a repository-local runtime bundle from checkpoint metadata, with explicit load failures for missing runtime artifacts, unsupported devices, or incompatible prompt templates

The notes field records which orchestrator mode produced the final topology and, when relevant, which checkpoint-backed runtime was selected.

Multi-Turn Solve-State Contract

The repository now exposes a typed solve-state object so later multi-turn logic can compose around the current single-turn executor without changing earlier execution contracts.

`SolveState`

SolveState(
    problem: ProblemInstance,
    selected_difficulty: DifficultyLevel,
    max_turns: int,
    max_nodes: int,
    available_roles: tuple[str, ...],
    turns: tuple[SolveTurnRecord, ...] = (),
    stop_reason: StopReason | None = None,
)

Properties:

completed_turns
remaining_turns
can_continue
latest_turn

`SolveTurnRecord`

SolveTurnRecord(
    turn_index: int,
    topology: TopologyPlan,
    execution: TopologyExecutionResult,
    testing_feedback: TestingFeedback,
)

`TestingFeedback`

TestingFeedback(
    outcome: TestingOutcome | None,
    diagnostics: tuple[str, ...],
    candidate_code: str | None,
)

`TopologyRevisionInput`

TopologyRevisionInput(
    problem: ProblemInstance,
    selected_difficulty: DifficultyLevel,
    turn_index: int,
    prior_topology: TopologyPlan,
    prior_execution_status: ExecutionStatus,
    testing_feedback: TestingFeedback,
    remaining_turns: int,
)

`StopReason`

Enum values:

solved
turn_budget_exhausted

Implementation inference:

the paper describes global history and testing feedback, but not a repository-level typed state object
TopologyRevisionInput is a repository-local contract so later revision logic can consume structured prior-turn artifacts instead of raw strings only

Topology Execution API

`execute_topology_plan(problem, topology, *, sandbox=None, worker_runtime=None) -> TopologyExecutionResult`

Execute a validated single-turn topology plan layer by layer.

Behavior:

validates the topology before execution
executes steps in index order
resolves references only from prior executed steps
dispatches each non-testing agent through a model-backed worker runtime seam
extracts the last referenced candidate code through an explicit code-candidate contract
evaluates candidate code through a concrete judge adapter
returns structured per-agent outputs, final candidate code, judge diagnostics, and final testing outcome

Implementation inference:

the current default worker runtime is RepositoryWorkerModelRuntime, which records runtime and model identifiers per agent while remaining a repository-local substitute for real gpt-4o-mini execution
the testing role delegates to a repository-local Python subprocess judge
the local judge validates a Python solve() entrypoint against explicit test cases, expected outputs, and explicit resource limits until a fuller benchmark integration exists

Judge Contract

The current judge-facing types are:

JudgeTestCase Carries one named invocation, optional positional or keyword arguments, optional stdin text, and expected output or stdout.
JudgeCaseResult Carries the typed verdict for one executed case, including pass/fail outcome, diagnostics, and captured actual versus expected outputs.
JudgeResourceLimits Carries per-evaluation CPU, wall-clock, and memory limits.
SandboxTestSpec Bundles the target entrypoint, concrete test cases, and resource limits into the adapter request.
SandboxRuntimeCapabilities Reports the active worker platform, launcher strategy, and typed wall-clock, CPU, and memory enforcement status for the evaluation.

Current benchmark-aligned semantics:

the judge now returns structured per-case verdicts instead of only a single aggregate outcome
string comparison normalizes line endings and ignores trailing whitespace at line boundaries, which is closer to common benchmark judge behavior than the earlier full strip() comparison
aggregate outcomes still map into the repository's typed TestingOutcome contract
wall-clock limits are enforced per case at the subprocess boundary instead of only within one long-lived in-process harness
on Windows, the judge now routes worker launch through a Job Object binding seam and targets hard process-memory limits through memory_limit_bytes when the host runtime permits dedicated job assignment
the sandbox result now carries typed runtime-capability metadata so callers can inspect whether memory binding was attached, downgraded, skipped, or not applicable
Windows CPU enforcement is now reported explicitly as unsupported until the repository has a verified Job Object CPU strategy

Current fidelity limits:

the repository judge is still local and Python-only
entrypoint and invocation semantics are still repository-defined rather than imported from a real benchmark harness
wall-clock handling is enforced by the subprocess boundary
POSIX CPU and memory limits use OS-level resource controls only on supported platforms
Windows hard memory enforcement depends on whether the host runtime allows the worker to be rebound into a dedicated Job Object
when Windows Job Object binding is unavailable, the judge keeps hard wall-clock enforcement and returns explicit platform diagnostics instead of claiming hard memory isolation
Windows worker launch first attempts CREATE_BREAKAWAY_FROM_JOB and falls back to plain subprocess creation only when that strategy is unavailable
Windows CPU limits are not claimed to be hard-enforced; wall-clock timeout remains the only guaranteed timing control on Windows
on runtimes without usable OS-level memory controls, memory limits fall back to traced Python allocations and remain approximate
output normalization is still a repository-level inference rather than a benchmark-specific ruleset
exact benchmark-specific semantics, datasets, and multi-language support are still out of scope for this milestone

Benchmark Adapter API

`evaluate_candidate_against_benchmark(problem, candidate, settings, *, adapter) -> BenchmarkEvaluationResult`

Evaluate one extracted candidate through a typed external benchmark boundary.

Parameters:

problem: BenchmarkProblemDefinition
candidate: CodeCandidate
settings: BenchmarkExecutionSettings
adapter: BenchmarkAdapter

Behavior:

keeps benchmark problem metadata, execution settings, verdict mapping, and artifact identifiers in explicit typed contracts
keeps the external benchmark seam distinct from the repository-local subprocess judge
validates candidate language against the requested benchmark execution language before dispatch
returns a normalized repository TestingOutcome only through BenchmarkVerdictMapping, so core services do not consume benchmark-native payloads directly

Key benchmark-facing types:

BenchmarkProblemDefinition Canonical benchmark metadata including benchmark_name, dataset_name, source_problem_id, optional split_name, repository-facing identifier, prompt, and language.
BenchmarkExecutionSettings External-harness execution settings such as language, invocation mode, entrypoint, benchmark-owned resource limits, optional compile or run phase settings, and explicit runtime mode.
BenchmarkPhaseExecutionSettings One benchmark-owned compile or run phase with explicit source_layout, command, optional executable_target, and resource_limits.
BenchmarkPhaseResourceLimits Phase-specific time and memory limits for compile or run phases.
BenchmarkTestCase Benchmark-owned cases expressed independently from the repository-local judge payload types.
BenchmarkVerdictMapping Normalized mapping from a benchmark-native verdict string into repository TestingOutcome.
BenchmarkArtifactIdentifiers Typed identifiers for benchmark-side run artifacts such as run_id, submission_id, optional result or log URIs, and per-phase artifact ids.
BenchmarkPhaseArtifactIdentifiers Typed artifact ids for one compile or run phase.
BenchmarkPhaseResult Typed per-phase diagnostics that keep compile failures and run-time failures distinct.
BenchmarkVendorSubmissionReceipt Typed submission metadata for vendor-native runtimes.
BenchmarkVendorPollSnapshot One observed vendor-native poll state, including terminal verdict when known.
BenchmarkEvaluationResult Adapter result containing adapter status, runtime mode, normalized verdict mapping, typed diagnostics, optional artifact identifiers, optional per-phase results, and optional vendor submission lifecycle metadata.

Current scope and limits:

the repository now exposes a typed benchmark adapter seam plus one canonical dataset-ingestion path for APPS-style JSONL artifacts
the included StubBenchmarkAdapter is only for contract verification and fixture-driven tests
the local subprocess judge remains the explicit development fallback for current solve execution
the repository now includes concrete Python, Node.js, and Java benchmark execution adapters plus a multi-language dispatch adapter, and it also exposes a separate stubbed vendor-native runtime boundary for submission-lifecycle verification

`load_canonical_benchmark_dataset(dataset_path, *, source_format=BenchmarkDatasetFormat.APPS_JSONL) -> CanonicalBenchmarkDataset`

Load one supported external benchmark dataset artifact into canonical problem records.

Parameters:

dataset_path: str | Path
source_format: BenchmarkDatasetFormat = BenchmarkDatasetFormat.APPS_JSONL

Behavior:

keeps source-layout parsing behind a typed dataset format selector instead of leaking vendor-specific keys into solve services
normalizes supported source records into canonical BenchmarkProblemDefinition objects
normalizes benchmark execution payloads into CanonicalBenchmarkRecord entries with explicit BenchmarkExecutionSettings and benchmark-owned BenchmarkTestCase values when the source row contains executable metadata
preserves repository-facing identifier, source problem_id, benchmark split_name, prompt text, language, and optional difficulty
returns dataset-level provenance through BenchmarkDatasetSource
records normalization assumptions through CanonicalBenchmarkDataset.normalization_notes

Current supported dataset format:

BenchmarkDatasetFormat.APPS_JSONL Expects one JSON object per line with problem_id, question, and split. Optional fields: difficulty, language.

Current normalization rules:

canonical identifiers are built as apps/<split>/<problem_id>
split names are normalized to lowercase train or test
prompt text converts CRLF or CR line endings to LF and trims trailing whitespace at line boundaries
when input_output metadata is present, APPS fn_name selects function invocation and otherwise the loader normalizes the record as stdin-style execution
APPS difficulty labels are mapped into repository tiers as an implementation inference: introductory -> easy, interview -> medium, competition -> hard

Current scope and limits:

only APPS-style JSONL ingestion is wired in this milestone
the repository does not bundle benchmark payloads and assumes the caller has legitimate local access to the source artifact
some APPS rows may still load as metadata-only records when they do not include executable input_output payloads
the first compiled-language local harness is now Java-first and stdin-oriented; wider compiled-language coverage still depends on host toolchain availability

`evaluate_candidate_against_benchmark_record(record, candidate, *, adapter) -> BenchmarkEvaluationResult`

Evaluate one candidate against a canonical benchmark dataset record that already contains benchmark-owned invocation settings and test cases.

Parameters:

record: CanonicalBenchmarkRecord
candidate: CodeCandidate
adapter: BenchmarkAdapter

Behavior:

dispatches through the benchmark adapter boundary using the record's own BenchmarkExecutionSettings
preserves the distinction between metadata-only canonical records and executable records with benchmark-owned test cases
allows benchmark execution to consume the canonical dataset layer directly instead of rebuilding ad hoc test specs outside the adapter seam

Current concrete benchmark paths:

PythonBenchmarkJudgeAdapter Uses the repository's subprocess sandbox for function-style Python cases and a standalone script path for stdin-style Python cases.
NodeJsBenchmarkJudgeAdapter Evaluates JavaScript benchmark records through local Node.js execution and emits benchmark-style verdict strings such as accepted and compilation_error.
JavaBenchmarkJudgeAdapter Compiles and executes stdin-style Java benchmark records through local javac plus java, while preserving compile-phase and run-phase artifacts.
CppBenchmarkJudgeAdapter Uses the same compile-then-run benchmark seam for C++ records and reports an explicit adapter error when the required compiler is unavailable.
MultiLanguageBenchmarkJudgeAdapter Dispatches to the configured Python, Node.js, C++, or Java benchmark harness based on the canonical record's BenchmarkExecutionSettings.language.
StubVendorNativeBenchmarkAdapter Exercises a vendor-native benchmark lifecycle through typed submission receipts, poll history, terminal verdict mapping, and artifact provenance without requiring a live external service.

Current fidelity limits:

Python and JavaScript function-style invocation is closest to benchmark semantics when fn_name is available
stdin-style Python and JavaScript execution now runs the candidate as a standalone script with benchmark-owned stdin payloads
stdin-style Java execution now runs through a local compile-then-run harness when javac and java are available
the JavaScript function path expects a CommonJS export and adds a repository-local compatibility shim for top-level solve(...) definitions
local compiled-language coverage is still incomplete: Java is the first repository-local compiled harness, while C++ depends on host-local g++ availability and currently has no bundled fallback toolchain
local harness artifact capture is file-based and now also preserves typed per-phase compile or run artifact identifiers
real vendor-native integrations still depend on external authentication, licensing, and service availability; the repository currently verifies that boundary through a fixture-driven stub
benchmark-specific output normalization rules and compiled-language local harnesses remain later milestones

Topology Planning API

`plan_problem_topology(problem, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> TopologyPlan`

Return a validated topology plan for a problem instance.

Behavior:

uses the explicit problem difficulty when present
defaults missing difficulty to DifficultyLevel.MEDIUM
infers a coarse local problem shape from prompt keywords
selects one of a small set of topology templates when no policy is provided
otherwise routes through the learned YAML planning boundary and validates the parsed result
can load that learned planning boundary from explicit checkpoint metadata
returns a validated single-turn TopologyPlan

Implementation inference:

prompt-shape inference is a repository-local heuristic, not a paper-defined mechanism
deterministic planning remains the local fallback so tests and offline callers do not require a model checkpoint

`plan_problem_topology_candidate(problem, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> LearnedTopologyPlan`

Return the raw learned-policy YAML candidate plus its parsed topology.

Behavior:

constructs an explicit first-turn prompt from the problem and selected difficulty
calls the provided policy, or a checkpoint-backed loaded policy, through the narrow TopologyOrchestratorPolicy boundary
extracts one repository YAML document from the raw response
parses the YAML through the existing transport and topology-validation path
retries failed extraction or validation attempts up to orchestrator_max_attempts

`revise_problem_topology_candidate(revision, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> LearnedTopologyPlan`

Return the raw learned-policy revised YAML candidate plus its parsed topology.

Behavior:

consumes the existing TopologyRevisionInput contract rather than raw strings
includes prior topology YAML and testing feedback in the explicit revision prompt
uses the same extraction, parsing, and retry path as first-turn planning, including the same checkpoint-backed policy path when configured

Learned Orchestrator Contract

`TopologyOrchestratorPolicy`

Policies must implement:

def generate_topology_candidate(
    self,
    *,
    prompt: str,
    request: OrchestratorPromptRequest,
) -> str: ...

Repository-local mock policies are still supported for tests, but checkpoint loading now also supports a repository-local frozen runtime bundle that materializes topology YAML candidates from serialized checkpoint state.

Checkpoint-backed frozen inference keeps explicit failure boundaries:

invalid checkpoint source selection raises OrchestratorCheckpointSelectionError
incompatible checkpoint metadata or missing runtime artifacts raises OrchestratorCheckpointLoadError
there is no silent fallback to the deterministic planner after checkpoint selection or loading fails

`LearnedTopologyPlan`

LearnedTopologyPlan(
    topology: TopologyPlan,
    topology_yaml: str,
    prompt: str,
    raw_response: str,
    attempt_count: int,
    kind: TopologyPromptKind,
)

Failure behavior:

missing extractable YAML raises TopologyCandidateExtractionError
malformed YAML raises the existing parse-layer transport error
schema-invalid or logic-invalid topologies raise the existing topology validation errors
there is no silent fallback to deterministic planning after policy failure

Topology YAML API

`serialize_topology_plan_to_yaml(topology) -> str`

Serialize a validated typed topology plan into the repository YAML transport format.

Behavior:

validates the topology before serialization
emits YAML from the canonical TopologyPlan.to_mapping() transport shape
keeps YAML encoding details behind the repository transport helper rather than exposing the YAML library directly to callers

`parse_topology_plan_yaml(yaml_text) -> TopologyPlan`

Parse repository YAML text into a validated typed topology plan.

Behavior:

parses YAML text through the infrastructure adapter boundary
rejects malformed YAML with a parse-layer error
rejects schema-invalid transport payloads with TopologySchemaError
rejects graph-rule violations with TopologyLogicError
returns the same typed TopologyPlan contract used by the existing planning and execution APIs

Distributed Evaluation API

`evaluate_candidate_batch(tasks, *, config=None, orchestrator=None) -> DistributedEvaluationBatch`

Evaluate multiple candidate solutions through an orchestration boundary that keeps submission, worker execution, and collection separate from judge logic.

Behavior:

preserves task ordering in the collected batch result
supports explicit max_workers, max_retries, and collection_timeout_seconds
keeps max_workers=1 as the local single-worker fallback path
returns typed per-task statuses plus the underlying SandboxExecutionResult when available

Benchmark Evaluation API

`run_benchmark_evaluation_entrypoint(dataset_path, output_path, *, checkpoint_source, checkpoint_id=None, source_format=BenchmarkDatasetFormat.APPS_JSONL, samples_per_problem=1, pass_k=None, max_workers=1, max_turns=2, orchestrator_device="cpu", orchestrator_max_attempts=1) -> EvaluationRunArtifact`

Run checkpoint-backed frozen inference over a canonical benchmark dataset and write a structured evaluation artifact.

Behavior:

loads a canonical benchmark dataset such as APPS JSONL through the benchmark dataset seam
resolves one orchestrator checkpoint explicitly from a checkpoint directory, checkpoint metadata file, or training artifact JSON
runs the current solve loop for each benchmark problem and each configured attempt index
re-judges the emitted candidate through the benchmark adapter boundary rather than trusting repository-local solve diagnostics alone
writes per-attempt results that preserve solve status, benchmark verdict, latency, topology size, benchmark artifact identifiers, and checkpoint id
writes run metadata including dataset version, harness version, runtime mode, checkpoint provenance, reproduction claim, exact-reproduction readiness, blocking gap ids, and aggregate pass@1 / pass@k

Current fidelity note:

the default runtime still uses repository-local Python and JavaScript benchmark harness adapters, so the produced metrics are benchmark-aligned but not yet vendor-native leaderboard reproductions
callers can now intentionally choose a vendor-native adapter boundary, but the bundled verification path is still the fixture-driven StubVendorNativeBenchmarkAdapter rather than a live service integration

Reproduction Audit API

`build_reproduction_audit() -> ReproductionAudit`

Return the repository's current strict paper-reproduction checklist as a typed in-memory object.

Behavior:

records the current overall claim as exact or approximate
lists line-item fidelity items with explicit exact, approximate, or blocked status
returns the current blocking gap ids needed for a strict paper-level claim

`write_reproduction_audit(output_path) -> ReproductionAudit`

Write the same reproduction audit to a JSON artifact.

`write_reproduction_audit_entrypoint(output_path) -> ReproductionAudit`

Public path-normalizing wrapper for the same audit artifact.

`run_batch_evaluation_entrypoint(...) -> EvaluationRunArtifact`

Compatibility alias for run_benchmark_evaluation_entrypoint(...).

SFT Baseline API

`generate_sft_dataset_entrypoint(dataset_path, *, sample_count=4500, seed=0, prompt_template_version="orchestrator-sft-v2", source_recipe_name="paper-oriented-synthetic-yaml-v1") -> tuple[SyntheticTopologySample, ...]`

Generate a deterministic JSONL dataset of schema-valid topology targets derived from the current rule-based orchestrator.

Current transport note:

target_topology remains JSON-serializable in the dataset artifact
target_topology_yaml now carries the YAML-form target used by the SFT path
the stored topology mapping now comes from the canonical TopologyPlan.to_mapping() transport shape rather than a training-local serializer
the generator also writes a sidecar metadata file at <dataset>.metadata.json that records sample count, paper target size, difficulty breakdown, source recipe, prompt-template version, and reduced-scale status

`run_sft_baseline_entrypoint(dataset_path, artifact_path, *, epochs=1, learning_rate=1e-4, seed=0, backbone_name=\"Qwen2.5-3B-Instruct\", tokenizer_name=\"Qwen2.5-3B-Instruct\", prompt_template_version=\"orchestrator-sft-v2\", optimizer_name=\"adamw\") -> SftTrainingArtifact`

Validate the generated dataset and write a reproducible checkpoint-producing artifact for the repository-local SFT stage.

Behavior:

validates that target_topology and target_topology_yaml stay in sync
writes a YAML-target training manifest distinct from the source dataset
emits a lightweight checkpoint directory with loadable metadata plus a repository-local orchestrator-runtime.json bundle for frozen inference
records dataset provenance, source dataset metadata path, source recipe, sample count, reduced-scale label, optimizer name, backbone, tokenizer, prompt-template version, seed, and checkpoint location in the artifact

`load_sft_checkpoint_entrypoint(checkpoint_path) -> OrchestratorCheckpointMetadata`

Load repository-local checkpoint metadata from a checkpoint directory, metadata file, or training-artifact-derived checkpoint source.

Implementation inference:

this stage now emits checkpoint-shaped artifacts and loadable metadata, but it still does not claim paper-scale supervised fine-tuning fidelity by itself

RL Training API

`compute_reward_breakdown_entrypoint(topology, *, yaml_valid, execution_outcome) -> RewardBreakdown`

Compute a repository-local reward breakdown from YAML validity, execution outcome, and topology-density signals.

`run_rl_baseline_entrypoint(dataset_path, artifact_path, *, checkpoint_source, rollout_count=8, group_size=8, turn_budget=2, seed=0, optimizer_learning_rate=1e-5, optimizer_name="grpo-paper-oriented", checkpoint_device="cpu") -> RlTrainingArtifact`

Run the repository-local RL training path from one source checkpoint and write rollout artifacts plus an updated checkpoint.

Behavior:

resolves the source checkpoint from a checkpoint directory, metadata file, or training artifact JSON
collects rollout records through the current bounded solve loop
preserves per-rollout execution outcomes, YAML-derived topology artifacts, turn counts, reward breakdowns, and the resulting checkpoint identifier
computes group-normalized advantages plus grouped rollout summaries as a paper-oriented GRPO-style update stage
writes a new checkpoint directory with updated metadata, copied runtime bundle state when available, and a stubbed weight lineage marker
returns a typed artifact that points to the rollout manifest, grouped-update artifact, and updated checkpoint

Implementation inference:

this path now reproduces grouped rollout collection and normalized-advantage bookkeeping more explicitly, but it still does not claim full paper-scale GRPO optimizer fidelity or distributed training behavior

Topology Schema

`TopologyPlan`

TopologyPlan(
    difficulty: DifficultyLevel,
    steps: tuple[TopologyStep, ...],
)

Current scope:

single-turn topology only
layered DAG structure
dependency-free parsing from plain mappings via TopologyPlan.from_mapping(...)
canonical plain-mapping serialization via TopologyPlan.to_mapping()

Properties:

node_count
max_nodes

Methods:

to_mapping() -> dict[str, Any]
validate() -> None
from_mapping(raw_plan: Mapping[str, Any]) -> TopologyPlan

Current transport contract:

TopologyPlan remains the source of truth
plain mappings and YAML are transport formats around the typed contract
the repository YAML contract uses: difficulty -> steps -> index -> agents -> name/role/refs -> step_index/agent_name
these YAML field names are repository-level implementation inferences rather than paper-stated schema facts

`TopologyStep`

TopologyStep(
    index: int,
    agents: tuple[AgentInvocation, ...],
)

`AgentInvocation`

AgentInvocation(
    name: str,
    role: AgentRole,
    refs: tuple[AgentReference, ...] = (),
)

`AgentReference`

AgentReference(
    step_index: int,
    agent_name: str,
)

`AgentRole`

Enum values:

retrieval
planning
algorithmic
coding
debugging
testing

Validation Rules

The current topology validator enforces these constraints:

the plan must contain at least one step
step indices must be contiguous and zero-based
every step must contain at least one agent
agent names must be unique across the full topology
the first step must not contain references
references may target earlier steps only
references must point to known prior agent names
the final step must contain a testing agent
the total node count must stay within the paper-derived difficulty budget

Difficulty-specific node budgets:

easy: 4
medium: 7
hard: 10

Implementation inference:

the paper does not define a full concrete schema; from_mapping(...) remains a repository-local parsing contract around the typed model and now coexists with the YAML adapter boundary

Topology Parsing Example

from agentconductor import TopologyPlan

plan = TopologyPlan.from_mapping(
    {
        "difficulty": "easy",
        "steps": [
            {
                "index": 0,
                "agents": [
                    {"name": "planner_0", "role": "planning", "refs": []},
                ],
            },
            {
                "index": 1,
                "agents": [
                    {
                        "name": "coder_1",
                        "role": "coding",
                        "refs": [{"step_index": 0, "agent_name": "planner_0"}],
                    },
                ],
            },
            {
                "index": 2,
                "agents": [
                    {
                        "name": "tester_2",
                        "role": "testing",
                        "refs": [
                            {"step_index": 0, "agent_name": "planner_0"},
                            {"step_index": 1, "agent_name": "coder_1"},
                        ],
                    },
                ],
            },
        ],
    }
)

YAML Transport Contract

The paper's primary topology representation is YAML, and the repository now uses a stable YAML transport around the existing typed topology model.

Current repository YAML shape:

difficulty: easy
steps:
  - index: 0
    agents:
      - name: planner_0
        role: planning
        refs: []
  - index: 1
    agents:
      - name: coder_1
        role: coding
        refs:
          - step_index: 0
            agent_name: planner_0
  - index: 2
    agents:
      - name: tester_2
        role: testing
        refs:
          - step_index: 0
            agent_name: planner_0
          - step_index: 1
            agent_name: coder_1

Error categories preserved explicitly:

YAML syntax failure: the text cannot be parsed as YAML
schema failure: the YAML parses, but required repository fields or field types are wrong
topology logic failure: the YAML parses into the repository schema, but the resulting TopologyPlan violates validation rules such as first-step references or missing final testing agent

Backward-compatibility rule:

existing typed APIs such as plan_problem_topology(...) will keep returning TopologyPlan
YAML entrypoints are additive transport helpers, not replacements for the typed topology model

Current implementation status:

the repository now has an infrastructure-only YAML adapter boundary for mapping -> YAML text and YAML text -> mapping
YAML parser failures are translated into explicit topology-YAML transport errors instead of leaking raw library exceptions
the repository now also supports YAML text -> TopologyPlan through the existing typed topology parser and validator
topology validation failures are now split into: TopologySchemaError for field-contract violations and TopologyLogicError for graph-rule violations, both under the existing TopologyValidationError base class

Current Limits

It currently does:

expose typed topology contracts
serialize typed topologies into the repository YAML transport format
parse repository YAML text back into validated typed topologies
generate topology YAML candidates through an explicit learned-orchestrator policy boundary
expose typed multi-turn state and revision-input contracts
parse single-turn topologies from plain mappings
validate paper-aligned topology structure
emit deterministic topology plans for supported difficulty tiers
emit deterministic revised topologies from prior-turn feedback
emit learned-policy topology candidates for first-turn and later-turn planning
resolve lightweight orchestrator checkpoints into the online solve loop and learned planning entrypoints
execute non-testing worker roles through a typed worker-runtime adapter seam
execute single-turn topologies with a local judge-backed testing role
run a bounded multi-turn solve loop with early stop on pass
produce lightweight loadable SFT checkpoint artifacts with explicit metadata
return candidate code and structured execution traces from solve_problem(...)

The repository currently does not:

load benchmark-grade model weights into a production inference runtime
claim benchmark-exact frozen inference semantics

Source References

Implementation files:

FilesExpand file tree

API.md

Latest commit

History

API.md

File metadata and controls

AgentConductor API

Public Entry Points

Installation and Import

Solve API

solve_problem(problem, *, max_turns=None, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1, worker_runtime=None) -> SolveResult

Multi-Turn Solve-State Contract

SolveState

SolveTurnRecord

TestingFeedback

TopologyRevisionInput

StopReason

Topology Execution API

execute_topology_plan(problem, topology, *, sandbox=None, worker_runtime=None) -> TopologyExecutionResult

Judge Contract

Benchmark Adapter API

evaluate_candidate_against_benchmark(problem, candidate, settings, *, adapter) -> BenchmarkEvaluationResult

load_canonical_benchmark_dataset(dataset_path, *, source_format=BenchmarkDatasetFormat.APPS_JSONL) -> CanonicalBenchmarkDataset

evaluate_candidate_against_benchmark_record(record, candidate, *, adapter) -> BenchmarkEvaluationResult

Topology Planning API

plan_problem_topology(problem, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> TopologyPlan

plan_problem_topology_candidate(problem, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> LearnedTopologyPlan

revise_problem_topology_candidate(revision, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> LearnedTopologyPlan

Learned Orchestrator Contract

TopologyOrchestratorPolicy

LearnedTopologyPlan

Topology YAML API

serialize_topology_plan_to_yaml(topology) -> str

parse_topology_plan_yaml(yaml_text) -> TopologyPlan

Distributed Evaluation API

evaluate_candidate_batch(tasks, *, config=None, orchestrator=None) -> DistributedEvaluationBatch

Benchmark Evaluation API

run_benchmark_evaluation_entrypoint(dataset_path, output_path, *, checkpoint_source, checkpoint_id=None, source_format=BenchmarkDatasetFormat.APPS_JSONL, samples_per_problem=1, pass_k=None, max_workers=1, max_turns=2, orchestrator_device="cpu", orchestrator_max_attempts=1) -> EvaluationRunArtifact

Reproduction Audit API

build_reproduction_audit() -> ReproductionAudit

write_reproduction_audit(output_path) -> ReproductionAudit

write_reproduction_audit_entrypoint(output_path) -> ReproductionAudit

run_batch_evaluation_entrypoint(...) -> EvaluationRunArtifact

SFT Baseline API

generate_sft_dataset_entrypoint(dataset_path, *, sample_count=4500, seed=0, prompt_template_version="orchestrator-sft-v2", source_recipe_name="paper-oriented-synthetic-yaml-v1") -> tuple[SyntheticTopologySample, ...]

run_sft_baseline_entrypoint(dataset_path, artifact_path, *, epochs=1, learning_rate=1e-4, seed=0, backbone_name=\"Qwen2.5-3B-Instruct\", tokenizer_name=\"Qwen2.5-3B-Instruct\", prompt_template_version=\"orchestrator-sft-v2\", optimizer_name=\"adamw\") -> SftTrainingArtifact

load_sft_checkpoint_entrypoint(checkpoint_path) -> OrchestratorCheckpointMetadata

RL Training API

compute_reward_breakdown_entrypoint(topology, *, yaml_valid, execution_outcome) -> RewardBreakdown

run_rl_baseline_entrypoint(dataset_path, artifact_path, *, checkpoint_source, rollout_count=8, group_size=8, turn_budget=2, seed=0, optimizer_learning_rate=1e-5, optimizer_name="grpo-paper-oriented", checkpoint_device="cpu") -> RlTrainingArtifact

Topology Schema

TopologyPlan

TopologyStep

AgentInvocation

AgentReference

AgentRole

Validation Rules

Topology Parsing Example

YAML Transport Contract

Current Limits

Source References

`solve_problem(problem, *, max_turns=None, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1, worker_runtime=None) -> SolveResult`

`SolveState`

`SolveTurnRecord`

`TestingFeedback`

`TopologyRevisionInput`

`StopReason`

`execute_topology_plan(problem, topology, *, sandbox=None, worker_runtime=None) -> TopologyExecutionResult`

`evaluate_candidate_against_benchmark(problem, candidate, settings, *, adapter) -> BenchmarkEvaluationResult`

`load_canonical_benchmark_dataset(dataset_path, *, source_format=BenchmarkDatasetFormat.APPS_JSONL) -> CanonicalBenchmarkDataset`

`evaluate_candidate_against_benchmark_record(record, candidate, *, adapter) -> BenchmarkEvaluationResult`

`plan_problem_topology(problem, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> TopologyPlan`

`plan_problem_topology_candidate(problem, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> LearnedTopologyPlan`

`revise_problem_topology_candidate(revision, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> LearnedTopologyPlan`

`TopologyOrchestratorPolicy`

`LearnedTopologyPlan`

`serialize_topology_plan_to_yaml(topology) -> str`

`parse_topology_plan_yaml(yaml_text) -> TopologyPlan`

`evaluate_candidate_batch(tasks, *, config=None, orchestrator=None) -> DistributedEvaluationBatch`

`build_reproduction_audit() -> ReproductionAudit`

`write_reproduction_audit(output_path) -> ReproductionAudit`

`write_reproduction_audit_entrypoint(output_path) -> ReproductionAudit`

`run_batch_evaluation_entrypoint(...) -> EvaluationRunArtifact`

`generate_sft_dataset_entrypoint(dataset_path, *, sample_count=4500, seed=0, prompt_template_version="orchestrator-sft-v2", source_recipe_name="paper-oriented-synthetic-yaml-v1") -> tuple[SyntheticTopologySample, ...]`

`run_sft_baseline_entrypoint(dataset_path, artifact_path, *, epochs=1, learning_rate=1e-4, seed=0, backbone_name=\"Qwen2.5-3B-Instruct\", tokenizer_name=\"Qwen2.5-3B-Instruct\", prompt_template_version=\"orchestrator-sft-v2\", optimizer_name=\"adamw\") -> SftTrainingArtifact`

`load_sft_checkpoint_entrypoint(checkpoint_path) -> OrchestratorCheckpointMetadata`

`compute_reward_breakdown_entrypoint(topology, *, yaml_valid, execution_outcome) -> RewardBreakdown`

`run_rl_baseline_entrypoint(dataset_path, artifact_path, *, checkpoint_source, rollout_count=8, group_size=8, turn_budget=2, seed=0, optimizer_learning_rate=1e-5, optimizer_name="grpo-paper-oriented", checkpoint_device="cpu") -> RlTrainingArtifact`

`TopologyPlan`

`TopologyStep`

`AgentInvocation`

`AgentReference`

`AgentRole`