This document describes the current stable Python API exposed by the repository root package.
The API is still in an early milestone. It currently provides:
- a typed solve entrypoint with bounded multi-turn execution over explicit orchestrator modes
- a typed multi-turn solve-state contract that records per-turn history
- deterministic and learned-policy topology-planning entrypoints
- a single-turn topology-execution entrypoint whose non-testing worker roles can run through a model-backed runtime seam and whose testing role is backed by a local subprocess judge adapter
- typed topology schema objects for single-turn plans
- validation rules for topology structure before execution
The repository now supports a local bounded multi-turn solve loop, but it still does not implement the paper's full benchmark-grade inference runtime.
Stable callable API:
agentconductor.solve_problemagentconductor.plan_problem_topologyagentconductor.plan_problem_topology_candidateagentconductor.revise_problem_topology_candidateagentconductor.execute_topology_planagentconductor.serialize_topology_plan_to_yamlagentconductor.parse_topology_plan_yamlagentconductor.evaluate_candidate_against_benchmarkagentconductor.evaluate_candidate_against_benchmark_recordagentconductor.load_canonical_benchmark_datasetagentconductor.evaluate_candidate_batchagentconductor.run_benchmark_evaluation_entrypointagentconductor.run_batch_evaluation_entrypointagentconductor.build_reproduction_auditagentconductor.write_reproduction_auditagentconductor.write_reproduction_audit_entrypointagentconductor.generate_sft_dataset_entrypointagentconductor.load_sft_checkpoint_entrypointagentconductor.run_sft_baseline_entrypointagentconductor.compute_reward_breakdown_entrypointagentconductor.run_rl_baseline_entrypoint
Stable public topology contract:
agentconductor.TopologyPlanagentconductor.TopologyStepagentconductor.AgentInvocationagentconductor.AgentReferenceagentconductor.AgentRoleagentconductor.TopologyValidationErroragentconductor.TopologySchemaErroragentconductor.TopologyLogicError
Other public types:
agentconductor.AgentExecutionResultagentconductor.StepExecutionResultagentconductor.TopologyExecutionResultagentconductor.BenchmarkAdapteragentconductor.BenchmarkArtifactIdentifiersagentconductor.BenchmarkDatasetFormatagentconductor.BenchmarkDatasetSourceagentconductor.BenchmarkExecutionPhaseagentconductor.BenchmarkEvaluationResultagentconductor.BenchmarkEvaluationStatusagentconductor.BenchmarkExecutionSettingsagentconductor.BenchmarkInvocationModeagentconductor.BenchmarkPhaseArtifactIdentifiersagentconductor.BenchmarkPhaseExecutionSettingsagentconductor.BenchmarkPhaseResourceLimitsagentconductor.BenchmarkPhaseResultagentconductor.BenchmarkPhaseStatusagentconductor.BenchmarkProblemDefinitionagentconductor.BenchmarkRuntimeModeagentconductor.BenchmarkTestCaseagentconductor.BenchmarkVerdictMappingagentconductor.BenchmarkVendorPollSnapshotagentconductor.BenchmarkVendorSubmissionReceiptagentconductor.BenchmarkVendorSubmissionStateagentconductor.CanonicalBenchmarkDatasetagentconductor.CanonicalBenchmarkRecordagentconductor.DistributedEvaluationBatchagentconductor.DistributedEvaluationConfigagentconductor.DistributedEvaluationResultagentconductor.DistributedEvaluationStatusagentconductor.DistributedEvaluationTaskagentconductor.EvaluationProblemDefinitionagentconductor.EvaluationProblemResultagentconductor.EvaluationRunArtifactagentconductor.EvaluationRunMetadataagentconductor.EvaluationSummaryagentconductor.ReproductionAuditagentconductor.ReproductionChecklistItemagentconductor.ReproductionClaimagentconductor.ReproductionStatusagentconductor.ExecutionStatusagentconductor.RewardBreakdownagentconductor.TestingOutcomeagentconductor.CodeCandidateagentconductor.JudgeTestCaseagentconductor.JudgeCaseResultagentconductor.JudgeResourceLimitsagentconductor.SandboxCapabilityStateagentconductor.SandboxBindingStateagentconductor.SandboxTestSpecagentconductor.SandboxExecutionResultagentconductor.SandboxRuntimeCapabilitiesagentconductor.PythonSubprocessJudgeAdapteragentconductor.PythonSubprocessSandboxAdapteragentconductor.NodeJsBenchmarkJudgeAdapteragentconductor.CppBenchmarkJudgeAdapteragentconductor.JavaBenchmarkJudgeAdapteragentconductor.PythonBenchmarkJudgeAdapteragentconductor.MultiLanguageBenchmarkJudgeAdapteragentconductor.StubBenchmarkAdapteragentconductor.StubBenchmarkSubmissionagentconductor.StubVendorNativeBenchmarkAdapteragentconductor.StubVendorSubmissionScenarioagentconductor.TopologyExecutionErroragentconductor.SolveStateagentconductor.SolveTurnRecordagentconductor.TestingFeedbackagentconductor.TopologyRevisionInputagentconductor.StopReasonagentconductor.SolveStateTransitionErroragentconductor.ProblemInstanceagentconductor.DifficultyLevelagentconductor.SolveRequestagentconductor.SolveResultagentconductor.SolveStatusagentconductor.LearnedTopologyPlanagentconductor.OrchestratorCheckpointMetadataagentconductor.OrchestratorCheckpointErroragentconductor.OrchestratorCheckpointSelectionErroragentconductor.OrchestratorCheckpointLoadErroragentconductor.OrchestratorModeagentconductor.OrchestratorPromptRequestagentconductor.TopologyOrchestratorPolicyagentconductor.TopologyPromptKindagentconductor.TopologyCandidateExtractionErroragentconductor.SftTrainingArtifactagentconductor.SftTrainingConfigagentconductor.SyntheticTopologySampleagentconductor.RlTrainingArtifactagentconductor.RlTrainingConfigagentconductor.RepositoryFrozenOrchestratorBundleagentconductor.RepositoryFrozenOrchestratorRuntimeagentconductor.RepositoryWorkerModelRuntimeagentconductor.WorkerGenerationRequestagentconductor.WorkerGenerationResultagentconductor.WorkerRoleRuntimeagentconductor.WorkerRuntimeError
From the repository root:
uv syncThen import from Python:
from agentconductor import (
LearnedTopologyPlan,
parse_topology_plan_yaml,
ProblemInstance,
TopologyOrchestratorPolicy,
TopologyPlan,
execute_topology_plan,
plan_problem_topology,
plan_problem_topology_candidate,
serialize_topology_plan_to_yaml,
solve_problem,
)solve_problem(problem, *, max_turns=None, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1, worker_runtime=None) -> SolveResult
Plan and execute a structured bounded multi-turn solve for a problem instance.
Parameters:
problem: ProblemInstancemax_turns: int | None = Noneorchestrator_policy: TopologyOrchestratorPolicy | None = Noneorchestrator_checkpoint: str | Path | None = Noneorchestrator_checkpoint_id: str | None = Noneorchestrator_device: str = "cpu"orchestrator_max_attempts: int = 1worker_runtime: WorkerRoleRuntime | None = None
Behavior:
- uses the explicit difficulty from
problemwhen present - defaults missing difficulty to
DifficultyLevel.MEDIUM - validates the turn budget against the current baseline limit
- uses deterministic planning when no orchestrator policy is provided
- uses the learned YAML planning path when
orchestrator_policyis provided - can also resolve the learned YAML planning path from explicit checkpoint
metadata when
orchestrator_checkpointis provided - executes plan -> evaluate in a bounded loop up to the current turn budget
- consumes typed prior-turn testing feedback when planning a later turn
- routes non-testing worker roles through the provided
worker_runtimeor the default repository-local model-backed worker runtime - returns the final candidate code, role trace, and final testing outcome
Returned SolveResult fields:
statusselected_difficultyplanned_turnsmax_nodesavailable_rolestopologyexecutioncandidate_solutiontesting_outcomesolve_statenotes
Implementation inference:
- the medium-difficulty fallback is an engineering inference until the repository implements the paper's real difficulty inference mechanism
- deterministic planning remains the repository-local fallback when no learned policy is configured
- the current checkpoint-backed frozen path loads a repository-local runtime bundle from checkpoint metadata, with explicit load failures for missing runtime artifacts, unsupported devices, or incompatible prompt templates
The notes field records which orchestrator mode produced the final topology
and, when relevant, which checkpoint-backed runtime was selected.
The repository now exposes a typed solve-state object so later multi-turn logic can compose around the current single-turn executor without changing earlier execution contracts.
SolveState(
problem: ProblemInstance,
selected_difficulty: DifficultyLevel,
max_turns: int,
max_nodes: int,
available_roles: tuple[str, ...],
turns: tuple[SolveTurnRecord, ...] = (),
stop_reason: StopReason | None = None,
)Properties:
completed_turnsremaining_turnscan_continuelatest_turn
SolveTurnRecord(
turn_index: int,
topology: TopologyPlan,
execution: TopologyExecutionResult,
testing_feedback: TestingFeedback,
)TestingFeedback(
outcome: TestingOutcome | None,
diagnostics: tuple[str, ...],
candidate_code: str | None,
)TopologyRevisionInput(
problem: ProblemInstance,
selected_difficulty: DifficultyLevel,
turn_index: int,
prior_topology: TopologyPlan,
prior_execution_status: ExecutionStatus,
testing_feedback: TestingFeedback,
remaining_turns: int,
)Enum values:
solvedturn_budget_exhausted
Implementation inference:
- the paper describes global history and testing feedback, but not a repository-level typed state object
TopologyRevisionInputis a repository-local contract so later revision logic can consume structured prior-turn artifacts instead of raw strings only
execute_topology_plan(problem, topology, *, sandbox=None, worker_runtime=None) -> TopologyExecutionResult
Execute a validated single-turn topology plan layer by layer.
Behavior:
- validates the topology before execution
- executes steps in index order
- resolves references only from prior executed steps
- dispatches each non-testing agent through a model-backed worker runtime seam
- extracts the last referenced candidate code through an explicit code-candidate contract
- evaluates candidate code through a concrete judge adapter
- returns structured per-agent outputs, final candidate code, judge diagnostics, and final testing outcome
Implementation inference:
- the current default worker runtime is
RepositoryWorkerModelRuntime, which records runtime and model identifiers per agent while remaining a repository-local substitute for realgpt-4o-miniexecution - the testing role delegates to a repository-local Python subprocess judge
- the local judge validates a Python
solve()entrypoint against explicit test cases, expected outputs, and explicit resource limits until a fuller benchmark integration exists
The current judge-facing types are:
JudgeTestCaseCarries one named invocation, optional positional or keyword arguments, optional stdin text, and expected output or stdout.JudgeCaseResultCarries the typed verdict for one executed case, including pass/fail outcome, diagnostics, and captured actual versus expected outputs.JudgeResourceLimitsCarries per-evaluation CPU, wall-clock, and memory limits.SandboxTestSpecBundles the target entrypoint, concrete test cases, and resource limits into the adapter request.SandboxRuntimeCapabilitiesReports the active worker platform, launcher strategy, and typed wall-clock, CPU, and memory enforcement status for the evaluation.
Current benchmark-aligned semantics:
- the judge now returns structured per-case verdicts instead of only a single aggregate outcome
- string comparison normalizes line endings and ignores trailing whitespace at line boundaries, which is closer to common benchmark judge behavior than the earlier full
strip()comparison - aggregate outcomes still map into the repository's typed
TestingOutcomecontract - wall-clock limits are enforced per case at the subprocess boundary instead of only within one long-lived in-process harness
- on Windows, the judge now routes worker launch through a Job Object binding seam and targets hard process-memory limits through
memory_limit_byteswhen the host runtime permits dedicated job assignment - the sandbox result now carries typed runtime-capability metadata so callers can inspect whether memory binding was attached, downgraded, skipped, or not applicable
- Windows CPU enforcement is now reported explicitly as unsupported until the repository has a verified Job Object CPU strategy
Current fidelity limits:
- the repository judge is still local and Python-only
- entrypoint and invocation semantics are still repository-defined rather than imported from a real benchmark harness
- wall-clock handling is enforced by the subprocess boundary
- POSIX CPU and memory limits use OS-level
resourcecontrols only on supported platforms - Windows hard memory enforcement depends on whether the host runtime allows the worker to be rebound into a dedicated Job Object
- when Windows Job Object binding is unavailable, the judge keeps hard wall-clock enforcement and returns explicit platform diagnostics instead of claiming hard memory isolation
- Windows worker launch first attempts
CREATE_BREAKAWAY_FROM_JOBand falls back to plain subprocess creation only when that strategy is unavailable - Windows CPU limits are not claimed to be hard-enforced; wall-clock timeout remains the only guaranteed timing control on Windows
- on runtimes without usable OS-level memory controls, memory limits fall back to traced Python allocations and remain approximate
- output normalization is still a repository-level inference rather than a benchmark-specific ruleset
- exact benchmark-specific semantics, datasets, and multi-language support are still out of scope for this milestone
evaluate_candidate_against_benchmark(problem, candidate, settings, *, adapter) -> BenchmarkEvaluationResult
Evaluate one extracted candidate through a typed external benchmark boundary.
Parameters:
problem: BenchmarkProblemDefinitioncandidate: CodeCandidatesettings: BenchmarkExecutionSettingsadapter: BenchmarkAdapter
Behavior:
- keeps benchmark problem metadata, execution settings, verdict mapping, and artifact identifiers in explicit typed contracts
- keeps the external benchmark seam distinct from the repository-local subprocess judge
- validates candidate language against the requested benchmark execution language before dispatch
- returns a normalized repository
TestingOutcomeonly throughBenchmarkVerdictMapping, so core services do not consume benchmark-native payloads directly
Key benchmark-facing types:
BenchmarkProblemDefinitionCanonical benchmark metadata includingbenchmark_name,dataset_name,source_problem_id, optionalsplit_name, repository-facingidentifier,prompt, andlanguage.BenchmarkExecutionSettingsExternal-harness execution settings such aslanguage, invocation mode, entrypoint, benchmark-owned resource limits, optional compile or run phase settings, and explicit runtime mode.BenchmarkPhaseExecutionSettingsOne benchmark-owned compile or run phase with explicitsource_layout,command, optionalexecutable_target, andresource_limits.BenchmarkPhaseResourceLimitsPhase-specific time and memory limits for compile or run phases.BenchmarkTestCaseBenchmark-owned cases expressed independently from the repository-local judge payload types.BenchmarkVerdictMappingNormalized mapping from a benchmark-native verdict string into repositoryTestingOutcome.BenchmarkArtifactIdentifiersTyped identifiers for benchmark-side run artifacts such asrun_id,submission_id, optional result or log URIs, and per-phase artifact ids.BenchmarkPhaseArtifactIdentifiersTyped artifact ids for one compile or run phase.BenchmarkPhaseResultTyped per-phase diagnostics that keep compile failures and run-time failures distinct.BenchmarkVendorSubmissionReceiptTyped submission metadata for vendor-native runtimes.BenchmarkVendorPollSnapshotOne observed vendor-native poll state, including terminal verdict when known.BenchmarkEvaluationResultAdapter result containing adapter status, runtime mode, normalized verdict mapping, typed diagnostics, optional artifact identifiers, optional per-phase results, and optional vendor submission lifecycle metadata.
Current scope and limits:
- the repository now exposes a typed benchmark adapter seam plus one canonical dataset-ingestion path for APPS-style JSONL artifacts
- the included
StubBenchmarkAdapteris only for contract verification and fixture-driven tests - the local subprocess judge remains the explicit development fallback for current solve execution
- the repository now includes concrete Python, Node.js, and Java benchmark execution adapters plus a multi-language dispatch adapter, and it also exposes a separate stubbed vendor-native runtime boundary for submission-lifecycle verification
load_canonical_benchmark_dataset(dataset_path, *, source_format=BenchmarkDatasetFormat.APPS_JSONL) -> CanonicalBenchmarkDataset
Load one supported external benchmark dataset artifact into canonical problem records.
Parameters:
dataset_path: str | Pathsource_format: BenchmarkDatasetFormat = BenchmarkDatasetFormat.APPS_JSONL
Behavior:
- keeps source-layout parsing behind a typed dataset format selector instead of leaking vendor-specific keys into solve services
- normalizes supported source records into canonical
BenchmarkProblemDefinitionobjects - normalizes benchmark execution payloads into
CanonicalBenchmarkRecordentries with explicitBenchmarkExecutionSettingsand benchmark-ownedBenchmarkTestCasevalues when the source row contains executable metadata - preserves repository-facing
identifier, sourceproblem_id, benchmarksplit_name, prompt text, language, and optional difficulty - returns dataset-level provenance through
BenchmarkDatasetSource - records normalization assumptions through
CanonicalBenchmarkDataset.normalization_notes
Current supported dataset format:
BenchmarkDatasetFormat.APPS_JSONLExpects one JSON object per line withproblem_id,question, andsplit. Optional fields:difficulty,language.
Current normalization rules:
- canonical identifiers are built as
apps/<split>/<problem_id> - split names are normalized to lowercase
trainortest - prompt text converts CRLF or CR line endings to LF and trims trailing whitespace at line boundaries
- when
input_outputmetadata is present, APPSfn_nameselects function invocation and otherwise the loader normalizes the record as stdin-style execution - APPS difficulty labels are mapped into repository tiers as an implementation inference:
introductory -> easy,interview -> medium,competition -> hard
Current scope and limits:
- only APPS-style JSONL ingestion is wired in this milestone
- the repository does not bundle benchmark payloads and assumes the caller has legitimate local access to the source artifact
- some APPS rows may still load as metadata-only records when they do not include executable
input_outputpayloads - the first compiled-language local harness is now Java-first and stdin-oriented; wider compiled-language coverage still depends on host toolchain availability
evaluate_candidate_against_benchmark_record(record, candidate, *, adapter) -> BenchmarkEvaluationResult
Evaluate one candidate against a canonical benchmark dataset record that already contains benchmark-owned invocation settings and test cases.
Parameters:
record: CanonicalBenchmarkRecordcandidate: CodeCandidateadapter: BenchmarkAdapter
Behavior:
- dispatches through the benchmark adapter boundary using the record's own
BenchmarkExecutionSettings - preserves the distinction between metadata-only canonical records and executable records with benchmark-owned test cases
- allows benchmark execution to consume the canonical dataset layer directly instead of rebuilding ad hoc test specs outside the adapter seam
Current concrete benchmark paths:
PythonBenchmarkJudgeAdapterUses the repository's subprocess sandbox for function-style Python cases and a standalone script path for stdin-style Python cases.NodeJsBenchmarkJudgeAdapterEvaluates JavaScript benchmark records through local Node.js execution and emits benchmark-style verdict strings such asacceptedandcompilation_error.JavaBenchmarkJudgeAdapterCompiles and executes stdin-style Java benchmark records through localjavacplusjava, while preserving compile-phase and run-phase artifacts.CppBenchmarkJudgeAdapterUses the same compile-then-run benchmark seam for C++ records and reports an explicit adapter error when the required compiler is unavailable.MultiLanguageBenchmarkJudgeAdapterDispatches to the configured Python, Node.js, C++, or Java benchmark harness based on the canonical record'sBenchmarkExecutionSettings.language.StubVendorNativeBenchmarkAdapterExercises a vendor-native benchmark lifecycle through typed submission receipts, poll history, terminal verdict mapping, and artifact provenance without requiring a live external service.
Current fidelity limits:
- Python and JavaScript function-style invocation is closest to benchmark semantics when
fn_nameis available - stdin-style Python and JavaScript execution now runs the candidate as a standalone script with benchmark-owned stdin payloads
- stdin-style Java execution now runs through a local compile-then-run harness
when
javacandjavaare available - the JavaScript function path expects a CommonJS export and adds a repository-local compatibility shim for top-level
solve(...)definitions - local compiled-language coverage is still incomplete: Java is the first
repository-local compiled harness, while C++ depends on host-local
g++availability and currently has no bundled fallback toolchain - local harness artifact capture is file-based and now also preserves typed per-phase compile or run artifact identifiers
- real vendor-native integrations still depend on external authentication, licensing, and service availability; the repository currently verifies that boundary through a fixture-driven stub
- benchmark-specific output normalization rules and compiled-language local harnesses remain later milestones
plan_problem_topology(problem, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> TopologyPlan
Return a validated topology plan for a problem instance.
Behavior:
- uses the explicit problem difficulty when present
- defaults missing difficulty to
DifficultyLevel.MEDIUM - infers a coarse local problem shape from prompt keywords
- selects one of a small set of topology templates when no policy is provided
- otherwise routes through the learned YAML planning boundary and validates the parsed result
- can load that learned planning boundary from explicit checkpoint metadata
- returns a validated single-turn
TopologyPlan
Implementation inference:
- prompt-shape inference is a repository-local heuristic, not a paper-defined mechanism
- deterministic planning remains the local fallback so tests and offline callers do not require a model checkpoint
plan_problem_topology_candidate(problem, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> LearnedTopologyPlan
Return the raw learned-policy YAML candidate plus its parsed topology.
Behavior:
- constructs an explicit first-turn prompt from the problem and selected difficulty
- calls the provided policy, or a checkpoint-backed loaded policy, through the
narrow
TopologyOrchestratorPolicyboundary - extracts one repository YAML document from the raw response
- parses the YAML through the existing transport and topology-validation path
- retries failed extraction or validation attempts up to
orchestrator_max_attempts
revise_problem_topology_candidate(revision, *, orchestrator_policy=None, orchestrator_checkpoint=None, orchestrator_checkpoint_id=None, orchestrator_device="cpu", orchestrator_max_attempts=1) -> LearnedTopologyPlan
Return the raw learned-policy revised YAML candidate plus its parsed topology.
Behavior:
- consumes the existing
TopologyRevisionInputcontract rather than raw strings - includes prior topology YAML and testing feedback in the explicit revision prompt
- uses the same extraction, parsing, and retry path as first-turn planning, including the same checkpoint-backed policy path when configured
Policies must implement:
def generate_topology_candidate(
self,
*,
prompt: str,
request: OrchestratorPromptRequest,
) -> str: ...Repository-local mock policies are still supported for tests, but checkpoint loading now also supports a repository-local frozen runtime bundle that materializes topology YAML candidates from serialized checkpoint state.
Checkpoint-backed frozen inference keeps explicit failure boundaries:
- invalid checkpoint source selection raises
OrchestratorCheckpointSelectionError - incompatible checkpoint metadata or missing runtime artifacts raises
OrchestratorCheckpointLoadError - there is no silent fallback to the deterministic planner after checkpoint selection or loading fails
LearnedTopologyPlan(
topology: TopologyPlan,
topology_yaml: str,
prompt: str,
raw_response: str,
attempt_count: int,
kind: TopologyPromptKind,
)Failure behavior:
- missing extractable YAML raises
TopologyCandidateExtractionError - malformed YAML raises the existing parse-layer transport error
- schema-invalid or logic-invalid topologies raise the existing topology validation errors
- there is no silent fallback to deterministic planning after policy failure
Serialize a validated typed topology plan into the repository YAML transport format.
Behavior:
- validates the topology before serialization
- emits YAML from the canonical
TopologyPlan.to_mapping()transport shape - keeps YAML encoding details behind the repository transport helper rather than exposing the YAML library directly to callers
Parse repository YAML text into a validated typed topology plan.
Behavior:
- parses YAML text through the infrastructure adapter boundary
- rejects malformed YAML with a parse-layer error
- rejects schema-invalid transport payloads with
TopologySchemaError - rejects graph-rule violations with
TopologyLogicError - returns the same typed
TopologyPlancontract used by the existing planning and execution APIs
Evaluate multiple candidate solutions through an orchestration boundary that keeps submission, worker execution, and collection separate from judge logic.
Behavior:
- preserves task ordering in the collected batch result
- supports explicit
max_workers,max_retries, andcollection_timeout_seconds - keeps
max_workers=1as the local single-worker fallback path - returns typed per-task statuses plus the underlying
SandboxExecutionResultwhen available
run_benchmark_evaluation_entrypoint(dataset_path, output_path, *, checkpoint_source, checkpoint_id=None, source_format=BenchmarkDatasetFormat.APPS_JSONL, samples_per_problem=1, pass_k=None, max_workers=1, max_turns=2, orchestrator_device="cpu", orchestrator_max_attempts=1) -> EvaluationRunArtifact
Run checkpoint-backed frozen inference over a canonical benchmark dataset and write a structured evaluation artifact.
Behavior:
- loads a canonical benchmark dataset such as APPS JSONL through the benchmark dataset seam
- resolves one orchestrator checkpoint explicitly from a checkpoint directory, checkpoint metadata file, or training artifact JSON
- runs the current solve loop for each benchmark problem and each configured attempt index
- re-judges the emitted candidate through the benchmark adapter boundary rather than trusting repository-local solve diagnostics alone
- writes per-attempt results that preserve solve status, benchmark verdict, latency, topology size, benchmark artifact identifiers, and checkpoint id
- writes run metadata including dataset version, harness version, runtime mode,
checkpoint provenance, reproduction claim, exact-reproduction readiness,
blocking gap ids, and aggregate
pass@1/pass@k
Current fidelity note:
- the default runtime still uses repository-local Python and JavaScript benchmark harness adapters, so the produced metrics are benchmark-aligned but not yet vendor-native leaderboard reproductions
- callers can now intentionally choose a vendor-native adapter boundary, but
the bundled verification path is still the fixture-driven
StubVendorNativeBenchmarkAdapterrather than a live service integration
Return the repository's current strict paper-reproduction checklist as a typed in-memory object.
Behavior:
- records the current overall claim as
exactorapproximate - lists line-item fidelity items with explicit
exact,approximate, orblockedstatus - returns the current blocking gap ids needed for a strict paper-level claim
Write the same reproduction audit to a JSON artifact.
Public path-normalizing wrapper for the same audit artifact.
Compatibility alias for run_benchmark_evaluation_entrypoint(...).
generate_sft_dataset_entrypoint(dataset_path, *, sample_count=4500, seed=0, prompt_template_version="orchestrator-sft-v2", source_recipe_name="paper-oriented-synthetic-yaml-v1") -> tuple[SyntheticTopologySample, ...]
Generate a deterministic JSONL dataset of schema-valid topology targets derived from the current rule-based orchestrator.
Current transport note:
target_topologyremains JSON-serializable in the dataset artifacttarget_topology_yamlnow carries the YAML-form target used by the SFT path- the stored topology mapping now comes from the canonical
TopologyPlan.to_mapping()transport shape rather than a training-local serializer - the generator also writes a sidecar metadata file at
<dataset>.metadata.jsonthat records sample count, paper target size, difficulty breakdown, source recipe, prompt-template version, and reduced-scale status
run_sft_baseline_entrypoint(dataset_path, artifact_path, *, epochs=1, learning_rate=1e-4, seed=0, backbone_name=\"Qwen2.5-3B-Instruct\", tokenizer_name=\"Qwen2.5-3B-Instruct\", prompt_template_version=\"orchestrator-sft-v2\", optimizer_name=\"adamw\") -> SftTrainingArtifact
Validate the generated dataset and write a reproducible checkpoint-producing artifact for the repository-local SFT stage.
Behavior:
- validates that
target_topologyandtarget_topology_yamlstay in sync - writes a YAML-target training manifest distinct from the source dataset
- emits a lightweight checkpoint directory with loadable metadata plus a
repository-local
orchestrator-runtime.jsonbundle for frozen inference - records dataset provenance, source dataset metadata path, source recipe, sample count, reduced-scale label, optimizer name, backbone, tokenizer, prompt-template version, seed, and checkpoint location in the artifact
Load repository-local checkpoint metadata from a checkpoint directory, metadata file, or training-artifact-derived checkpoint source.
Implementation inference:
- this stage now emits checkpoint-shaped artifacts and loadable metadata, but it still does not claim paper-scale supervised fine-tuning fidelity by itself
Compute a repository-local reward breakdown from YAML validity, execution outcome, and topology-density signals.
run_rl_baseline_entrypoint(dataset_path, artifact_path, *, checkpoint_source, rollout_count=8, group_size=8, turn_budget=2, seed=0, optimizer_learning_rate=1e-5, optimizer_name="grpo-paper-oriented", checkpoint_device="cpu") -> RlTrainingArtifact
Run the repository-local RL training path from one source checkpoint and write rollout artifacts plus an updated checkpoint.
Behavior:
- resolves the source checkpoint from a checkpoint directory, metadata file, or training artifact JSON
- collects rollout records through the current bounded solve loop
- preserves per-rollout execution outcomes, YAML-derived topology artifacts, turn counts, reward breakdowns, and the resulting checkpoint identifier
- computes group-normalized advantages plus grouped rollout summaries as a paper-oriented GRPO-style update stage
- writes a new checkpoint directory with updated metadata, copied runtime bundle state when available, and a stubbed weight lineage marker
- returns a typed artifact that points to the rollout manifest, grouped-update artifact, and updated checkpoint
Implementation inference:
- this path now reproduces grouped rollout collection and normalized-advantage bookkeeping more explicitly, but it still does not claim full paper-scale GRPO optimizer fidelity or distributed training behavior
TopologyPlan(
difficulty: DifficultyLevel,
steps: tuple[TopologyStep, ...],
)Current scope:
- single-turn topology only
- layered DAG structure
- dependency-free parsing from plain mappings via
TopologyPlan.from_mapping(...) - canonical plain-mapping serialization via
TopologyPlan.to_mapping()
Properties:
node_countmax_nodes
Methods:
to_mapping() -> dict[str, Any]validate() -> Nonefrom_mapping(raw_plan: Mapping[str, Any]) -> TopologyPlan
Current transport contract:
TopologyPlanremains the source of truth- plain mappings and YAML are transport formats around the typed contract
- the repository YAML contract uses:
difficulty -> steps -> index -> agents -> name/role/refs -> step_index/agent_name - these YAML field names are repository-level implementation inferences rather than paper-stated schema facts
TopologyStep(
index: int,
agents: tuple[AgentInvocation, ...],
)AgentInvocation(
name: str,
role: AgentRole,
refs: tuple[AgentReference, ...] = (),
)AgentReference(
step_index: int,
agent_name: str,
)Enum values:
retrievalplanningalgorithmiccodingdebuggingtesting
The current topology validator enforces these constraints:
- the plan must contain at least one step
- step indices must be contiguous and zero-based
- every step must contain at least one agent
- agent names must be unique across the full topology
- the first step must not contain references
- references may target earlier steps only
- references must point to known prior agent names
- the final step must contain a testing agent
- the total node count must stay within the paper-derived difficulty budget
Difficulty-specific node budgets:
easy:4medium:7hard:10
Implementation inference:
- the paper does not define a full concrete schema;
from_mapping(...)remains a repository-local parsing contract around the typed model and now coexists with the YAML adapter boundary
from agentconductor import TopologyPlan
plan = TopologyPlan.from_mapping(
{
"difficulty": "easy",
"steps": [
{
"index": 0,
"agents": [
{"name": "planner_0", "role": "planning", "refs": []},
],
},
{
"index": 1,
"agents": [
{
"name": "coder_1",
"role": "coding",
"refs": [{"step_index": 0, "agent_name": "planner_0"}],
},
],
},
{
"index": 2,
"agents": [
{
"name": "tester_2",
"role": "testing",
"refs": [
{"step_index": 0, "agent_name": "planner_0"},
{"step_index": 1, "agent_name": "coder_1"},
],
},
],
},
],
}
)The paper's primary topology representation is YAML, and the repository now uses a stable YAML transport around the existing typed topology model.
Current repository YAML shape:
difficulty: easy
steps:
- index: 0
agents:
- name: planner_0
role: planning
refs: []
- index: 1
agents:
- name: coder_1
role: coding
refs:
- step_index: 0
agent_name: planner_0
- index: 2
agents:
- name: tester_2
role: testing
refs:
- step_index: 0
agent_name: planner_0
- step_index: 1
agent_name: coder_1Error categories preserved explicitly:
- YAML syntax failure: the text cannot be parsed as YAML
- schema failure: the YAML parses, but required repository fields or field types are wrong
- topology logic failure:
the YAML parses into the repository schema, but the resulting
TopologyPlanviolates validation rules such as first-step references or missing final testing agent
Backward-compatibility rule:
- existing typed APIs such as
plan_problem_topology(...)will keep returningTopologyPlan - YAML entrypoints are additive transport helpers, not replacements for the typed topology model
Current implementation status:
- the repository now has an infrastructure-only YAML adapter boundary for
mapping -> YAML textandYAML text -> mapping - YAML parser failures are translated into explicit topology-YAML transport errors instead of leaking raw library exceptions
- the repository now also supports
YAML text -> TopologyPlanthrough the existing typed topology parser and validator - topology validation failures are now split into:
TopologySchemaErrorfor field-contract violations andTopologyLogicErrorfor graph-rule violations, both under the existingTopologyValidationErrorbase class
It currently does:
- expose typed topology contracts
- serialize typed topologies into the repository YAML transport format
- parse repository YAML text back into validated typed topologies
- generate topology YAML candidates through an explicit learned-orchestrator policy boundary
- expose typed multi-turn state and revision-input contracts
- parse single-turn topologies from plain mappings
- validate paper-aligned topology structure
- emit deterministic topology plans for supported difficulty tiers
- emit deterministic revised topologies from prior-turn feedback
- emit learned-policy topology candidates for first-turn and later-turn planning
- resolve lightweight orchestrator checkpoints into the online solve loop and learned planning entrypoints
- execute non-testing worker roles through a typed worker-runtime adapter seam
- execute single-turn topologies with a local judge-backed testing role
- run a bounded multi-turn solve loop with early stop on pass
- produce lightweight loadable SFT checkpoint artifacts with explicit metadata
- return candidate code and structured execution traces from
solve_problem(...)
The repository currently does not:
- load benchmark-grade model weights into a production inference runtime
- claim benchmark-exact frozen inference semantics
Implementation files:
- src/agentconductor/domain/topology.py
- src/agentconductor/application/orchestrator.py
- src/agentconductor/interfaces/planning.py
- src/agentconductor/interfaces/api.py
- src/agentconductor/application/api.py
- src/agentconductor/application/execution.py
- src/agentconductor/application/history.py
- src/agentconductor/domain/history.py
- src/agentconductor/domain/models.py
- src/agentconductor/infrastructure/sandbox.py