Skip to content

[plan][feat]: SWE-bench Pro Evaluation Harness Integration #973

@ayazhankadessova

Description

@ayazhankadessova

Description

Placeholder for multi-agent planning in progress. This will be updated with the consensus plan.

Feature: Build a SWE-bench Pro evaluation harness that tests agentize lol impl pipeline against real-world software engineering tasks, including task ingestion, automated repository setup, isolated worktree execution, patch scoring, and metrics collection (tokens, accuracy, wall time).

Proposed Solution

Planning in progress via ultra-planner...

Related PR

TBD - will be updated when PR is created

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions