This document describes all CLI arguments supported by fb eval.
fb eval \
-p runs/2026-01-17__12-54-55/output.jsonl-
--config-path
Path toconfig.toml(used forHF_TOKEN/HF_ENDPOINTwhen loading dataset).
If not provided, uses default discovery (searching upward fromfeaturebench/infer). -
--predictions-path, -p
Path to predictions JSONL file (typicallyruns/<timestamp>/output.jsonl).
Required.If use
-p gold, gold patches will be deployed and the resulte will be saved under./runs/gold/. Note that currently only lv1 tasks have gold patch. -
--task-id
Specific task IDs (instance IDs) to evaluate. Accepts space-separated values.
Default: all tasks in predictions. -
--n-concurrent
Number of parallel workers.
Default:4. -
--timeout
Override timeout for test execution (seconds).
Default:None(usestimeout_runfrom repo_settings). -
--gpu-ids
Comma-separated GPU IDs (e.g.,0,1,2,3).
Default: all available. -
--proxy-port
Proxy port for container network (host gateway) (e.g.,--proxy-port 7890).
Default:None. -
--review-codes
Save agent-generated code for review.
Acceptstrue/false,1/0,yes/no.
Default:false. -
--dataset
HuggingFace dataset repo name (e.g.,LiberCoders/FeatureBench).
Default:LiberCoders/FeatureBench. -
--split
Dataset split name (e.g.,lite,full).
Default:full.
-
--include-failed
Include predictions withsuccess=falsefromoutput.jsonl.
Default: skip failed predictions. -
--force-rerun
Force rerun specified task IDs even ifreport.jsonalready exists.
Accepts space-separated task IDs or a.txtfile path (one task_id per line).
runs/{timestamp}/
├── report.json # Evaluation summary
└── eval_outputs/
└── {instance_id}/
└── attempt-{n}/
├── run_instance.log # Evaluation log
├── test_output.txt # Test execution output
├── patch.diff # Applied patch
└── report.json # Instance evaluation result
Summarize eval_outputs and generate a CSV report.
python featurebench/scripts/cal_eval_outputs.py --path <eval_outputs_dir> --attempt-mode <attempt_mode><attempt_mode> can be best, worst, or a number (e.g., 1, 2, 3). Default: best.
When <attempt_mode> is a number k and --merge is enabled, attempts 1..k are merged:
- pass_rate: average over the first k attempts
- resolved: pass@k style (success if any of the first k succeeds)
- prompt_tokens/completion_tokens: sum over the first k, then take the mean