This folder is for offline batch experiments. It is separate from the frontend / one-job wrap path so experimental runs do not overwrite existing UI or pipeline artifacts.
Frontend / one-job path:
python -m src.evaluate.wrap_card_query_eval --jobs-json <jobs.json> --job-id <job_id> --llm-mode iterUnified backend batch experiment path:
python src/batch_exp/run_batch_eval.py \
--queries-file data_251117/query/query_rewrite_polished.jsonlReuse an existing query2modelcard intermediate:
python src/batch_exp/run_batch_eval.py --jobs-json <batch_preset_queries.json>query2modelcard: either run through the backend from--queries-file, or loaded from--jobs-json.modelcard2nugget: batch-generate missing model-card nugget CSVs, reusing existingdata_251117/card2nugget/*.csv.query2nugget: batch-map all queries to selected headers and optional constraints.filter: for each query + method + top-k group, filter candidate nugget CSVs and record method scores/rankings.
Use the polished model recommendation query set here, not the old smoke / 15-example files:
python src/batch_exp/run_batch_eval.py \
--queries-file data_251117/query/query_rewrite_polished.jsonl \
--top-k 1 3 5 10 \
--query-llm-mode batch \
--card-llm-mode batch \
--match-build structuredrun_batch_eval.py will automatically request enough query2modelcard results to cover the largest requested --top-k, then reuse the deduped model-card list for batch modelcard2nugget extraction and per-query/per-top-k filtering.
Reuse an existing query2modelcard intermediate:
python src/batch_exp/run_batch_eval.py \
--jobs-json jobs_251117/batch_runs/batch_preset_queries_20260501_154026.json \
--top-k 1 3 5 10 \
--query-llm-mode batch \
--card-llm-mode batch \
--match-build structuredpython src/batch_exp/run_batch_eval.py \
--queries-file data_251117/query/query_rewrite_polished.jsonl \
--top-k 1 3 5 10 \
--query-llm-mode batch \
--card-llm-mode batch \
--match-build structuredOutputs are written under:
data_251117/evaluate/query2modelcard_nugget_batch/<run_id>/
In backend mode the saved intermediate is:
data_251117/evaluate/query2modelcard_nugget_batch/<run_id>/query2modelcard_backend_intermediate.json
If you want a separate figure that emphasizes per-method nugget distributions, use the copied local batch outputs:
python src/batch_exp/plot_distribution_eval.py \
--summary-json data_251117/evaluate/query2modelcard_nugget_batch/aggregate_summary.json \
--per-query-jsonl data_251117/evaluate/query2modelcard_nugget_batch/per_query_method_scores.jsonl \
--output data_251117/statistic.png \
--compare-top-k 10This alternative figure is organized as:
- two
top-kblocks per row, each block split into a left distribution panel and a right rank-share panel; - each
top-kpanel uses the correspondingtop_kslice fromper_query_method_scores.jsonl; - blue-toned histograms on the left and red-toned rank shares on the right;
- a separate
.pngoutput so it does not overwrite the current main figure.
If per_query_method_scores.jsonl sits next to the summary JSON, --per-query-jsonl can be omitted and the plot script will resolve it automatically.
By default the script plots every top-k present in the summary. Pass one or more values to restrict the figure, for example --compare-top-k 1 3 5 10.
If the ranking changes across top-k, keep the per-k panels in the figure and describe the sensitivity explicitly in the writeup instead of collapsing everything to top-10.