🏠 Homepage • 📄 Paper • 🤗 Dataset
📖 README •
Tip
ACPBench ❤️ lm-evaluation-harness ❤️ hugging-face ❤️ Inspect-AI!
ACPBench and ACPBench-Hard are integrated with two powerful evaluation frameworks to facilitate quick evaluation of existing pretrained models as well as custom finetuned models.
ACPBench supports two primary evaluation methods for both ACPBench and ACPBench-Hard datasets:
- lm-evaluation-harness by EleutherAI
- Inspect AI by UK AI Security Institute.
Evaluate your model on ACPBench using the following command:
lm_eval --model <your-model> \
--model_args <model-args> \
--tasks acp_bench \
--output <output-folder> \
--log_samplesEvaluate your model on ACPBench using either of these commands:
Option A: Direct HuggingFace integration
inspect eval hf/ibm-research/acp_bench --model <your-model>Option B: Local evaluation script
inspect eval evals/acpbench.py --model <your-model>ACPBench-Hard includes 8 challenging tasks with dev and test sets available both in this repository and on HuggingFace.
Evaluate your model on ACPBench-Hard using the following command:
lm_eval --model <your-model> \
--model_args <model-args> \
--tasks acp_bench_hard \
--output <output-folder> \
--log_samplesEvaluate your model on ACPBench-Hard using the local evaluation script:
inspect eval evals/acpbench_hard.py --model <your-model>For custom implementations, you can use the 'exact_match' metric from HuggingFace or generate outputs in lm-eval-harness format and use the provided evaluation scripts.
Generate outputs for each example in the following lm-eval format:
[ {
"doc_id": 0,
"doc": {
"id": -8342636639526456067,
"group": "applicable_actions_bool",
"context": "This is a ferry domain, ...",
"question": "Is the following action applicable in this state: travel by sea from location l1 to location l0?",
"answer": "yes"
},
"resp": [["... Therefore, the answer is Yes",
"... the answer is Yes",
"Yes",
"The answer is yes",
"the action is applicable"]],
"filtered_resps": [
[
"Yes",
"Yes",
"Yes",
"Yes",
"Yes"
]
],
},
...
]Once the JSON file is created, use the evaluation script to compute scores:
python evaluation_bool_mcq.py --results <results-json-filepath> --gt <ground-truth-json-filepath>