I used the released HuggingFace checkpoint agentrl/ReSearch-Qwen-7B-Instruct, but the evaluation results are much lower than what is reported in the paper. My results:
Bamboogle: {'em': 0.16, 'f1': 0.20129523809523808, 'acc': 0.176, 'precision': 0.21292307692307694, 'recall': 0.19866666666666669}
ReSearch Paper Table 2 said that:
Bamboogle (testset) EM: 42.40
Here is my runing code (following the guidance in the README):
python run_eval.py \
--config_path eval_config.yaml \
--method_name re-call \
--data_dir /home/a14-hliu/hl542/ReCall/data/ \
--dataset_name bamboogle \
--split test \
--save_dir /home/a14-hliu/hl542/ReCall/eval_results/re-call_qwen3-7b-instruct \
--save_note re-call_qwen3-7b_ins \
--sgl_remote_url http://127.0.0.1:8083 \
--remote_retriever_url http://127.0.0.1:8082 \
--generator_model /home/a14-hliu/.cache/huggingface/hub/models--agentrl--ReSearch-Qwen-7B-Instruct/snapshots/f0787566dce64b1363746137aca5dd432ac48b9e \
--sandbox_url http://127.0.0.1:8081
I also followed the README instructions for retrieval materials:
E5-base-v2
wiki18_100w_e5_index.zip
wiki18_100w.zip
Could you please advise what I might be missing? Thank you.