Skip to content

Released model Performance lower than reported in the paper #91

@Hyfred

Description

@Hyfred

I used the released HuggingFace checkpoint agentrl/ReSearch-Qwen-7B-Instruct, but the evaluation results are much lower than what is reported in the paper. My results:
Bamboogle: {'em': 0.16, 'f1': 0.20129523809523808, 'acc': 0.176, 'precision': 0.21292307692307694, 'recall': 0.19866666666666669}

ReSearch Paper Table 2 said that:
Bamboogle (testset) EM: 42.40

Here is my runing code (following the guidance in the README):

python run_eval.py \
    --config_path eval_config.yaml \
    --method_name re-call \
    --data_dir /home/a14-hliu/hl542/ReCall/data/ \
    --dataset_name bamboogle \
    --split test \
    --save_dir /home/a14-hliu/hl542/ReCall/eval_results/re-call_qwen3-7b-instruct \
    --save_note re-call_qwen3-7b_ins \
    --sgl_remote_url http://127.0.0.1:8083 \
    --remote_retriever_url http://127.0.0.1:8082 \
    --generator_model /home/a14-hliu/.cache/huggingface/hub/models--agentrl--ReSearch-Qwen-7B-Instruct/snapshots/f0787566dce64b1363746137aca5dd432ac48b9e \
    --sandbox_url http://127.0.0.1:8081

I also followed the README instructions for retrieval materials:

E5-base-v2
wiki18_100w_e5_index.zip
wiki18_100w.zip

Could you please advise what I might be missing? Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions