ImpliRet (Implicit Fact Retrieval) flips the usual IR setup:
the query is intentionally surface-level (often just a *who / what / when) while the evidence hides implicitly inside the document and must be inferred rather than string-matched (e.g., LMU is a univerisity in Germany in the following Gif).
| Dimension | Variants | Example in the document | User query asks for |
|---|---|---|---|
| Reasoning type (3) | Arithmetic, Temporal, World-Knowledge | “The 2024 model is 2.5 × last year’s price.” | “How much does the 2024 model cost?” |
| Discourse style (2) | multispeaker forum threads / unispeaker chat logs | Ten-turn chat by one speaker vs. two-post Q&A | same |
Corpus layout
- 6 document pools = 3 reasoning types × 2 discourse styles
- 6K documents + 6K queries (1 : 1), 1.5K in each pool
- Each query has exactly one positive passage; the rest of the pool are hard negatives.
- Dialogue/forum text is auto-generated by Gemma-3-27B and verified by a second LLM to ensure the implicit clue exists but is never stated explicitly.
Why it matters
The best baseline (ReasonIR-8B) reaches only ≈ 25 % nDCG@10, and even GPT-4.1 falters when asked to choose the right passage from 10 look-alikes—highlighting that document-side reasoning is still an open challenge.
Table of Contents
🔬 Retrieval & RAG Results (click to collapse)
The table below reports nDCG@10 (↑ higher is better) for our baseline retrievers.
| Retriever | W. Know. | Arithmetic | Temporal | Average |
|---|---|---|---|---|
| Sparse | ||||
| BM25 | 14.69 | 11.06 | 10.98 | 12.24 |
| Late Interaction | ||||
| ColBERT v2 | 15.79 | 14.96 | 11.99 | 14.25 |
| Dense Encoders | ||||
| Contriever | 16.50 | 13.70 | 12.73 | 14.31 |
| Dragon+ | 17.46 | 14.61 | 12.66 | 14.91 |
| ReasonIR-8B | 18.88 | 10.78 | 11.25 | 13.64 |
| Knowledge-Graph-Augmented | ||||
| HippoRAG 2 | 16.62 | 14.13 | 12.83 | 14.53 |
🧩 RAG‑style Evaluation
The table below shows ROUGE‑1 recall (R‑1@k) for two long‑context LLM readers when the top‑k retrieved documents (oracle setting) are supplied.
| Experiment | k | W. Know. | Arithmetic | Temporal | Average |
|---|---|---|---|---|---|
| Llama 3.3 70B | 1 | 73.79 | 90.13 | 81.85 | 81.92 |
| 10 | 27.37 | 16.98 | 25.23 | 23.19 | |
| 30 | 17.43 | 4.42 | 10.29 | 10.71 | |
| GPT-4.1 | 1 | 93.24 | 92.12 | 84.90 | 88.05 |
| 10 | 62.21 | 23.86 | 15.59 | 35.06 | |
| 30 | 53.91 | 9.28 | 6.93 | 22.90 | |
| GPT-o4-mini | 1 | 92.34 | 92.45 | 93.44 | 92.74 |
| 10 | 88.11 | 76.61 | 73.94 | 79.55 | |
| 30 | 75.44 | 76.31 | 14.86 | 55.54 |
Table 3. ROUGE‑1 recall (R‑1@k), averaged over uni‑speaker and multi‑speaker documents.
You can load the ImpliRet dataset via 🤗 Hugging Face like this:
- Repository:
zeinabTaghavi/ImpliRet - Reasoning Categories (
split):arithmetic,wknow,temporal - Discourse styles (
name):multispeaker,unispeaker
from datasets import load_dataset
ds = load_dataset(
"zeinabTaghavi/ImpliRet",
name="multispeaker", # or "unispeaker"
split="arithmetic" # wknow | temporal
)
print(ds.features) # quick schema check
print(ds[0]["question"]) # sanity sample# clone & install
$ git clone https://github.com/ZeinabTaghavi/ImpliRet.git
$ cd ImpliRet
$ python -m venv impliret_env && source impliret_env/bin/activate
$ pip install -r requirements.txtRepository map
├── RAG_Style/
│ ├── experiment_configs # Config of RAG with retrievals or Oracle retriever
│ ├── model_configs # Config of each LLM that will be used in RAG_Style
│ ├── script # Codes of Asyncron and Syncron experiments
│ ├── results
│ └── reports
├── Retrieval/
│ ├── retrievals # Codes of each experiment
│ ├── results
│ └── reports
└── README.md
Running the retrieval baselines (index creation): The retrievals: BM25s, ColBertV2, Contriever, DragonPlus, HippoRagV2, ReasonIR.
Example of running:
Run the retriever and generate the report with bash Retrieval/retrieve.sh, which performs the following steps:
# Running the retrieval for indexing
python ./Retrieval/retrieve_indexing.py --output_folder ./Retrieval/results/ --category arithmetic --discourse multispeaker --retriever_name bm25
# Reporting
python Retrieval/reporting.pyIndexing results are written to Retrieval/results.
Reports (MRR, nDCG@10 …) are stored in Retrieval/reports.
Here we try Long context and RAG, the setting of the experiments configs are in the RAG_Style/experiment_configs folder, the config of models are also stored in RAG_Style/model_configs.
You can choose among three setups for running this experiment:
Note: All examples are for Arithmetic category (A in the file name) and Multi Speaker discourse style (Multi in the file name).
1- Simplest way: Loading the model locally, using vllm with bash RAG_Style/s_run_tests.sh that does the following in detail:
# example with
# LLM: Llama3.3 70-B, retriever: BM25s
# Number of documents that are given to LLM: 10
# Hence, The configuration file name is A_Multi_llama_bm_10.yaml
export HF_HOME=...
export HF_TOKEN= ...
python ./RAG_Style/scripts/sync/sync_run_tests.py \
--config ./RAG_Style/experiment_configs/bm/A_Multi_llama_bm_10.yaml
2- Loading the vllm on server with bash RAG_Style/async_run_multi_llama.sh that does the following in detail:
export HF_HOME=...
export HF_TOKEN= ...
# ------------------------------------------------------------------
# Start vLLM server via helper script (background) and wait for load
# ------------------------------------------------------------------
# run_tests.sh (top of file)
PROJECT_ROOT=... # adjust once
source "$PROJECT_ROOT/scripts/async/start_vllm.sh"
# example with
# LLM: Llama3.3 70-B, retriever: Oracle (positive document is in the context)
# Number of documents that are given to LLM: 10 (1 pos, 9 neg)
# Hence, The configuration file name is A_Multi_llama_1.yaml
python ./RAG_Style/scripts/async/async_run_tests.py \
--config ./RAG_Style/experiment_configs/oracle_retriever/A_Multi_llama_10.yaml
# ------------------------------------------------------------------
# Shut down the vLLM server
# ------------------------------------------------------------------
echo "Stopping vLLM server (PID=$VLLM_PID)"
kill $VLLM_PID
wait $VLLM_PID 2>/dev/null
3- Using other models like GPT that does not need the server loading with RAG_Style/async_run_multi_GPT.sh or in detail:
The outputs will be hashed and stored in Experiments/evaluation/results.
# example with
# LLM: GPT4.1, retriever: Oracle (positive document is in the context)
# Number of documents that are given to LLM: 10 (1 pos, 9 neg)
# Hence, The configuration file name is A_Multi_GPT_10.yaml
python RAG_Style/scripts/async/async_run_tests.py \
--config RAG_Style/experiment_configs/oracle_retriever/A_Multi_GPT_10.yamlyou can generate the reporting of RAG with the following command:
# Reporting the results:
python RAG_Style/scripts/reporting.pyThe results will be stored at RAG_Style/results folder.
We welcome external baselines! The quickest path is through two companion notebooks:
| Notebook | Purpose |
|---|---|
📓 notebook.ipynb |
End‑to‑end evaluation harness for all built‑in retrievers—run this first to verify your setup. |
🚀 contribute.ipynb |
Step‑by‑step template for creating a custom MyRetriever, indexing the corpus, and running the full metric suite. |
- Fork this repository (or clone it locally).
- Add code (optional).
Use the 🚀contribute.ipynbnotebook to structure and export your custom retriever code. - Submit results only (optional).
Prefer to keep your code private? Runcontribute.ipynb, generate the metrics, and verify the output format. - Send it in.
Open a pull request or email the artefacts (results ± code) plus a short description to zeinabtaghavi1377@gmail.com. - We’ll merge, trigger CI, and add your numbers to Table 2 and the badges — 🥳🎉
Questions? Open an issue or drop us an email to zeinabtaghavi1377@gmail.com. — happy to help!😃
@inproceedings{taghavi-etal-2025-impliret,
author = {Zeinab Sadat Taghavi and Ali Modarressi and Yunpu Ma and Hinrich Sch{\"u}tze},
title = {ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge},
booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)},
year = {2025},
month = nov,
address = {Suzhou, China},
publisher = {Association for Computational Linguistics},
}