Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge
π Website | π EACL (Virtual Oral) | π€ Dataset | π X (Twitter)
This is the official implementation for the paper "Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge", which explores how LLMs propagate new knowledge through multi-step reasoning when it conflicts with the model's initial parametric knowledge. Our paper is accepted at EACL 2026 (Main, Virtual Oral Presentation).
Our TΚα΄α΄α΄ benchmark (Testing Reasoning Amid Conflicting Knowledge) spans across three reasoning scenarios: Multi-Hop QA (WIKI), Code Generation (CODE), and Mathematical Reasoning (MATH). The benchmark follows a two-stage process: Knowledge Probing and Knowledge Injection. Performance is assessed using our metrics: Answer Pass (AP), Full Knowledge Entailment (FKE), and Holistic Pass (HP).
Takeaways:
- Providing updated facts yields limited performance gains and even worsens performance, compared with no facts provided.
- Performance further degrades with more updated facts.
- The failure stems from both inability to faithfully integrate updated facts and flawed reasoning even when knowledge is integrated
The TRACK benchmark is available on HuggingFace Datasets with three configs: wiki, code, and math (500 examples each).
Each example contains five fields aligned with the paper:
| Field | Paper notation | Description |
|---|---|---|
question |
Complex multi-step reasoning question | |
answer |
Final answer | |
probing_questions |
List of probing questions | |
probing_answers |
List of probing answers | |
atomic_facts |
List of required atomic facts |
from datasets import load_dataset
wiki = load_dataset("yiyangfengSBU/track", name="wiki", split="test")
code = load_dataset("yiyangfengSBU/track", name="code", split="test")
math = load_dataset("yiyangfengSBU/track", name="math", split="test")We use Python 3.12.
conda create -n track python=3.12
conda activate track
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install tqdm openai together transformers sparqlwrapper datasets accelerate ninja nvidia-ml-py nltk rich peft
MAX_JOBS=32 pip install flash-attn --no-build-isolation
In api_key/config.json, you need to provide your API keys for the models you want to use. The file should look like this:
{
"api_key": {
"openai_api_key": "sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"togetherai_api_key": "123abc456def789ghijklmno0123456789",
"huggingface_api_key": "hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
},
"wikimedia": {
"client_application_key": "123abc456def789ghijklmno0123456789",
"client_application_secret": "123abc456def789ghijklmno0123456789",
"access_token": "123abc456def789ghijklmno0123456789",
"user_agent": "TRACK/1.0 (https://www.wikidata.org/wiki/User:TRACK)"
}
}The benchmark data is available directly from HuggingFace (see Dataset section above). Alternatively, you can regenerate it locally:
python scripts/testset/grow_collection.py --data_size 500 --api_config_file ./api_key/config.json
python scripts/testset/code_collection.py --data_size 500 --api_config_file ./api_key/config.json
python scripts/testset/math_collection.py --data_size 500 --api_config_file ./api_key/config.jsonAll of the experiments will save checkpoints of results and can be resumed anytime as long as the checkpoints are saved. To ensure the data is saved, you can run python autogit.py.
To probe the model's knowledge, run:
python run_knowledge_experiments.py \
--model_names llama-3.2-1b llama-3.2-3b llama-3.2-11b qwen-3-1.7b qwen-3-4b qwen-3-8b gpt-4.1-mini o4-mini \ # List of ALL model names to run.
--cpu-only-models gpt-4.1-mini o4-mini \ # List of model names that are CPU-only
--task_names grow code math \ # List of task names. "grow" means WIKI (previously Graph Reasoning On Wikidata)
--load_from_huggingface \ # Optional: load dataset from HuggingFace Hub instead of local pkl fileTo automatically annotate the probing results, you can use the following command:
python run_knowledge_evaluations.py \
--model_names llama-3.2-1b llama-3.2-3b llama-3.2-11b qwen-3-1.7b qwen-3-4b qwen-3-8b gpt-4.1-mini o4-mini \ # List of ALL model names to run.
--task_names grow code math \ # List of task names. "grow" means WIKI (previously Graph Reasoning On Wikidata)Before running the script, ensure you have evaluated the knowledge probing results (saved in data/eval_results/{task_name}/probe_evaluated/) to identify the knowledge gaps in the model's responses.
The Base Model performance:
python run_reasoning_experiments.py \
--model_names llama-3.2-1b llama-3.2-3b llama-3.2-11b qwen-3-1.7b qwen-3-4b qwen-3-8b gpt-4.1-mini \ # List of ALL model names to run.
--cpu-only-models gpt-4.1-mini \ # List of model names that are CPU-onlyFor Llama-3.2 models with knowledge injection:
python run_reasoning_experiments.py \
--model_names llama-3.2-1b llama-3.2-3b llama-3.2-11b \ # List of ALL model names to run.
--inject-knowledge \ # Flag to run with knowledge injection. If not set, runs original baselines.
--methods base ft_ck mello \ # List of methods to test (e.g., base, ft_ck).For Qwen-3 models with knowledge injection:
python run_reasoning_experiments.py \
--model_names qwen-3-1.7b qwen-3-4b qwen-3-8b \ # List of ALL model names to run.
--inject-knowledge \ # Flag to run with knowledge injection. If not set, runs original baselines.
--methods base append_t ft_ck mello \ # List of methods to test (e.g., base, ft_ck).For GPT models with knowledge injection:
python run_reasoning_experiments.py \
--model_names gpt-4.1-mini o4-mini \ # List of ALL model names to run.
--cpu-only-models gpt-4.1-mini o4 mini \ # List of model names that are CPU-only
--inject-knowledge \ # Flag to run with knowledge injection. If not set, runs original baselines.
--methods base \ # List of methods to test (e.g., base, ft_ck).To automatically annotate the knowledge injection results, you can use the following command:
python run_reasoning_evaluations.py \
--model_name llama-3.2-1b llama-3.2-3b llama-3.2-11b qwen-3-1.7b qwen-3-4b qwen-3-8b gpt-4.1-mini o4-mini \ # List of ALL model names to run.
--evaluate_model_name gpt-5-mini-2025-08-07 \ # List of model names for evaluation.
--method_names base ft_ck mello append_t \ # List of methods to test (e.g., base, ft_ck).If you find this repository useful, please consider citing our paper:
@inproceedings{feng-etal-2026-tracking,
title = "Tracking the Limits of Knowledge Propagation: How {LLM}s Fail at Multi-Step Reasoning with Conflicting Knowledge",
author = "Feng, Yiyang and
Chen, Zeming and
Wu, Haotian and
Zhou, Jiawei and
Bosselut, Antoine",
editor = "Demberg, Vera and
Inui, Kentaro and
Marquez, Llu{\'i}s",
booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
month = mar,
year = "2026",
address = "Rabat, Morocco",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.eacl-long.273/",
doi = "10.18653/v1/2026.eacl-long.273",
pages = "5813--5847",
ISBN = "979-8-89176-380-7",
}

