Skip to content

epfl-nlp/track

Β 
Β 

Repository files navigation

Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge

🌐 Website | πŸ“ƒ EACL (Virtual Oral) | πŸ€— Dataset | 𝕏 X (Twitter)

This is the official implementation for the paper "Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge", which explores how LLMs propagate new knowledge through multi-step reasoning when it conflicts with the model's initial parametric knowledge. Our paper is accepted at EACL 2026 (Main, Virtual Oral Presentation).

illustration

Our TΚ€α΄€α΄„α΄‹ benchmark (Testing Reasoning Amid Conflicting Knowledge) spans across three reasoning scenarios: Multi-Hop QA (WIKI), Code Generation (CODE), and Mathematical Reasoning (MATH). The benchmark follows a two-stage process: Knowledge Probing and Knowledge Injection. Performance is assessed using our metrics: Answer Pass (AP), Full Knowledge Entailment (FKE), and Holistic Pass (HP).

pipeline

Takeaways:

  1. Providing updated facts yields limited performance gains and even worsens performance, compared with no facts provided.
  2. Performance further degrades with more updated facts.
  3. The failure stems from both inability to faithfully integrate updated facts and flawed reasoning even when knowledge is integrated

Dataset

The TRACK benchmark is available on HuggingFace Datasets with three configs: wiki, code, and math (500 examples each).

Each example contains five fields aligned with the paper:

Field Paper notation Description
question $q$ Complex multi-step reasoning question
answer $a$ Final answer
probing_questions $q_i$ List of probing questions
probing_answers $a_i$ List of probing answers
atomic_facts $K_q$ List of required atomic facts

Load with the datasets library

from datasets import load_dataset

wiki = load_dataset("yiyangfengSBU/track", name="wiki", split="test")
code = load_dataset("yiyangfengSBU/track", name="code", split="test")
math = load_dataset("yiyangfengSBU/track", name="math", split="test")

Setup

We use Python 3.12.

conda create -n track python=3.12
conda activate track
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install tqdm openai together transformers sparqlwrapper datasets accelerate ninja nvidia-ml-py nltk rich peft
MAX_JOBS=32 pip install flash-attn --no-build-isolation

In api_key/config.json, you need to provide your API keys for the models you want to use. The file should look like this:

{
    "api_key": {
        "openai_api_key": "sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
        "togetherai_api_key": "123abc456def789ghijklmno0123456789",
        "huggingface_api_key": "hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
    },
    "wikimedia": {
        "client_application_key": "123abc456def789ghijklmno0123456789",
        "client_application_secret": "123abc456def789ghijklmno0123456789",
        "access_token": "123abc456def789ghijklmno0123456789",
        "user_agent": "TRACK/1.0 (https://www.wikidata.org/wiki/User:TRACK)"
    }
}

Data Collection

The benchmark data is available directly from HuggingFace (see Dataset section above). Alternatively, you can regenerate it locally:

python scripts/testset/grow_collection.py --data_size 500 --api_config_file ./api_key/config.json
python scripts/testset/code_collection.py --data_size 500 --api_config_file ./api_key/config.json
python scripts/testset/math_collection.py --data_size 500 --api_config_file ./api_key/config.json

Experiments

All of the experiments will save checkpoints of results and can be resumed anytime as long as the checkpoints are saved. To ensure the data is saved, you can run python autogit.py.

Stage 1: Knowledge Probing

To probe the model's knowledge, run:

python run_knowledge_experiments.py \
    --model_names llama-3.2-1b llama-3.2-3b llama-3.2-11b qwen-3-1.7b qwen-3-4b qwen-3-8b gpt-4.1-mini o4-mini \    # List of ALL model names to run.
    --cpu-only-models gpt-4.1-mini o4-mini \                                                                        # List of model names that are CPU-only
    --task_names grow code math \                                                                                   # List of task names. "grow" means WIKI (previously Graph Reasoning On Wikidata)
    --load_from_huggingface \                                                                                       # Optional: load dataset from HuggingFace Hub instead of local pkl file

To automatically annotate the probing results, you can use the following command:

python run_knowledge_evaluations.py \
    --model_names llama-3.2-1b llama-3.2-3b llama-3.2-11b qwen-3-1.7b qwen-3-4b qwen-3-8b gpt-4.1-mini o4-mini \    # List of ALL model names to run.
    --task_names grow code math \                                                                                   # List of task names. "grow" means WIKI (previously Graph Reasoning On Wikidata)

Stage 2: Knowledge Injection

Before running the script, ensure you have evaluated the knowledge probing results (saved in data/eval_results/{task_name}/probe_evaluated/) to identify the knowledge gaps in the model's responses.

The Base Model performance:

python run_reasoning_experiments.py \
    --model_names llama-3.2-1b llama-3.2-3b llama-3.2-11b qwen-3-1.7b qwen-3-4b qwen-3-8b gpt-4.1-mini \    # List of ALL model names to run.
    --cpu-only-models gpt-4.1-mini \                                                                        # List of model names that are CPU-only

For Llama-3.2 models with knowledge injection:

python run_reasoning_experiments.py \
    --model_names llama-3.2-1b llama-3.2-3b llama-3.2-11b \ # List of ALL model names to run.
    --inject-knowledge \                                    # Flag to run with knowledge injection. If not set, runs original baselines.
    --methods base ft_ck mello \                            # List of methods to test (e.g., base, ft_ck).

For Qwen-3 models with knowledge injection:

python run_reasoning_experiments.py \
    --model_names qwen-3-1.7b qwen-3-4b qwen-3-8b \ # List of ALL model names to run.
    --inject-knowledge \                            # Flag to run with knowledge injection. If not set, runs original baselines.
    --methods base append_t ft_ck mello \           # List of methods to test (e.g., base, ft_ck).

For GPT models with knowledge injection:

python run_reasoning_experiments.py \
    --model_names gpt-4.1-mini o4-mini \        # List of ALL model names to run.
    --cpu-only-models gpt-4.1-mini o4 mini \    # List of model names that are CPU-only
    --inject-knowledge \                        # Flag to run with knowledge injection. If not set, runs original baselines.
    --methods base \                            # List of methods to test (e.g., base, ft_ck).

To automatically annotate the knowledge injection results, you can use the following command:

python run_reasoning_evaluations.py \
    --model_name llama-3.2-1b llama-3.2-3b llama-3.2-11b qwen-3-1.7b qwen-3-4b qwen-3-8b gpt-4.1-mini o4-mini \ # List of ALL model names to run.
    --evaluate_model_name gpt-5-mini-2025-08-07 \                                                               # List of model names for evaluation.
    --method_names base ft_ck mello append_t \                                                                       # List of methods to test (e.g., base, ft_ck).

Citation

If you find this repository useful, please consider citing our paper:

@inproceedings{feng-etal-2026-tracking,
    title = "Tracking the Limits of Knowledge Propagation: How {LLM}s Fail at Multi-Step Reasoning with Conflicting Knowledge",
    author = "Feng, Yiyang  and
      Chen, Zeming  and
      Wu, Haotian  and
      Zhou, Jiawei  and
      Bosselut, Antoine",
    editor = "Demberg, Vera  and
      Inui, Kentaro  and
      Marquez, Llu{\'i}s",
    booktitle = "Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.eacl-long.273/",
    doi = "10.18653/v1/2026.eacl-long.273",
    pages = "5813--5847",
    ISBN = "979-8-89176-380-7",
}

About

Testing Reasoning Amid Conflicting Knowledge (TRACK)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 55.9%
  • Jupyter Notebook 44.1%