This is the datasets and source code for the extension 'Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning' and original OOPSLA-24 paper 'Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models'.
conda create -n <your_env_name> python=3.10
pip install -r requirements.txt
- Install Ocaml and Dune on Linux Server.
- Run the command
dune exec ./bin/main.exeunder the MTL2PL folder.
- Java 8 or higher. The command
java -versionshould complete successfully with a line like:java version "1.8.0_92".
- To run Llama2 series and Mistral series models locally, we apply (at least) two NVIDIA A100 PCIe 80GB GPUs.
- To interact with GPTs, we need an OPENAI API key (replace the
your_api_keypart in theOpenAI Interact) - We use Ollama to run local models.
- Starting the model on the Linux server (default port:11434).
ollama run [MODEL_NAME]- Interact using request library.
response = requests.post("http://your_server_ip:11434/api/generate", json={ "model": [MODEL_NAME], "prompt": [INPUT_PROMPT], "stream": False }).json() - There are also online services for open-source LLMs like DeepInfra. We can use the OPENAI python library to generate responses.
- Download StanfordCoreNLP and unzip the file.
- Starting the service on the Linux server:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port your_port -timeout 15000 - Use the code in
Response Validationto analyze LLM responses.
The non-temporal benchmark dataset is in Drowzee Benchmark. The temporal benchmark dataset is in Temporal Benchmark
Each item in the dataset follows the structure below:
-
qid: A unique identifier for each test case in the dataset. -
category: Knowledge domain for entities used in this test case (culture/geography/history/people/health/math/nature/society/technology). -
reasoning: Reasoning strategy used when generating this test case (transitive/negation/inverse/composite). -
subject: The subject entity of the derived fact in the test case. -
predicate: The derived relation betweensubjectandobjectin this test case. -
object: The secondary entity or concept linked to the subject by the predicate. -
description: The first sentence from the Wikipedia page ofsubjectandobject. -
question: The question generated according to the derived fact in this test case. -
answer: The ground truth answer to the question based on the derived fact and evidence. -
evidence: A list of supporting triples that are used to derive facts in this test case.
.MTL2PL
├── bin
│ ├── ast_utility.ml
│ ├── dune
│ ├── lexer.mll
│ ├── main.ml
│ └── parser.mly
├── database.pl
├── dataset.json
├── dune-project
├── facts.pl
├── lib
│ └── dune
├── MTL2PL.opam
├── output.txt
├── paper_examples
├── pycode
│ ├── data
│ ├── qa
│ ├── facts_generator.py
│ ├── generate_question_json.py
│ └── mtl_generator.py
├── queries.txt
├── test
│ ├── dune
│ └── MTL2PL.ml
└── test.pl
.non-temporal
├── 0.fact_extraction
│ ├── category
│ ├── get_wiki_cat_id.py
│ ├── transitive_entity_extract.py
│ ├── transitive_pl_build.py
│ └── wiki_pl_build.py
├── 1.logical_reasoning
│ ├── prolog_inference.py
│ ├── prolog_rules
│ │ ├── composite_problem_dict.json
│ │ ├── composite_rules.pl
│ │ ├── inverse_problem_dict.json
│ │ ├── inverse_rules.pl
│ │ ├── negation_problem_dict.json
│ │ ├── negation_rules.pl
│ │ ├── transitive_problem_dict.json
│ │ └── transitive_rules.pl
│ └── rule_generation.py
├── 2.benchmark_construction
│ ├── benchmark_generation.py
│ └── question_generation.py
├── 3.llm_interaction
│ ├── local_interaction.py
│ ├── openai_interaction.py
│ └── test_vllm.py
├── 4.response_evaluation
│ ├── answer_test.py
│ ├── extract_triple.py
│ ├── get_all_pred.py
│ ├── __init__.py
│ ├── sim_calc.py
│ └── utils
│ └── wordnet_synonyms.json
└── utils
├── file_ops.py
└── wiki_property_cat_v1.xlsx
This project makes use of the following open-source projects:
- PySwip: PySwip is a Python - SWI-Prolog bridge enabling to query SWI-Prolog in your Python programs.
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs.
- Ollama: Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
- StanfordCoreNLP: A suite of NLP tools.
- Phrase-BERT: A model to embed phrases.
@article{li2025detecting,
title={Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning},
author={Li, Ningke and Song, Yahui and Wang, Kailong and Li, Yuekang and Shi, Ling and Liu, Yi and Wang, Haoyu},
journal={arXiv preprint arXiv:2502.13416},
year={2025}
}
@article{10.1145/3689776,
author = {Li, Ningke and Li, Yuekang and Liu, Yi and Shi, Ling and Wang, Kailong and Wang, Haoyu},
title = {Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models},
year = {2024},
issue_date = {October 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {8},
number = {OOPSLA2},
url = {https://doi.org/10.1145/3689776},
doi = {10.1145/3689776},
month = oct,
articleno = {336},
numpages = {30},
keywords = {Hallucination, Large Language Model, Software Testing}
}

