Skip to content

security-pride/Drowzee

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Drowzee

This is the datasets and source code for the extension 'Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning' and original OOPSLA-24 paper 'Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models'.

Overview

Alt text

Alt text

Requirements

Python Prerequisites

conda create -n <your_env_name> python=3.10

pip install -r requirements.txt

Temporal Generation Prerequisites

Java Prerequisites

  • Java 8 or higher. The command java -version should complete successfully with a line like: java version "1.8.0_92".

Hardware Requirements

  • To run Llama2 series and Mistral series models locally, we apply (at least) two NVIDIA A100 PCIe 80GB GPUs.

To Interact with LLMs

  • To interact with GPTs, we need an OPENAI API key (replace the your_api_key part in the OpenAI Interact)
  • We use Ollama to run local models.
    1. Starting the model on the Linux server (default port:11434).
    ollama run [MODEL_NAME]
    
    1. Interact using request library.
    response = requests.post("http://your_server_ip:11434/api/generate", json={
        "model": [MODEL_NAME],
        "prompt": [INPUT_PROMPT],
        "stream": False
    }).json()
    
  • There are also online services for open-source LLMs like DeepInfra. We can use the OPENAI python library to generate responses.

To Analyze Responses

  • Download StanfordCoreNLP and unzip the file.
  • Starting the service on the Linux server:
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port your_port -timeout 15000
    
  • Use the code in Response Validation to analyze LLM responses.

Dataset Formatted

The non-temporal benchmark dataset is in Drowzee Benchmark. The temporal benchmark dataset is in Temporal Benchmark

Each item in the dataset follows the structure below:

  • qid: A unique identifier for each test case in the dataset.

  • category: Knowledge domain for entities used in this test case (culture/geography/history/people/health/math/nature/society/technology).

  • reasoning: Reasoning strategy used when generating this test case (transitive/negation/inverse/composite).

  • subject: The subject entity of the derived fact in the test case.

  • predicate: The derived relation between subject and object in this test case.

  • object: The secondary entity or concept linked to the subject by the predicate.

  • description: The first sentence from the Wikipedia page of subject and object.

  • question: The question generated according to the derived fact in this test case.

  • answer: The ground truth answer to the question based on the derived fact and evidence.

  • evidence: A list of supporting triples that are used to derive facts in this test case.

Code Structure

.MTL2PL
├── bin
│   ├── ast_utility.ml
│   ├── dune
│   ├── lexer.mll
│   ├── main.ml
│   └── parser.mly
├── database.pl
├── dataset.json
├── dune-project
├── facts.pl
├── lib
│   └── dune
├── MTL2PL.opam
├── output.txt
├── paper_examples
├── pycode
│   ├── data
│   ├── qa
│   ├── facts_generator.py
│   ├── generate_question_json.py
│   └── mtl_generator.py
├── queries.txt
├── test
│   ├── dune
│   └── MTL2PL.ml
└── test.pl
.non-temporal
├── 0.fact_extraction
│   ├── category
│   ├── get_wiki_cat_id.py
│   ├── transitive_entity_extract.py
│   ├── transitive_pl_build.py
│   └── wiki_pl_build.py
├── 1.logical_reasoning
│   ├── prolog_inference.py
│   ├── prolog_rules
│   │   ├── composite_problem_dict.json
│   │   ├── composite_rules.pl
│   │   ├── inverse_problem_dict.json
│   │   ├── inverse_rules.pl
│   │   ├── negation_problem_dict.json
│   │   ├── negation_rules.pl
│   │   ├── transitive_problem_dict.json
│   │   └── transitive_rules.pl
│   └── rule_generation.py
├── 2.benchmark_construction
│   ├── benchmark_generation.py
│   └── question_generation.py
├── 3.llm_interaction
│   ├── local_interaction.py
│   ├── openai_interaction.py
│   └── test_vllm.py
├── 4.response_evaluation
│   ├── answer_test.py
│   ├── extract_triple.py
│   ├── get_all_pred.py
│   ├── __init__.py
│   ├── sim_calc.py
│   └── utils
│       └── wordnet_synonyms.json
└── utils
    ├── file_ops.py
    └── wiki_property_cat_v1.xlsx

Acknowledgements

This project makes use of the following open-source projects:

  • PySwip: PySwip is a Python - SWI-Prolog bridge enabling to query SWI-Prolog in your Python programs.
  • vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs.
  • Ollama: Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
  • StanfordCoreNLP: A suite of NLP tools.
  • Phrase-BERT: A model to embed phrases.

Reference

@article{li2025detecting,
  title={Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning},
  author={Li, Ningke and Song, Yahui and Wang, Kailong and Li, Yuekang and Shi, Ling and Liu, Yi and Wang, Haoyu},
  journal={arXiv preprint arXiv:2502.13416},
  year={2025}
}
@article{10.1145/3689776,
    author = {Li, Ningke and Li, Yuekang and Liu, Yi and Shi, Ling and Wang, Kailong and Wang, Haoyu},
    title = {Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models},
    year = {2024},
    issue_date = {October 2024},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    volume = {8},
    number = {OOPSLA2},
    url = {https://doi.org/10.1145/3689776},
    doi = {10.1145/3689776},
    month = oct,
    articleno = {336},
    numpages = {30},
    keywords = {Hallucination, Large Language Model, Software Testing}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages