Fujitsu Hallucination Benchmark†

This is the benchmark introduced in the AAAI 2026 workshop paper "ECHO: EvidenCe-prior Hallucination Observation (AAAI 2026 Workshop AABA4ET)". Please find more information from the following Techblog

https://blog-en.fltech.dev/entry/2026/03/06/fujitsu-hallucination-benchmark-en (English)

https://blog.fltech.dev/entry/2026/03/06/fujitsu-hallucination-benchmark-ja (Japanese)

Benchmark Overview

This benchmark contains the following QA datasets:

Text-only QAs: 414 questions

Raw visual QAs: 414 questions (with unedited images)

Edited visual QAs: 414 questions (with edited images)

It consists of a total of 1242 QA dataset entries.

Key Findings

Detection of LaPH (Language Prior Hallucination)

LaPH can be identified by comparing the output of?Edited visual QA with that of Raw visual QA.

LaPH refers to the phenomenon where Large Vision-Language Models (LVLMs) generate answers based on linguistic prior knowledge without utilizing visual information.

LVLM Performance Trends

In brief experiments with several LVLMs (e.g., gpt4v, command), the accuracy forraw visual QAs was approximately 90%.

However, for edited visual QAs, the accuracy decreased to about 70%, with approximately 16% to 18% LaPH observed.

Gpt4v achieved an accuracy of around 92.8% on raw visual QA and 67.1% on?edited visual QA, with about 18.3% LaPH.

Command achieved an accuracy of around 93.9% on?raw visual QA and 73.4% on?edited visual QA, with about 16.4% LaPH.

Directory Layout

The root directory contains the following four folders: database scripts sample paper

1 Database folder

This folder contains image data and QA data.

img-shuf-modified

Contains raw (unedited) and edited visual images.

LaPH_all_QA.tsv

Includes labeling information for all data samples

Samples are divided into three parts:

Indexes with "_text" suffix: Text-only input (without any visual input).

Indexes with "_visual_raw" suffix: Input includes both text and visual image, with unedited images.

Indexes with "_visual_edited" suffix: Input includes text and an edited image. Details of images related to questions are edited, so answers also change compared to raw images.

Because some questions may have open answers, multiple-choice questions (MCQ) are used instead.

LaPH_text_mcq.tsv Contains text-only MCQ cases.

LaPH_vqa_mcq.tsv Contains visual-QA MCQ cases.

numbers200000 Orders for situations when some cases are failed to get output from LVLMs, we can easily found the failed cases using an order, it can help generate files:

sample_LaPH_text.tsv

sample_LaPH_vqa.tsv

Input samples for an LVLM.

2 Scripts folder

In this folder, you will find 2.0 run.sh.

You can find usage of LVLMs test scripts, to get output of LaPH of this benchmark data with specific input.

2.1 gpt_api_vqa.py

2.2 gpt_api_text.py

With these two script you can input LaPH_vqa.tsv and LaPH_text.tsv to get output from gpt4v model output_LaPH_gpt_vqa and output_LaPH_gpt_text

2.3 gemini_api_vqa.py

2.4 gemini_api_text.py

With these two script, you can input LaPH_vqa.tsv and LaPH_text.tsv to get output from gemini model output_LaPH_gemini_vqa and output_LaPH_gemini_text

2.5 command_api_vqa.py

2.6 command_api_text.py

With these two script you can input LaPH_vqa.tsv and LaPH_text.tsv to get output from command model output_LaPH_command_vqa and output_LaPH_command_text the format of LaPH_vqa.tsv and LaPH_text.tsv samples can be seen in dataset

2.7 eval_LaPH.py can get comparison result of ground truth and output from LVLM, sample input file can be 2.7.1 sample-output-vqa-forCompare, with content of index, ground truth and output

2.8 gene_mcq_res.py can get result of MCQ outputs, sample input file can be 2.8.1 sample-output-MCQ-forCompare, with content of index, output, in MCQ, the QA will be a set, if not all the QAs are correctly answered, the result for this set will be wrong. requirement of these scripts will be cohere, requests, base64, google, vlmeval, openai, glov, argparse, json and so on.

3 Sample folder

We summerized the examples how to used Fujitsu Hallucination Benchmark in the slides

License

See "TERMS_OF_USE" for details.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
database		database
samples		samples
scripts		scripts
README.md		README.md
Terms of Use.md		Terms of Use.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fujitsu Hallucination Benchmark†

Benchmark Overview

Text-only QAs: 414 questions

Raw visual QAs: 414 questions (with unedited images)

Edited visual QAs: 414 questions (with edited images)

Key Findings

Detection of LaPH (Language Prior Hallucination)

LVLM Performance Trends

Directory Layout

1 Database folder

img-shuf-modified

LaPH_all_QA.tsv

LaPH_text_mcq.tsv Contains text-only MCQ cases.

LaPH_vqa_mcq.tsv Contains visual-QA MCQ cases.

sample_LaPH_text.tsv

sample_LaPH_vqa.tsv

2 Scripts folder

3 Sample folder

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fujitsu Hallucination Benchmark†

Benchmark Overview

Text-only QAs: 414 questions

Raw visual QAs: 414 questions (with unedited images)

Edited visual QAs: 414 questions (with edited images)

Key Findings

Detection of LaPH (Language Prior Hallucination)

LVLM Performance Trends

Directory Layout

1 Database folder

img-shuf-modified

LaPH_all_QA.tsv

LaPH_text_mcq.tsv Contains text-only MCQ cases.

LaPH_vqa_mcq.tsv Contains visual-QA MCQ cases.

sample_LaPH_text.tsv

sample_LaPH_vqa.tsv

2 Scripts folder

3 Sample folder

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages