This is the benchmark introduced in the AAAI 2026 workshop paper "ECHO: EvidenCe-prior Hallucination Observation (AAAI 2026 Workshop AABA4ET)". Please find more information from the following Techblog
https://blog-en.fltech.dev/entry/2026/03/06/fujitsu-hallucination-benchmark-en (English)
https://blog.fltech.dev/entry/2026/03/06/fujitsu-hallucination-benchmark-ja (Japanese)
This benchmark contains the following QA datasets:
It consists of a total of 1242 QA dataset entries.
LaPH can be identified by comparing the output of?Edited visual QA with that of Raw visual QA.
LaPH refers to the phenomenon where Large Vision-Language Models (LVLMs) generate answers based on linguistic prior knowledge without utilizing visual information.
In brief experiments with several LVLMs (e.g., gpt4v, command), the accuracy forraw visual QAs was approximately 90%.
However, for edited visual QAs, the accuracy decreased to about 70%, with approximately 16% to 18% LaPH observed.
Gpt4v achieved an accuracy of around 92.8% on raw visual QA and 67.1% on?edited visual QA, with about 18.3% LaPH.
Command achieved an accuracy of around 93.9% on?raw visual QA and 73.4% on?edited visual QA, with about 16.4% LaPH.
The root directory contains the following four folders: database scripts sample paper
This folder contains image data and QA data.
Contains raw (unedited) and edited visual images.
Includes labeling information for all data samples
Samples are divided into three parts:
Indexes with "_text" suffix: Text-only input (without any visual input).
Indexes with "_visual_raw" suffix: Input includes both text and visual image, with unedited images.
Indexes with "_visual_edited" suffix: Input includes text and an edited image. Details of images related to questions are edited, so answers also change compared to raw images.
Because some questions may have open answers, multiple-choice questions (MCQ) are used instead.
numbers200000 Orders for situations when some cases are failed to get output from LVLMs, we can easily found the failed cases using an order, it can help generate files:
Input samples for an LVLM.
In this folder, you will find 2.0 run.sh.
You can find usage of LVLMs test scripts, to get output of LaPH of this benchmark data with specific input.
2.1 gpt_api_vqa.py
2.2 gpt_api_text.py
With these two script you can input LaPH_vqa.tsv and LaPH_text.tsv to get output from gpt4v model output_LaPH_gpt_vqa and output_LaPH_gpt_text
2.3 gemini_api_vqa.py
2.4 gemini_api_text.py
With these two script, you can input LaPH_vqa.tsv and LaPH_text.tsv to get output from gemini model output_LaPH_gemini_vqa and output_LaPH_gemini_text
2.5 command_api_vqa.py
2.6 command_api_text.py
With these two script you can input LaPH_vqa.tsv and LaPH_text.tsv to get output from command model output_LaPH_command_vqa and output_LaPH_command_text the format of LaPH_vqa.tsv and LaPH_text.tsv samples can be seen in dataset
2.7 eval_LaPH.py can get comparison result of ground truth and output from LVLM, sample input file can be 2.7.1 sample-output-vqa-forCompare, with content of index, ground truth and output
2.8 gene_mcq_res.py can get result of MCQ outputs, sample input file can be 2.8.1 sample-output-MCQ-forCompare, with content of index, output, in MCQ, the QA will be a set, if not all the QAs are correctly answered, the result for this set will be wrong. requirement of these scripts will be cohere, requests, base64, google, vlmeval, openai, glov, argparse, json and so on.
We summerized the examples how to used Fujitsu Hallucination Benchmark in the slides
See "TERMS_OF_USE" for details.