Fujitsu RAG Hard Benchmark†

† RAG: Retrieval-Augmented Generation

This is the benchmark dataset introduced in the AAAI 2026 workshop paper "Overcoming the 'Impracticality' of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework."

For the blog about this benchmark dataset, click here.

Data layout

dataset/FJ_KGQA_Hard.yaml: annotation data (100 questions)
dataset/DL_URL.csv: list of PDF download sources (index, file name, URL)
dataset/PDFs: source PDFs (some included. Items listed in dataset/DL_URL.csv should be downloaded and placed by the user)
evaluate/: evaluation scripts and samples
pyproject.toml, poetry.lock: dependencies for the evaluation scripts

Obtaining PDFs

For PDFs listed in dataset/DL_URL.csv, download them from each URL and place them in dataset/PDFs. If a file already exists, you do not need to download it again. Please follow the license and terms of use of each PDF distributor.

Annotation format (`dataset/FJ_KGQA_Hard.yaml`)

The YAML file consists of a tasks array, and each element represents one QA pair and its evidence.

tasks:
- no.: "1"
  question: ...
  answer: ...
  question_type: Yes/No
  retrieval_level: Easy
  answer_level: Easy
  answer_skill: 記載通りに回答
  tag: テキスト
  grading_criteria: ""
  rationales:
  - file_name: sample.pdf
    pages:
    - number: 2
      view:
        width: 100
        height: 100
      bounding_boxes:
      - top: 30.82
        left: 0.25
        width: 22.75
        height: 32.57
  Reasoning Complexity:
    Reasoning Depth (Multi-step Reasoning):
      value: multi

This YAML snippet is intentionally abbreviated. Other diagnostic metrics under Reasoning Complexity, Retrieval Difficulty, Source Structure & Modality, and Explainability Requirement are omitted here for readability.

Field meanings

no.: question ID (string)
question: question text
answer: correct answer (based on the evidence)
question_type: type of question (e.g., Yes/No, Factoid, Definition/Description)
retrieval_level: difficulty of evidence retrieval (Easy/Medium/Hard)
answer_level: difficulty of answering (Easy/Medium/Hard)
answer_skill: required skill or reading operation
tag: media type of evidence (figure/table/text, etc.; multiple possible)
grading_criteria: scoring criteria or additional constraints (may be empty)
rationales: array of evidence entries
- file_name: referenced PDF file name (under dataset/PDFs)
- pages: array of referenced pages
  - number: page number (1-based)
  - view: coordinate system size (typically normalized to 100x100)
  - bounding_boxes: rectangles for evidence spans (may be empty)
    - top, left, width, height: rectangle position and size in view coordinates (origin is top-left)

Diagnostic metadata fields

Each task also contains diagnostic metadata blocks for multi-dimensional analysis:

Reasoning Complexity
Retrieval Difficulty
Source Structure & Modality
Explainability Requirement

Each metric under those blocks has the following schema:

<block_name>:
  <metric_name>:
    value: <label>

value: label assigned to that QA item for the metric
Allowed labels for each metric are documented below

Label definitions

Reasoning Complexity

Metric	Allowed labels (`value`)
`Reasoning Depth (Multi-step Reasoning)`	`single`, `multi`
`Quantitative Operation`	`false`, `true`
`Negation Question`	`false`, `true`
`Cause and Effect`	`false`, `true`
`Comparison (and Conditional Judgment)`	`false`, `true`
`Temporal Specification`	`false`, `true`
`Type of Output Processing`	`summary`, `trans`, `list`

Retrieval Difficulty

Metric	Allowed labels (`value`)
`multi-document`	`false`, `true`
`multi-chunk`	`false`, `true`
`Low Locality`	`false`, `true`
`Remote Reference`	`false`, `true`
`Document Volume`	`<1000p`, `>=1000p`
`Chunk Size`	`<=511tok`, `>=512tok`
`Abstraction Discrepancy`	`false`, `true`
`Vocabulary Mismatch`	`false`, `true`

Source Structure & Modality

Metric	Allowed labels (`value`)
`Tables/Charts`	`false`, `true`
`Complex Layout`	`false`, `true`
`Specific Area Reference`	`false`, `true`
`Logical Nesting`	`false`, `true`
`Large Enumeration`	`false`, `true`
`Redundancy`	`false`, `true`

Explainability Requirement

Metric	Allowed labels (`value`)
`Strictness of Evidence Presentation`	`no-evidence`, `hier-ref`, `coord-ref`, `multi-ref`

Evaluation tools

evaluate/evaluate_qa.py: evaluation script for QA and references
evaluate/sample.json: sample input format
evaluate/.env.example: example OPENAI_API_KEY setting
results/: output directory for evaluation results (created on run)

Usage

Copy evaluate/.env.example to evaluate/.env and set OPENAI_API_KEY
Edit MODEL_SETTINGS in evaluate/evaluate_qa.py to select the model
Install dependencies: poetry install
Run evaluation: python evaluate/evaluate_qa.py --qa-results-file evaluate/sample.json --reference-eval-mode full-coverage

--reference-eval-mode can be set to match-rate (match rate) or full-coverage (only exact matches).

Evaluation JSON format

Use the same structure as evaluate/sample.json. Only items with success set to true are evaluated.

question: question text
predicted_answer: generated answer
correct_answer: correct answer
predicted_references: predicted references (array; each element has pdf and page)
correct_references: correct references (array; each element has pdf and page)
success: evaluation flag

Make sure the page type is consistent between predicted_references and correct_references (mixing strings and numbers is treated as a mismatch).

License

See the "TERMS_OF_USE" for details. PDFs must be used according to the licenses of their distributors.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset		dataset
evaluate		evaluate
poster		poster
LICENSE		LICENSE
README.ja.md		README.ja.md
README.md		README.md
TERMS_OF_USE.md		TERMS_OF_USE.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fujitsu RAG Hard Benchmark†

Data layout

Obtaining PDFs

Annotation format (`dataset/FJ_KGQA_Hard.yaml`)

Field meanings

Diagnostic metadata fields

Label definitions

Evaluation tools

Usage

Evaluation JSON format

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fujitsu RAG Hard Benchmark†

Data layout

Obtaining PDFs

Annotation format (dataset/FJ_KGQA_Hard.yaml)

Field meanings

Diagnostic metadata fields

Label definitions

Evaluation tools

Usage

Evaluation JSON format

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Annotation format (`dataset/FJ_KGQA_Hard.yaml`)

Packages