A multi-document scientific-reasoning benchmark with verifiable ground truth across deduction, induction, and causal abduction, with parametric control over inference complexity and premise obfuscation.
SciR generates each task from a structured formal object (a deduction tree, an inductive rule hypothesis, or a causal graph) and renders it into multi-document scientific discourse via a domain-tuned, cross-validated rendering scheme. Two axes can be dialled independently: how hard the underlying inference is, and how hard it is to extract the relevant information from heterogeneous scientific text.
Paper | Dataset (Zenodo) | Dataset (Hugging Face)
# Clone
git clone https://github.com/idiap/scir.git
cd scir
# Environment
conda create -n scir python=3.12 -y
conda activate scir
pip install -e .
# Configure API keys
cp config.example.yaml config.yaml # then fill in placeholders
# Fetch the benchmark tasks (CC-BY-NC 4.0, hosted on Zenodo via the Idiap dataset page)
python scripts/download_data.py
# 1. Run a baseline cell (e.g., gpt-4o + neuro-symbolic on causal Easy)
python -m evaluation.causal.evaluate \
--dataset data/causal/tasks/main/tasks_5n1c_transformed_n200.json \
--llm-config gpt-4o --solver gies \
--modes nl_only 100_obfuscated \
--output data/causal/results/main/results_5n1c_4o_ns.jsonThe benchmark tasks are distributed via Zenodo through the Idiap dataset page (CC-BY-NC 4.0) and fetched on demand by scripts/download_data.py. They are also mirrored on Hugging Face for use with the datasets library. This repository ships only the per-cell baseline results under data/<track>/results/ and the small generation seeds under data/<track>/seeds/; the task files are not stored here to avoid duplicating the Zenodo release.
Replace --llm-config with one of 4o-azure, o3mini-azure, deepseek-r1, llama-70b, qwen30b, olmo32b. Replace --solver with prover9 (deduction), popper (induction), or gies (causal); omit for direct CoT; or use symbcot for the SymbCoT* baseline.
Some solvers rely on third-party tools that are not bundled with this repository for license-compatibility reasons. Install them separately:
- Prover9 (deduction, neuro-symbolic mode). Install per
NLTK's Prover9 instructions,
then either add the
bin/directory to yourPATHor set thePROVER9environment variable to point at it. - Z3 (deduction, optional alternative to Prover9). Install Z3 from
microsoft/z3 and either add
z3to yourPATHor setZ3_BINto the directory containing the binary. - Popper (induction, neuro-symbolic mode). Install via
logic-and-learning-lab/Popper
and ensure
popper-ilp(or your chosen entry point) is onPATH. - GIES / pcalg (causal, neuro-symbolic mode). Requires R with the
pcalgpackage, plus the Pythongiesandsemplerpackages (pip install scir[causal]).
Trees of syllogisms are built by chained premise replacement; labelling them True, False, or Unknown amounts to keeping, negating the conclusion, or deleting a premise. Each task pairs a base tree with one or more Unknown distractor trees, and predicates are instantiated with developmental-biology pathway data.
A target rule and one or more distractor rules are sampled from a curated set of drug-interaction patterns (e.g., both inhibit the same enzyme vs. both bind the same target). Drug pairs are then selected so the positives support each rule, while one negative example per distractor invalidates that distractor without falsifying the target. The full task is rendered as background drug–protein facts plus observed interactions.
A connected subgraph is sampled from the Sachs protein-signalling network and a fictional protein XYZ is added with a random set of edges. Protein concentrations are simulated with a linear Gaussian SCM under observational and per-node do-interventions, yielding a table that matches the format of the original Sachs data. The task is to recover the edges of XYZ.
generation/ Parametric task generators (one package per domain)
evaluation/ Evaluation pipelines (CoT, NS, SymbCoT solvers + scoring)
prompts/ Prompt templates for transform + evaluation
config.example.yaml Placeholder API config — fill in to run anything
data/<domain>/seeds/ Generation seeds (deduction only)
data/<domain>/tasks/main/ Main-tier task files (n=200, both NL+OBF)
data/<domain>/tasks/difficulty_scaling/ Difficulty-scaling task files (n=50, NL-only)
data/<domain>/results/main/ Baseline results (216 cells: 6 models × 3 solvers × 2 modes × 6 tiers)
data/<domain>/results/difficulty_scaling/ Difficulty-scaling baseline results
assets/ Figures used by this README
Each domain has a generate.py:
python -m generation.deduction.generate --tier e4d1 --n 200
python -m generation.induction.generate --tier d2p2 --n 200
python -m generation.causal.generate --tier 5n1c --n 200Then transform symbolic tasks to obfuscated scientific prose:
python -m generation.<domain>.transform --input <tasks.json> --output <out.json>@misc{beckmann2026scir,
title = {SciR: A Controllable Benchmark for Scientific Reasoning in LLMs},
author = {Beckmann, Pierre and Valentino, Marco and Freitas, Andr{\'e}},
year = {2026},
eprint = {2606.13020},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
}Source code is licensed under GPL-3.0-only (see LICENSE and LICENSES/). The dataset distributed via Zenodo (linked from the Idiap dataset page) is licensed under CC-BY-NC-4.0 (inheriting DrugBank's non-commercial restriction).
Questions? Open an issue or contact Pierre Beckmann.