SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

A multi-document scientific-reasoning benchmark with verifiable ground truth across deduction, induction, and causal abduction, with parametric control over inference complexity and premise obfuscation.

SciR generates each task from a structured formal object (a deduction tree, an inductive rule hypothesis, or a causal graph) and renders it into multi-document scientific discourse via a domain-tuned, cross-validated rendering scheme. Two axes can be dialled independently: how hard the underlying inference is, and how hard it is to extract the relevant information from heterogeneous scientific text.

Paper | Dataset (Zenodo) | Dataset (Hugging Face)

Quick Start

# Clone
git clone https://github.com/idiap/scir.git
cd scir

# Environment
conda create -n scir python=3.12 -y
conda activate scir
pip install -e .

# Configure API keys
cp config.example.yaml config.yaml   # then fill in placeholders

# Fetch the benchmark tasks (CC-BY-NC 4.0, hosted on Zenodo via the Idiap dataset page)
python scripts/download_data.py

# 1. Run a baseline cell (e.g., gpt-4o + neuro-symbolic on causal Easy)
python -m evaluation.causal.evaluate \
  --dataset data/causal/tasks/main/tasks_5n1c_transformed_n200.json \
  --llm-config gpt-4o --solver gies \
  --modes nl_only 100_obfuscated \
  --output data/causal/results/main/results_5n1c_4o_ns.json

The benchmark tasks are distributed via Zenodo through the Idiap dataset page (CC-BY-NC 4.0) and fetched on demand by scripts/download_data.py. They are also mirrored on Hugging Face for use with the datasets library. This repository ships only the per-cell baseline results under data/<track>/results/ and the small generation seeds under data/<track>/seeds/; the task files are not stored here to avoid duplicating the Zenodo release.

Replace --llm-config with one of 4o-azure, o3mini-azure, deepseek-r1, llama-70b, qwen30b, olmo32b. Replace --solver with prover9 (deduction), popper (induction), or gies (causal); omit for direct CoT; or use symbcot for the SymbCoT* baseline.

External tools

Some solvers rely on third-party tools that are not bundled with this repository for license-compatibility reasons. Install them separately:

Prover9 (deduction, neuro-symbolic mode). Install per NLTK's Prover9 instructions, then either add the bin/ directory to your PATH or set the PROVER9 environment variable to point at it.
Z3 (deduction, optional alternative to Prover9). Install Z3 from microsoft/z3 and either add z3 to your PATH or set Z3_BIN to the directory containing the binary.
Popper (induction, neuro-symbolic mode). Install via logic-and-learning-lab/Popper and ensure popper-ilp (or your chosen entry point) is on PATH.
GIES / pcalg (causal, neuro-symbolic mode). Requires R with the pcalg package, plus the Python gies and sempler packages (pip install scir[causal]).

Domains

Deduction

Trees of syllogisms are built by chained premise replacement; labelling them True, False, or Unknown amounts to keeping, negating the conclusion, or deleting a premise. Each task pairs a base tree with one or more Unknown distractor trees, and predicates are instantiated with developmental-biology pathway data.

Induction

A target rule and one or more distractor rules are sampled from a curated set of drug-interaction patterns (e.g., both inhibit the same enzyme vs. both bind the same target). Drug pairs are then selected so the positives support each rule, while one negative example per distractor invalidates that distractor without falsifying the target. The full task is rendered as background drug–protein facts plus observed interactions.

Causal

A connected subgraph is sampled from the Sachs protein-signalling network and a fictional protein XYZ is added with a random set of edges. Protein concentrations are simulated with a linear Gaussian SCM under observational and per-node do-interventions, yielding a table that matches the format of the original Sachs data. The task is to recover the edges of XYZ.

Repository layout

generation/                     Parametric task generators (one package per domain)
evaluation/                     Evaluation pipelines (CoT, NS, SymbCoT solvers + scoring)
prompts/                        Prompt templates for transform + evaluation
config.example.yaml             Placeholder API config — fill in to run anything
data/<domain>/seeds/                       Generation seeds (deduction only)
data/<domain>/tasks/main/                  Main-tier task files (n=200, both NL+OBF)
data/<domain>/tasks/difficulty_scaling/    Difficulty-scaling task files (n=50, NL-only)
data/<domain>/results/main/                Baseline results (216 cells: 6 models × 3 solvers × 2 modes × 6 tiers)
data/<domain>/results/difficulty_scaling/  Difficulty-scaling baseline results
assets/                         Figures used by this README

Generating new tasks

Each domain has a generate.py:

python -m generation.deduction.generate --tier e4d1 --n 200
python -m generation.induction.generate --tier d2p2 --n 200
python -m generation.causal.generate    --tier 5n1c --n 200

Then transform symbolic tasks to obfuscated scientific prose:

python -m generation.<domain>.transform --input <tasks.json> --output <out.json>

Citation

@misc{beckmann2026scir,
  title  = {SciR: A Controllable Benchmark for Scientific Reasoning in LLMs},
  author = {Beckmann, Pierre and Valentino, Marco and Freitas, Andr{\'e}},
  year   = {2026},
  eprint = {2606.13020},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
}

License

Source code is licensed under GPL-3.0-only (see LICENSE and LICENSES/). The dataset distributed via Zenodo (linked from the Idiap dataset page) is licensed under CC-BY-NC-4.0 (inheriting DrugBank's non-commercial restriction).

Questions? Open an issue or contact Pierre Beckmann.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

Quick Start

External tools

Domains

Deduction

Induction

Causal

Repository layout

Generating new tasks

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSES		LICENSES
assets		assets
data		data
evaluation		evaluation
generation		generation
prompts		prompts
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REUSE.toml		REUSE.toml
config.example.yaml		config.example.yaml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

Quick Start

External tools

Domains

Deduction

Induction

Causal

Repository layout

Generating new tasks

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages