Skip to content

idiap/SciR

Repository files navigation

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

A multi-document scientific-reasoning benchmark with verifiable ground truth across deduction, induction, and causal abduction, with parametric control over inference complexity and premise obfuscation.

SciR generates each task from a structured formal object (a deduction tree, an inductive rule hypothesis, or a causal graph) and renders it into multi-document scientific discourse via a domain-tuned, cross-validated rendering scheme. Two axes can be dialled independently: how hard the underlying inference is, and how hard it is to extract the relevant information from heterogeneous scientific text.

Paper | Dataset (Zenodo) | Dataset (Hugging Face)

pipeline


Quick Start

# Clone
git clone https://github.com/idiap/scir.git
cd scir

# Environment
conda create -n scir python=3.12 -y
conda activate scir
pip install -e .

# Configure API keys
cp config.example.yaml config.yaml   # then fill in placeholders

# Fetch the benchmark tasks (CC-BY-NC 4.0, hosted on Zenodo via the Idiap dataset page)
python scripts/download_data.py

# 1. Run a baseline cell (e.g., gpt-4o + neuro-symbolic on causal Easy)
python -m evaluation.causal.evaluate \
  --dataset data/causal/tasks/main/tasks_5n1c_transformed_n200.json \
  --llm-config gpt-4o --solver gies \
  --modes nl_only 100_obfuscated \
  --output data/causal/results/main/results_5n1c_4o_ns.json

The benchmark tasks are distributed via Zenodo through the Idiap dataset page (CC-BY-NC 4.0) and fetched on demand by scripts/download_data.py. They are also mirrored on Hugging Face for use with the datasets library. This repository ships only the per-cell baseline results under data/<track>/results/ and the small generation seeds under data/<track>/seeds/; the task files are not stored here to avoid duplicating the Zenodo release.

Replace --llm-config with one of 4o-azure, o3mini-azure, deepseek-r1, llama-70b, qwen30b, olmo32b. Replace --solver with prover9 (deduction), popper (induction), or gies (causal); omit for direct CoT; or use symbcot for the SymbCoT* baseline.


External tools

Some solvers rely on third-party tools that are not bundled with this repository for license-compatibility reasons. Install them separately:

  • Prover9 (deduction, neuro-symbolic mode). Install per NLTK's Prover9 instructions, then either add the bin/ directory to your PATH or set the PROVER9 environment variable to point at it.
  • Z3 (deduction, optional alternative to Prover9). Install Z3 from microsoft/z3 and either add z3 to your PATH or set Z3_BIN to the directory containing the binary.
  • Popper (induction, neuro-symbolic mode). Install via logic-and-learning-lab/Popper and ensure popper-ilp (or your chosen entry point) is on PATH.
  • GIES / pcalg (causal, neuro-symbolic mode). Requires R with the pcalg package, plus the Python gies and sempler packages (pip install scir[causal]).

Domains

Deduction

Deduction track generation

Trees of syllogisms are built by chained premise replacement; labelling them True, False, or Unknown amounts to keeping, negating the conclusion, or deleting a premise. Each task pairs a base tree with one or more Unknown distractor trees, and predicates are instantiated with developmental-biology pathway data.

Induction

Induction track generation

A target rule and one or more distractor rules are sampled from a curated set of drug-interaction patterns (e.g., both inhibit the same enzyme vs. both bind the same target). Drug pairs are then selected so the positives support each rule, while one negative example per distractor invalidates that distractor without falsifying the target. The full task is rendered as background drug–protein facts plus observed interactions.

Causal

Causal track generation

A connected subgraph is sampled from the Sachs protein-signalling network and a fictional protein XYZ is added with a random set of edges. Protein concentrations are simulated with a linear Gaussian SCM under observational and per-node do-interventions, yielding a table that matches the format of the original Sachs data. The task is to recover the edges of XYZ.


Repository layout

generation/                     Parametric task generators (one package per domain)
evaluation/                     Evaluation pipelines (CoT, NS, SymbCoT solvers + scoring)
prompts/                        Prompt templates for transform + evaluation
config.example.yaml             Placeholder API config — fill in to run anything
data/<domain>/seeds/                       Generation seeds (deduction only)
data/<domain>/tasks/main/                  Main-tier task files (n=200, both NL+OBF)
data/<domain>/tasks/difficulty_scaling/    Difficulty-scaling task files (n=50, NL-only)
data/<domain>/results/main/                Baseline results (216 cells: 6 models × 3 solvers × 2 modes × 6 tiers)
data/<domain>/results/difficulty_scaling/  Difficulty-scaling baseline results
assets/                         Figures used by this README

Generating new tasks

Each domain has a generate.py:

python -m generation.deduction.generate --tier e4d1 --n 200
python -m generation.induction.generate --tier d2p2 --n 200
python -m generation.causal.generate    --tier 5n1c --n 200

Then transform symbolic tasks to obfuscated scientific prose:

python -m generation.<domain>.transform --input <tasks.json> --output <out.json>

Citation

@misc{beckmann2026scir,
  title  = {SciR: A Controllable Benchmark for Scientific Reasoning in LLMs},
  author = {Beckmann, Pierre and Valentino, Marco and Freitas, Andr{\'e}},
  year   = {2026},
  eprint = {2606.13020},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
}

License

Source code is licensed under GPL-3.0-only (see LICENSE and LICENSES/). The dataset distributed via Zenodo (linked from the Idiap dataset page) is licensed under CC-BY-NC-4.0 (inheriting DrugBank's non-commercial restriction).


Questions? Open an issue or contact Pierre Beckmann.

About

A multi-document benchmark for evaluating LLMs on three forms of scientific reasoning (deduction, induction, causal abduction), with parametric control over inference complexity and premise obfuscation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors