Skip to content

EPiCs-group/ConforFormer

Repository files navigation

ConforFormer

This repository contains the code used for the ConforFormer project. It combines:

  • a research fork of Uni-Mol with the model, task, loss, and data pipeline changes used in this work
  • data-processing pipelines for the conformer and isomer datasets used by the paper
  • lightweight 2D fingerprint baselines for the Uni-Mol MoleculeNet benchmark

Model checkpoints are published on Hugging Face. Dataset artifacts generated by the pipelines in this repository can be reproduced from the provided scripts.

Repository layout

Path Purpose
unimol/ Uni-Mol fork with the ConforFormer model code, training tasks, losses, and inference utilities
baselines/ 2D molecular fingerprint baselines based on CatBoost and XGBoost
data_processing/ Pipelines for reducing Uni-Mol data, filtering OpenMolecules conformers, generating isomer annotations, and building the contrastive benchmark
analysis/ Analysis scripts for enantiomer/optisomer labeling and benchmark evaluation
example_scripts/ Example shell scripts for training, fine-tuning, inference, and 2D baseline runs
results/ Checked-in baseline tuning runs and benchmark summaries

Quick start

2D baselines

The repository root includes a lightweight Python environment for the 2D baselines defined in baselines/. These scripts use:

  • CatBoost with OpenBabel fingerprints (FP2, FP3, FP4, MACCS)
  • XGBoost with RDKit Morgan fingerprints (ECFP4_1024 in the paper setup)

Install the baseline dependencies with uv:

uv sync

Run the OpenBabel CatBoost benchmark:

uv run python baselines/catboost_fp2_baseline.py \
  --data-root data_downloads/unimol/molecular_property_prediction \
  --output-dir results/catboost_fp2 \
  --feature-mode tanimoto \
  --fingerprint FP2 \
  --n-anchors 256

Run the RDKit ECFP4 XGBoost benchmark:

uv run python -m baselines.xgb_ecfp_baseline \
  --data-root data_downloads/unimol/molecular_property_prediction \
  --tasks all \
  --radius 2 \
  --fp-bits 1024 \
  --output-dir results/xgb_ecfp4_1024

Wrapper scripts are available in example_scripts/baselines/. See baselines/README.md for tuning workflows, reference configs, and output locations.

ConforFormer and Uni-Mol workflows

The model code lives under unimol/. Use the Uni-Mol-specific requirements and setup instructions in unimol/README.md for:

  • pretraining and fine-tuning
  • conformer embedding and inference
  • docking and pocket tasks

Example entry points are provided in:

Data pipelines

data_processing/ contains the pipelines used to reproduce the data assets referenced in the paper:

  • reduced Uni-Mol splits
  • OpenMolecules conformer filtering and grouping
  • isomer lookup generation and tailored Uni-Mol datasets
  • the contrastive benchmark generation workflow

Large pipelines are numbered in execution order inside their respective directories. See data_processing/README.md for dataset prerequisites.

Results and references

The repository includes checked-in summaries for the 2D baseline experiments, including:

These outputs are useful as references for reproducing the reported 2D baseline numbers without rerunning every sweep from scratch.

Requirements

  • Python >=3.11 for the root baseline environment
  • uv for dependency management at the repository root
  • OpenBabel for the OpenBabel-based fingerprint baselines
  • RDKit for ECFP/Morgan fingerprint baselines

For the Uni-Mol fork and its additional training requirements, refer to the documentation in unimol/.

License

Original contributions in this repository are released under the MIT License. See LICENSE. For project questions, contact e.a.pidko@tudelft.nl.

About

Repository for the ConforFormer model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors