This repository contains the code used for the ConforFormer project. It combines:
- a research fork of Uni-Mol with the model, task, loss, and data pipeline changes used in this work
- data-processing pipelines for the conformer and isomer datasets used by the paper
- lightweight 2D fingerprint baselines for the Uni-Mol MoleculeNet benchmark
Model checkpoints are published on Hugging Face. Dataset artifacts generated by the pipelines in this repository can be reproduced from the provided scripts.
| Path | Purpose |
|---|---|
unimol/ |
Uni-Mol fork with the ConforFormer model code, training tasks, losses, and inference utilities |
baselines/ |
2D molecular fingerprint baselines based on CatBoost and XGBoost |
data_processing/ |
Pipelines for reducing Uni-Mol data, filtering OpenMolecules conformers, generating isomer annotations, and building the contrastive benchmark |
analysis/ |
Analysis scripts for enantiomer/optisomer labeling and benchmark evaluation |
example_scripts/ |
Example shell scripts for training, fine-tuning, inference, and 2D baseline runs |
results/ |
Checked-in baseline tuning runs and benchmark summaries |
The repository root includes a lightweight Python environment for the 2D
baselines defined in baselines/. These scripts use:
- CatBoost with OpenBabel fingerprints (
FP2,FP3,FP4,MACCS) - XGBoost with RDKit Morgan fingerprints (
ECFP4_1024in the paper setup)
Install the baseline dependencies with uv:
uv syncRun the OpenBabel CatBoost benchmark:
uv run python baselines/catboost_fp2_baseline.py \
--data-root data_downloads/unimol/molecular_property_prediction \
--output-dir results/catboost_fp2 \
--feature-mode tanimoto \
--fingerprint FP2 \
--n-anchors 256Run the RDKit ECFP4 XGBoost benchmark:
uv run python -m baselines.xgb_ecfp_baseline \
--data-root data_downloads/unimol/molecular_property_prediction \
--tasks all \
--radius 2 \
--fp-bits 1024 \
--output-dir results/xgb_ecfp4_1024Wrapper scripts are available in example_scripts/baselines/.
See baselines/README.md for tuning workflows,
reference configs, and output locations.
The model code lives under unimol/. Use the Uni-Mol-specific
requirements and setup instructions in unimol/README.md
for:
- pretraining and fine-tuning
- conformer embedding and inference
- docking and pocket tasks
Example entry points are provided in:
data_processing/ contains the pipelines used to
reproduce the data assets referenced in the paper:
- reduced Uni-Mol splits
- OpenMolecules conformer filtering and grouping
- isomer lookup generation and tailored Uni-Mol datasets
- the contrastive benchmark generation workflow
Large pipelines are numbered in execution order inside their respective
directories. See data_processing/README.md for
dataset prerequisites.
The repository includes checked-in summaries for the 2D baseline experiments, including:
- fingerprint sweep comparisons in
results/fingerprint_sweep/ - tuned CatBoost runs in
results/catboost_tuning_runs/ - repeated-seed ECFP4 tuning runs for XGBoost and CatBoost in
results/
These outputs are useful as references for reproducing the reported 2D baseline numbers without rerunning every sweep from scratch.
- Python
>=3.11for the root baseline environment uvfor dependency management at the repository root- OpenBabel for the OpenBabel-based fingerprint baselines
- RDKit for ECFP/Morgan fingerprint baselines
For the Uni-Mol fork and its additional training requirements, refer to the
documentation in unimol/.
Original contributions in this repository are released under the MIT License.
See LICENSE. For project questions, contact
e.a.pidko@tudelft.nl.