A machine learning framework for predicting environmental impacts from molecular SMILES strings using pre-trained Graph Neural Networks (GNNs). This repository provides ready-to-use models for Life Cycle Assessment (LCA) across 15 environmental impact categories.
- 🎯 Single Molecule Inference: Predict environmental impacts from SMILES strings
- 📊 15 Environmental Impact Categories: Climate change, acidification, eutrophication, toxicity, resource depletion, etc.
- 🧠 Multiple Model Types: Molecular GNNs, country-specific GNNs, and energy-aware GNNs
- ⚡ Ready-to-Use: Pre-trained models with automatic denormalization
- 🔄 Batch Processing: Process multiple molecules from Excel files
- 🌍 Country-Specific Predictions: Regional environmental modeling with 90+ countries
- 🧪 Comprehensive Testing: Full test suite with CI/CD pipeline
- 🔧 Development Ready: Automated code quality checks and type safety
- Quick Start
- Docker Deployment
- Development & Testing
- Research Functions
- Model Architecture
- Repository Structure
- Configuration
- Requirements
- Contributing
- License & Contact
- Contributors
# Clone the repository
git clone https://github.com/process-intelligence-research/LCA_GNNs.git
cd LCA_GNNs
# Install dependencies
pip install -r requirements.txt# Single environmental impact (e.g., Global Warming Impact)
python main.py --workflow inference \
--model_path trained_models/GNN_C_single/GNN_C_Gwi_final_lr_5.00e-05.pth \
--smiles "CCO" \
--target_task "Gwi" \
--country_name "Germany"
# All 15 environmental impacts at once (multitask)
python main.py --workflow inference \
--model_path trained_models/GNN_C_multi_best.pth \
--smiles "CCO" \
--dataset_type "GNN_C" \
--country_name "Germany" \
--multitask
# Energy-focused environmental assessment
python main.py --workflow inference \
--model_path trained_models/GNN_E_multi_best.pth \
--smiles "CCO" \
--dataset_type "GNN_E" \
--country_name "Germany" \
--multitask
# Pure molecular prediction (no geographical factors)
python main.py --workflow inference \
--model_path trained_models/GNN_M_multi_best.pth \
--smiles "CCO" \
--dataset_type "GNN_M" \
--multitask# Process multiple molecules from Excel file
python main.py --workflow batch \
--model_path trained_models/GNN_C_multi_best.pth \
--data_path test_molecules.xlsxRequired Excel format for batch processing:
| SMILES | country_name |
|---|---|
| CCO | Germany |
| CC(=O)O | Japan |
| c1ccccc1 | China |
from src.engines.predict_engines import predict_single_molecule
# GNN_C model (requires country_name)
results = predict_single_molecule(
model_path="trained_models/GNN_C_single/GNN_C_Gwi_final_lr_5.00e-05.pth",
smiles="CCO",
country_name="Germany", # Required for GNN_C
dataset_type="GNN_C",
multitask=True
)
# GNN_E model (uses country_name for automatic energy mix lookup)
results = predict_single_molecule(
model_path="trained_models/GNN_E_single/GNN_E_Gwi_final_lr_5.00e-04.pth",
smiles="CCO",
country_name="Germany", # Automatically retrieves energy mix
dataset_type="GNN_E",
multitask=True
)
# GNN_M model (no additional parameters needed)
results = predict_single_molecule(
model_path="trained_models/GNN_M_model.pth",
smiles="CCO",
dataset_type="GNN_M",
multitask=True
)
# Results format:
# {
# "smiles": "CCO",
# "country_name": "Germany",
# "predictions": {
# "Acid": 1.23e-4,
# "Gwi": 5.67e-3,
# "CTUe": 8.90e-6,
# # ... 12 more categories
# },
# "denormalized": True
# }Note: The following functions require access to proprietary training data and are intended for research/development use:
# Setup configuration templates
python main.py --workflow config
# Data preparation (requires proprietary dataset)
python main.py --workflow data --data_path training_data.xlsx
# Model training (requires proprietary dataset)
python main.py --workflow train
# Final model training (requires proprietary dataset)
python main.py --workflow final_trainDocker support is available for easy deployment and development. All Docker files are in the docker/ folder.
Quick Start:
# Build and run (from project root)
docker build -f docker/Dockerfile -t lca-gnn:latest .
docker run --rm \
-v "$(pwd)/trained_models:/app/trained_models:ro" \
-v "$(pwd)/data:/app/data:ro" \
lca-gnn:latest \
python main.py --workflow inference \
--model_path trained_models/GNN_C_Gwi.pth \
--smiles "CCO" \
--target_task "Gwi" \
--country_name "Germany"
# Using helper scripts
docker/docker-helper.sh build # Linux/Mac
docker\docker-helper.bat build # WindowsFor comprehensive Docker documentation, deployment strategies, and troubleshooting, see docker/README.md.
This project maintains high code quality with automated CI/CD pipeline. For detailed development guidelines, see the Contributing section.
Quick development setup:
# Install dependencies including development tools
pip install -r requirements.txt
# Run all quality checks
python -m ruff check src/ tests/ # Linting
python -m ruff format src/ tests/ # Formatting
python -m pyright src/ # Type checking
python -m pytest tests/ -v # TestingCI Pipeline: Automated checks for linting, formatting, type checking, and testing on all pull requests.
| Model Type | Description | Use Case | Single Molecule Inference |
|---|---|---|---|
| QSPR | Traditional ML on molecular descriptors | Baseline molecular property prediction | ❌ Not supported* |
| GNN_M | Graph neural networks on molecular structure | Advanced molecular property prediction | ✅ Supported |
| GNN_C | GNNs with country-specific features | Regional environmental impact modeling | ✅ Supported |
| GNN_E | GNNs with energy system features | Energy-focused environmental assessment | ✅ Supported |
*QSPR models require pre-computed molecular descriptors and cannot be used for single molecule inference from SMILES strings. Use batch prediction with pre-computed descriptor data instead.
Environmental Impact Categories (target_task):
Acid: Acidification potentialGwi: Global warming impactCTUe: Ecotoxicity potentialADP_f: Abiotic depletion potential (fossil fuels)Eutro_f: Eutrophication potential (freshwater)Eutro_m: Eutrophication potential (marine)Eutro_t: Eutrophication potential (terrestrial)CTUh: Human toxicity potentialIonising: Ionising radiation potentialSoil: Land use potentialADP_e: Abiotic depletion potential (elements)ODP: Ozone depletion potentialhuman_health: Particulate matter formation potentialPhoto: Photochemical ozone formation potentialWater_use: Water use potential
Countries: Available countries for GNN_C and GNN_E models are listed in data/raw/energy_mapping.json. Examples include "Germany", "United States", "Japan", "China", and 85+ others.
- Single-Task: Separate models for each impact category
- Multi-Task: One model predicting all 15 categories simultaneously
LCA_GNNs/
├── .github/
│ └── workflows/ # CI/CD pipeline configuration
├── docker/ # Docker deployment files
│ ├── Dockerfile # Production Docker image
│ ├── Dockerfile.dev # Development Docker image
│ ├── docker-compose.yml # Docker Compose configuration
│ ├── docker-helper.sh # Linux/Mac helper scripts
│ ├── docker-helper.bat # Windows helper scripts
│ ├── .env.example # Environment configuration template
│ └── README.md # Comprehensive Docker documentation
├── src/
│ ├── config/ # Configuration management
│ ├── engines/ # Training, evaluation, and prediction engines
│ ├── models/ # GNN and QSPR model architectures
│ ├── data_processing/ # Dataset creation and preprocessing
│ ├── trainer/ # Core training loops
│ └── scripts.py # Main pipeline interface
├── tests/ # Test suite
│ ├── test_imports.py # Import and structure validation
│ ├── test_inference.py # Inference functionality tests
│ ├── test_fastapi.py # API tests (when available)
│ └── conftest.py # Pytest configuration
├── configs/ # Configuration templates
├── examples/ # Usage examples and tests
├── data/ # Data directory
├── trained_models/ # Model checkpoints
├── .dockerignore # Docker build exclusions
├── pyproject.toml # Python project configuration
├── main.py # CLI workflow interface
└── requirements.txt # Project dependencies
The project uses pyproject.toml for Python tooling configuration:
- Pyright: Type checking configuration optimized for ML projects
- Build system: Standard Python packaging configuration
YAML-based configuration system with categories:
# Example configuration
optimizer:
learning_rate: 0.001
weight_decay: 0.0001
training:
epochs: 500
batch_size: 20
k_fold: 10
task_mode: "single" # or "multi"
data:
dataset_type: "GNN_C"
path: "./data"
model:
model_type: "GNN_C_single"
hidden_dim: 128
num_layers: 3
experiment:
enable_wandb: true
project_name: "LCA_Environmental_Impact"- Python: 3.9+
- Dependencies: PyTorch 2.0+, PyTorch Geometric, RDKit, scikit-learn
- Optional: CUDA-compatible GPU for faster training
Core dependencies:
torch==2.7.0
torch-geometric
rdkit
scikit-learn
pandas
wandb
numpy
pyyaml
openpyxl
tqdm
requests
Development dependencies:
pytest>=8.0.0
ruff>=0.1.0
pyright>=1.1.0
fastapi>=0.100.0
uvicorn>=0.20.0
For development setup:
# Install all dependencies including development tools
pip install -r requirements.txt
# Or install development dependencies separately
pip install pytest ruff pyright fastapi uvicornWe welcome contributions! Please follow these guidelines:
- Fork the repository and create a feature branch
- Install development dependencies:
pip install -r requirements.txt - Make your changes following the code style guidelines
- Run quality checks:
python -m ruff check src/ tests/ # Linting python -m ruff format src/ tests/ # Formatting python -m pyright src/ # Type checking python -m pytest tests/ -v # Testing
- Submit a pull request with a clear description
- Formatting: Automatic formatting with Ruff
- Linting: Code quality checks with Ruff
- Type Hints: Encouraged but not strictly enforced (ML-friendly configuration)
- Testing: Add tests for new functionality in the
tests/directory
- Core functionality: Place in
src/directory - Tests: Place in
tests/directory with descriptive names - Examples: Place in
examples/directory (excluded from CI) - Documentation: Update README.md and docstrings
The CI pipeline will automatically run all quality checks on pull requests.
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE for details.
The AGPL-3.0 license allows commercial use while ensuring that any modifications or derivative works are also made available under the same license terms. As co-developers, contributors have co-ownership and user rights that are not limited by the publication license.
![]() |
Qinghe Gao | |
![]() |
Lukas Schulze Balhorn | |
![]() |
Alessandro Laera | |
![]() |
Raoul Meys | |
![]() |
Jonas Goßen | |
![]() |
Jana M. Weber | |
![]() |
Gregor Wernet | |
![]() |
Artur M. Schweidtmann | |







