Skip to content

Dev#12

Merged
JeremieGince merged 35 commits intomainfrom
dev
Nov 3, 2025
Merged

Dev#12
JeremieGince merged 35 commits intomainfrom
dev

Conversation

@JeremieGince
Copy link
Contributor

@JeremieGince JeremieGince commented Nov 3, 2025

Description

This pull request introduces several improvements and refactors to the data module and dataset handling in the codebase, along with dependency and documentation updates. The most significant changes are the creation of a new datamodules package (separating it from datasets), the addition of a new MaxcutDataModule, and updates to dependencies to support newer versions and additional packages. There are also several updates to example notebooks to use the new import paths and minor documentation enhancements.

Data module and dataset refactor:

  • Moved the DataModule class from src/matchcake_opt/datasets/datamodule.py to src/matchcake_opt/datamodules/datamodule.py, updated import paths accordingly, and exposed DataModule via src/matchcake_opt/datamodules/__init__.py and src/matchcake_opt/__init__.py. This improves code organization and modularity. [1] [2] [3]
  • Refactored the DataModule class to store the original training dataset, added a prepare_data method, and made train_dataset and val_dataset properties return Optional types. [1] [2]

New features:

  • Added a new MaxcutDataModule in src/matchcake_opt/datamodules/maxcut_datamodule.py to handle Maxcut datasets with custom data loading logic using torch_geometric.
  • Updated the main package and pyproject.toml to include torch-geometric and its submodules as dependencies and in type checking. [1] [2]

Dependency and environment updates:

  • Relaxed version constraints for torchvision and matchcake to allow newer versions, added support for CUDA 13.0 (cu130) in dependency resolution, and included extra dev dependencies. [1] [2] [3] [4] [5]
  • Updated the Sphinx documentation workflow to use the new package structure.

Notebook and documentation updates:

  • Updated all example notebooks to import DataModule from the new location and adjusted max_time parameters for training. [1] [2] [3] [4] [5] [6] [7]
  • Added a new notebook notebooks/datasets_normalisation.ipynb demonstrating normalization statistics for several datasets.
  • Added a license link to the README.md.

Checklist

Please complete the following checklist when submitting a PR. The PR will not be reviewed until all items are checked.

  • All new features include a unit test.
    Make sure that the tests passed and the coverage is
    sufficient by running pytest tests --cov=src --cov-report=term-missing.
  • All new functions and code are clearly documented.
  • The code is formatted using Black.
    You can do this by running black src tests.
  • The imports are sorted using isort.
    You can do this by running isort src tests.
  • The code is type-checked using Mypy.
    You can do this by running mypy src tests.

github-actions bot and others added 30 commits October 16, 2025 00:23
Changed the Sphinx apidoc source directory from './src/bolightningpipeline' to './src/matchcake_opt' in the docs GitHub Actions workflow to reflect the new documentation source location.
Added a direct link to the Apache License 2.0 in the License section of the README for improved clarity and accessibility.
Introduces MaxcutDataset and MaxcutModel classes for Max-Cut graph optimization tasks using torch-geometric. Refactors datamodule structure, adds MaxcutDataModule, updates imports, and adds torch-geometric as a dependency. Also removes unnecessary softmax from BaseModel.predict.
Updated import statements to reference DataModule from the correct 'datamodules' package instead of 'datasets' in the notebook and pipeline modules. This resolves import errors after directory restructuring.
Replaced wildcard import from matchcake_opt.datasets with explicit import of DataModule from matchcake_opt.datamodules.datamodule in automl_pipeline_tutorial.ipynb and nif_deep_learning.ipynb for improved clarity and maintainability.
Replaces returning None in val_dataloader with raising MisconfigurationException to provide clearer error handling when validation data loader is not configured.
Added training_step, validation_step, and test_step methods to MaxcutModel for handling model training and evaluation. Also updated val_dataloader in MaxcutDataModule to return an empty list instead of raising an exception.
Changed MaxcutModel.predict to return a tensor instead of a dict, simplifying its output. Updated LightningPipeline.run_validation to handle empty metrics and ensure validation time is added to the correct metrics dictionary.
The test_step method now returns a dictionary of computed metrics and energy instead of just the loss value. This change enables more detailed evaluation outputs during testing.
Introduces a static method to convert bitstring samples to a numpy array of integers, supporting string and 1D array inputs for improved flexibility in data handling.
Updated metric update methods to include inputs and outputs, removed unused static methods and sample-based metrics computation, and simplified test step to return loss only. This streamlines the MaxcutModel class and aligns metric updates with expected input signatures.
Introduces a prepare_data() method to BaseDataset and updates MaxcutDataset to use it for graph construction and label assignment. DataModule now calls prepare_data() on datasets and defers train/val split until preparation, improving modularity and consistency in dataset handling.
Introduces the 'circular' graph type to MaxcutDataset, updates type annotations, and implements the _build_circular_graph method using networkx.circulant_graph. This allows users to generate circular graphs for Max-Cut problem datasets.
The run_test method now accepts an optional ckpt_path argument, allowing callers to specify which checkpoint to use during testing. The default remains 'best' for backward compatibility.
Replaces the empty validation dataloader with a DataLoader instance in MaxcutDataModule. Adds type annotations for the batch parameter in validation_step and test_step methods of MaxcutModel for improved type safety and clarity.
Wrapped the validation call in a try-except block to attempt validation with the 'last' checkpoint if the 'best' checkpoint is not found, improving robustness when the 'best' checkpoint is missing.
Catches the SearchSpaceExhausted exception when requesting new trials in the AutoMLPipeline, allowing the run loop to exit gracefully if the search space is exhausted.
When automl_overwrite_fit is True, the checkpoint folder is now removed before proceeding. This ensures a clean state for new AutoML runs and prevents issues from leftover checkpoints.
Expanded pyproject.toml and dependency resolution to support CUDA 13.0 (cu130) builds. This includes relaxing torchvision version constraints, adding cu130-specific dependency groups, and registering the new PyTorch cu130 index.
Moved DataModule imports from datasets to datamodules in test files for consistency. Added type hints and minor refactoring in datamodule and maxcut_datamodule. Added a new test suite for MaxcutDataset. Updated pyproject.toml to include types-networkx and torch_geometric modules for type checking.
Moved datamodule tests to a new test_datamodules directory and added test stubs for maxcut datamodule. Enhanced MaxcutDataset tests with parameterized graph types and parameters, improved test coverage, and added new tests for graph parameter validation and output shape. Minor code changes in maxcut_dataset.py to mark some error branches as uncovered for coverage tools. Updated .gitignore to exclude .tmp directory.
Introduces unit tests for MaxcutDataModule and MaxcutModel, covering their main methods and behaviors. Also adds pragma: no cover to NotImplementedError branches in both classes to improve test coverage reporting.
Introduces the RetinaMNISTDataset class for handling the RetinaMNIST dataset, including data loading, transformation, and output shape methods. Adds comprehensive unit tests to verify dataset initialization, item retrieval, tensor conversion, length, and output shape.
Added RetinaMNISTDataset to the datasets module import. Updated the RetinaMNIST dataset test to use the correct dataset name, mock class, and label shape, ensuring consistency with the actual dataset implementation.
Updated the 'max_time' parameter in automl_pipeline_tutorial.ipynb, ligthning_pipeline_tutorial.ipynb, and nif_deep_learning.ipynb to shorten training duration for quicker runs and testing.
JeremieGince and others added 5 commits November 2, 2025 17:33
Updated the matchcake dependency in pyproject.toml to remove the upper version limit, allowing versions >=0.0.4. This change increases compatibility with future matchcake releases.
Updated dataset classes for CIFAR10, MNIST, PathMNIST, and RetinaMNIST to use torchvision.transforms.v2 and dataset-specific normalization values. Removed redundant to_long_tensor methods and related tests. Added a notebook for dataset normalization statistics. Updated dev dependencies to include pip>=25.3.
Deleted the test_to_long_tensor test cases from both PathMNIST and RetinaMNIST dataset test files as they are no longer needed or relevant.
Refined the mean and std values used in v2.Normalize for CIFAR10, MNIST, PathMNIST, and RetinaMNIST datasets to higher precision, based on updated calculations. Also updated the datasets_normalisation.ipynb notebook to reflect these new statistics and added MNIST normalization analysis.
Refactor dataset normalization and update transforms
@JeremieGince JeremieGince linked an issue Nov 3, 2025 that may be closed by this pull request
@JeremieGince JeremieGince added bug Something isn't working enhancement New feature or request labels Nov 3, 2025
@github-actions
Copy link

github-actions bot commented Nov 3, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
882 859 97% 90% 🟢

New Files

File Coverage Status
src/matchcake_opt/datamodules/init.py 100% 🟢
src/matchcake_opt/datamodules/maxcut_datamodule.py 93% 🟢
src/matchcake_opt/datasets/maxcut_dataset.py 99% 🟢
src/matchcake_opt/datasets/retinamnist_dataset.py 100% 🟢
src/matchcake_opt/modules/maxcut_model.py 100% 🟢
TOTAL 98% 🟢

Modified Files

File Coverage Status
src/matchcake_opt/init.py 100% 🟢
src/matchcake_opt/datasets/init.py 100% 🟢
src/matchcake_opt/datasets/base_dataset.py 100% 🟢
src/matchcake_opt/datasets/cifar10_dataset.py 100% 🟢
src/matchcake_opt/datasets/mnist_dataset.py 100% 🟢
src/matchcake_opt/datasets/pathmnist_dataset.py 100% 🟢
src/matchcake_opt/modules/base_model.py 100% 🟢
src/matchcake_opt/tr_pipeline/automl_pipeline.py 92% 🟢
src/matchcake_opt/tr_pipeline/lightning_pipeline.py 97% 🟢
TOTAL 99% 🟢

updated for commit: 16b6308 by action🐍

@JeremieGince JeremieGince merged commit 5108418 into main Nov 3, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add datasets

1 participant