Dev by JeremieGince · Pull Request #12 · MatchCake/MatchCake-Opt

JeremieGince · 2025-11-03T15:23:01Z

Description

This pull request introduces several improvements and refactors to the data module and dataset handling in the codebase, along with dependency and documentation updates. The most significant changes are the creation of a new datamodules package (separating it from datasets), the addition of a new MaxcutDataModule, and updates to dependencies to support newer versions and additional packages. There are also several updates to example notebooks to use the new import paths and minor documentation enhancements.

Data module and dataset refactor:

Moved the DataModule class from src/matchcake_opt/datasets/datamodule.py to src/matchcake_opt/datamodules/datamodule.py, updated import paths accordingly, and exposed DataModule via src/matchcake_opt/datamodules/__init__.py and src/matchcake_opt/__init__.py. This improves code organization and modularity. [1] [2] [3]
Refactored the DataModule class to store the original training dataset, added a prepare_data method, and made train_dataset and val_dataset properties return Optional types. [1] [2]

New features:

Added a new MaxcutDataModule in src/matchcake_opt/datamodules/maxcut_datamodule.py to handle Maxcut datasets with custom data loading logic using torch_geometric.
Updated the main package and pyproject.toml to include torch-geometric and its submodules as dependencies and in type checking. [1] [2]

Dependency and environment updates:

Relaxed version constraints for torchvision and matchcake to allow newer versions, added support for CUDA 13.0 (cu130) in dependency resolution, and included extra dev dependencies. [1] [2] [3] [4] [5]
Updated the Sphinx documentation workflow to use the new package structure.

Notebook and documentation updates:

Updated all example notebooks to import DataModule from the new location and adjusted max_time parameters for training. [1] [2] [3] [4] [5] [6] [7]
Added a new notebook notebooks/datasets_normalisation.ipynb demonstrating normalization statistics for several datasets.
Added a license link to the README.md.

Checklist

Please complete the following checklist when submitting a PR. The PR will not be reviewed until all items are checked.

All new features include a unit test.
Make sure that the tests passed and the coverage is
sufficient by running pytest tests --cov=src --cov-report=term-missing.
All new functions and code are clearly documented.
The code is formatted using Black.
You can do this by running black src tests.
The imports are sorted using isort.
You can do this by running isort src tests.
The code is type-checked using Mypy.
You can do this by running mypy src tests.

Changed the Sphinx apidoc source directory from './src/bolightningpipeline' to './src/matchcake_opt' in the docs GitHub Actions workflow to reflect the new documentation source location.

Added a direct link to the Apache License 2.0 in the License section of the README for improved clarity and accessibility.

Introduces MaxcutDataset and MaxcutModel classes for Max-Cut graph optimization tasks using torch-geometric. Refactors datamodule structure, adds MaxcutDataModule, updates imports, and adds torch-geometric as a dependency. Also removes unnecessary softmax from BaseModel.predict.

Updated import statements to reference DataModule from the correct 'datamodules' package instead of 'datasets' in the notebook and pipeline modules. This resolves import errors after directory restructuring.

Replaced wildcard import from matchcake_opt.datasets with explicit import of DataModule from matchcake_opt.datamodules.datamodule in automl_pipeline_tutorial.ipynb and nif_deep_learning.ipynb for improved clarity and maintainability.

Replaces returning None in val_dataloader with raising MisconfigurationException to provide clearer error handling when validation data loader is not configured.

Added training_step, validation_step, and test_step methods to MaxcutModel for handling model training and evaluation. Also updated val_dataloader in MaxcutDataModule to return an empty list instead of raising an exception.

Changed MaxcutModel.predict to return a tensor instead of a dict, simplifying its output. Updated LightningPipeline.run_validation to handle empty metrics and ensure validation time is added to the correct metrics dictionary.

The test_step method now returns a dictionary of computed metrics and energy instead of just the loss value. This change enables more detailed evaluation outputs during testing.

Introduces a static method to convert bitstring samples to a numpy array of integers, supporting string and 1D array inputs for improved flexibility in data handling.

Updated metric update methods to include inputs and outputs, removed unused static methods and sample-based metrics computation, and simplified test step to return loss only. This streamlines the MaxcutModel class and aligns metric updates with expected input signatures.

Introduces a prepare_data() method to BaseDataset and updates MaxcutDataset to use it for graph construction and label assignment. DataModule now calls prepare_data() on datasets and defers train/val split until preparation, improving modularity and consistency in dataset handling.

Introduces the 'circular' graph type to MaxcutDataset, updates type annotations, and implements the _build_circular_graph method using networkx.circulant_graph. This allows users to generate circular graphs for Max-Cut problem datasets.

The run_test method now accepts an optional ckpt_path argument, allowing callers to specify which checkpoint to use during testing. The default remains 'best' for backward compatibility.

Replaces the empty validation dataloader with a DataLoader instance in MaxcutDataModule. Adds type annotations for the batch parameter in validation_step and test_step methods of MaxcutModel for improved type safety and clarity.

Wrapped the validation call in a try-except block to attempt validation with the 'last' checkpoint if the 'best' checkpoint is not found, improving robustness when the 'best' checkpoint is missing.

Catches the SearchSpaceExhausted exception when requesting new trials in the AutoMLPipeline, allowing the run loop to exit gracefully if the search space is exhausted.

When automl_overwrite_fit is True, the checkpoint folder is now removed before proceeding. This ensures a clean state for new AutoML runs and prevents issues from leftover checkpoints.

Expanded pyproject.toml and dependency resolution to support CUDA 13.0 (cu130) builds. This includes relaxing torchvision version constraints, adding cu130-specific dependency groups, and registering the new PyTorch cu130 index.

Moved DataModule imports from datasets to datamodules in test files for consistency. Added type hints and minor refactoring in datamodule and maxcut_datamodule. Added a new test suite for MaxcutDataset. Updated pyproject.toml to include types-networkx and torch_geometric modules for type checking.

Moved datamodule tests to a new test_datamodules directory and added test stubs for maxcut datamodule. Enhanced MaxcutDataset tests with parameterized graph types and parameters, improved test coverage, and added new tests for graph parameter validation and output shape. Minor code changes in maxcut_dataset.py to mark some error branches as uncovered for coverage tools. Updated .gitignore to exclude .tmp directory.

Introduces unit tests for MaxcutDataModule and MaxcutModel, covering their main methods and behaviors. Also adds pragma: no cover to NotImplementedError branches in both classes to improve test coverage reporting.

Introduces the RetinaMNISTDataset class for handling the RetinaMNIST dataset, including data loading, transformation, and output shape methods. Adds comprehensive unit tests to verify dataset initialization, item retrieval, tensor conversion, length, and output shape.

Added RetinaMNISTDataset to the datasets module import. Updated the RetinaMNIST dataset test to use the correct dataset name, mock class, and label shape, ensuring consistency with the actual dataset implementation.

Updated the 'max_time' parameter in automl_pipeline_tutorial.ipynb, ligthning_pipeline_tutorial.ipynb, and nif_deep_learning.ipynb to shorten training duration for quicker runs and testing.

Maxcut dataset

5 add datasets

Updated the matchcake dependency in pyproject.toml to remove the upper version limit, allowing versions >=0.0.4. This change increases compatibility with future matchcake releases.

Updated dataset classes for CIFAR10, MNIST, PathMNIST, and RetinaMNIST to use torchvision.transforms.v2 and dataset-specific normalization values. Removed redundant to_long_tensor methods and related tests. Added a notebook for dataset normalization statistics. Updated dev dependencies to include pip>=25.3.

Deleted the test_to_long_tensor test cases from both PathMNIST and RetinaMNIST dataset test files as they are no longer needed or relevant.

Refined the mean and std values used in v2.Normalize for CIFAR10, MNIST, PathMNIST, and RetinaMNIST datasets to higher precision, based on updated calculations. Also updated the datasets_normalisation.ipynb notebook to reflect these new statistics and added MNIST normalization analysis.

Refactor dataset normalization and update transforms

github-actions · 2025-11-03T15:45:15Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
882	859	97%	90%	🟢

New Files

File	Coverage	Status
src/matchcake_opt/datamodules/init.py	100%	🟢
src/matchcake_opt/datamodules/maxcut_datamodule.py	93%	🟢
src/matchcake_opt/datasets/maxcut_dataset.py	99%	🟢
src/matchcake_opt/datasets/retinamnist_dataset.py	100%	🟢
src/matchcake_opt/modules/maxcut_model.py	100%	🟢
TOTAL	98%	🟢

Modified Files

File	Coverage	Status
src/matchcake_opt/init.py	100%	🟢
src/matchcake_opt/datasets/init.py	100%	🟢
src/matchcake_opt/datasets/base_dataset.py	100%	🟢
src/matchcake_opt/datasets/cifar10_dataset.py	100%	🟢
src/matchcake_opt/datasets/mnist_dataset.py	100%	🟢
src/matchcake_opt/datasets/pathmnist_dataset.py	100%	🟢
src/matchcake_opt/modules/base_model.py	100%	🟢
src/matchcake_opt/tr_pipeline/automl_pipeline.py	92%	🟢
src/matchcake_opt/tr_pipeline/lightning_pipeline.py	97%	🟢
TOTAL	99%	🟢

updated for commit: 16b6308 by action🐍

github-actions bot and others added 30 commits October 16, 2025 00:23

Automatic merge from 'main' into 'dev' [skip actions]

2f6b037

Update Sphinx apidoc source path in workflow

f37c6b7

Changed the Sphinx apidoc source directory from './src/bolightningpipeline' to './src/matchcake_opt' in the docs GitHub Actions workflow to reflect the new documentation source location.

Add Apache License 2.0 link to README

23b07e7

Added a direct link to the Apache License 2.0 in the License section of the README for improved clarity and accessibility.

Fix import paths for DataModule references

dde02ae

Updated import statements to reference DataModule from the correct 'datamodules' package instead of 'datasets' in the notebook and pipeline modules. This resolves import errors after directory restructuring.

Update dataset import in notebooks

077ddcd

Replaced wildcard import from matchcake_opt.datasets with explicit import of DataModule from matchcake_opt.datamodules.datamodule in automl_pipeline_tutorial.ipynb and nif_deep_learning.ipynb for improved clarity and maintainability.

Raise MisconfigurationException in val_dataloader

0a3e550

Replaces returning None in val_dataloader with raising MisconfigurationException to provide clearer error handling when validation data loader is not configured.

Implement training, validation, and test steps in MaxcutModel

dadf22e

Added training_step, validation_step, and test_step methods to MaxcutModel for handling model training and evaluation. Also updated val_dataloader in MaxcutDataModule to return an empty list instead of raising an exception.

Refactor predict method and validation metrics handling

75066dd

Changed MaxcutModel.predict to return a tensor instead of a dict, simplifying its output. Updated LightningPipeline.run_validation to handle empty metrics and ensure validation time is added to the correct metrics dictionary.

Refactor test_step to return metrics components

24396dd

The test_step method now returns a dictionary of computed metrics and energy instead of just the loss value. This change enables more detailed evaluation outputs during testing.

Add bitstrings_to_arr utility to MaxcutModel

fb46bd5

Introduces a static method to convert bitstring samples to a numpy array of integers, supporting string and 1D array inputs for improved flexibility in data handling.

Add support for circular graphs in MaxcutDataset

2a5bfa7

Introduces the 'circular' graph type to MaxcutDataset, updates type annotations, and implements the _build_circular_graph method using networkx.circulant_graph. This allows users to generate circular graphs for Max-Cut problem datasets.

Update maxcut_dataset.py

48fc255

Add ckpt_path parameter to run_test method

bcd5a56

The run_test method now accepts an optional ckpt_path argument, allowing callers to specify which checkpoint to use during testing. The default remains 'best' for backward compatibility.

Implement proper val_dataloader and type annotations

79bab81

Replaces the empty validation dataloader with a DataLoader instance in MaxcutDataModule. Adds type annotations for the batch parameter in validation_step and test_step methods of MaxcutModel for improved type safety and clarity.

Fallback to 'last' checkpoint if 'best' is unavailable

e1daf86

Wrapped the validation call in a try-except block to attempt validation with the 'last' checkpoint if the 'best' checkpoint is not found, improving robustness when the 'best' checkpoint is missing.

Handle SearchSpaceExhausted in AutoML pipeline

b1400e1

Catches the SearchSpaceExhausted exception when requesting new trials in the AutoMLPipeline, allowing the run loop to exit gracefully if the search space is exhausted.

Add checkpoint folder cleanup on overwrite fit

5f27a8e

When automl_overwrite_fit is True, the checkpoint folder is now removed before proceeding. This ensures a clean state for new AutoML runs and prevents issues from leftover checkpoints.

Add CUDA 13.0 (cu130) support to dependencies

7686d85

Expanded pyproject.toml and dependency resolution to support CUDA 13.0 (cu130) builds. This includes relaxing torchvision version constraints, adding cu130-specific dependency groups, and registering the new PyTorch cu130 index.

Add tests for MaxcutDataModule and MaxcutModel

7608a49

Introduces unit tests for MaxcutDataModule and MaxcutModel, covering their main methods and behaviors. Also adds pragma: no cover to NotImplementedError branches in both classes to improve test coverage reporting.

Update RetinaMNISTDataset import and test mocks

a80737d

Added RetinaMNISTDataset to the datasets module import. Updated the RetinaMNIST dataset test to use the correct dataset name, mock class, and label shape, ensuring consistency with the actual dataset implementation.

Reduce max_time for training in tutorial notebooks

99dc64a

Updated the 'max_time' parameter in automl_pipeline_tutorial.ipynb, ligthning_pipeline_tutorial.ipynb, and nif_deep_learning.ipynb to shorten training duration for quicker runs and testing.

Merge pull request #9 from MatchCake/maxcut_dataset

b1e49ea

Maxcut dataset

Merge branch 'dev' into 5-add-datasets

d985c6f

Merge pull request #10 from MatchCake/5-add-datasets

ef1a819

5 add datasets

JeremieGince and others added 5 commits November 2, 2025 17:33

Relax matchcake dependency version constraint

28f3876

Updated the matchcake dependency in pyproject.toml to remove the upper version limit, allowing versions >=0.0.4. This change increases compatibility with future matchcake releases.

Remove redundant to_long_tensor tests from dataset tests

aa59488

Deleted the test_to_long_tensor test cases from both PathMNIST and RetinaMNIST dataset test files as they are no longer needed or relevant.

Merge pull request #11 from MatchCake/normalization

16b6308

Refactor dataset normalization and update transforms

JeremieGince linked an issue Nov 3, 2025 that may be closed by this pull request

Add datasets #5

Closed

JeremieGince added bug Something isn't working enhancement New feature or request labels Nov 3, 2025

JeremieGince merged commit 5108418 into main Nov 3, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev#12

Dev#12
JeremieGince merged 35 commits intomainfrom
dev

JeremieGince commented Nov 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JeremieGince commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

github-actions bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JeremieGince commented Nov 3, 2025 •

edited

Loading

github-actions bot commented Nov 3, 2025 •

edited

Loading