Refactor dataset normalization and update transforms by JeremieGince · Pull Request #11 · MatchCake/MatchCake-Opt

JeremieGince · 2025-11-03T14:30:19Z

Description

This pull request standardizes and improves dataset normalization and transformation across several dataset classes, ensuring consistency and correctness in data preprocessing. The changes include updating normalization statistics based on empirical calculations, switching to the newer torchvision.transforms.v2 API, and cleaning up redundant code. Additionally, a new notebook is added to document normalization statistics computation for key datasets.

Dataset normalization and transformation improvements:

Updated all dataset classes (Cifar10Dataset, MNISTDataset, PathMNISTDataset, RetinaMNISTDataset) to use the torchvision.transforms.v2 API for image transformations, replacing older transforms usage for improved clarity and maintainability. [1] [2] [3] [4]
Set dataset-specific normalization means and standard deviations in the normalization transforms, using values empirically computed for each dataset (e.g., (0.328, 0.328, 0.328) and (0.278, 0.269, 0.268) for CIFAR10). [1] [2] [3] [4]
Updated target transforms to use v2.ToDtype(torch.long) and removed custom to_long_tensor static methods, simplifying code and reducing redundancy. [1] [2] [3] [4]

Documentation and reproducibility:

Added a new notebook notebooks/datasets_normalisation.ipynb that computes and documents normalization statistics for PathMNIST, RetinaMNIST, and CIFAR10, providing transparency and reproducibility for normalization choices.

Testing and dependency updates:

Removed tests for the now-removed to_long_tensor methods in both CIFAR10 and MNIST dataset test files, reflecting the updated codebase. [1] [2]
Updated the development dependencies in pyproject.toml to require a newer version of pip.

Checklist

Please complete the following checklist when submitting a PR. The PR will not be reviewed until all items are checked.

All new features include a unit test.
Make sure that the tests passed and the coverage is
sufficient by running pytest tests --cov=src --cov-report=term-missing.
All new functions and code are clearly documented.
The code is formatted using Black.
You can do this by running black src tests.
The imports are sorted using isort.
You can do this by running isort src tests.
The code is type-checked using Mypy.
You can do this by running mypy src tests.

Updated dataset classes for CIFAR10, MNIST, PathMNIST, and RetinaMNIST to use torchvision.transforms.v2 and dataset-specific normalization values. Removed redundant to_long_tensor methods and related tests. Added a notebook for dataset normalization statistics. Updated dev dependencies to include pip>=25.3.

Deleted the test_to_long_tensor test cases from both PathMNIST and RetinaMNIST dataset test files as they are no longer needed or relevant.

Refined the mean and std values used in v2.Normalize for CIFAR10, MNIST, PathMNIST, and RetinaMNIST datasets to higher precision, based on updated calculations. Also updated the datasets_normalisation.ipynb notebook to reflect these new statistics and added MNIST normalization analysis.

github-actions · 2025-11-03T14:56:52Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
882	859	97%	90%	🟢

New Files

No new covered files...

Modified Files

File	Coverage	Status
src/matchcake_opt/datasets/cifar10_dataset.py	100%	🟢
src/matchcake_opt/datasets/mnist_dataset.py	100%	🟢
src/matchcake_opt/datasets/pathmnist_dataset.py	100%	🟢
src/matchcake_opt/datasets/retinamnist_dataset.py	100%	🟢
TOTAL	100%	🟢

updated for commit: 635104f by action🐍

JeremieGince added 3 commits November 3, 2025 09:29

Remove redundant to_long_tensor tests from dataset tests

aa59488

Deleted the test_to_long_tensor test cases from both PathMNIST and RetinaMNIST dataset test files as they are no longer needed or relevant.

JeremieGince merged commit 16b6308 into dev Nov 3, 2025
6 checks passed

JeremieGince deleted the normalization branch November 3, 2025 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dataset normalization and update transforms#11

Refactor dataset normalization and update transforms#11
JeremieGince merged 3 commits intodevfrom
normalization

JeremieGince commented Nov 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JeremieGince commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

github-actions bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JeremieGince commented Nov 3, 2025 •

edited

Loading

github-actions bot commented Nov 3, 2025 •

edited

Loading