Refactor dataset normalization and update transforms#11
Merged
JeremieGince merged 3 commits intodevfrom Nov 3, 2025
Merged
Conversation
Updated dataset classes for CIFAR10, MNIST, PathMNIST, and RetinaMNIST to use torchvision.transforms.v2 and dataset-specific normalization values. Removed redundant to_long_tensor methods and related tests. Added a notebook for dataset normalization statistics. Updated dev dependencies to include pip>=25.3.
Deleted the test_to_long_tensor test cases from both PathMNIST and RetinaMNIST dataset test files as they are no longer needed or relevant.
Refined the mean and std values used in v2.Normalize for CIFAR10, MNIST, PathMNIST, and RetinaMNIST datasets to higher precision, based on updated calculations. Also updated the datasets_normalisation.ipynb notebook to reflect these new statistics and added MNIST normalization analysis.
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pull request standardizes and improves dataset normalization and transformation across several dataset classes, ensuring consistency and correctness in data preprocessing. The changes include updating normalization statistics based on empirical calculations, switching to the newer
torchvision.transforms.v2API, and cleaning up redundant code. Additionally, a new notebook is added to document normalization statistics computation for key datasets.Dataset normalization and transformation improvements:
Cifar10Dataset,MNISTDataset,PathMNISTDataset,RetinaMNISTDataset) to use thetorchvision.transforms.v2API for image transformations, replacing oldertransformsusage for improved clarity and maintainability. [1] [2] [3] [4](0.328, 0.328, 0.328)and(0.278, 0.269, 0.268)for CIFAR10). [1] [2] [3] [4]v2.ToDtype(torch.long)and removed customto_long_tensorstatic methods, simplifying code and reducing redundancy. [1] [2] [3] [4]Documentation and reproducibility:
notebooks/datasets_normalisation.ipynbthat computes and documents normalization statistics for PathMNIST, RetinaMNIST, and CIFAR10, providing transparency and reproducibility for normalization choices.Testing and dependency updates:
to_long_tensormethods in both CIFAR10 and MNIST dataset test files, reflecting the updated codebase. [1] [2]pyproject.tomlto require a newer version ofpip.Checklist
Please complete the following checklist when submitting a PR. The PR will not be reviewed until all items are checked.
Make sure that the tests passed and the coverage is
sufficient by running
pytest tests --cov=src --cov-report=term-missing.You can do this by running
black src tests.You can do this by running
isort src tests.You can do this by running
mypy src tests.