Skip to content

Refactor dataset normalization and update transforms#11

Merged
JeremieGince merged 3 commits intodevfrom
normalization
Nov 3, 2025
Merged

Refactor dataset normalization and update transforms#11
JeremieGince merged 3 commits intodevfrom
normalization

Conversation

@JeremieGince
Copy link
Contributor

@JeremieGince JeremieGince commented Nov 3, 2025

Description

This pull request standardizes and improves dataset normalization and transformation across several dataset classes, ensuring consistency and correctness in data preprocessing. The changes include updating normalization statistics based on empirical calculations, switching to the newer torchvision.transforms.v2 API, and cleaning up redundant code. Additionally, a new notebook is added to document normalization statistics computation for key datasets.

Dataset normalization and transformation improvements:

  • Updated all dataset classes (Cifar10Dataset, MNISTDataset, PathMNISTDataset, RetinaMNISTDataset) to use the torchvision.transforms.v2 API for image transformations, replacing older transforms usage for improved clarity and maintainability. [1] [2] [3] [4]
  • Set dataset-specific normalization means and standard deviations in the normalization transforms, using values empirically computed for each dataset (e.g., (0.328, 0.328, 0.328) and (0.278, 0.269, 0.268) for CIFAR10). [1] [2] [3] [4]
  • Updated target transforms to use v2.ToDtype(torch.long) and removed custom to_long_tensor static methods, simplifying code and reducing redundancy. [1] [2] [3] [4]

Documentation and reproducibility:

  • Added a new notebook notebooks/datasets_normalisation.ipynb that computes and documents normalization statistics for PathMNIST, RetinaMNIST, and CIFAR10, providing transparency and reproducibility for normalization choices.

Testing and dependency updates:

  • Removed tests for the now-removed to_long_tensor methods in both CIFAR10 and MNIST dataset test files, reflecting the updated codebase. [1] [2]
  • Updated the development dependencies in pyproject.toml to require a newer version of pip.

Checklist

Please complete the following checklist when submitting a PR. The PR will not be reviewed until all items are checked.

  • All new features include a unit test.
    Make sure that the tests passed and the coverage is
    sufficient by running pytest tests --cov=src --cov-report=term-missing.
  • All new functions and code are clearly documented.
  • The code is formatted using Black.
    You can do this by running black src tests.
  • The imports are sorted using isort.
    You can do this by running isort src tests.
  • The code is type-checked using Mypy.
    You can do this by running mypy src tests.

Updated dataset classes for CIFAR10, MNIST, PathMNIST, and RetinaMNIST to use torchvision.transforms.v2 and dataset-specific normalization values. Removed redundant to_long_tensor methods and related tests. Added a notebook for dataset normalization statistics. Updated dev dependencies to include pip>=25.3.
Deleted the test_to_long_tensor test cases from both PathMNIST and RetinaMNIST dataset test files as they are no longer needed or relevant.
Refined the mean and std values used in v2.Normalize for CIFAR10, MNIST, PathMNIST, and RetinaMNIST datasets to higher precision, based on updated calculations. Also updated the datasets_normalisation.ipynb notebook to reflect these new statistics and added MNIST normalization analysis.
@github-actions
Copy link

github-actions bot commented Nov 3, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
882 859 97% 90% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
src/matchcake_opt/datasets/cifar10_dataset.py 100% 🟢
src/matchcake_opt/datasets/mnist_dataset.py 100% 🟢
src/matchcake_opt/datasets/pathmnist_dataset.py 100% 🟢
src/matchcake_opt/datasets/retinamnist_dataset.py 100% 🟢
TOTAL 100% 🟢

updated for commit: 635104f by action🐍

@JeremieGince JeremieGince merged commit 16b6308 into dev Nov 3, 2025
6 checks passed
@JeremieGince JeremieGince deleted the normalization branch November 3, 2025 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant