This codebase contains
- the data extracted from a meat-industry production line after packaging, preprocessed to contain only product windows in order remove any brand identity
- Code in "src" folder to use vision foundation models (dinov2, CLIP and ViT-MAE) to extract visual features from this dataset and train non-neural-network classifiers on top of these embeddings.
In total the data contains 30 products. We run one-shot, five-shot, ten-shot and full set experiments. In all cases, for the same seed the test-set is the same: the few-shot experiments select N items per class from among training images and apply the created model to the same test set as full-set.
The performance in this 30-class classification task is very high. The best models achieve.
- One-shot with augmentation, avg over 20 runs: 0.73 overall accuracy
- Five-shot with augmentation, avg over 20 runs: 0.895 overall accuracy
- Ten-shot with augmentation, avg over 20 runs: 0.929 overall accuracy
- Full-set (no augmentation), avg over 20 runs: 0.975 overall accuracy
While other model types were also tested, these results were obtained with logistic regression and the smallest version of the DINOv2 model.
This work demonstrates that the vision foundation models embed images of different meat products to sufficiently linearly separated areas, allowing a simple logistic regression to learn to separate the classes with very high accuracy.