Author: Bijesh Singha
The primary objective of this assignment was to implement a YOLO (You Only Look Once) object detection model and evaluate its performance across diverse video datasets. The analysis focuses on:
- Identifying entities within the model's standard trained classes.
- Observing model behavior when encountering "out-of-distribution" objects not present in the training set.
- Testing the model's generalization capabilities across real-world, stylized (cartoon), and AI-generated content.
The model was tested on standard CCTV footage capturing road traffic. * Findings: The model correctly identified a high volume of standard entities, including cars (2,580 detections), persons (749), and motorcycles (369). * Interpretation: Accuracy remained high because these entities belong to the standard 80 COCO classes the model was originally trained on.
Testing class limitations using footage of a person feeding lions. * Findings: While the model accurately detected the person (102 detections), it completely failed to identify the lions. * Feature Proximity Mapping: Because "lion" is not one of the 80 trained classes, the model forced a classification based on the closest visual features available, misidentifying them as dogs (86), bears (66), and cats (29).
Evaluating how the model handles non-realistic, stylized representations.
- Findings:
- A cartoon child was correctly generalized as a "person" (325 detections).
- Abstractly drawn objects caused struggles; a chocolate bar was misidentified as a "cell phone" and bees were classified as "birds" or "sports balls".
Interpretation: The model generalizes human-like features well but defaults to the most similar trained feature set for small or abstract objects.
Determining if YOLO can distinguish between real and synthetic (AI) media. * Findings: The model purely identifies objects regardless of whether the source is real or synthetic; it failed to classify the video as "AI-generated". * Interpretation: YOLO is built for object localization and classification, not provenance or deepfake detection.
Based on these experiments, the following conclusions were drawn regarding YOLO's operational logic:
| Scenario | Primary Object | Top Detection Class | Accuracy Note |
|---|---|---|---|
| Road Video | Cars / Persons | Car / Person | High: Objects are within standard 80 classes. |
| Animal Video | Lions | Dog / Bear | Low: "Lion" class is missing from training. |
| Cartoon Video | Cartoon Child | Person | Moderate: Effectively generalizes features. |
| AI Video | Synthetic Objects | Standard Classes | N/A: Cannot detect "AI" as a class. |
Fixed Class Constraint: The model is strictly limited to its 80 trained classes.
No Null Results: When encountering an unknown object, the model does not return a null result; it assigns the class with the most similar visual features (e.g., mapping lion features to a dog or bear).
Generalization vs. Specificity: While it can generalize (treating cartoon humans as "people"), it lacks the nuance to identify specific items outside its training set, such as specific animal species or unique consumer goods.