Fine-tuning state-of-the-art object detectors on the KITTI autonomous driving dataset with real-time bounding box visualization.
This project fine-tunes YOLOv8 and DETR (DEtection TRansformer) on the KITTI autonomous driving dataset to detect:
- Cars
- Pedestrians
- Cyclists
- Vans, Trucks
It includes a visualization pipeline that overlays bounding boxes on video frames, suitable for real-time inference demos.
kitti-detection/
├── configs/
│ ├── yolo_kitti.yaml # YOLOv8 training config
│ └── detr_kitti.yaml # DETR training config
├── data/
│ ├── download_kitti.sh # Dataset download script
│ └── kitti_dataset.py # PyTorch Dataset class for KITTI
├── models/
│ ├── yolo_trainer.py # YOLOv8 fine-tuning script
│ └── detr_trainer.py # DETR fine-tuning with HuggingFace
├── scripts/
│ ├── visualize_video.py # Draw bounding boxes on video frames
│ ├── evaluate.py # Compute mAP, precision, recall
│ └── inference.py # Single image/video inference
├── utils/
│ ├── kitti_parser.py # Parse KITTI annotation format
│ ├── bbox_utils.py # Bounding box utilities
│ └── metrics.py # Evaluation metrics (mAP, IoU)
├── notebooks/
│ └── EDA_and_Results.ipynb # Exploratory analysis + results
├── requirements.txt
└── README.md
pip install -r requirements.txtbash data/download_kitti.sh
# OR manually: http://www.cvlibs.net/datasets/kitti/eval_object.phppython models/yolo_trainer.py --epochs 50 --batch 16 --imgsz 640python models/detr_trainer.py --epochs 30 --batch 8 --lr 1e-4python scripts/visualize_video.py \
--model yolo \
--weights runs/yolo/best.pt \
--input data/kitti_video.mp4 \
--output results/output.mp4 \
--conf 0.5python scripts/evaluate.py --model yolo --weights runs/yolo/best.pt| Split | Images | Labels |
|---|---|---|
| Train | 6,481 | Car, Pedestrian, Cyclist, Van, Truck |
| Val | 1,000 | Same classes |
| Test | 7,518 | No labels (official submission) |
Class distribution:
- Car: ~28,000 instances
- Pedestrian: ~4,400 instances
- Cyclist: ~1,600 instances
- Backbone: CSPDarknet + C2f modules
- Neck: PAN-FPN
- Head: Decoupled detection head
- Pre-trained: COCO → Fine-tuned on KITTI
- Backbone: ResNet-50
- Transformer: 6 encoder + 6 decoder layers
- 100 learned object queries
- Pre-trained: COCO → Fine-tuned on KITTI
- Color-coded boxes by class (Car=blue, Pedestrian=green, Cyclist=yellow)
- Confidence scores displayed per detection
- FPS counter for real-time performance monitoring
- Track IDs (optional, using ByteTrack)
- Save to MP4 or display live with OpenCV
- Fine-tuning pre-trained models on domain-specific data
- KITTI dataset parsing (unique
.txtannotation format) - Multi-class object detection evaluation (mAP, per-class AP)
- Video inference pipeline with OpenCV
- Transformer-based detection (DETR) vs. CNN-based (YOLO)
- Autonomous driving domain knowledge