Computer Vision is a field at the intersection of Artificial Intelligence (AI) and Computer Science. Its goal is to enable machines to interpret and understand visual data (images, videos) in a way similar to human perception.
Computer Vision focuses on:
- Recognizing objects in images or videos
- Interpreting scenes and making decisions based on visual input
- Understanding motion and spatial relationships
- Facial Recognition (e.g., smartphones)
- Autonomous Vehicles (detecting pedestrians, traffic signs)
- Medical Imaging (assisting doctors with scans)
- Robotics (navigation and interaction)
-
Image Classification
Assign a label to an image (e.g., "cat", "car"). -
Object Detection
Identify and locate objects using bounding boxes. -
Segmentation
Divide an image into regions and label each pixel. -
Recognition & Identification
Match faces or interpret handwritten digits. -
Motion Analysis
Track movement across video frames. -
3D Scene Reconstruction
Infer depth and spatial relationships from 2D images.
Computer Vision is transforming industries by giving machines the ability to see, interpret, and respond to visual information. Its impact will continue to grow across countless domains.
- OpenCV Documentation
- [Computer Vision Basics](https://en.wikipedia.org/wiki/Computer
Computer Vision has evolved significantly over the decades. Below is a timeline of key milestones:
- Relied on manually crafted algorithms.
- Techniques like:
- Edge Detection
- Feature Extraction
- Worked well in controlled environments but struggled with real-world complexity.
- Introduction of Machine Learning models such as:
- Support Vector Machines (SVMs)
- Provided modest improvements but still limited by data and computational power.
-
Convolutional Neural Networks (CNNs) introduced in the late 1980s, but became practical much later.
-
Two major breakthroughs between 2010–2012:
- GPU Acceleration
Enabled faster training of deep neural networks. - Large-Scale Datasets
Example: ImageNet with millions of labeled images.
- GPU Acceleration
-
AlexNet (2012)
- Dramatically improved performance in the ImageNet Challenge.
- Reduced classification error rates significantly.
- Sparked the modern wave of Computer Vision innovation.
- CNN-based models dominate state-of-the-art systems.
- Achieve performance comparable to (and sometimes surpassing) human accuracy on benchmarks like ImageNet.
- ImageNet Project
- History of CNNs
ImageNet is one of the most influential projects in the history of Computer Vision and Artificial Intelligence. It provided the foundation for modern deep learning breakthroughs.
- A large-scale image dataset introduced in 2009.
- Contains millions of labeled images across 1,000+ categories.
- Designed to advance research in visual recognition.
Before ImageNet:
- Computer Vision models were limited by small datasets.
- Deep learning was impractical due to lack of data and computational power.
ImageNet changed this by:
- Offering massive labeled data for training.
- Enabling benchmark competitions like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
- A deep Convolutional Neural Network (CNN) trained on ImageNet.
- Achieved a dramatic reduction in error rates.
- Sparked the deep learning revolution in Computer Vision.
- CNN-based models became the standard for image recognition.
- Inspired architectures like VGG, ResNet, and EfficientNet.
- Performance now rivals or surpasses human-level accuracy on benchmarks.
- ImageNet Official Site
- [ILSVRC Challenge](https://image-net.org/challenges
A digital image is essentially a collection of numbers arranged in a grid. These numbers represent pixel intensity values.
- Represented as a 2D array:
height × width - Each pixel value indicates brightness:
0→ Black255→ White (for 8-bit encoding)
- Represented as a 3D array:
height × width × channels - Channels: Red, Green, Blue
- Example:
A64 × 64color image →64 × 64 × 3 - Pixel values range:
0,0,0→ Black255,255,255→ White
- 8-bit → Values from
0–255 - 16-bit → High Color
- 24-bit → True Color
- Normalization: Scale pixel values to a smaller range (e.g.,
0–1). - Tensor Formats:
- TensorFlow/Keras → Channels Last:
(height, width, channels)- Example:
256 × 256 × 3
- Example:
- PyTorch → Channels First:
(channels, height, width)- Example:
3 × 256 × 256
- Example:
- TensorFlow/Keras → Channels Last:
- Images are arrays of pixel values.
- Color images use multiple channels (RGB).
- Frameworks differ in tensor layout → configure CNN input accordingly.
- TensorFlow Image Guide
- PyTorch Image Tensors
Before training deep learning models, raw images typically need to be resized, center-cropped (optional), batched, and normalized.
This guide shows how to do that in TensorFlow/Keras and PyTorch, plus how to verify the preprocessing results.
# One (or both) of these depending on your stack
pip install tensorflow==2.* # Keras included
pip install torch torchvision torchaudio
pip install matplotlibImage Augmentation is a technique used to artificially increase the diversity of a training dataset by applying random transformations to existing images. This helps improve model generalization without collecting more data.
- Prevent overfitting by introducing variability.
- Improve robustness to real-world conditions.
- Simulate changes in orientation, lighting, and occlusion.
⚠️ Apply augmentation only to training data, not validation or test sets.
Image classification is the process of assigning a single label to an entire image from a predefined set of categories. For example, a photo might be classified as cat, dog, bird, or fish, depending on which category best matches its content.
This task focuses on identifying what is in the image, not where objects are located or their exact shape.
- Goal: Predict one class label for the entire image.
- Output:
- Predicted label
- Confidence score (model certainty)
- Examples:
- Labeling X-rays as healthy or diseased
- Identifying plant species from leaf photos
- Recognizing handwritten digits in postal codes
Convolutional Neural Networks (CNNs) excel at image classification because they learn hierarchical features:
- Low-level: edges, textures
- Mid-level: shapes, patterns
- High-level: object parts and full objects
This layered representation enables robust predictions across variations in scale, lighting, and viewpoint.
- Accuracy: proportion of correctly classified images.
- Example: 960 correct out of 1200 → 80% accuracy.
- Additional metrics for imbalanced data:
- Precision
- Recall
- F1 Score
- Top-k accuracy: checks if the correct label is among the top-k predictions.
Accuracy alone can be misleading for skewed datasets; consider precision/recall and confusion matrices.
- Class imbalance: leads to biased predictions.
- Remedies: class-weighted loss, oversampling, augmentation.
- Overfitting: occurs with small datasets.
- Remedies: dropout, weight decay, transfer learning.
- Data leakage: ensure strict separation of train/validation/test sets.
- Object Detection: predicts bounding boxes and labels for multiple objects.
- Semantic Segmentation: assigns a label to each pixel.
- Instance Segmentation: segments and labels each object instance.
- Deep Learning by Goodfellow et al.
- ImageNet Challenge benchmarks
- PyTorch and TensorFlow documentation for model implementations
Purpose: Step-by-step, paraphrased tutorial to load a pretrained YOLO classification model, run inference, extract top‑1/top‑5 predictions, and overlay the best label on the image.
Audience: Python users (beginner–intermediate) working with computer vision.
- Python 3.8+
- A GPU is optional; CPU works for small models.
- Install packages:
pip install ultralytics pillow matplotlibThe
ultralyticspackage provides YOLOv8 models, including classification variants (e.g.,yolov8n-cls).
from ultralytics import YOLO
# Load a small, pretrained classification model (ImageNet-trained)
model = YOLO("yolov8n-cls.pt") # alternatives: yolov8s-cls.pt, yolov8m-cls.ptfrom PIL import Image
import matplotlib.pyplot as plt
img_path = "path/to/your/image.jpg"
# Preview the image
img = Image.open(img_path).convert("RGB")
plt.imshow(img)
plt.axis("off")
plt.show()# Run classification inference; returns a list of results (one per image)
results = model.predict(source=img_path)
# Inspect raw result
res = results[0]
print("Classes (IDs):", res.probs.top5)
print("Top-5 confidences:", res.probs.top5conf)
print("Class names (model):", res.names)Notes:
res.namesmaps class IDs (e.g., 0..999 for ImageNet) to human-readable labels.res.probs.top1andres.probs.top5provide indices of the best classes.
import numpy as np
# Top-1
top1_id = int(res.probs.top1)
top1_conf = float(res.probs.top1conf)
top1_label = res.names[top1_id]
print(f"Top-1: {top1_label} ({top1_conf*100:.2f}%)")
# Top-5
top5_ids = list(map(int, res.probs.top5))
top5_confs = list(map(float, res.probs.top5conf))
top5_labels = [res.names[i] for i in top5_ids]
for rank, (lbl, conf) in enumerate(zip(top5_labels, top5_confs), start=1):
print(f"Top-{rank}: {lbl} ({conf*100:.2f}%)")from PIL import ImageDraw, ImageFont
# Draw label on a copy of the image
overlay = img.copy()
draw = ImageDraw.Draw(overlay)
# Choose a font (system-dependent); fall back to default if unavailable
try:
font = ImageFont.truetype("arial.ttf", 24)
except:
font = ImageFont.load_default()
text = f"{top1_label} ({top1_conf*100:.1f}%)"
text_color = (255, 255, 255)
box_color = (0, 0, 0, 160) # semi-transparent black
# Compute text size and draw a background rectangle
text_w, text_h = draw.textsize(text, font=font)
padding = 8
rect = [10, 10, 10 + text_w + 2*padding, 10 + text_h + 2*padding]
# Draw rectangle and text
draw.rectangle(rect, fill=box_color)
draw.text((10 + padding, 10 + padding), text, fill=text_color, font=font)
plt.imshow(overlay)
plt.axis("off")
plt.show()import csv
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model.predict(source=images)
with open("predictions.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["image", "top1_label", "top1_conf", "top5_labels", "top5_confs"])
for img_path, res in zip(images, results):
top1_id = int(res.probs.top1)
top1_label = res.names[top1_id]
top1_conf = float(res.probs.top1conf)
top5_ids = list(map(int, res.probs.top5))
top5_labels = [res.names[i] for i in top5_ids]
top5_confs = list(map(float, res.probs.top5conf))
writer.writerow([
img_path,
top1_label,
f"{top1_conf:.6f}",
";".join(top5_labels),
";".join(f"{c:.6f}" for c in top5_confs)
])- Model choice:
yolov8n-cls.ptis fast on CPU; useyolov8s/m/l-cls.ptfor better accuracy on GPU. - Image preprocessing: The
ultralyticspipeline handles resizing/normalization automatically when usingmodel.predict. - Reproducibility: Set seeds (e.g.,
torch.manual_seed) if you retrain or fine‑tune models. - Performance: On CPU, expect modest latency; for production, prefer GPU or export to ONNX/TFLite.
- Classify a different image.
- Automate a folder pipeline and log predictions.
- Integrate into a web API (FastAPI/Flask) or a mobile app.
Purpose: Paraphrased explanation suitable for documentation, READMEs, or educational notes.
Object detection is a computer vision task that identifies what objects are present in an image and where they are located. Unlike image classification, which assigns a single label to an entire image, object detection predicts:
- Class labels for each object
- Bounding boxes indicating object positions
- Confidence scores for predictions
- Bounding Box: A rectangle that encloses the detected object.
- Class Label: The category assigned to the object (e.g., dog, car).
- Confidence Score: Indicates the model’s certainty about the prediction.
- Detecting faces in smartphone cameras for autofocus and facial recognition.
- Identifying pedestrians and vehicles in traffic surveillance footage.
- Monitoring wildlife by detecting animals in images or videos captured by camera traps.
- Process:
- Generate region proposals.
- Classify each region and refine bounding boxes.
- Examples: Fast R-CNN, Faster R-CNN.
- Pros: High accuracy.
- Cons: Slower due to two-step process.
- Process: Detect objects in a single pass without proposal generation.
- Examples: YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector).
- Pros: Faster, suitable for real-time applications.
- Cons: May sacrifice some accuracy compared to two-stage methods.
- IoU (Intersection over Union): Measures overlap between predicted and ground truth boxes.
- IoU = (Area of Overlap) / (Area of Union).
- Common threshold: IoU ≥ 0.5 for a correct detection.
- mAP (Mean Average Precision): Summarizes precision-recall performance across all classes.
- Class imbalance: Leads to biased detection.
- Remedies: Use balanced datasets, augmentation, or focal loss.
- Small object detection: Harder for models; consider higher resolution or specialized architectures.
- Overfitting: Apply regularization and strong augmentation.
- Image Classification: Assigns one label to the entire image.
- Semantic Segmentation: Labels each pixel in the image.
- Instance Segmentation: Combines detection and segmentation for individual objects.
- Girshick et al. — Fast R-CNN, Faster R-CNN papers.
- YOLO: Redmon et al. — https://pjreddie.com/darknet/yolo/
- SSD: Liu et al. — https://arxiv.org/abs/1512.02325
Purpose: Step-by-step, paraphrased tutorial to load a pretrained YOLO detection model, run inference, parse results, and render bounding boxes with labels and confidence scores.
Audience: Python users (beginner–intermediate) working with computer vision.
- Python 3.8+
- A GPU is optional; CPU works for small models.
- Install packages:
pip install ultralytics pillow matplotlibThe
ultralyticspackage provides YOLOv8 detection models (e.g.,yolov8n.pt,yolov8s.pt).
from ultralytics import YOLO
# Load a small, pretrained YOLOv8 detection model
model = YOLO("yolov8n.pt") # alternatives: yolov8s.pt, yolov8m.pt, yolov8l.ptfrom PIL import Image
import matplotlib.pyplot as plt
img_path = "path/to/your/image.jpg"
# Preview the image
img = Image.open(img_path).convert("RGB")
plt.imshow(img)
plt.axis("off")
plt.show()# Run object detection; returns a list of results (one per image)
results = model.predict(source=img_path)
# Inspect raw result
res = results[0]
print(res) # summary
# Boxes, class IDs, confidences
boxes = res.boxes # XYXY boxes, confidences, class indices
print("Number of detections:", len(boxes))
# Example: print first detection
if len(boxes) > 0:
b = boxes[0]
print("xyxy:", b.xyxy.tolist())
print("conf:", float(b.conf))
print("cls:", int(b.cls))
print("class name:", res.names[int(b.cls)])# Extract all detections as dictionaries
parsed = []
for b in boxes:
xyxy = b.xyxy.squeeze().tolist() # [x1, y1, x2, y2]
conf = float(b.conf)
cls_id = int(b.cls)
label = res.names[cls_id]
parsed.append({"label": label, "conf": conf, "xyxy": xyxy})
# Sort by confidence (descending)
parsed.sort(key=lambda d: d["conf"], reverse=True)
for i, det in enumerate(parsed, start=1):
x1, y1, x2, y2 = det["xyxy"]
print(f"{i}. {det['label']} ({det['conf']*100:.2f}%) box=({x1:.1f},{y1:.1f},{x2:.1f},{y2:.1f})")import numpy as np
from PIL import ImageDraw, ImageFont
overlay = img.copy()
draw = ImageDraw.Draw(overlay)
# Font setup (fallback to default)
try:
font = ImageFont.truetype("arial.ttf", 18)
except:
font = ImageFont.load_default()
for det in parsed:
x1, y1, x2, y2 = det["xyxy"]
label = det["label"]
conf = det["conf"]
caption = f"{label} {conf*100:.1f}%"
# Draw rectangle
draw.rectangle([x1, y1, x2, y2], outline=(0, 255, 0), width=2)
# Draw label background and text
text_w, text_h = draw.textsize(caption, font=font)
pad = 4
bg = [x1, max(0, y1 - text_h - 2*pad), x1 + text_w + 2*pad, y1]
draw.rectangle(bg, fill=(0, 0, 0, 160))
draw.text((x1 + pad, y1 - text_h - pad), caption, fill=(255, 255, 255), font=font)
plt.imshow(overlay)
plt.axis("off")
plt.show()import csv
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model.predict(source=images)
with open("detections.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["image", "label", "conf", "x1", "y1", "x2", "y2"])
for img_path, res in zip(images, results):
for b in res.boxes:
xyxy = b.xyxy.squeeze().tolist()
conf = float(b.conf)
cls_id = int(b.cls)
label = res.names[cls_id]
writer.writerow([img_path, label, f"{conf:.6f}", *[f"{v:.2f}" for v in xyxy]])- Model choice:
yolov8n.ptis fast on CPU; useyolov8s/m/l.ptfor better accuracy (prefer GPU). - Coordinate formats: YOLO returns XYXY by default via
boxes.xyxy; other formats (XYWH) are available. - Confidence & NMS: Built-in post-processing filters boxes using confidence and non‑max suppression.
- Video/Webcam: Use
model.predict(source=0)for webcam or pass a video path; iterate over frames to save annotated outputs.
- Run detection on a directory of images.
- Process videos or live streams.
- Fine‑tune the pretrained model on your custom dataset.
Purpose: Paraphrased explanation suitable for documentation, READMEs, or educational notes.
Image segmentation is the process of dividing an image into multiple regions or segments, where each region corresponds to a specific object or part of an object. Unlike object detection, which draws bounding boxes, segmentation provides pixel-level classification, enabling precise understanding of object shapes and boundaries.
Example: Given an image of a cat and a dog, segmentation identifies exactly which pixels belong to the cat, which belong to the dog, and which belong to the background.
- Assigns each pixel to a category without distinguishing individual instances.
- Example: Two cats in an image → all cat pixels labeled as “cat.”
- Differentiates between individual objects of the same category.
- Example: Two cats → one set of pixels for Cat 1, another for Cat 2.
- Video conferencing: Separate participant from background for virtual backgrounds.
- Medical imaging: Outline tumors or organs in scans.
- Autonomous driving: Identify drivable areas, lanes, pedestrians, and vehicles.
- IoU (Intersection over Union): Measures overlap between predicted and ground truth masks.
- IoU = (Area of Overlap) / (Area of Union).
- Higher IoU indicates better segmentation accuracy.
Segmentation offers a finer-grained understanding of visual data than detection by labeling every pixel. This precision supports advanced tasks like:
- Panoptic Segmentation: Combines semantic and instance segmentation.
- Scene Understanding: Critical for robotics, AR/VR, and autonomous systems.
- Long et al. — Fully Convolutional Networks for Semantic Segmentation.
- Mask R-CNN: He et al. — https://arxiv.org/abs/1703.06870
- YOLO Segmentation: Ultralytics Docs — https://docs.ultralytics.com