Skip to content

Latest commit

 

History

History
771 lines (552 loc) · 22.6 KB

File metadata and controls

771 lines (552 loc) · 22.6 KB

📌 Computer Vision Overview

Computer Vision is a field at the intersection of Artificial Intelligence (AI) and Computer Science. Its goal is to enable machines to interpret and understand visual data (images, videos) in a way similar to human perception.


✅ What is Computer Vision?

Computer Vision focuses on:

  • Recognizing objects in images or videos
  • Interpreting scenes and making decisions based on visual input
  • Understanding motion and spatial relationships

🔍 Real-World Applications

  • Facial Recognition (e.g., smartphones)
  • Autonomous Vehicles (detecting pedestrians, traffic signs)
  • Medical Imaging (assisting doctors with scans)
  • Robotics (navigation and interaction)

🛠 Core Tasks in Computer Vision

  1. Image Classification
    Assign a label to an image (e.g., "cat", "car").

  2. Object Detection
    Identify and locate objects using bounding boxes.

  3. Segmentation
    Divide an image into regions and label each pixel.

  4. Recognition & Identification
    Match faces or interpret handwritten digits.

  5. Motion Analysis
    Track movement across video frames.

  6. 3D Scene Reconstruction
    Infer depth and spatial relationships from 2D images.


🌟 Why It Matters?

Computer Vision is transforming industries by giving machines the ability to see, interpret, and respond to visual information. Its impact will continue to grow across countless domains.


📚 Learn More

📜 A Brief History of Computer Vision

Computer Vision has evolved significantly over the decades. Below is a timeline of key milestones:


🕰 Early Years (1960s–1990s)

  • Relied on manually crafted algorithms.
  • Techniques like:
    • Edge Detection
    • Feature Extraction
  • Worked well in controlled environments but struggled with real-world complexity.

📈 Machine Learning Era (2000s)

  • Introduction of Machine Learning models such as:
    • Support Vector Machines (SVMs)
  • Provided modest improvements but still limited by data and computational power.

🚀 Deep Learning Revolution (2010s)

  • Convolutional Neural Networks (CNNs) introduced in the late 1980s, but became practical much later.

  • Two major breakthroughs between 2010–2012:

    1. GPU Acceleration
      Enabled faster training of deep neural networks.
    2. Large-Scale Datasets
      Example: ImageNet with millions of labeled images.
  • AlexNet (2012)

    • Dramatically improved performance in the ImageNet Challenge.
    • Reduced classification error rates significantly.
    • Sparked the modern wave of Computer Vision innovation.

🌟 Today

  • CNN-based models dominate state-of-the-art systems.
  • Achieve performance comparable to (and sometimes surpassing) human accuracy on benchmarks like ImageNet.

📚 Learn More

🖼 ImageNet: A Pioneering Vision for Computers

ImageNet is one of the most influential projects in the history of Computer Vision and Artificial Intelligence. It provided the foundation for modern deep learning breakthroughs.


✅ What is ImageNet?

  • A large-scale image dataset introduced in 2009.
  • Contains millions of labeled images across 1,000+ categories.
  • Designed to advance research in visual recognition.

🔍 Why Was It Revolutionary?

Before ImageNet:

  • Computer Vision models were limited by small datasets.
  • Deep learning was impractical due to lack of data and computational power.

ImageNet changed this by:

  • Offering massive labeled data for training.
  • Enabling benchmark competitions like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

🚀 The Turning Point: AlexNet (2012)

  • A deep Convolutional Neural Network (CNN) trained on ImageNet.
  • Achieved a dramatic reduction in error rates.
  • Sparked the deep learning revolution in Computer Vision.

🌟 Impact on AI

  • CNN-based models became the standard for image recognition.
  • Inspired architectures like VGG, ResNet, and EfficientNet.
  • Performance now rivals or surpasses human-level accuracy on benchmarks.

📚 Learn More

🖼 Understanding Image Data

A digital image is essentially a collection of numbers arranged in a grid. These numbers represent pixel intensity values.


✅ Grayscale Images

  • Represented as a 2D array:
    height × width
  • Each pixel value indicates brightness:
    • 0 → Black
    • 255 → White (for 8-bit encoding)

🎨 Color Images (RGB)

  • Represented as a 3D array:
    height × width × channels
  • Channels: Red, Green, Blue
  • Example:
    A 64 × 64 color image → 64 × 64 × 3
  • Pixel values range:
    • 0,0,0 → Black
    • 255,255,255 → White

🔢 Color Depth

  • 8-bit → Values from 0–255
  • 16-bit → High Color
  • 24-bit → True Color

⚙️ Preprocessing for CNNs

  • Normalization: Scale pixel values to a smaller range (e.g., 0–1).
  • Tensor Formats:
    • TensorFlow/Keras → Channels Last: (height, width, channels)
      • Example: 256 × 256 × 3
    • PyTorch → Channels First: (channels, height, width)
      • Example: 3 × 256 × 256

📚 Key Takeaways

  • Images are arrays of pixel values.
  • Color images use multiple channels (RGB).
  • Frameworks differ in tensor layout → configure CNN input accordingly.

🔗 Learn More

  • TensorFlow Image Guide
  • PyTorch Image Tensors

🧰 Preprocessing Image Data in Python

Before training deep learning models, raw images typically need to be resized, center-cropped (optional), batched, and normalized.
This guide shows how to do that in TensorFlow/Keras and PyTorch, plus how to verify the preprocessing results.


📦 Environment

# One (or both) of these depending on your stack
pip install tensorflow==2.*  # Keras included
pip install torch torchvision torchaudio
pip install matplotlib

Code for doing this is mentioned here:

preprocessing_image_dataset

🔄 Augmenting Image Data in Python

Image Augmentation is a technique used to artificially increase the diversity of a training dataset by applying random transformations to existing images. This helps improve model generalization without collecting more data.


✅ Why Augment Images?

  • Prevent overfitting by introducing variability.
  • Improve robustness to real-world conditions.
  • Simulate changes in orientation, lighting, and occlusion.

⚠️ Apply augmentation only to training data, not validation or test sets.


Code for doing Image Augmentation is mentioned here:

image_augmentation

Computer Vision Image Classification


Overview

Image classification is the process of assigning a single label to an entire image from a predefined set of categories. For example, a photo might be classified as cat, dog, bird, or fish, depending on which category best matches its content.

This task focuses on identifying what is in the image, not where objects are located or their exact shape.


Key Points

  • Goal: Predict one class label for the entire image.
  • Output:
    • Predicted label
    • Confidence score (model certainty)
  • Examples:
    • Labeling X-rays as healthy or diseased
    • Identifying plant species from leaf photos
    • Recognizing handwritten digits in postal codes

Why CNNs Work Well

Convolutional Neural Networks (CNNs) excel at image classification because they learn hierarchical features:

  • Low-level: edges, textures
  • Mid-level: shapes, patterns
  • High-level: object parts and full objects

This layered representation enables robust predictions across variations in scale, lighting, and viewpoint.


Metrics

  • Accuracy: proportion of correctly classified images.
    • Example: 960 correct out of 1200 → 80% accuracy.
  • Additional metrics for imbalanced data:
    • Precision
    • Recall
    • F1 Score
  • Top-k accuracy: checks if the correct label is among the top-k predictions.

Accuracy alone can be misleading for skewed datasets; consider precision/recall and confusion matrices.


Common Pitfalls

  • Class imbalance: leads to biased predictions.
    • Remedies: class-weighted loss, oversampling, augmentation.
  • Overfitting: occurs with small datasets.
    • Remedies: dropout, weight decay, transfer learning.
  • Data leakage: ensure strict separation of train/validation/test sets.

Related Tasks

  • Object Detection: predicts bounding boxes and labels for multiple objects.
  • Semantic Segmentation: assigns a label to each pixel.
  • Instance Segmentation: segments and labels each object instance.

References

  • Deep Learning by Goodfellow et al.
  • ImageNet Challenge benchmarks
  • PyTorch and TensorFlow documentation for model implementations

Using a Pretrained YOLO Model for Image Classification in Python

Purpose: Step-by-step, paraphrased tutorial to load a pretrained YOLO classification model, run inference, extract top‑1/top‑5 predictions, and overlay the best label on the image.
Audience: Python users (beginner–intermediate) working with computer vision.


Prerequisites

  • Python 3.8+
  • A GPU is optional; CPU works for small models.
  • Install packages:
pip install ultralytics pillow matplotlib

The ultralytics package provides YOLOv8 models, including classification variants (e.g., yolov8n-cls).


1) Load a Pretrained YOLO Classification Model

from ultralytics import YOLO

# Load a small, pretrained classification model (ImageNet-trained)
model = YOLO("yolov8n-cls.pt")  # alternatives: yolov8s-cls.pt, yolov8m-cls.pt

2) Pick an Image and (Optionally) Preview It

from PIL import Image
import matplotlib.pyplot as plt

img_path = "path/to/your/image.jpg"

# Preview the image
img = Image.open(img_path).convert("RGB")
plt.imshow(img)
plt.axis("off")
plt.show()

3) Run Inference

# Run classification inference; returns a list of results (one per image)
results = model.predict(source=img_path)

# Inspect raw result
res = results[0]
print("Classes (IDs):", res.probs.top5)
print("Top-5 confidences:", res.probs.top5conf)
print("Class names (model):", res.names)

Notes:

  • res.names maps class IDs (e.g., 0..999 for ImageNet) to human-readable labels.
  • res.probs.top1 and res.probs.top5 provide indices of the best classes.

4) Format Top‑1 and Top‑5 Predictions

import numpy as np

# Top-1
top1_id = int(res.probs.top1)
top1_conf = float(res.probs.top1conf)
top1_label = res.names[top1_id]

print(f"Top-1: {top1_label} ({top1_conf*100:.2f}%)")

# Top-5
top5_ids = list(map(int, res.probs.top5))
top5_confs = list(map(float, res.probs.top5conf))
top5_labels = [res.names[i] for i in top5_ids]

for rank, (lbl, conf) in enumerate(zip(top5_labels, top5_confs), start=1):
    print(f"Top-{rank}: {lbl} ({conf*100:.2f}%)")

5) Overlay the Top‑1 Label on the Image

from PIL import ImageDraw, ImageFont

# Draw label on a copy of the image
overlay = img.copy()
draw = ImageDraw.Draw(overlay)

# Choose a font (system-dependent); fall back to default if unavailable
try:
    font = ImageFont.truetype("arial.ttf", 24)
except:
    font = ImageFont.load_default()

text = f"{top1_label} ({top1_conf*100:.1f}%)"
text_color = (255, 255, 255)
box_color = (0, 0, 0, 160)  # semi-transparent black

# Compute text size and draw a background rectangle
text_w, text_h = draw.textsize(text, font=font)
padding = 8
rect = [10, 10, 10 + text_w + 2*padding, 10 + text_h + 2*padding]

# Draw rectangle and text
draw.rectangle(rect, fill=box_color)
draw.text((10 + padding, 10 + padding), text, fill=text_color, font=font)

plt.imshow(overlay)
plt.axis("off")
plt.show()

6) Batch Classification & Saving Results (Optional)

import csv

images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model.predict(source=images)

with open("predictions.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["image", "top1_label", "top1_conf", "top5_labels", "top5_confs"])

    for img_path, res in zip(images, results):
        top1_id = int(res.probs.top1)
        top1_label = res.names[top1_id]
        top1_conf = float(res.probs.top1conf)

        top5_ids = list(map(int, res.probs.top5))
        top5_labels = [res.names[i] for i in top5_ids]
        top5_confs = list(map(float, res.probs.top5conf))

        writer.writerow([
            img_path,
            top1_label,
            f"{top1_conf:.6f}",
            ";".join(top5_labels),
            ";".join(f"{c:.6f}" for c in top5_confs)
        ])

Tips & Troubleshooting

  • Model choice: yolov8n-cls.pt is fast on CPU; use yolov8s/m/l-cls.pt for better accuracy on GPU.
  • Image preprocessing: The ultralytics pipeline handles resizing/normalization automatically when using model.predict.
  • Reproducibility: Set seeds (e.g., torch.manual_seed) if you retrain or fine‑tune models.
  • Performance: On CPU, expect modest latency; for production, prefer GPU or export to ONNX/TFLite.

Next Steps

  • Classify a different image.
  • Automate a folder pipeline and log predictions.
  • Integrate into a web API (FastAPI/Flask) or a mobile app.

Object Detection

Purpose: Paraphrased explanation suitable for documentation, READMEs, or educational notes.


Overview

Object detection is a computer vision task that identifies what objects are present in an image and where they are located. Unlike image classification, which assigns a single label to an entire image, object detection predicts:

  • Class labels for each object
  • Bounding boxes indicating object positions
  • Confidence scores for predictions

Key Concepts

  • Bounding Box: A rectangle that encloses the detected object.
  • Class Label: The category assigned to the object (e.g., dog, car).
  • Confidence Score: Indicates the model’s certainty about the prediction.

Real-World Applications

  • Detecting faces in smartphone cameras for autofocus and facial recognition.
  • Identifying pedestrians and vehicles in traffic surveillance footage.
  • Monitoring wildlife by detecting animals in images or videos captured by camera traps.

Approaches

Two-Stage Detectors

  • Process:
    1. Generate region proposals.
    2. Classify each region and refine bounding boxes.
  • Examples: Fast R-CNN, Faster R-CNN.
  • Pros: High accuracy.
  • Cons: Slower due to two-step process.

One-Stage Detectors

  • Process: Detect objects in a single pass without proposal generation.
  • Examples: YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector).
  • Pros: Faster, suitable for real-time applications.
  • Cons: May sacrifice some accuracy compared to two-stage methods.

Evaluation Metrics

  • IoU (Intersection over Union): Measures overlap between predicted and ground truth boxes.
    • IoU = (Area of Overlap) / (Area of Union).
    • Common threshold: IoU ≥ 0.5 for a correct detection.
  • mAP (Mean Average Precision): Summarizes precision-recall performance across all classes.

Common Pitfalls

  • Class imbalance: Leads to biased detection.
    • Remedies: Use balanced datasets, augmentation, or focal loss.
  • Small object detection: Harder for models; consider higher resolution or specialized architectures.
  • Overfitting: Apply regularization and strong augmentation.

Related Tasks

  • Image Classification: Assigns one label to the entire image.
  • Semantic Segmentation: Labels each pixel in the image.
  • Instance Segmentation: Combines detection and segmentation for individual objects.

References


Using a Pretrained YOLO Model for Object Detection in Python

Purpose: Step-by-step, paraphrased tutorial to load a pretrained YOLO detection model, run inference, parse results, and render bounding boxes with labels and confidence scores.
Audience: Python users (beginner–intermediate) working with computer vision.


Prerequisites

  • Python 3.8+
  • A GPU is optional; CPU works for small models.
  • Install packages:
pip install ultralytics pillow matplotlib

The ultralytics package provides YOLOv8 detection models (e.g., yolov8n.pt, yolov8s.pt).


1) Load a Pretrained YOLO Detection Model

from ultralytics import YOLO

# Load a small, pretrained YOLOv8 detection model
model = YOLO("yolov8n.pt")  # alternatives: yolov8s.pt, yolov8m.pt, yolov8l.pt

2) Pick an Image and Preview It

from PIL import Image
import matplotlib.pyplot as plt

img_path = "path/to/your/image.jpg"

# Preview the image
img = Image.open(img_path).convert("RGB")
plt.imshow(img)
plt.axis("off")
plt.show()

3) Run Detection Inference

# Run object detection; returns a list of results (one per image)
results = model.predict(source=img_path)

# Inspect raw result
res = results[0]
print(res)  # summary

# Boxes, class IDs, confidences
boxes = res.boxes  # XYXY boxes, confidences, class indices
print("Number of detections:", len(boxes))

# Example: print first detection
if len(boxes) > 0:
    b = boxes[0]
    print("xyxy:", b.xyxy.tolist())
    print("conf:", float(b.conf))
    print("cls:", int(b.cls))
    print("class name:", res.names[int(b.cls)])

4) Parse and Format Detections

# Extract all detections as dictionaries
parsed = []
for b in boxes:
    xyxy = b.xyxy.squeeze().tolist()  # [x1, y1, x2, y2]
    conf = float(b.conf)
    cls_id = int(b.cls)
    label = res.names[cls_id]
    parsed.append({"label": label, "conf": conf, "xyxy": xyxy})

# Sort by confidence (descending)
parsed.sort(key=lambda d: d["conf"], reverse=True)

for i, det in enumerate(parsed, start=1):
    x1, y1, x2, y2 = det["xyxy"]
    print(f"{i}. {det['label']} ({det['conf']*100:.2f}%) box=({x1:.1f},{y1:.1f},{x2:.1f},{y2:.1f})")

5) Render Bounding Boxes on the Image

import numpy as np
from PIL import ImageDraw, ImageFont

overlay = img.copy()
draw = ImageDraw.Draw(overlay)

# Font setup (fallback to default)
try:
    font = ImageFont.truetype("arial.ttf", 18)
except:
    font = ImageFont.load_default()

for det in parsed:
    x1, y1, x2, y2 = det["xyxy"]
    label = det["label"]
    conf = det["conf"]
    caption = f"{label} {conf*100:.1f}%"

    # Draw rectangle
    draw.rectangle([x1, y1, x2, y2], outline=(0, 255, 0), width=2)

    # Draw label background and text
    text_w, text_h = draw.textsize(caption, font=font)
    pad = 4
    bg = [x1, max(0, y1 - text_h - 2*pad), x1 + text_w + 2*pad, y1]
    draw.rectangle(bg, fill=(0, 0, 0, 160))
    draw.text((x1 + pad, y1 - text_h - pad), caption, fill=(255, 255, 255), font=font)

plt.imshow(overlay)
plt.axis("off")
plt.show()

6) Batch Detection & Save CSV (Optional)

import csv

images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model.predict(source=images)

with open("detections.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["image", "label", "conf", "x1", "y1", "x2", "y2"])

    for img_path, res in zip(images, results):
        for b in res.boxes:
            xyxy = b.xyxy.squeeze().tolist()
            conf = float(b.conf)
            cls_id = int(b.cls)
            label = res.names[cls_id]
            writer.writerow([img_path, label, f"{conf:.6f}", *[f"{v:.2f}" for v in xyxy]])

Tips & Troubleshooting

  • Model choice: yolov8n.pt is fast on CPU; use yolov8s/m/l.pt for better accuracy (prefer GPU).
  • Coordinate formats: YOLO returns XYXY by default via boxes.xyxy; other formats (XYWH) are available.
  • Confidence & NMS: Built-in post-processing filters boxes using confidence and non‑max suppression.
  • Video/Webcam: Use model.predict(source=0) for webcam or pass a video path; iterate over frames to save annotated outputs.

Next Steps

  • Run detection on a directory of images.
  • Process videos or live streams.
  • Fine‑tune the pretrained model on your custom dataset.

Image Segmentation

Purpose: Paraphrased explanation suitable for documentation, READMEs, or educational notes.


Overview

Image segmentation is the process of dividing an image into multiple regions or segments, where each region corresponds to a specific object or part of an object. Unlike object detection, which draws bounding boxes, segmentation provides pixel-level classification, enabling precise understanding of object shapes and boundaries.

Example: Given an image of a cat and a dog, segmentation identifies exactly which pixels belong to the cat, which belong to the dog, and which belong to the background.


Types of Segmentation

Semantic Segmentation

  • Assigns each pixel to a category without distinguishing individual instances.
  • Example: Two cats in an image → all cat pixels labeled as “cat.”

Instance Segmentation

  • Differentiates between individual objects of the same category.
  • Example: Two cats → one set of pixels for Cat 1, another for Cat 2.

Real-World Applications

  • Video conferencing: Separate participant from background for virtual backgrounds.
  • Medical imaging: Outline tumors or organs in scans.
  • Autonomous driving: Identify drivable areas, lanes, pedestrians, and vehicles.

Evaluation Metrics

  • IoU (Intersection over Union): Measures overlap between predicted and ground truth masks.
    • IoU = (Area of Overlap) / (Area of Union).
    • Higher IoU indicates better segmentation accuracy.

Why It Matters

Segmentation offers a finer-grained understanding of visual data than detection by labeling every pixel. This precision supports advanced tasks like:

  • Panoptic Segmentation: Combines semantic and instance segmentation.
  • Scene Understanding: Critical for robotics, AR/VR, and autonomous systems.

References