📌 Computer Vision Overview

Computer Vision is a field at the intersection of Artificial Intelligence (AI) and Computer Science. Its goal is to enable machines to interpret and understand visual data (images, videos) in a way similar to human perception.

✅ What is Computer Vision?

Computer Vision focuses on:

Recognizing objects in images or videos
Interpreting scenes and making decisions based on visual input
Understanding motion and spatial relationships

🔍 Real-World Applications

Facial Recognition (e.g., smartphones)
Autonomous Vehicles (detecting pedestrians, traffic signs)
Medical Imaging (assisting doctors with scans)
Robotics (navigation and interaction)

🛠 Core Tasks in Computer Vision

Image Classification
Assign a label to an image (e.g., "cat", "car").
Object Detection
Identify and locate objects using bounding boxes.
Segmentation
Divide an image into regions and label each pixel.
Recognition & Identification
Match faces or interpret handwritten digits.
Motion Analysis
Track movement across video frames.
3D Scene Reconstruction
Infer depth and spatial relationships from 2D images.

🌟 Why It Matters?

Computer Vision is transforming industries by giving machines the ability to see, interpret, and respond to visual information. Its impact will continue to grow across countless domains.

📚 Learn More

OpenCV Documentation
[Computer Vision Basics](https://en.wikipedia.org/wiki/Computer

📜 A Brief History of Computer Vision

Computer Vision has evolved significantly over the decades. Below is a timeline of key milestones:

🕰 Early Years (1960s–1990s)

Relied on manually crafted algorithms.
Techniques like:
- Edge Detection
- Feature Extraction
Worked well in controlled environments but struggled with real-world complexity.

📈 Machine Learning Era (2000s)

Introduction of Machine Learning models such as:
- Support Vector Machines (SVMs)
Provided modest improvements but still limited by data and computational power.

🚀 Deep Learning Revolution (2010s)

Convolutional Neural Networks (CNNs) introduced in the late 1980s, but became practical much later.
Two major breakthroughs between 2010–2012:
1. GPU Acceleration
  Enabled faster training of deep neural networks.
2. Large-Scale Datasets
  Example: ImageNet with millions of labeled images.
AlexNet (2012)
- Dramatically improved performance in the ImageNet Challenge.
- Reduced classification error rates significantly.
- Sparked the modern wave of Computer Vision innovation.

🌟 Today

CNN-based models dominate state-of-the-art systems.
Achieve performance comparable to (and sometimes surpassing) human accuracy on benchmarks like ImageNet.

📚 Learn More

ImageNet Project
History of CNNs

🖼 ImageNet: A Pioneering Vision for Computers

ImageNet is one of the most influential projects in the history of Computer Vision and Artificial Intelligence. It provided the foundation for modern deep learning breakthroughs.

✅ What is ImageNet?

A large-scale image dataset introduced in 2009.
Contains millions of labeled images across 1,000+ categories.
Designed to advance research in visual recognition.

🔍 Why Was It Revolutionary?

Before ImageNet:

Computer Vision models were limited by small datasets.
Deep learning was impractical due to lack of data and computational power.

ImageNet changed this by:

Offering massive labeled data for training.
Enabling benchmark competitions like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

🚀 The Turning Point: AlexNet (2012)

A deep Convolutional Neural Network (CNN) trained on ImageNet.
Achieved a dramatic reduction in error rates.
Sparked the deep learning revolution in Computer Vision.

🌟 Impact on AI

CNN-based models became the standard for image recognition.
Inspired architectures like VGG, ResNet, and EfficientNet.
Performance now rivals or surpasses human-level accuracy on benchmarks.

📚 Learn More

ImageNet Official Site
[ILSVRC Challenge](https://image-net.org/challenges

🖼 Understanding Image Data

A digital image is essentially a collection of numbers arranged in a grid. These numbers represent pixel intensity values.

✅ Grayscale Images

Represented as a 2D array:
height × width
Each pixel value indicates brightness:
- 0 → Black
- 255 → White (for 8-bit encoding)

🎨 Color Images (RGB)

Represented as a 3D array:
height × width × channels
Channels: Red, Green, Blue
Example:
A 64 × 64 color image → 64 × 64 × 3
Pixel values range:
- 0,0,0 → Black
- 255,255,255 → White

🔢 Color Depth

8-bit → Values from 0–255
16-bit → High Color
24-bit → True Color

⚙️ Preprocessing for CNNs

Normalization: Scale pixel values to a smaller range (e.g., 0–1).
Tensor Formats:
- TensorFlow/Keras → Channels Last: (height, width, channels)
  - Example: 256 × 256 × 3
- PyTorch → Channels First: (channels, height, width)
  - Example: 3 × 256 × 256

📚 Key Takeaways

Images are arrays of pixel values.
Color images use multiple channels (RGB).
Frameworks differ in tensor layout → configure CNN input accordingly.

🔗 Learn More

TensorFlow Image Guide
PyTorch Image Tensors

🧰 Preprocessing Image Data in Python

Before training deep learning models, raw images typically need to be resized, center-cropped (optional), batched, and normalized.
This guide shows how to do that in TensorFlow/Keras and PyTorch, plus how to verify the preprocessing results.

📦 Environment

# One (or both) of these depending on your stack
pip install tensorflow==2.*  # Keras included
pip install torch torchvision torchaudio
pip install matplotlib

Code for doing this is mentioned here:

preprocessing_image_dataset

🔄 Augmenting Image Data in Python

Image Augmentation is a technique used to artificially increase the diversity of a training dataset by applying random transformations to existing images. This helps improve model generalization without collecting more data.

✅ Why Augment Images?

Prevent overfitting by introducing variability.
Improve robustness to real-world conditions.
Simulate changes in orientation, lighting, and occlusion.

⚠️ Apply augmentation only to training data, not validation or test sets.

Code for doing Image Augmentation is mentioned here:

image_augmentation

Computer Vision Image Classification

Overview

Image classification is the process of assigning a single label to an entire image from a predefined set of categories. For example, a photo might be classified as cat, dog, bird, or fish, depending on which category best matches its content.

This task focuses on identifying what is in the image, not where objects are located or their exact shape.

Key Points

Goal: Predict one class label for the entire image.
Output:
- Predicted label
- Confidence score (model certainty)
Examples:
- Labeling X-rays as healthy or diseased
- Identifying plant species from leaf photos
- Recognizing handwritten digits in postal codes

Why CNNs Work Well

Convolutional Neural Networks (CNNs) excel at image classification because they learn hierarchical features:

Low-level: edges, textures
Mid-level: shapes, patterns
High-level: object parts and full objects

This layered representation enables robust predictions across variations in scale, lighting, and viewpoint.

Metrics

Accuracy: proportion of correctly classified images.
- Example: 960 correct out of 1200 → 80% accuracy.
Additional metrics for imbalanced data:
- Precision
- Recall
- F1 Score
Top-k accuracy: checks if the correct label is among the top-k predictions.

Accuracy alone can be misleading for skewed datasets; consider precision/recall and confusion matrices.

Common Pitfalls

Class imbalance: leads to biased predictions.
- Remedies: class-weighted loss, oversampling, augmentation.
Overfitting: occurs with small datasets.
- Remedies: dropout, weight decay, transfer learning.
Data leakage: ensure strict separation of train/validation/test sets.

Related Tasks

Object Detection: predicts bounding boxes and labels for multiple objects.
Semantic Segmentation: assigns a label to each pixel.
Instance Segmentation: segments and labels each object instance.

References

Deep Learning by Goodfellow et al.
ImageNet Challenge benchmarks
PyTorch and TensorFlow documentation for model implementations

Using a Pretrained YOLO Model for Image Classification in Python

Purpose: Step-by-step, paraphrased tutorial to load a pretrained YOLO classification model, run inference, extract top‑1/top‑5 predictions, and overlay the best label on the image.
Audience: Python users (beginner–intermediate) working with computer vision.

Prerequisites

Python 3.8+
A GPU is optional; CPU works for small models.
Install packages:

pip install ultralytics pillow matplotlib

The ultralytics package provides YOLOv8 models, including classification variants (e.g., yolov8n-cls).

1) Load a Pretrained YOLO Classification Model

from ultralytics import YOLO

# Load a small, pretrained classification model (ImageNet-trained)
model = YOLO("yolov8n-cls.pt")  # alternatives: yolov8s-cls.pt, yolov8m-cls.pt

2) Pick an Image and (Optionally) Preview It

from PIL import Image
import matplotlib.pyplot as plt

img_path = "path/to/your/image.jpg"

# Preview the image
img = Image.open(img_path).convert("RGB")
plt.imshow(img)
plt.axis("off")
plt.show()

3) Run Inference

# Run classification inference; returns a list of results (one per image)
results = model.predict(source=img_path)

# Inspect raw result
res = results[0]
print("Classes (IDs):", res.probs.top5)
print("Top-5 confidences:", res.probs.top5conf)
print("Class names (model):", res.names)

Notes:

res.names maps class IDs (e.g., 0..999 for ImageNet) to human-readable labels.
res.probs.top1 and res.probs.top5 provide indices of the best classes.

4) Format Top‑1 and Top‑5 Predictions

import numpy as np

# Top-1
top1_id = int(res.probs.top1)
top1_conf = float(res.probs.top1conf)
top1_label = res.names[top1_id]

print(f"Top-1: {top1_label} ({top1_conf*100:.2f}%)")

# Top-5
top5_ids = list(map(int, res.probs.top5))
top5_confs = list(map(float, res.probs.top5conf))
top5_labels = [res.names[i] for i in top5_ids]

for rank, (lbl, conf) in enumerate(zip(top5_labels, top5_confs), start=1):
    print(f"Top-{rank}: {lbl} ({conf*100:.2f}%)")

5) Overlay the Top‑1 Label on the Image

from PIL import ImageDraw, ImageFont

# Draw label on a copy of the image
overlay = img.copy()
draw = ImageDraw.Draw(overlay)

# Choose a font (system-dependent); fall back to default if unavailable
try:
    font = ImageFont.truetype("arial.ttf", 24)
except:
    font = ImageFont.load_default()

text = f"{top1_label} ({top1_conf*100:.1f}%)"
text_color = (255, 255, 255)
box_color = (0, 0, 0, 160)  # semi-transparent black

# Compute text size and draw a background rectangle
text_w, text_h = draw.textsize(text, font=font)
padding = 8
rect = [10, 10, 10 + text_w + 2*padding, 10 + text_h + 2*padding]

# Draw rectangle and text
draw.rectangle(rect, fill=box_color)
draw.text((10 + padding, 10 + padding), text, fill=text_color, font=font)

plt.imshow(overlay)
plt.axis("off")
plt.show()

6) Batch Classification & Saving Results (Optional)

import csv

images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model.predict(source=images)

with open("predictions.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["image", "top1_label", "top1_conf", "top5_labels", "top5_confs"])

    for img_path, res in zip(images, results):
        top1_id = int(res.probs.top1)
        top1_label = res.names[top1_id]
        top1_conf = float(res.probs.top1conf)

        top5_ids = list(map(int, res.probs.top5))
        top5_labels = [res.names[i] for i in top5_ids]
        top5_confs = list(map(float, res.probs.top5conf))

        writer.writerow([
            img_path,
            top1_label,
            f"{top1_conf:.6f}",
            ";".join(top5_labels),
            ";".join(f"{c:.6f}" for c in top5_confs)
        ])

Tips & Troubleshooting

Model choice: yolov8n-cls.pt is fast on CPU; use yolov8s/m/l-cls.pt for better accuracy on GPU.
Image preprocessing: The ultralytics pipeline handles resizing/normalization automatically when using model.predict.
Reproducibility: Set seeds (e.g., torch.manual_seed) if you retrain or fine‑tune models.
Performance: On CPU, expect modest latency; for production, prefer GPU or export to ONNX/TFLite.

Next Steps

Classify a different image.
Automate a folder pipeline and log predictions.
Integrate into a web API (FastAPI/Flask) or a mobile app.

Object Detection

Purpose: Paraphrased explanation suitable for documentation, READMEs, or educational notes.

Overview

Object detection is a computer vision task that identifies what objects are present in an image and where they are located. Unlike image classification, which assigns a single label to an entire image, object detection predicts:

Class labels for each object
Bounding boxes indicating object positions
Confidence scores for predictions

Key Concepts

Bounding Box: A rectangle that encloses the detected object.
Class Label: The category assigned to the object (e.g., dog, car).
Confidence Score: Indicates the model’s certainty about the prediction.

Real-World Applications

Detecting faces in smartphone cameras for autofocus and facial recognition.
Identifying pedestrians and vehicles in traffic surveillance footage.
Monitoring wildlife by detecting animals in images or videos captured by camera traps.

Approaches

Two-Stage Detectors

Process:
1. Generate region proposals.
2. Classify each region and refine bounding boxes.
Examples: Fast R-CNN, Faster R-CNN.
Pros: High accuracy.
Cons: Slower due to two-step process.

One-Stage Detectors

Process: Detect objects in a single pass without proposal generation.
Examples: YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector).
Pros: Faster, suitable for real-time applications.
Cons: May sacrifice some accuracy compared to two-stage methods.

Evaluation Metrics

IoU (Intersection over Union): Measures overlap between predicted and ground truth boxes.
- IoU = (Area of Overlap) / (Area of Union).
- Common threshold: IoU ≥ 0.5 for a correct detection.
mAP (Mean Average Precision): Summarizes precision-recall performance across all classes.

Common Pitfalls

Class imbalance: Leads to biased detection.
- Remedies: Use balanced datasets, augmentation, or focal loss.
Small object detection: Harder for models; consider higher resolution or specialized architectures.
Overfitting: Apply regularization and strong augmentation.

Related Tasks

Image Classification: Assigns one label to the entire image.
Semantic Segmentation: Labels each pixel in the image.
Instance Segmentation: Combines detection and segmentation for individual objects.

References

Girshick et al. — Fast R-CNN, Faster R-CNN papers.
YOLO: Redmon et al. — https://pjreddie.com/darknet/yolo/
SSD: Liu et al. — https://arxiv.org/abs/1512.02325

Using a Pretrained YOLO Model for Object Detection in Python

Purpose: Step-by-step, paraphrased tutorial to load a pretrained YOLO detection model, run inference, parse results, and render bounding boxes with labels and confidence scores.
Audience: Python users (beginner–intermediate) working with computer vision.

Prerequisites

Python 3.8+
A GPU is optional; CPU works for small models.
Install packages:

pip install ultralytics pillow matplotlib

The ultralytics package provides YOLOv8 detection models (e.g., yolov8n.pt, yolov8s.pt).

1) Load a Pretrained YOLO Detection Model

from ultralytics import YOLO

# Load a small, pretrained YOLOv8 detection model
model = YOLO("yolov8n.pt")  # alternatives: yolov8s.pt, yolov8m.pt, yolov8l.pt

2) Pick an Image and Preview It

from PIL import Image
import matplotlib.pyplot as plt

img_path = "path/to/your/image.jpg"

# Preview the image
img = Image.open(img_path).convert("RGB")
plt.imshow(img)
plt.axis("off")
plt.show()

3) Run Detection Inference

# Run object detection; returns a list of results (one per image)
results = model.predict(source=img_path)

# Inspect raw result
res = results[0]
print(res)  # summary

# Boxes, class IDs, confidences
boxes = res.boxes  # XYXY boxes, confidences, class indices
print("Number of detections:", len(boxes))

# Example: print first detection
if len(boxes) > 0:
    b = boxes[0]
    print("xyxy:", b.xyxy.tolist())
    print("conf:", float(b.conf))
    print("cls:", int(b.cls))
    print("class name:", res.names[int(b.cls)])

4) Parse and Format Detections

# Extract all detections as dictionaries
parsed = []
for b in boxes:
    xyxy = b.xyxy.squeeze().tolist()  # [x1, y1, x2, y2]
    conf = float(b.conf)
    cls_id = int(b.cls)
    label = res.names[cls_id]
    parsed.append({"label": label, "conf": conf, "xyxy": xyxy})

# Sort by confidence (descending)
parsed.sort(key=lambda d: d["conf"], reverse=True)

for i, det in enumerate(parsed, start=1):
    x1, y1, x2, y2 = det["xyxy"]
    print(f"{i}. {det['label']} ({det['conf']*100:.2f}%) box=({x1:.1f},{y1:.1f},{x2:.1f},{y2:.1f})")

5) Render Bounding Boxes on the Image

import numpy as np
from PIL import ImageDraw, ImageFont

overlay = img.copy()
draw = ImageDraw.Draw(overlay)

# Font setup (fallback to default)
try:
    font = ImageFont.truetype("arial.ttf", 18)
except:
    font = ImageFont.load_default()

for det in parsed:
    x1, y1, x2, y2 = det["xyxy"]
    label = det["label"]
    conf = det["conf"]
    caption = f"{label} {conf*100:.1f}%"

    # Draw rectangle
    draw.rectangle([x1, y1, x2, y2], outline=(0, 255, 0), width=2)

    # Draw label background and text
    text_w, text_h = draw.textsize(caption, font=font)
    pad = 4
    bg = [x1, max(0, y1 - text_h - 2*pad), x1 + text_w + 2*pad, y1]
    draw.rectangle(bg, fill=(0, 0, 0, 160))
    draw.text((x1 + pad, y1 - text_h - pad), caption, fill=(255, 255, 255), font=font)

plt.imshow(overlay)
plt.axis("off")
plt.show()

6) Batch Detection & Save CSV (Optional)

import csv

images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model.predict(source=images)

with open("detections.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["image", "label", "conf", "x1", "y1", "x2", "y2"])

    for img_path, res in zip(images, results):
        for b in res.boxes:
            xyxy = b.xyxy.squeeze().tolist()
            conf = float(b.conf)
            cls_id = int(b.cls)
            label = res.names[cls_id]
            writer.writerow([img_path, label, f"{conf:.6f}", *[f"{v:.2f}" for v in xyxy]])

Tips & Troubleshooting

Model choice: yolov8n.pt is fast on CPU; use yolov8s/m/l.pt for better accuracy (prefer GPU).
Coordinate formats: YOLO returns XYXY by default via boxes.xyxy; other formats (XYWH) are available.
Confidence & NMS: Built-in post-processing filters boxes using confidence and non‑max suppression.
Video/Webcam: Use model.predict(source=0) for webcam or pass a video path; iterate over frames to save annotated outputs.

Next Steps

Run detection on a directory of images.
Process videos or live streams.
Fine‑tune the pretrained model on your custom dataset.

Image Segmentation

Purpose: Paraphrased explanation suitable for documentation, READMEs, or educational notes.

Overview

Image segmentation is the process of dividing an image into multiple regions or segments, where each region corresponds to a specific object or part of an object. Unlike object detection, which draws bounding boxes, segmentation provides pixel-level classification, enabling precise understanding of object shapes and boundaries.

Example: Given an image of a cat and a dog, segmentation identifies exactly which pixels belong to the cat, which belong to the dog, and which belong to the background.

Types of Segmentation

Semantic Segmentation

Assigns each pixel to a category without distinguishing individual instances.
Example: Two cats in an image → all cat pixels labeled as “cat.”

Instance Segmentation

Differentiates between individual objects of the same category.
Example: Two cats → one set of pixels for Cat 1, another for Cat 2.

Real-World Applications

Video conferencing: Separate participant from background for virtual backgrounds.
Medical imaging: Outline tumors or organs in scans.
Autonomous driving: Identify drivable areas, lanes, pedestrians, and vehicles.

Evaluation Metrics

IoU (Intersection over Union): Measures overlap between predicted and ground truth masks.
- IoU = (Area of Overlap) / (Area of Union).
- Higher IoU indicates better segmentation accuracy.

Why It Matters

Segmentation offers a finer-grained understanding of visual data than detection by labeling every pixel. This precision supports advanced tasks like:

Panoptic Segmentation: Combines semantic and instance segmentation.
Scene Understanding: Critical for robotics, AR/VR, and autonomous systems.

References

Long et al. — Fully Convolutional Networks for Semantic Segmentation.
Mask R-CNN: He et al. — https://arxiv.org/abs/1703.06870
YOLO Segmentation: Ultralytics Docs — https://docs.ultralytics.com

FilesExpand file tree

computer_vision_basics.md

Latest commit

History

computer_vision_basics.md

File metadata and controls

📌 Computer Vision Overview

✅ What is Computer Vision?

🔍 Real-World Applications

🛠 Core Tasks in Computer Vision

🌟 Why It Matters?

📚 Learn More

📜 A Brief History of Computer Vision

🕰 Early Years (1960s–1990s)

📈 Machine Learning Era (2000s)

🚀 Deep Learning Revolution (2010s)

🌟 Today

📚 Learn More

🖼 ImageNet: A Pioneering Vision for Computers

✅ What is ImageNet?

🔍 Why Was It Revolutionary?

🚀 The Turning Point: AlexNet (2012)

🌟 Impact on AI

📚 Learn More

🖼 Understanding Image Data

✅ Grayscale Images

🎨 Color Images (RGB)

🔢 Color Depth

⚙️ Preprocessing for CNNs

📚 Key Takeaways

🔗 Learn More

🧰 Preprocessing Image Data in Python

📦 Environment

Code for doing this is mentioned here:

🔄 Augmenting Image Data in Python

✅ Why Augment Images?

Code for doing Image Augmentation is mentioned here:

Computer Vision Image Classification

Overview

Key Points

Why CNNs Work Well

Metrics

Common Pitfalls

Related Tasks

References

Using a Pretrained YOLO Model for Image Classification in Python

Prerequisites

1) Load a Pretrained YOLO Classification Model

2) Pick an Image and (Optionally) Preview It

3) Run Inference

4) Format Top‑1 and Top‑5 Predictions

5) Overlay the Top‑1 Label on the Image

6) Batch Classification & Saving Results (Optional)

Tips & Troubleshooting

Next Steps

Object Detection

Overview

Key Concepts

Real-World Applications

Approaches

Two-Stage Detectors

One-Stage Detectors

Evaluation Metrics

Common Pitfalls

Related Tasks

References

Using a Pretrained YOLO Model for Object Detection in Python

Prerequisites

1) Load a Pretrained YOLO Detection Model

2) Pick an Image and Preview It

3) Run Detection Inference

4) Parse and Format Detections

5) Render Bounding Boxes on the Image

6) Batch Detection & Save CSV (Optional)

Tips & Troubleshooting

Next Steps

Image Segmentation

Overview

Types of Segmentation

Semantic Segmentation