Image Processing Integration

This document explains how to use the image processing functionality in SmartBlogger, including both OCR and general image understanding capabilities.

Overview

SmartBlogger now supports comprehensive image processing capabilities:

Optical Character Recognition (OCR): Extract text from images using the DeepSeek-OCR model with vLLM acceleration.
General Image Understanding: Describe and analyze images using multimodal vision models.

This allows you to upload images and have both the text content extracted and the visual content described and analyzed.

How It Works

When you upload an image file (PNG, JPEG, BMP, TIFF), the system automatically detects it as an image file.
The DeepSeek-OCR model processes the image and extracts text content using vLLM for acceleration.
The extracted text is then treated like any other text input and processed through the normal workflow.
The system generates summaries and uses the content for blog post generation.

Requirements

To use the image processing functionality, you need to have the following dependencies installed:

vllm
transformers
torch
einops
addict
easydict
flash-attn
Pillow

Install these dependencies using:

pip install vllm transformers torch einops addict easydict flash-attn Pillow

Usage

Start the SmartBlogger application
In the Content Input section, use the file uploader to upload image files
The system will automatically process the images and extract text
Continue with your normal workflow to generate blog posts

Configuration

The image processor uses the following default settings:

OCR Model: deepseek-ai/DeepSeek-OCR
Vision Model: llava-hf/llava-1.5-7b-hf (example)
Processing mode: Markdown (preserves document structure)
vLLM acceleration with 50% GPU memory utilization for OCR
vLLM acceleration with 30% GPU memory utilization for vision model
Max model length: 4096 tokens for OCR, 2048 for vision

Troubleshooting

If image processing is not working:

Check that all required dependencies are installed
Verify that the models can be downloaded (requires internet connection)
Check the application logs for error messages
Ensure sufficient system resources (the models require significant memory)

Performance Benefits of vLLM

Using vLLM for image processing provides several performance benefits:

Faster inference times
Better memory management
Support for continuous batching
Optimized CUDA kernels
Reduced latency for image processing

Platform-Specific Optimizations

Apple Silicon (M1/M2/M3) Support

The image processor includes optimizations for Apple Silicon Macs:

Automatic detection of Apple Silicon hardware
GPU-based processing when good GPU RAM is available
Conservative GPU memory utilization (50-60%)
Full model sequence lengths for better quality
Support for MLX acceleration (when available)
CPU fallback when necessary

On Apple Silicon with good GPU RAM (like your M2), the system will automatically:

Use GPU processing for better performance
Maintain full model capabilities
Utilize MLX acceleration when available

NVIDIA GPU Systems

For NVIDIA GPU systems:

The image processor is configured to use only 50% of GPU memory for OCR and 30% for vision to avoid conflicts with Ollama
Ollama is given priority for GPU resources as it's the primary LLM engine
Full model capabilities are available

Troubleshooting Memory Issues

If you experience memory issues on any platform, consider:

Running image processing on CPU by setting gpu_memory_utilization=0.0
Processing images sequentially rather than concurrently
Reducing the max_model_len parameter further
Using smaller images when possible

Limitations

Image processing may take some time depending on image size and system resources
Accuracy may vary based on image quality and text complexity
The models require significant system resources (GPU recommended)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image Processing Integration

Overview

How It Works

Requirements

Usage

Configuration

Troubleshooting

Performance Benefits of vLLM

Platform-Specific Optimizations

Apple Silicon (M1/M2/M3) Support

NVIDIA GPU Systems

Troubleshooting Memory Issues

Limitations

FilesExpand file tree

IMAGE_PROCESSING.md

Latest commit

History

IMAGE_PROCESSING.md

File metadata and controls

Image Processing Integration

Overview

How It Works

Requirements

Usage

Configuration

Troubleshooting

Performance Benefits of vLLM

Platform-Specific Optimizations

Apple Silicon (M1/M2/M3) Support

NVIDIA GPU Systems

Troubleshooting Memory Issues

Limitations