Cleaning Service

A FastAPI service that processes and cleans scraped content to make it suitable for use as a knowledge base for large language models. The service removes HTML tags, extracts text content, and normalizes various document formats.

Overview

The cleaning service is responsible for:

Cleaning HTML content by removing tags and extracting readable text
Processing various document formats (PDF, DOCX, DOC, etc.)
Generating cleaned text files for LLM consumption
Updating metadata to track cleaning status
Logging cleaning operations for monitoring

Architecture

cleaning/
├── api/                   # FastAPI application
│   ├── app.py            # Main API endpoints
│   ├── models.py         # Pydantic models
│   ├── config.py         # Configuration settings
│   └── __init__.py
├── worker/               # Processing workers
│   ├── tasks.py          # Cleaning task implementations
│   ├── utils.py          # Utility functions
│   └── __init__.py
├── debug.py             # Debug utilities
├── requirements.txt     # Python dependencies
└── Dockerfile          # Container configuration

Features

Content Cleaning

HTML Cleaning: Removes HTML tags, CSS styles, and JavaScript while preserving text content
Document Processing: Handles PDF, DOCX, DOC, and other document formats using Unstructured library
Text Extraction: Converts complex documents into clean, readable text
Metadata Preservation: Maintains original metadata while adding cleaning status

Processing Capabilities

Multi-format Support: HTML, PDF, DOCX, DOC, and other document types
Language Support: Configurable language processing for international content
Error Handling: Comprehensive error logging and recovery
File Management: Temporary file handling and cleanup

API Endpoints

POST /clean_file

Process and clean a file, extracting readable text content.

Request Body:

{
  "file_path": "/path/to/file.html",
  "meta_data_path": "/path/to/file.meta.json",
  "directory_path": "/path/to/directory",
  "source_file_id": "uuid",
  "url": "https://source-url.com",
  "logs_path": "/path/to/logfile.log"
}

Response:

HTTP 200 on successful processing
Generates cleaned.txt and cleaned.meta.json files
Updates database via Ruuter Internal API

Data Models

EntityToClean

{
  "file_path": "FilePath",           # Path to original file
  "meta_data_path": "FilePath",      # Path to metadata file
  "directory_path": "DirectoryPath", # Working directory
  "source_file_id": "str",           # Unique file identifier
  "url": "str",                      # Original source URL
  "logs_path": "FilePath"            # Path to log file
}

Environment Variables

RUUTER_INTERNAL: Internal Ruuter service URL for database operations
LANGUAGES: Supported languages for document processing (configurable in settings)

Dependencies

FastAPI 0.115.12: Web framework for API endpoints
Unstructured 0.18.2: Document processing library with PDF/DOCX support
BeautifulSoup4 4.13.4: HTML parsing and text extraction
Pydantic Settings 2.10.1: Configuration management

Running the Service

Development

# Install dependencies
pip install -r requirements.txt

# Start the FastAPI server
uvicorn api.app:app --host 0.0.0.0 --port 8001 --reload

Docker

# Build the image
docker build -t cleaning .

# Run the container
docker run -p 8001:8001 \
  -e RUUTER_INTERNAL="http://ruuter-internal:8080" \
  -v /data:/data \
  cleaning

Processing Flow

File Reception: API receives cleaning request with file paths and metadata
Content Type Detection: Determines processing strategy based on file type
Text Extraction:
- HTML files: BeautifulSoup for tag removal and text extraction
- Other files: Unstructured library for document parsing
Text Cleaning: Removes excess whitespace and formatting artifacts
File Generation: Creates cleaned.txt with processed content
Metadata Update: Updates metadata with cleaning status and timestamps
Storage: Uploads cleaned files via file processing service
Database Update: Updates source file records via Ruuter Internal

Integration

The cleaning service integrates with:

Scrapper Service: Receives files to be cleaned after scraping
File Processing: Uploads cleaned content to blob storage
Ruuter Internal: Updates database records with cleaning status
Scheduler: Triggered as part of automated data processing pipeline

Error Handling

Logging: Comprehensive logging to both files and console
Error Recovery: Graceful handling of file processing errors
Status Tracking: Updates database with error states when processing fails
Resource Cleanup: Proper cleanup of temporary files and resources

Configuration

Key settings in api/config.py:

Language configuration for document processing
File path handling
Database connection settings
Logging configuration

Output

For each cleaned file, the service generates:

cleaned.txt: Plain text content suitable for LLM processing
cleaned.meta.json: Updated metadata with cleaning status
Log entries for monitoring and debugging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaning Service

Overview

Architecture

Features

Content Cleaning

Processing Capabilities

API Endpoints

POST /clean_file

Data Models

EntityToClean

Environment Variables

Dependencies

Running the Service

Development

Docker

Processing Flow

Integration

Error Handling

Configuration

Output

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Cleaning Service

Overview

Architecture

Features

Content Cleaning

Processing Capabilities

API Endpoints

POST /clean_file

Data Models

EntityToClean

Environment Variables

Dependencies

Running the Service

Development

Docker

Processing Flow

Integration

Error Handling

Configuration

Output