A FastAPI service that processes and cleans scraped content to make it suitable for use as a knowledge base for large language models. The service removes HTML tags, extracts text content, and normalizes various document formats.
The cleaning service is responsible for:
- Cleaning HTML content by removing tags and extracting readable text
- Processing various document formats (PDF, DOCX, DOC, etc.)
- Generating cleaned text files for LLM consumption
- Updating metadata to track cleaning status
- Logging cleaning operations for monitoring
cleaning/
├── api/ # FastAPI application
│ ├── app.py # Main API endpoints
│ ├── models.py # Pydantic models
│ ├── config.py # Configuration settings
│ └── __init__.py
├── worker/ # Processing workers
│ ├── tasks.py # Cleaning task implementations
│ ├── utils.py # Utility functions
│ └── __init__.py
├── debug.py # Debug utilities
├── requirements.txt # Python dependencies
└── Dockerfile # Container configuration
- HTML Cleaning: Removes HTML tags, CSS styles, and JavaScript while preserving text content
- Document Processing: Handles PDF, DOCX, DOC, and other document formats using Unstructured library
- Text Extraction: Converts complex documents into clean, readable text
- Metadata Preservation: Maintains original metadata while adding cleaning status
- Multi-format Support: HTML, PDF, DOCX, DOC, and other document types
- Language Support: Configurable language processing for international content
- Error Handling: Comprehensive error logging and recovery
- File Management: Temporary file handling and cleanup
Process and clean a file, extracting readable text content.
Request Body:
{
"file_path": "/path/to/file.html",
"meta_data_path": "/path/to/file.meta.json",
"directory_path": "/path/to/directory",
"source_file_id": "uuid",
"url": "https://source-url.com",
"logs_path": "/path/to/logfile.log"
}Response:
- HTTP 200 on successful processing
- Generates cleaned.txt and cleaned.meta.json files
- Updates database via Ruuter Internal API
{
"file_path": "FilePath", # Path to original file
"meta_data_path": "FilePath", # Path to metadata file
"directory_path": "DirectoryPath", # Working directory
"source_file_id": "str", # Unique file identifier
"url": "str", # Original source URL
"logs_path": "FilePath" # Path to log file
}RUUTER_INTERNAL: Internal Ruuter service URL for database operationsLANGUAGES: Supported languages for document processing (configurable in settings)
- FastAPI 0.115.12: Web framework for API endpoints
- Unstructured 0.18.2: Document processing library with PDF/DOCX support
- BeautifulSoup4 4.13.4: HTML parsing and text extraction
- Pydantic Settings 2.10.1: Configuration management
# Install dependencies
pip install -r requirements.txt
# Start the FastAPI server
uvicorn api.app:app --host 0.0.0.0 --port 8001 --reload# Build the image
docker build -t cleaning .
# Run the container
docker run -p 8001:8001 \
-e RUUTER_INTERNAL="http://ruuter-internal:8080" \
-v /data:/data \
cleaning- File Reception: API receives cleaning request with file paths and metadata
- Content Type Detection: Determines processing strategy based on file type
- Text Extraction:
- HTML files: BeautifulSoup for tag removal and text extraction
- Other files: Unstructured library for document parsing
- Text Cleaning: Removes excess whitespace and formatting artifacts
- File Generation: Creates cleaned.txt with processed content
- Metadata Update: Updates metadata with cleaning status and timestamps
- Storage: Uploads cleaned files via file processing service
- Database Update: Updates source file records via Ruuter Internal
The cleaning service integrates with:
- Scrapper Service: Receives files to be cleaned after scraping
- File Processing: Uploads cleaned content to blob storage
- Ruuter Internal: Updates database records with cleaning status
- Scheduler: Triggered as part of automated data processing pipeline
- Logging: Comprehensive logging to both files and console
- Error Recovery: Graceful handling of file processing errors
- Status Tracking: Updates database with error states when processing fails
- Resource Cleanup: Proper cleanup of temporary files and resources
Key settings in api/config.py:
- Language configuration for document processing
- File path handling
- Database connection settings
- Logging configuration
For each cleaned file, the service generates:
cleaned.txt: Plain text content suitable for LLM processingcleaned.meta.json: Updated metadata with cleaning status- Log entries for monitoring and debugging