███████╗██╗ █████╗ ████████╗███████╗ ██████╗ ██╗ ██╗██╗██╗ ██╗
██╔════╝██║ ██╔══██╗╚══██╔══╝██╔════╝██╔═══██╗██║ ██║██║██║ ██║
███████╗██║ ███████║ ██║ █████╗ ██║ ██║██║ ██║██║██║ ██║
╚════██║██║ ██╔══██║ ██║ ██╔══╝ ██║▄▄ ██║██║ ██║██║██║ ██║
███████║███████╗██║ ██║ ██║ ███████╗╚██████╔╝╚██████╔╝██║███████╗███████╗
╚══════╝╚══════╝╚═╝ ╚═╝ ╚═╝ ╚══════╝ ╚══▀▀═╝ ╚═════╝ ╚═╝╚══════╝╚══════╝
A robust Python CLI tool for converting HTML documents to clean, standards-compliant Markdown
Installation • Quick Start • Features • Documentation • Contributing
SlateQuill is a Python-based CLI tool that reliably transforms HTML documents into clean, standards-compliant Markdown with full support for:
- Complex tables with proper alignment and formatting
- Nested lists with accurate indentation
- Footnotes and reference links
- Embedded content (images, videos, code blocks)
- Extensible plugin system for future format support (PDF, DOCX, EPUB, etc.)
- 🔒 Security-First: Built-in input validation and sanitization
- ⚡ Performance: Async I/O and streaming support for large files
- 🔧 Extensible: Plugin architecture for custom converters
- 🎨 Configurable: Flexible output formatting options
- 📱 User-Friendly: Beautiful CLI with progress bars and clear error messages
Since SlateQuill is still in development and not yet published to PyPI, install from source:
# Clone the repository
git clone https://github.com/NiklasSkulll/SlateQuill.git
cd SlateQuill
# Option 1: Using Poetry (Recommended)
poetry install
poetry shell
SlateQuill input.html output.md
# Option 2: Using pip in virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -e .
SlateQuill input.html output.md# Test with included sample files
SlateQuill tests/fixtures/test_input.html test_output.mdpip install SlateQuill# Homebrew (macOS) - planned
brew install SlateQuill
# Conda - planned
conda install -c conda-forge SlateQuillFor users who prefer containerized environments or want to avoid Python installation:
docker run -v $(pwd):/workspace SlateQuill:latest convert input.html -o output.md# Convert HTML file to Markdown
SlateQuill input.html output.md
# Convert with auto-generated output name
SlateQuill input.html
# Show version
SlateQuill --version
# Show help
SlateQuill --helpfrom SlateQuill import convert_file
from pathlib import Path
import asyncio
# Simple conversion
async def convert():
await convert_file(Path("input.html"), Path("output.md"))
# Run the conversion
asyncio.run(convert())
# With custom configuration
from SlateQuill.config import Config, ConversionConfig, load_config
config = load_config() # Load from .slateQuill.toml or use defaults
await convert_file(Path("input.html"), Path("output.md"), config)- HTML to Markdown Conversion: Clean, standards-compliant output
- Security-First: Input validation and HTML sanitization
- Async Processing: Efficient handling of files
- Configuration System: TOML-based configuration
- Plugin Architecture: Extensible converter system (planned)
- ✅ Basic HTML to Markdown conversion
- ✅ Configuration system with TOML support
- ✅ Security validation and HTML sanitization
- ✅ Simple CLI interface
- ✅ Async core functionality
- 🚧 Advanced CLI features (planned)
- 🚧 Batch processing (planned)
- 🚧 Multiple output formats (planned)
- Input Validation: Comprehensive security checks
- HTML Sanitization: XSS prevention and content cleaning
- Error Handling: Detailed error messages and recovery
- Memory Management: Efficient processing of large documents
- Plugin Sandboxing: Safe execution of third-party plugins
- Streaming Support: Memory-efficient processing (planned)
- Async I/O: Non-blocking file operations
- Memory Management: Efficient processing of documents
- Current Performance:
- Small files (<1MB): Fast conversion
- Medium files (1-10MB): Efficient processing
- Large files (>10MB): Planned optimizations
Create a .slateQuill.toml file in your project root:
[conversion]
markdown_flavor = "github"
preserve_html = false
strip_comments = true
[security]
max_file_size = 104857600 # 100MB
sanitize_html = true
validate_encoding = true
[logging]
level = "INFO"Note: Advanced configuration options are planned for future versions.
- User Guide: Installation, configuration, and usage
- API Reference: Complete API documentation
- Plugin Development: Creating custom converters
- Contributing Guide: Development setup and workflow
SlateQuill includes a plugin architecture for future extensibility:
The plugin system is designed to support different input formats in future versions:
from SlateQuill.plugins.base import BaseConverter
from pathlib import Path
class CustomConverter(BaseConverter):
def can_handle(self, file_path: Path) -> bool:
return file_path.suffix.lower() == '.custom'
async def convert(self, content: bytes, options=None) -> str:
# Your conversion logic here
return markdown_content
def validate_input(self, content: bytes) -> bool:
# Input validation logic
return True
@property
def supported_formats(self) -> list[str]:
return ['.custom']Note: Plugin system is currently in development. Current version focuses on HTML to Markdown conversion.
SlateQuill is built with a modular architecture that separates concerns and enables extensibility:
# Core conversion pipeline
from SlateQuill.core import convert_file
from SlateQuill.config import Config
from SlateQuill.security import validate_input
from SlateQuill.plugins import load_plugins
# Exception hierarchy for robust error handling
class SlateQuillError(Exception):
"""Base exception for SlateQuill"""
pass
class ConversionError(SlateQuillError):
"""Raised when conversion fails"""
pass
class InvalidInputError(ConversionError):
"""Raised when input is malformed"""
pass
class SecurityError(SlateQuillError):
"""Raised when security validation fails"""
passThe plugin system uses Python's entry-points mechanism for extensibility:
# plugins/base.py
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Dict, Any, Optional
class BaseConverter(ABC):
"""Abstract base class for all converters."""
@abstractmethod
def can_handle(self, file_path: Path) -> bool:
"""Check if this converter can handle the given file."""
pass
@abstractmethod
async def convert(self, content: bytes, options: Optional[Dict[str, Any]] = None) -> str:
"""Convert content to Markdown."""
pass
@abstractmethod
def validate_input(self, content: bytes) -> bool:
"""Validate input for security and format correctness."""
pass
@property
@abstractmethod
def supported_formats(self) -> list[str]:
"""List of supported file formats."""
passConfiguration is handled through Pydantic models for type safety:
# config.py
from pydantic import BaseModel, Field
from typing import Dict, Any
class ConversionConfig(BaseModel):
"""Configuration for conversion process."""
markdown_flavor: str = Field(default="github")
preserve_html: bool = Field(default=False)
strip_comments: bool = Field(default=True)
max_file_size: int = Field(default=100_000_000)
line_length: int = Field(default=80)
class Config(BaseModel):
"""Main configuration class."""
conversion: ConversionConfig = ConversionConfig()
plugins: Dict[str, Any] = Field(default_factory=dict)
logging_level: str = Field(default="INFO")- Python 3.10 or higher
- Poetry for dependency management
- Git for version control
# Create new project (for contributors)
poetry new --src SlateQuill
cd SlateQuill
# Clone existing project
git clone https://github.com/NiklasSkulll/SlateQuill.git
cd SlateQuill
# Install dependencies
poetry install
# Set up development tools
poetry run pre-commit install- Formatter: Black (line length 88, PEP 8 compliant)
- Import Sorter: isort (compatible with Black)
- Linter: Ruff (faster than flake8, comprehensive rules)
- Type Checker: mypy with strict settings
- Security: bandit for security linting, safety for dependencies
# Run all quality checks
poetry run pre-commit run --all-files
# Individual checks
poetry run black --check src/
poetry run ruff check src/
poetry run mypy src/
poetry run bandit -r src/
poetry run safety checktests/
├── fixtures/ # Test data
│ ├── test_input.html # Sample HTML input
│ └── test_output.md # Expected output
└── (unit tests planned) # Test modules for v0.2.0
# Install development dependencies
poetry install
# Run tests (when implemented)
poetry run pytest
# Run with coverage (planned)
poetry run pytest --cov=SlateQuillComprehensive testing suite will be implemented in v0.2.0
- File Size Limits: Configurable maximum file size (default 100MB)
- Content Sanitization: HTML sanitization to prevent XSS
- Encoding Detection: Proper handling of various text encodings
- Path Validation: Directory traversal attack prevention
- Sandboxing: Isolated execution environment for plugins
- API Restrictions: Limited system resource access
- Dependency Scanning: Automated vulnerability checks
- Code Signing: Optional plugin authenticity verification
- Secure temporary file cleanup
- Memory leak prevention
- Sensitive data logging avoidance
- Internal system detail protection in error messages
- Triggers:
push,pull_request - Matrix: Python 3.10, 3.11, 3.12 on Ubuntu, macOS, Windows
- Steps:
- Install Poetry and cache dependencies
- Run security checks (
bandit,safety) - Run quality checks (
pre-commit) - Execute full test suite with coverage
- Upload coverage to Codecov
- Run performance benchmarks
- Triggers: Git tag
v* - Steps:
- Validate tag format and changelog
- Build and test package
- Publish to PyPI (trusted publishing)
- Build Docker image (optional)
- Generate GitHub release
- Deploy documentation
- Schedule: Weekly
- Scans: Dependencies, CodeQL, container images
[conversion]
markdown_flavor = "github" # "github", "commonmark", "strict"
line_length = 80
heading_style = "atx" # "atx" (#) or "setext" (===)
emphasis_style = "asterisk" # "asterisk" (*) or "underscore" (_)
preserve_html = false
strip_comments = true
clean_whitespace = true
[security]
max_file_size = 104_857_600 # 100MB in bytes
allow_external_links = true
sanitize_html = true
[plugins]
pdf.ocr_enabled = true
pdf.language = "en"
docx.preserve_styles = false
[output]
create_backup = false
overwrite_existing = true
output_directory = "./output"
[logging]
level = "INFO"
format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
file = "./SlateQuill.log"
[performance]
use_streaming = true
max_workers = 4
cache_results = true
cache_ttl = 3600 # seconds[tool.poetry.dependencies]
python = "^3.10"
beautifulsoup4 = "^4.12.0"
lxml = "^4.9.0"
markdownify = "^0.11.0"
typer = "^0.9.0"
rich = "^13.0.0"
pydantic = "^2.0.0"
aiofiles = "^23.0.0"
click-completion = "^0.5.0"
bleach = "^6.0.0"
tomli = "^2.0.0"
tomli-w = "^1.0.0"[tool.poetry.group.dev.dependencies]
pytest = "^7.0.0"
pytest-asyncio = "^0.21.0"
pytest-cov = "^4.0.0"
pytest-benchmark = "^4.0.0"
hypothesis = "^6.0.0"
black = "^23.0.0"
ruff = "^0.1.0"
mypy = "^1.0.0"
pre-commit = "^3.0.0"
bandit = "^1.7.0"
safety = "^2.0.0"
mkdocs = "^1.5.0"
mkdocs-material = "^9.0.0"
mkdocstrings = {extras = ["python"], version = "^0.23.0"}- Python: 3.10 or higher
- RAM: 512MB minimum (2GB recommended for large files)
- Storage: 100MB for installation
- Network: Optional for plugin downloads
- Python 3.10 or higher
- Poetry for dependency management
- Git for version control
# Clone the repository
git clone https://github.com/NiklasSkulll/SlateQuill.git
cd SlateQuill
# Install dependencies
poetry install
# Set up pre-commit hooks
pre-commit install
# Run tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=SlateQuillSlateQuill/
├── src/SlateQuill/ # Main package
│ ├── __init__.py # Package initialization
│ ├── simple_cli.py # Simple CLI interface
│ ├── cli.py # Advanced CLI (planned)
│ ├── core.py # Core conversion logic
│ ├── html2md.py # HTML to Markdown converter
│ ├── config.py # Configuration management
│ ├── security.py # Security validation
│ ├── exceptions.py # Exception hierarchy
│ └── plugins/ # Plugin system
│ └── base.py # Base plugin class
├── tests/ # Test suite
│ ├── fixtures/ # Test data
│ ├── test_input.html # Sample input
│ └── test_output.md # Sample output
├── docs/ # Documentation (planned)
├── pyproject.toml # Poetry configuration
├── .gitignore # Git ignore rules
├── .pre-commit-config.yaml # Pre-commit hooks
├── CONTRIBUTING.md # Contributing guidelines
├── SECURITY.md # Security policy
└── CHANGELOG.md # Version history
Performance benchmarks will be added as the project matures.
Current status:
- Basic HTML to Markdown conversion: Working
- Small files (<1MB): Fast conversion
- Async processing: Implemented
- Memory efficiency: Under development
Detailed benchmarks will be provided in v0.2.0
SlateQuill takes security seriously:
- Input Validation: All inputs are validated before processing
- HTML Sanitization: Dangerous HTML elements are removed or escaped
- Plugin Sandboxing: Third-party plugins run in isolated environments
- Dependency Scanning: Regular security audits of dependencies
To report security vulnerabilities, please see our Security Policy.
- ✅ Basic HTML to Markdown conversion
- ✅ Configuration system with TOML support
- ✅ Security validation and HTML sanitization
- ✅ Simple CLI interface
- ✅ Async core functionality
- ✅ Exception hierarchy
- ✅ Plugin base architecture
- 🚧 Advanced CLI with Typer framework
- 🚧 Batch processing support
- 🚧 Multiple output formats
- 🚧 Configuration validation
- 🚧 Comprehensive test suite
- 📋 Streaming support for large files
- 📋 First official plugin (PDF support)
- 📋 Performance optimizations
- 📋 Memory usage improvements
- 📋 Docker image (optional deployment method)
- 📋 Stable API
- 📋 Enterprise features
- 📋 Comprehensive documentation
- 📋 Performance guarantees
- Advanced Formats: EPUB, DOCX, LaTeX, AsciiDoc → Markdown
- AI Integration: Smart content extraction and formatting
- Language Bindings: Node.js, Go, Rust libraries
- Cloud Integration: AWS Lambda, Google Cloud Functions
- Enterprise Features: Bulk processing, monitoring, collaboration
SlateQuill follows Semantic Versioning (MAJOR.MINOR.PATCH):
- MAJOR: Incompatible API changes
- MINOR: New functionality in backward-compatible manner
- PATCH: Backward-compatible bug fixes
Releases are automated through GitHub Actions with conventional commits generating changelogs automatically.
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Run the test suite:
poetry run pytest - Submit a pull request
- Follow PEP 8 style guide (enforced by Black)
- Write tests for new features
- Update documentation as needed
- Use conventional commits for commit messages
This project is licensed under the MIT License - see the LICENSE file for details.
- BeautifulSoup for HTML parsing
- markdownify for base conversion
- Typer for the CLI framework
- Rich for beautiful terminal output
Made by Niklas Skulll