A FastAPI-based server that provides REST API endpoints for converting various file formats to Markdown using Microsoft's MarkItDown library.
- Multiple File Format Support: Convert PDF, DOCX, XLSX, PPTX, and HTML files to Markdown
- Async & Thread-Safe: Built with FastAPI for high performance and concurrent request handling
- Docker Ready: Fully containerized with Docker and Docker Compose support
- REST API: Clean RESTful endpoints for easy integration
- Error Handling: Comprehensive error handling and logging
- Health Checks: Built-in health check endpoints
- CORS Enabled: Cross-origin resource sharing support
| Format | Endpoint | File Extensions |
|---|---|---|
/parse_pdf |
.pdf |
|
| Word Documents | /parse_docx |
.docx |
| Excel Spreadsheets | /parse_xlsx |
.xlsx, .xls |
| PowerPoint Presentations | /parse_pptx |
.pptx, .ppt |
| HTML Files | /parse_html |
.html, .htm |
-
Clone the repository:
git clone <repository-url> cd Kolosal-RMS-MarkItDown
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the server:
python main.py
The API will be available at http://localhost:8000
-
Build and run with Docker:
docker build -t markitdown-api . docker run -p 8000:8000 markitdown-api -
Or use Docker Compose:
docker-compose up -d
Once the server is running, visit:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
curl -X POST "http://localhost:8000/parse_pdf" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@document.pdf"curl -X POST "http://localhost:8000/parse_docx" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@document.docx"curl -X POST "http://localhost:8000/parse_xlsx" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@spreadsheet.xlsx"curl -X POST "http://localhost:8000/parse_pptx" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@presentation.pptx"curl -X POST "http://localhost:8000/parse_html" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@webpage.html"All endpoints return a JSON response with the following structure:
{
"success": true,
"filename": "document.pdf",
"markdown_content": "# Document Title\n\nDocument content in markdown format...",
"title": "Document Title",
"metadata": {
"original_filename": "document.pdf",
"file_size": 1024576,
"mime_type": "application/pdf"
}
}The server provides a health check endpoint:
curl http://localhost:8000/healthResponse:
{
"status": "healthy",
"service": "markitdown-api"
}- Async Processing: All file operations are handled asynchronously
- Thread Pool: CPU-intensive conversions run in a dedicated thread pool
- Concurrent Requests: Supports multiple simultaneous file conversions
- Memory Efficient: Uses streaming for file processing
- Error Recovery: Graceful error handling without server crashes
The server can be configured through environment variables:
HOST: Server host (default:0.0.0.0)PORT: Server port (default:8000)LOG_LEVEL: Logging level (default:info)
docker build -t kolosal-markitdown-api .docker run -d \
--name markitdown-api \
-p 8000:8000 \
kolosal-markitdown-api# Start services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
This project is built using Microsoft's MarkItDown library, which provides the core functionality for converting various file formats to Markdown.
- MarkItDown: https://github.com/microsoft/markitdown
- Microsoft: For creating and maintaining the MarkItDown library
- FastAPI: For the excellent async web framework
- Uvicorn: For the ASGI server implementation
For support and questions, please:
- Check the documentation
- Search existing issues in the repository
- Create a new issue if needed
Kolosal Inc - Retrieval Management Service Implementation