Project is somewhat experimental. I can only provide minimal support.
A high-performance Speech-to-Text API server using NVIDIA's Parakeet TDT 0.6B v3 model with CPU and ONNX int8 support.
- Flexible Precision: FP32 (Best quality, GPU/CPU) or INT8 (Slightly worse quality, CPU-only)
- Easy Setup: Clone and run with a single command
- Auto-Download: Models download automatically on first run
- OpenAI Compatible: Drop-in replacement for OpenAI's Whisper API
- Multilingual: Supports 25 European languages (auto-detected)
# Clone the repository
git clone https://github.com/Dolyfin/parakeet-api.git
cd parakeet-api
# Run installation script (auto-detects GPU)
# Linux/macOS:
./install.sh
# Windows:
install.bat# Linux/macOS
./start.sh
# Windows
start.bat
# Or run directly with options
python server.py --precision fp32 # GPU/CPU, best quality (default)
python server.py --precision int8 # CPU-only, fastestcurl -X POST "http://localhost:8022/v1/transcribe" \
-F "file=@audio.wav"curl -X POST "http://localhost:8022/v1/audio/transcriptions" \
-F "file=@audio.mp3" \
-F "model=parakeet-tdt-0.6b-v3"import requests
with open("audio.wav", "rb") as f:
response = requests.post(
"http://localhost:8022/v1/audio/transcriptions",
files={"file": f},
data={"model": "parakeet-tdt-0.6b-v3"}
)
print(response.json()["text"])from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8022/v1",
api_key="not-needed"
)
with open("audio.wav", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="parakeet-tdt-0.6b-v3",
file=audio_file
)
print(transcript.text)- POST /v1/transcribe - Standard transcription endpoint
- POST /v1/audio/transcriptions - OpenAI-compatible endpoint
- GET /health - Health check
- GET /docs - Interactive API documentation
Supports json, text, srt, and vtt formats. Add timestamps with timestamps=true.
# Precision selection
python server.py --precision fp32 # Best quality, GPU/CPU (default)
python server.py --precision int8 # Fastest, CPU-only
# Server configuration
python server.py --host 0.0.0.0 --port 9000
# CPU threading (applies to both FP32 and INT8)
python server.py --threads 8 # Use 8 threads
python server.py --threads 0 # Auto-detect (default, uses all CPU cores)
# Force CPU for FP32 with optimized threading
python server.py --precision fp32 --cpu --threads 0The install script auto-detects your GPU. To check GPU support manually:
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"If GPU is not detected, reinstall PyTorch with CUDA:
# For CUDA 12.6 (RTX 3000/4000 series and newer)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# For CUDA 11.8 (older GPUs)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# CPU-only
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpuNVIDIA Parakeet TDT 0.6B v3
- Size: 600M parameters
- Languages: 25 European languages (auto-detected)
- Architecture: FastConformer-TDT
| Precision | Memory | Backend | Device Support |
|---|---|---|---|
| FP32 | ~2.5GB | NeMo (PyTorch) | GPU/CPU |
| INT8 | ~640MB | ONNX | CPU-only |
Test Word Error Rate (WER) and performance:
# Linux/macOS
./benchmark.sh ./test_data 3 results.csv
# Windows
benchmark.bat .\test_data 3 results.csv
# Manual
python benchmark.py --dataset ./test_data --iterations 3Dataset format: Pairs of audio files and .txt transcriptions in the same folder.
Tested on RTX 3060 and i5-11600K on a 2.4hr dataset:
| Precision | Device | WER (%) | RTF | Time (s) | VRAM (MB) | Delta from FP32 GPU |
|---|---|---|---|---|---|---|
| FP32 | GPU | 11.80 | 0.01 | 77.3 | 2434 | - |
| FP32 | CPU | 11.79 | 0.10 | 779.1 | - | - |
| INT8 | CPU | 15.49 | 0.04 | 362.1 | - | +3.69% |
Note: Custsom test dataset was not perfect and could inflate WER values. Results show relative performance differences between configurations.
FP32 CPU can benefit from higher --threads than default by 10-20% which isn't tested.
- Check NVIDIA drivers:
nvidia-smi - Verify PyTorch CUDA:
python -c "import torch; print(torch.cuda.is_available())" - Reinstall PyTorch with CUDA support (see GPU Setup above)
- Use INT8:
python server.py --precision int8 - Force CPU for FP32:
python server.py --precision fp32 --cpu
Upgrade to Python 3.10+ or install pre-built wheels:
pip install numpy==1.23.5 scipy==1.11.4WAV, MP3, FLAC, OGG, M4A and more. Audio is automatically resampled to 16kHz and converted to mono.
*not tested
# Build
docker build -t parakeet-api .
# Run with GPU
docker-compose up
# Run CPU-only
docker run -p 8022:8022 parakeet-api --precision int8parakeet-api/
├── server.py # FastAPI server
├── backend.py # Backend abstraction layer
├── inference_nemo.py # NeMo backend (FP32)
├── inference_onnx.py # ONNX backend (INT8)
├── model_downloader.py # Model auto-download
├── benchmark.py # WER & performance testing
├── config.py # Configuration
├── requirements.txt # Dependencies
├── install.sh/bat # Installation scripts
├── start.sh/bat # Startup scripts
└── benchmark.sh/bat # Benchmark launchers
This project uses the NVIDIA Parakeet model licensed under CC-BY-4.0. You agree to the license by downloading the model.
This project is licensed under GNU General Public License v3.0.
- Model: NVIDIA NeMo nvidia/parakeet-tdt-0.6b-v3
- ONNX Runtime: sherpa-onnx
For issues and questions:
- Check the Troubleshooting section
- Review API docs at
/docs - Open an issue on GitHub