This repo provides a fast, high‑accuracy, fully local pipeline to extract Arabic text from PDFs on Windows using:
- Tesseract 5 with the Arabic model from tessdata_best (highest free accuracy).
- OCRmyPDF as the smart wrapper that auto‑rotates, deskews, cleans pages, and produces a searchable PDF plus a sidecar TXT file.
ℹ️ Reality check: Zero OCR errors is not realistic, especially with Arabic (fonts, scan quality, diacritics). The settings here balance accuracy, speed, and reasonable file sizes for long PDFs.
- Searchable PDF:
*_OCR.pdf - Plain text sidecar:
*_OCR.txt— strongly recommended for search/indexing and to bypass some RTL quirks in certain PDF viewers.
System (install via Chocolatey):
- Python 3.10+
- Tesseract OCR
- Ghostscript
- (Optional) pngquant (better compression for color images)
Python:
ocrmypdf(installed viapip)
Ensure TESSDATA_PREFIX points to your Tesseract
tessdatadirectory. You must install Arabic model (ara) from tessdata_best.
If you don’t have Chocolatey, install it from its official site first. Then run:
# 1) System tools
choco install -y python3 tesseract ghostscript pngquant
# 2) Python packages
pip install --upgrade pip
pip install -r requirements.txt
# 3) Arabic model from tessdata_best
Invoke-WebRequest `
https://github.com/tesseract-ocr/tessdata_best/raw/main/ara.traineddata `
-OutFile "C:\Program Files\Tesseract-OCR\tessdata\ara.traineddata"
# 4) Make sure Tesseract knows where models live
[Environment]::SetEnvironmentVariable('TESSDATA_PREFIX', 'C:\Program Files\Tesseract-OCR\tessdata', 'Machine')
$env:TESSDATA_PREFIX='C:\Program Files\Tesseract-OCR\tessdata'
# 5) Sanity checks
tesseract --version
tesseract --list-langs # "ara" should be listed
ocrmypdf --versionIf you see permission issues, reopen PowerShell as Administrator. If
arais missing, confirmara.traineddatais in the correcttessdatadirectory and that TESSDATA_PREFIX is set.
OCR/
├─ app.py # main script
├─ requirements.txt # Python deps
└─ README.md
Place your PDF somewhere (example):
C:\Users\waok\Downloads\ARABIC PDF PATH .pdf
From the project folder:
python app.py "C:\Users\waok\Downloads\ARABIC PDF PATH .pdf" "C:\Users\waok\Downloads\ARABIC PDF PATH _OCR.pdf"If you omit arguments, app.py uses a default input path inside the script. The script also writes a sidecar TXT next to your output PDF: ..._OCR.txt.
ocrmypdf -l ara --jobs 8 `
--rotate-pages --deskew --clean --remove-background `
--optimize 3 --output-type pdf `
--pdf-renderer hocr `
--sidecar "C:\Users\waok\Downloads\ARABIC PDF PATH _OCR.txt" `
"C:\Users\waok\Downloads\ARABIC PDF PATH .pdf" `
"C:\Users\waok\Downloads\ARABIC PDF PATH _OCR.pdf"If you see “page already has text” and still want to force OCR, add
--force-ocr.
- Language:
ara(add+engif your content is mixed). - Preprocessing: Auto‑rotate, deskew, and background clean before OCR for better accuracy.
- Tesseract engine: LSTM‑only (most accurate) with automatic page segmentation (PSM 3) — good default for diverse document layouts.
- PDF rendering:
pdf_renderer="hocr"works better with RTL languages in many viewers. - Output type:
output_type="pdf"(smaller than PDF/A). - Sidecar:
*_OCR.txtis reliable for search/indexing and downstream NLP. - Parallelism:
jobsleverages CPU cores to speed up long PDFs.
Tuned for highest free accuracy with reasonable output size on Windows.
import os, sys, pathlib
import ocrmypdf
def main(input_path: str, output_path: str | None = None):
# Ensure Tesseract models path (tessdata_best) is known
os.environ.setdefault("TESSDATA_PREFIX", r"C:\Program Files\Tesseract-OCR\tessdata")
p_in = pathlib.Path(input_path)
if output_path is None:
output_path = str(p_in.with_name(p_in.stem + "_OCR.pdf"))
sidecar = str(pathlib.Path(output_path).with_suffix(".txt"))
# NOTE: If your PDF is truly 300 dpi or higher, remove oversample=300
ocrmypdf.ocr(
str(p_in),
output_path,
language="ara", # add +eng if you have mixed Arabic/English
rotate_pages=True,
deskew=True,
clean=True,
remove_background=True,
optimize=3,
output_type="pdf",
pdf_renderer="hocr",
tesseract_oem=1, # LSTM only
tesseract_pagesegmode=3, # automatic page segmentation
oversample=300, # drop this if input is already >=300 dpi
sidecar=sidecar,
jobs=max(1, (os.cpu_count() or 8) - 1),
)
print(f"✅ Done:\nPDF : {output_path}\nTXT : {sidecar}")
if __name__ == "__main__":
default_input = r"C:\Users\waok\Downloads\ARABIC PDF PATH .pdf"
in_path = sys.argv[1] if len(sys.argv) > 1 else default_input
out_path = sys.argv[2] if len(sys.argv) > 2 else None
try:
main(in_path, out_path)
except Exception as exc:
print("Failed to run ocrmypdf:", exc)
sys.exit(2)- PSM (page segmentation):
tesseract_pagesegmode=3is a strong default for varied layouts.=6is great for uniform text blocks (paragraph pages).
- OEM (engine):
tesseract_oem=1(LSTM‑only) is typically the most accurate.
- Languages:
- Use
ara+engonly if needed; adding more languages can sometimes degrade accuracy.
- Use
- Oversample:
- Keep
oversample=300only when scans are below 300 dpi. Otherwise remove it to reduce output size.
- Keep
WinError 2/ “command not found”: Add Tesseract and Ghostscript to your PATH or reopen PowerShell as Admin after install.- “Tesseract couldn’t load any languages”: Confirm
TESSDATA_PREFIXand thatara.traineddataexists in thetessdatafolder. - Large output size:
- Prefer
output_type="pdf"(smaller than default PDF/A). - Use
optimize=3. - Remove
oversampleif scans are already 300+ dpi. jbig2is commonly missing on Windows; it’s safe to ignore (only affects B/W compression).
- Prefer
- Warning
lots of diacritics: Informational; common with Arabic. Improve source quality and keep effective dpi around 300. - RTL quirks in PDF viewers: Some viewers struggle with Arabic text selection/search. Rely on the sidecar TXT or try another viewer.
import re, io
in_txt = r"C:\path\to\file_OCR.txt"
out_txt = r"C:\path\to\file_OCR_no_diac.txt"
with io.open(in_txt, "r", encoding="utf-8") as f:
txt = f.read()
# Remove common Arabic diacritics
txt_no_diac = re.sub(r"[\u064B-\u065F\u0670\u06D6-\u06ED]", "", txt)
with io.open(out_txt, "w", encoding="utf-8") as f:
f.write(txt_no_diac)
print("Saved:", out_txt)# Replace the path with your PDF
python app.py "C:\Users\...\ARABIC PDF PATH .pdf"
# Check outputs:
# ...\ARABIC PDF PATH _OCR.pdf
# ...\ARABIC PDF PATH _OCR.txtFree to use internally. Respect the upstream licenses (Tesseract, OCRmyPDF).
I can include a Windows Batch (.bat) that asks for a PDF path and runs the same settings automatically (and writes the TXT next to it). Just say the word.