Skip to content

govtechmy/ocr-testing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR HTTP API (Alpine + Tesseract + pytesseract)

Single-container OCR service running on Alpine Linux. Exposes an HTTP API to extract text from uploaded images using Tesseract via pytesseract.

  • Languages: English (eng) and Malay (msa)
  • Default per-request language: eng+msa

Project layout

Build and run

From the project root:

docker compose up --build ocr-api

The API will listen on:

  • http://localhost:8000

Endpoints

GET /health

Health and configuration info.

Example:

curl http://localhost:8000/health

Expected JSON (shape):

{
  "status": "ok",
  "default_language": "eng+msa",
  "supported_languages": ["eng", "eng+msa", "msa"],
  "tesseract_languages": ["eng", "msa", "..."]
}

POST /ocr

Upload an image, get back extracted text.

  • Request: multipart/form-data
    • Field file: image file (PNG/JPEG/etc.)
    • Optional field or query param lang: eng, msa, or eng+msa (default)

Examples:

# Default languages (eng+msa)
curl -F "file=@/path/to/image.png" \
  http://localhost:8000/ocr

# English only
curl -F "file=@/path/to/english.png" \
  -F "lang=eng" \
  http://localhost:8000/ocr

# Malay only
curl -F "file=@/path/to/malay.png" \
  -F "lang=msa" \
  http://localhost:8000/ocr

Response (shape):

{
  "filename": "image.png",
  "language": "eng+msa",
  "text": "... OCR result ..."
}

Alpine and Tesseract notes

  • The Dockerfile currently installs:
    • tesseract-ocr
    • tesseract-ocr-data-eng
    • tesseract-ocr-data-msa
  • If build fails because a package name is not found:
    1. Start a temporary Alpine container:

      docker run --rm -it python:3.11-alpine sh
    2. Inside it, inspect available Tesseract packages:

      apk update
      apk search 'tesseract-ocr*'
    3. Adjust the package names in the Dockerfile to match what your Alpine repo provides.

After the image builds successfully, verify inside a running container:

docker compose run --rm ocr-api sh
# inside the container
tesseract --version
tesseract --list-langs

You should see eng and msa in the language list.

About

testing on OCR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors