LittleReader

A lightweight hybrid system for parsing and digitizing historical newspaper pages.

Introduction

Historical newspapers represent an important source of cultural, social, and historical information. The ongoing digitization of newspaper archives has created the opportunity to automatically extract and structure their content through modern computer vision and machine learning techniques. However, historical newspapers remain particularly challenging due to their complex and irregular layouts, degraded scans, and heterogeneous typography.

This repository contains the implementation of LittleReader, a lightweight hybrid system for parsing and digitizing historical newspaper pages. The proposed architecture combines a multi-stage pipeline with multimodal approaches in order to detect page layouts, reconstruct reading order, and extract textual content in a structured JSON format. LittleReader is composed of three main modules: a layout parser based on RT-DETR, a heuristic-based layout handler designed for occidental newspapers, and an OCR module that integrates classical OCR methods with Vision-Language Models (VLMs).

Document Layout Detection Module

The document layout detection module are based on the fine tuned model available at this hf repo.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
imgs		imgs
src		src
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LittleReader

Introduction

Document Layout Detection Module

Layout Handling Module

OCR Module

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LittleReader

Introduction

Document Layout Detection Module

Layout Handling Module

OCR Module

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages