fix: improve font encoding handling for better text extraction by ajroetker · Pull Request #61 · ledongthuc/pdf

ajroetker · 2026-01-03T06:36:55Z

Summary

This PR improves font encoding handling to fix text extraction issues in PDFs with complex font configurations. It includes two key fixes:

Prioritize ToUnicode CMap - Check ToUnicode CMap first before falling back to Encoding dictionaries
Properly apply BaseEncoding + Differences - When using Encoding dictionaries, apply BaseEncoding first then overlay Differences array

Problem

When extracting text from certain PDFs (e.g., scanned legal documents), characters were being incorrectly decoded:

'0' was appearing as 'M' because ToUnicode CMap was being ignored
Case numbers like 1:15-cv-07433-LAP were corrupted because Differences were applied without the BaseEncoding foundation

Solution

Fix 1: ToUnicode Priority

Per the PDF specification (section 9.10.2), ToUnicode CMap should be the primary source for mapping character codes to Unicode. This change checks ToUnicode first:

func (f Font) getEncoder() TextEncoding {
    toUnicode := f.V.Key("ToUnicode")
    if toUnicode.Kind() == Stream {
        if m := readCmap(toUnicode); m != nil {
            return m
        }
    }
    // Fall back to Encoding-based decoding
    ...
}

Fix 2: BaseEncoding + Differences

Per PDF spec section 9.6.6, when an Encoding dictionary is present:

Start with BaseEncoding (e.g., WinAnsiEncoding, MacRomanEncoding)
Apply Differences array on top to override specific character codes

The new dictEncoder:

Builds a complete 256-entry lookup table at initialization
Copies BaseEncoding first (defaulting to PDFDocEncoding)
Applies Differences entries on top
Uses O(1) lookup instead of O(n) scanning during decoding

func newDictEncoder(enc Value) *dictEncoder {
    e := &dictEncoder{}
    
    // Start with base encoding
    baseEnc := enc.Key("BaseEncoding")
    switch baseEnc.Name() {
    case "WinAnsiEncoding":
        baseTable = &winAnsiEncoding
    // ... other encodings
    }
    copy(e.table[:], baseTable[:])
    
    // Apply Differences on top
    diff := enc.Key("Differences")
    // ... apply differences
    
    return e
}

Testing

Tested with legal document PDFs (court documents) that previously had incorrect text extraction:

Before: 104 pages with font encoding corruption
After: 2 pages with font encoding corruption (98% reduction)

All characters including numbers (0-9), punctuation, and case numbers are now correctly extracted.

The PDF specification states that ToUnicode CMap is the authoritative source for character-to-Unicode mapping. Previously, the library only checked ToUnicode for fonts with "Identity-H" encoding or null encoding, causing incorrect text extraction for many PDFs. This change: - Checks ToUnicode CMap first before falling back to Encoding - Falls back to pdfDocEncoding instead of nopEncoder for better compatibility with unknown encodings - Removes the now-redundant charmapEncoding() method This fixes text extraction issues where characters were being incorrectly decoded (e.g., '0' appearing as 'M') due to ToUnicode being ignored when an Encoding entry was present.

The PDF spec (section 9.6.6) requires that when an Encoding dictionary is present, the BaseEncoding (e.g., WinAnsiEncoding, MacRomanEncoding) should be applied first, then the Differences array overlays specific character code mappings on top. Previously, dictEncoder only looked at the Differences array and matched character codes one by one, which was both slow and incorrect for fonts that rely on BaseEncoding for most characters. This fix: - Builds a complete 256-entry lookup table at initialization time - Copies the BaseEncoding table first (defaulting to PDFDocEncoding) - Applies Differences array entries on top - Uses O(1) lookup instead of O(n) scanning during decoding Fixes font encoding corruption in PDFs where fonts use custom Encoding dictionaries with BaseEncoding + Differences (common in legal documents).

ajroetker added 2 commits January 2, 2026 22:36

ajroetker changed the title ~~fix: prioritize ToUnicode CMap over Encoding for text extraction~~ fix: improve font encoding handling for better text extraction Jan 3, 2026

ajroetker force-pushed the fix/tounicode-priority branch from e8ef98b to e6b61eb Compare January 3, 2026 20:20

ajroetker mentioned this pull request Jan 3, 2026

refactor: extract shared text walker and add encoding chain #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve font encoding handling for better text extraction#61

fix: improve font encoding handling for better text extraction#61
ajroetker wants to merge 2 commits intoledongthuc:masterfrom
ajroetker:fix/tounicode-priority

ajroetker commented Jan 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajroetker commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Fix 1: ToUnicode Priority

Fix 2: BaseEncoding + Differences

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajroetker commented Jan 3, 2026 •

edited

Loading