Skip to content

fix: improve font encoding handling for better text extraction#61

Open
ajroetker wants to merge 2 commits intoledongthuc:masterfrom
ajroetker:fix/tounicode-priority
Open

fix: improve font encoding handling for better text extraction#61
ajroetker wants to merge 2 commits intoledongthuc:masterfrom
ajroetker:fix/tounicode-priority

Conversation

@ajroetker
Copy link

@ajroetker ajroetker commented Jan 3, 2026

Summary

This PR improves font encoding handling to fix text extraction issues in PDFs with complex font configurations. It includes two key fixes:

  1. Prioritize ToUnicode CMap - Check ToUnicode CMap first before falling back to Encoding dictionaries
  2. Properly apply BaseEncoding + Differences - When using Encoding dictionaries, apply BaseEncoding first then overlay Differences array

Problem

When extracting text from certain PDFs (e.g., scanned legal documents), characters were being incorrectly decoded:

  • '0' was appearing as 'M' because ToUnicode CMap was being ignored
  • Case numbers like 1:15-cv-07433-LAP were corrupted because Differences were applied without the BaseEncoding foundation

Solution

Fix 1: ToUnicode Priority

Per the PDF specification (section 9.10.2), ToUnicode CMap should be the primary source for mapping character codes to Unicode. This change checks ToUnicode first:

func (f Font) getEncoder() TextEncoding {
    toUnicode := f.V.Key("ToUnicode")
    if toUnicode.Kind() == Stream {
        if m := readCmap(toUnicode); m != nil {
            return m
        }
    }
    // Fall back to Encoding-based decoding
    ...
}

Fix 2: BaseEncoding + Differences

Per PDF spec section 9.6.6, when an Encoding dictionary is present:

  1. Start with BaseEncoding (e.g., WinAnsiEncoding, MacRomanEncoding)
  2. Apply Differences array on top to override specific character codes

The new dictEncoder:

  • Builds a complete 256-entry lookup table at initialization
  • Copies BaseEncoding first (defaulting to PDFDocEncoding)
  • Applies Differences entries on top
  • Uses O(1) lookup instead of O(n) scanning during decoding
func newDictEncoder(enc Value) *dictEncoder {
    e := &dictEncoder{}
    
    // Start with base encoding
    baseEnc := enc.Key("BaseEncoding")
    switch baseEnc.Name() {
    case "WinAnsiEncoding":
        baseTable = &winAnsiEncoding
    // ... other encodings
    }
    copy(e.table[:], baseTable[:])
    
    // Apply Differences on top
    diff := enc.Key("Differences")
    // ... apply differences
    
    return e
}

Testing

Tested with legal document PDFs (court documents) that previously had incorrect text extraction:

  • Before: 104 pages with font encoding corruption
  • After: 2 pages with font encoding corruption (98% reduction)

All characters including numbers (0-9), punctuation, and case numbers are now correctly extracted.

The PDF specification states that ToUnicode CMap is the authoritative
source for character-to-Unicode mapping. Previously, the library only
checked ToUnicode for fonts with "Identity-H" encoding or null encoding,
causing incorrect text extraction for many PDFs.

This change:
- Checks ToUnicode CMap first before falling back to Encoding
- Falls back to pdfDocEncoding instead of nopEncoder for better
  compatibility with unknown encodings
- Removes the now-redundant charmapEncoding() method

This fixes text extraction issues where characters were being
incorrectly decoded (e.g., '0' appearing as 'M') due to ToUnicode
being ignored when an Encoding entry was present.
The PDF spec (section 9.6.6) requires that when an Encoding dictionary
is present, the BaseEncoding (e.g., WinAnsiEncoding, MacRomanEncoding)
should be applied first, then the Differences array overlays specific
character code mappings on top.

Previously, dictEncoder only looked at the Differences array and matched
character codes one by one, which was both slow and incorrect for fonts
that rely on BaseEncoding for most characters.

This fix:
- Builds a complete 256-entry lookup table at initialization time
- Copies the BaseEncoding table first (defaulting to PDFDocEncoding)
- Applies Differences array entries on top
- Uses O(1) lookup instead of O(n) scanning during decoding

Fixes font encoding corruption in PDFs where fonts use custom Encoding
dictionaries with BaseEncoding + Differences (common in legal documents).
@ajroetker ajroetker changed the title fix: prioritize ToUnicode CMap over Encoding for text extraction fix: improve font encoding handling for better text extraction Jan 3, 2026
@ajroetker ajroetker force-pushed the fix/tounicode-priority branch from e8ef98b to e6b61eb Compare January 3, 2026 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant