fix: improve font encoding handling for better text extraction#61
Open
ajroetker wants to merge 2 commits intoledongthuc:masterfrom
Open
fix: improve font encoding handling for better text extraction#61ajroetker wants to merge 2 commits intoledongthuc:masterfrom
ajroetker wants to merge 2 commits intoledongthuc:masterfrom
Conversation
The PDF specification states that ToUnicode CMap is the authoritative source for character-to-Unicode mapping. Previously, the library only checked ToUnicode for fonts with "Identity-H" encoding or null encoding, causing incorrect text extraction for many PDFs. This change: - Checks ToUnicode CMap first before falling back to Encoding - Falls back to pdfDocEncoding instead of nopEncoder for better compatibility with unknown encodings - Removes the now-redundant charmapEncoding() method This fixes text extraction issues where characters were being incorrectly decoded (e.g., '0' appearing as 'M') due to ToUnicode being ignored when an Encoding entry was present.
The PDF spec (section 9.6.6) requires that when an Encoding dictionary is present, the BaseEncoding (e.g., WinAnsiEncoding, MacRomanEncoding) should be applied first, then the Differences array overlays specific character code mappings on top. Previously, dictEncoder only looked at the Differences array and matched character codes one by one, which was both slow and incorrect for fonts that rely on BaseEncoding for most characters. This fix: - Builds a complete 256-entry lookup table at initialization time - Copies the BaseEncoding table first (defaulting to PDFDocEncoding) - Applies Differences array entries on top - Uses O(1) lookup instead of O(n) scanning during decoding Fixes font encoding corruption in PDFs where fonts use custom Encoding dictionaries with BaseEncoding + Differences (common in legal documents).
e8ef98b to
e6b61eb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves font encoding handling to fix text extraction issues in PDFs with complex font configurations. It includes two key fixes:
Problem
When extracting text from certain PDFs (e.g., scanned legal documents), characters were being incorrectly decoded:
1:15-cv-07433-LAPwere corrupted because Differences were applied without the BaseEncoding foundationSolution
Fix 1: ToUnicode Priority
Per the PDF specification (section 9.10.2), ToUnicode CMap should be the primary source for mapping character codes to Unicode. This change checks ToUnicode first:
Fix 2: BaseEncoding + Differences
Per PDF spec section 9.6.6, when an Encoding dictionary is present:
The new
dictEncoder:Testing
Tested with legal document PDFs (court documents) that previously had incorrect text extraction:
All characters including numbers (0-9), punctuation, and case numbers are now correctly extracted.