gh-146311: Reject non-canonical padding bits in base32, 64, & 85 decoding#146312
Draft
gpshead wants to merge 10 commits intopython:mainfrom
Draft
gh-146311: Reject non-canonical padding bits in base32, 64, & 85 decoding#146312gpshead wants to merge 10 commits intopython:mainfrom
gpshead wants to merge 10 commits intopython:mainfrom
Conversation
RFC 4648 section 3.5 allows decoders to reject encoded data containing non-zero pad bits. Both a2b_base64 (strict_mode=True) and a2b_base32 currently silently discard non-zero trailing bits instead of raising binascii.Error. These tests document the expected behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add leftchar validation after the main decode loop in a2b_base64 (strict_mode only) and a2b_base32 (always). Fix existing test data that incidentally had non-zero padding bits to use characters with zero trailing bits while preserving the same decoded output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8451e22 to
0ca2563
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
Author
|
discussing if base32 needs strict_mode on the issue. not adding a NEWS entry until that is decided. |
…ro-padding-bits
Gate non-zero padding bits rejection behind a new canonical= keyword argument independent of strict_mode, per discussion on pythongh-146311. Per RFC 4648 section 3.5 ("Canonical Encoding"), decoders MAY reject encodings where pad bits are not zero. The new canonical=True flag enables this check for a2b_base64, a2b_base32, a2b_base85, and a2b_ascii85. For base85/ascii85, the canonical check also rejects single-character final groups (never produced by a conforming encoder) and verifies that partial group encodings match what the encoder would produce. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The _Py_ID(canonical) identifier used by the clinic-generated argument parsing code needs to be registered in the global strings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RFC 4648 only covers base16, base32, and base64. The canonical encoding concept applies to base85 but is not defined by that RFC. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the re-encode-and-compare loops with a quotient comparison: two divisions by 85**n_pad tell us whether the decoded uint32 and the zero-padded output bytes share the same leading base-85 digits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test non-canonical rejection for all partial group sizes (2/3/4 chars) - Test digit-0 1-char group for ascii85 (exercises chunk_len==0 guard) - Test boundary byte values (\x00, \xff) at each group size Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Round-trip tests: encoder always produces canonical output (base64, base32, base85, ascii85) - Uniqueness tests: for base85/ascii85 partial groups, sweep all 85 last-digit values and verify exactly one decodes to the original payload with canonical=True Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
serhiy-storchaka
left a comment
There was a problem hiding this comment.
LGTM technically, but I have two notes:
- The check for single-character final group should be unconditional, like in other codecs. This is the part of the specification, without MAY.
- "Canonical" Ascii85/Base85 encoding is not defined, this is a projection. And in case of Ascii85 many other deviations of "canonical" encoding are accepted by default and not checked. As minimum, we should use "canonical" in quotes here, or don't use that word for these encodings.
| } | ||
| goto error; | ||
| } | ||
| int n_pad = 4 - chunk_len; |
Member
There was a problem hiding this comment.
I think that it is better to make chunk_len and i int.
| * quotients. A 1-char group (chunk_len==0) is always | ||
| * non-canonical since no conforming encoder produces it. */ | ||
| if (canonical && chunk_len < 4) { | ||
| if (chunk_len == 0) { |
Member
There was a problem hiding this comment.
I think this should be checked unconditionally. See https://www.adobe.com/jp/print/postscript/pdfs/PLRM.pdf, section 3.13.3.
The following conditions constitute encoding violations:
• The value represented by a 5-tuple is greater than 232 − 1.
• A z character occurs in the middle of a 5-tuple.
• A final partial 5-tuple contains only one character.
The first two checks do not depend on canonical.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
canonical=Falsekeyword argument toa2b_base64,a2b_base32,a2b_base85, anda2b_ascii85(and theirbase64module wrappers). Whencanonical=True, non-canonical encodings are rejected per RFC 4648 section 3.5.This is independent of
strict_mode.For base85/ascii85, the check also rejects single-character final groups (never produced by a conforming encoder) and verifies partial group padding matches what the encoder would produce.