Skip to content

gh-146311: Reject non-canonical padding bits in base32, 64, & 85 decoding#146312

Draft
gpshead wants to merge 10 commits intopython:mainfrom
gpshead:gh-146311-nonzero-padding-bits
Draft

gh-146311: Reject non-canonical padding bits in base32, 64, & 85 decoding#146312
gpshead wants to merge 10 commits intopython:mainfrom
gpshead:gh-146311-nonzero-padding-bits

Conversation

@gpshead
Copy link
Copy Markdown
Member

@gpshead gpshead commented Mar 22, 2026

Summary

Add canonical=False keyword argument to a2b_base64, a2b_base32, a2b_base85, and a2b_ascii85 (and their base64 module wrappers). When canonical=True, non-canonical encodings are rejected per RFC 4648 section 3.5.

This is independent of strict_mode.

For base85/ascii85, the check also rejects single-character final groups (never produced by a conforming encoder) and verifies partial group padding matches what the encoder would produce.

gpshead and others added 2 commits March 22, 2026 15:05
RFC 4648 section 3.5 allows decoders to reject encoded data containing
non-zero pad bits. Both a2b_base64 (strict_mode=True) and a2b_base32
currently silently discard non-zero trailing bits instead of raising
binascii.Error. These tests document the expected behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add leftchar validation after the main decode loop in a2b_base64
(strict_mode only) and a2b_base32 (always). Fix existing test data
that incidentally had non-zero padding bits to use characters with
zero trailing bits while preserving the same decoded output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gpshead gpshead force-pushed the gh-146311-nonzero-padding-bits branch from 8451e22 to 0ca2563 Compare March 22, 2026 22:07
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gpshead
Copy link
Copy Markdown
Member Author

gpshead commented Mar 22, 2026

discussing if base32 needs strict_mode on the issue. not adding a NEWS entry until that is decided.

@gpshead gpshead self-assigned this Mar 22, 2026
gpshead and others added 3 commits April 4, 2026 22:40
Gate non-zero padding bits rejection behind a new canonical= keyword
argument independent of strict_mode, per discussion on pythongh-146311.

Per RFC 4648 section 3.5 ("Canonical Encoding"), decoders MAY reject
encodings where pad bits are not zero. The new canonical=True flag
enables this check for a2b_base64, a2b_base32, a2b_base85, and
a2b_ascii85.

For base85/ascii85, the canonical check also rejects single-character
final groups (never produced by a conforming encoder) and verifies
that partial group encodings match what the encoder would produce.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The _Py_ID(canonical) identifier used by the clinic-generated
argument parsing code needs to be registered in the global strings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gpshead gpshead changed the title gh-146311: Reject non-zero padding bits in base64/base32 decoding gh-146311: Reject non-canonical padding bits in base32, 64, & 85 decoding Apr 4, 2026
gpshead and others added 4 commits April 4, 2026 23:51
RFC 4648 only covers base16, base32, and base64. The canonical
encoding concept applies to base85 but is not defined by that RFC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the re-encode-and-compare loops with a quotient comparison:
two divisions by 85**n_pad tell us whether the decoded uint32 and
the zero-padded output bytes share the same leading base-85 digits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test non-canonical rejection for all partial group sizes (2/3/4 chars)
- Test digit-0 1-char group for ascii85 (exercises chunk_len==0 guard)
- Test boundary byte values (\x00, \xff) at each group size

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Round-trip tests: encoder always produces canonical output (base64,
  base32, base85, ascii85)
- Uniqueness tests: for base85/ascii85 partial groups, sweep all 85
  last-digit values and verify exactly one decodes to the original
  payload with canonical=True

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM technically, but I have two notes:

  • The check for single-character final group should be unconditional, like in other codecs. This is the part of the specification, without MAY.
  • "Canonical" Ascii85/Base85 encoding is not defined, this is a projection. And in case of Ascii85 many other deviations of "canonical" encoding are accepted by default and not checked. As minimum, we should use "canonical" in quotes here, or don't use that word for these encodings.

}
goto error;
}
int n_pad = 4 - chunk_len;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is better to make chunk_len and i int.

* quotients. A 1-char group (chunk_len==0) is always
* non-canonical since no conforming encoder produces it. */
if (canonical && chunk_len < 4) {
if (chunk_len == 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be checked unconditionally. See https://www.adobe.com/jp/print/postscript/pdfs/PLRM.pdf, section 3.13.3.

The following conditions constitute encoding violations:
• The value represented by a 5-tuple is greater than 232 − 1.
• A z character occurs in the middle of a 5-tuple.
• A final partial 5-tuple contains only one character.

The first two checks do not depend on canonical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants