gh-146311: Reject non-canonical padding bits in base32, 64, & 85 decoding by gpshead · Pull Request #146312 · python/cpython

gpshead · 2026-03-22T21:58:31Z

Summary

Add canonical=False keyword argument to a2b_base64, a2b_base32, a2b_base85, and a2b_ascii85 (and their base64 module wrappers). When canonical=True, non-canonical encodings are rejected per RFC 4648 section 3.5.

This is independent of strict_mode.

For base85/ascii85, the check also rejects single-character final groups (never produced by a conforming encoder) and verifies partial group padding matches what the encoder would produce.

Issue: binascii base64 & base32 decode ignore excess padding bits in their input #146311

RFC 4648 section 3.5 allows decoders to reject encoded data containing non-zero pad bits. Both a2b_base64 (strict_mode=True) and a2b_base32 currently silently discard non-zero trailing bits instead of raising binascii.Error. These tests document the expected behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add leftchar validation after the main decode loop in a2b_base64 (strict_mode only) and a2b_base32 (always). Fix existing test data that incidentally had non-zero padding bits to use characters with zero trailing bits while preserving the same decoded output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gpshead · 2026-03-22T22:20:26Z

discussing if base32 needs strict_mode on the issue. not adding a NEWS entry until that is decided.

…ro-padding-bits

Gate non-zero padding bits rejection behind a new canonical= keyword argument independent of strict_mode, per discussion on pythongh-146311. Per RFC 4648 section 3.5 ("Canonical Encoding"), decoders MAY reject encodings where pad bits are not zero. The new canonical=True flag enables this check for a2b_base64, a2b_base32, a2b_base85, and a2b_ascii85. For base85/ascii85, the canonical check also rejects single-character final groups (never produced by a conforming encoder) and verifies that partial group encodings match what the encoder would produce. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The _Py_ID(canonical) identifier used by the clinic-generated argument parsing code needs to be registered in the global strings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RFC 4648 only covers base16, base32, and base64. The canonical encoding concept applies to base85 but is not defined by that RFC. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the re-encode-and-compare loops with a quotient comparison: two divisions by 85**n_pad tell us whether the decoded uint32 and the zero-padded output bytes share the same leading base-85 digits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Test non-canonical rejection for all partial group sizes (2/3/4 chars) - Test digit-0 1-char group for ascii85 (exercises chunk_len==0 guard) - Test boundary byte values (\x00, \xff) at each group size Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Round-trip tests: encoder always produces canonical output (base64, base32, base85, ascii85) - Uniqueness tests: for base85/ascii85 partial groups, sweep all 85 last-digit values and verify exactly one decodes to the original payload with canonical=True Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

serhiy-storchaka

LGTM technically, but I have two notes:

The check for single-character final group should be unconditional, like in other codecs. This is the part of the specification, without MAY.
"Canonical" Ascii85/Base85 encoding is not defined, this is a projection. And in case of Ascii85 many other deviations of "canonical" encoding are accepted by default and not checked. As minimum, we should use "canonical" in quotes here, or don't use that word for these encodings.

serhiy-storchaka · 2026-04-05T09:09:34Z

Modules/binascii.c

+                }
+                goto error;
+            }
+            int n_pad = 4 - chunk_len;


I think that it is better to make chunk_len and i int.

serhiy-storchaka · 2026-04-05T09:29:12Z

Modules/binascii.c

+         * quotients.  A 1-char group (chunk_len==0) is always
+         * non-canonical since no conforming encoder produces it. */
+        if (canonical && chunk_len < 4) {
+            if (chunk_len == 0) {


I think this should be checked unconditionally. See https://www.adobe.com/jp/print/postscript/pdfs/PLRM.pdf, section 3.13.3.

The following conditions constitute encoding violations:
• The value represented by a 5-tuple is greater than 232 − 1.
• A z character occurs in the middle of a 5-tuple.
• A ﬁnal partial 5-tuple contains only one character.

The first two checks do not depend on canonical.

bedevere-app bot mentioned this pull request Mar 22, 2026

binascii base64 & base32 decode ignore excess padding bits in their input #146311

Open

gpshead and others added 2 commits March 22, 2026 15:05

gpshead force-pushed the gh-146311-nonzero-padding-bits branch from 8451e22 to 0ca2563 Compare March 22, 2026 22:07

Fix test_base64 test data with non-zero padding bits

615b227

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gpshead self-assigned this Mar 22, 2026

gpshead and others added 3 commits April 4, 2026 22:40

Merge remote-tracking branch 'origin/main' into pythongh-146311-nonze…

1bf5c75

…ro-padding-bits

Add 'canonical' to global strings tables

4b7c6ae

The _Py_ID(canonical) identifier used by the clinic-generated argument parsing code needs to be registered in the global strings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gpshead changed the title ~~gh-146311: Reject non-zero padding bits in base64/base32 decoding~~ gh-146311: Reject non-canonical padding bits in base32, 64, & 85 decoding Apr 4, 2026

gpshead and others added 4 commits April 4, 2026 23:51

Remove incorrect RFC 4648 references from base85/ascii85

308433a

RFC 4648 only covers base16, base32, and base64. The canonical encoding concept applies to base85 but is not defined by that RFC. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

serhiy-storchaka reviewed Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-146311: Reject non-canonical padding bits in base32, 64, & 85 decoding#146312

gh-146311: Reject non-canonical padding bits in base32, 64, & 85 decoding#146312
gpshead wants to merge 10 commits intopython:mainfrom
gpshead:gh-146311-nonzero-padding-bits

gpshead commented Mar 22, 2026 •

edited

Loading

Uh oh!

gpshead commented Mar 22, 2026

Uh oh!

serhiy-storchaka left a comment

Uh oh!

serhiy-storchaka Apr 5, 2026

Uh oh!

serhiy-storchaka Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gpshead commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

gpshead commented Mar 22, 2026

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gpshead commented Mar 22, 2026 •

edited

Loading