Various safety fixes and defense-in-depth by Manishearth · Pull Request #7887 · unicode-org/icu4x

Manishearth · 2026-04-20T20:32:44Z

This fixes a bunch of issues found by @sayrer by running Claude on ICU4X to look for vulnerabilities (reported in https://github.com/unicode-org/icu4x/security/advisories/GHSA-c8m9-943c-5xmw).

Most of the fixes here are defense in depth: the code was already safe, but we can add assertions/etc to make it safer. In other cases the issues found were potential correctness issues, not safety issues.

The only real safety issue found was the multiplication overflows in VarZeroVec validation code: in theory, a 64 bit platform might validate a VZV that is invalid on a 32 bit platform. These have been fixed. They aren't severe enough to warrant a patch release / advisory: the only time this can cause a security issue is if you are trusting VZV data blobs constructed on a 64 bit platform on a 32 bit platform, which only happens with baked data, and 4GB-size VZVs aren't really a thing there.

Thanks, @sayrer !

The first commit was agent assisted but in the end mostly rewritten by me.

Changelog

zerovec: Fix minor soundness issue around unchecked multiplication, add defense in depth against other overflow situations.

resb: Add defense-in-depth around checked multiplication.

icu_provider_fs: Add defense-in-depth against path traversal.

icu_list: Prevent correctness errors from out-of-range indices in list patterns.

sffc · 2026-04-21T03:34:33Z

+                    // Data marker attributes are already validated to be
+                    // alphanumeric + underscore/dash. This is defense-in-depth
+                    // against potential path traversal attacks.


Where do we do the claimed validation? I think we allow attributes to contain /, such as FR/long to be the long French display name, which should by design create a directory in FsDataProvider.

Consider forbidding other characters like . and : to prevent the path traversal

FR/long is not an attribute. We have the key long and the attribute fr.

My initial code here forbade . and string-initial / and \, but then I checked the attributes docs and realized they forbid slashes anyway.

https://unicode-org.github.io/icu4x/rustdoc/icu_provider/struct.DataMarkerAttributes.html#method.try_from_str

oof, this breaks part of my DisplayNames design, I guess a part I haven't implemented yet but assumed would be implementable. We will want to go back and revisit this.

I have a slight preference for you to pull this assumption out to a separate PR because we should discuss this on the DataMarkerAttributes level, but if you want to land this to assert on the status quo, I guess that's okay.

sffc · 2026-04-21T03:38:05Z


    #[inline]
    unsafe fn from_bytes_unchecked(bytes: &[u8]) -> &Self {
+        debug_assert!(!bytes.is_empty());


Thought: this one is very 🤷 because the trait's safety invariants have us handled via the check on line 179

I'm basically always in favor of adding cheap debug assertions that double-validate things unless it's going to be a big mess.

I guess I wouldn't want this everywhere.

I'm going to lean slightly towards keeping this since the non-emptiness is relevant to type-specific invariants.

@sffc is correct that "this one is very 🤷". However, it requires constant vigilance. Maybe this:

#[cfg(test)] mod misuse_tests { use super::OptionULE; use crate::ule::ULE; /// Direct misuse of the unsafe contract: `from_bytes_unchecked` documents /// that callers must pass a slice that has already passed `validate_bytes`, /// which rejects empty input. Calling it with `&[]` violates that contract. /// /// In debug builds the `debug_assert!(!bytes.is_empty())` catches this at /// the construction site. In release builds this same call is undefined /// behavior (the returned `&OptionULE<u8>` would read an out-of-bounds /// discriminant byte on its next use), which is why the test is gated to /// debug assertions. #[test] #[should_panic] #[cfg(debug_assertions)] fn from_bytes_unchecked_empty_slice_trips_debug_assert() { // SAFETY: intentionally unsafe — this test exists to verify that the // defense-in-depth assertion fires when the caller violates the // documented precondition. let _ = unsafe { <OptionULE<u8> as ULE>::from_bytes_unchecked(&[]) }; } /// Same shape, but routed through a "helpful wrapper" — the realistic /// refactor hazard. A future author writes a convenience function that /// forwards a slice to `from_bytes_unchecked` and forgets the length /// check that lived in the original call site. The assertion catches it. #[test] #[should_panic] #[cfg(debug_assertions)] fn wrapper_forwarding_empty_slice_trips_debug_assert() { unsafe fn construct_without_validating(bytes: &[u8]) -> &OptionULE<u8> { // SAFETY: (intentionally unsound — simulates a refactor that drops // the validate_bytes call upstream.) unsafe { <OptionULE<u8> as ULE>::from_bytes_unchecked(bytes) } } // SAFETY: see above — deliberately violating the contract. let _ = unsafe { construct_without_validating(&[]) }; } /// Sanity check: the *contract-respecting* path does not trip the /// assertion. A single-byte slice `[0]` encodes `None` for `OptionULE<u8>` /// (discriminant 0, inner byte ignored — here we supply two bytes so the /// layout is well-formed regardless of U::SIZE assumptions). #[test] fn from_bytes_unchecked_valid_slice_is_fine() { let bytes = [0u8, 0u8]; // SAFETY: validate_bytes would accept this slice (length == 1 + u8::SIZE, // discriminant is 0 meaning None, inner byte is a valid u8 ULE). let opt = unsafe { <OptionULE<u8> as ULE>::from_bytes_unchecked(&bytes) }; // Just prove we can touch it without UB in debug. let _ = opt; } }

debug_assert!(!bytes.is_empty()); could be added to most from_bytes_unchecked implementations then

I know, this is a case where there's some relevant arithmetic just after it.

From an unsafe review perspective this is nicer, though I probably should have added an assertion message.

sffc · 2026-04-21T03:39:35Z

+        let indices_len = F::Index::SIZE.wrapping_mul(len_minus_one as usize);
+        debug_assert!(F::Index::SIZE.checked_mul(len_minus_one as usize).is_some());


Observation: this is equivalent to the * operator you had before I think

Correct, but now it's intentional 😄

I don't think I like doing this in general in the icu4x codebase, but since this is in one our two most sensitive files (the other being yoke)... ok to be extra explicit I guess

Right. Also I think this is not the first time someone will run AI tooling to search for vulns, so some defense in depth is good in that it prevents duplicate reports.

sffc · 2026-04-21T03:42:25Z

+        let indices_size = F::Index::SIZE
+            .checked_mul(len - 1)
+            .expect(F::Index::TOO_LARGE_ERROR);
+        debug_assert!(out + F::Len::SIZE + indices_size <= self.entire_slice.len());


Note, here and elsewhere: I'm not thinking too hard in this review about why you are doing a checked_mul on some and a debug assertion on others. I assume that when you assert, it is a safety thing that is technically reachable, and debug assert is an invariant that is not reachable. I can go back through this again if you want me to.

Yes. There are three ways I've done things in this PR:

Assertions for safety things that are reachable in an API that is allowed to panic (like VZV construction here). I consider FS data provider to be "allowed to panic" in case of a path traversal; that's pretty serious. Happy to downgrade to debug assert if you don't want panics there, though. I view FS code as very fragile and panicky anyway.

A debug assertion for safety things that are not reachable but I'm being cautious.

An error for actual reachable safety issues

We only have one real reachable safety issue here: the one with 32 bit overflow in VZV components.

sffc · 2026-04-21T04:29:06Z

/gemini review

sffc

This is landable but I requested an LLM review to double check that the changes you're making to math expressions are no-ops, which is easier for an LLM to verify than a human.

gemini-code-assist

Code Review

This pull request introduces several safety enhancements to prevent integer overflows, underflows, and path traversal vulnerabilities across the codebase, specifically within list patterns, binary deserialization, and VarZeroVec components. The changes primarily involve replacing direct arithmetic with checked operations and adding explicit bounds checks. The review feedback identifies a potential denial-of-service vulnerability in the file system provider where an assert! is used for input validation instead of returning an error. Additionally, it points out potential underflow panics when calculating index sizes for empty element collections and suggests using saturating_sub to ensure robustness.

gemini-code-assist · 2026-04-21T04:34:09Z

+                    // Data marker attributes are already validated to be
+                    // alphanumeric + underscore/dash. This is defense-in-depth
+                    // against potential path traversal attacks.
+                    assert!(!attr_path.contains(['/', '\\']));


Using assert! for input validation can lead to denial-of-service vulnerabilities if the input can be controlled by an untrusted source (e.g., via a DataRequest). It is better to return a DataError instead of panicking, especially in a library context where robustness is expected.

if attr_path.contains(['/', '\\']) { return Err(DataError::custom("Invalid marker attributes").with_path_context(&path)); }

This should not be hit, and DOS is better than path traversal.

gemini-code-assist · 2026-04-21T04:34:09Z

+        reason = "Function contract allows panicky behavior"
+    )]
+    let indices_size = F::Index::SIZE
+        .checked_mul(elements.len() - 1)


elements.len() - 1 will underflow and panic in debug mode if elements is empty. Using saturating_sub(1) is safer and prevents this panic, which aligns with the defense-in-depth goals of this pull request.

Suggested change

.checked_mul(elements.len() - 1)

.checked_mul(elements.len().saturating_sub(1))

We have checked that elements is nonempty already

gemini-code-assist · 2026-04-21T04:34:10Z

+        reason = "Function contract allows panicky behavior"
+    )]
+    let indices_size = F::Index::SIZE
+        .checked_mul(elements.len() - 1)


elements.len() - 1 will underflow and panic in debug mode if elements is empty. Using saturating_sub(1) is safer and prevents this panic.

Suggested change

.checked_mul(elements.len() - 1)

.checked_mul(elements.len().saturating_sub(1))

robertbastian · 2026-04-21T06:55:20Z

                    && (allow_suffix || index_1 == pattern.len() - 3) =>
            {
-                if (index_0 > 0 && !cfg!(test)) || index_1 - 3 >= 256 {
+                let index1_minus_three = index_1.wrapping_sub(3);


this is a needless change, if the string contains both {0} and {1}, and index_0 < index_1, then index_1 is at least 3. I can see that when reading the code, we should not complicate our code just because AIs are too stupid to understand such simple dependencies

No, I made this change because it's pretty non-obvious to humans too. It took me a while to realize this was checked. An alternate way to do this is to have comments about the invariants. I'm happy to switch to that if you prefer.

This isn't unsafe code, but I was reviewing all of this with my unsafe review hat, where you really want things to be dead obvious.

robertbastian · 2026-04-21T06:58:52Z


    #[inline]
    unsafe fn from_bytes_unchecked(bytes: &[u8]) -> &Self {
+        debug_assert!(!bytes.is_empty());


debug_assert!(!bytes.is_empty()); could be added to most from_bytes_unchecked implementations then

robertbastian · 2026-04-21T07:00:15Z

 ///
 /// * `key_hashes` - [`ExactSizeIterator`] over the hashed key values
-#[expect(clippy::indexing_slicing, clippy::unwrap_used)]
+#[expect(clippy::indexing_slicing, clippy::unwrap_used, clippy::expect_used)]


justification?

Same justification as the preexisting ones? This is a construction API allowed to panic. I could add justification there just wasn't any already so I didn't.

Followup for Robert's comments on #7887 ## Changelog (N/A)

Manishearth requested review from a team, robertbastian and sffc as code owners April 20, 2026 20:32

Manishearth added 2 commits April 20, 2026 13:41

Add some checked casts/multiplication to zerovec

59109ac

Use checked multiplication in resb, add safety comments

0d0583f

Manishearth force-pushed the safety-fixes branch from 39d3ca6 to 912867e Compare April 20, 2026 20:41

Manishearth added 2 commits April 20, 2026 19:14

Defend against path traversal attacks in FSDataProvider

b497477

Check index0 size in ListJoinerPattern::try_from_str

2c1e23d

Manishearth force-pushed the safety-fixes branch from 912867e to 2c1e23d Compare April 21, 2026 02:14

sffc reviewed Apr 21, 2026

View reviewed changes

Manishearth requested a review from sffc April 21, 2026 04:03

sffc approved these changes Apr 21, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Manishearth merged commit 27f15b2 into unicode-org:main Apr 21, 2026
34 checks passed

robertbastian reviewed Apr 21, 2026

View reviewed changes

Manishearth mentioned this pull request Apr 21, 2026

Followup fixes to safety PR #7897

Merged

Manishearth added a commit that referenced this pull request Apr 21, 2026

Followup fixes to safety PR (#7897)

0ab9635

Followup for Robert's comments on #7887 ## Changelog (N/A)

		let indices_len = F::Index::SIZE.wrapping_mul(len_minus_one as usize);
		debug_assert!(F::Index::SIZE.checked_mul(len_minus_one as usize).is_some());

	.checked_mul(elements.len() - 1)
	.checked_mul(elements.len().saturating_sub(1))

Conversation

Manishearth commented Apr 20, 2026

Changelog

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc commented Apr 21, 2026

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants