Conversation
|
I don't think it makes sense to add that to the library, it's better added as a custom validator. |
|
@Keats The need for an UTF-16 code unit length validator is very common - assume all of web form handling -, since the maxlength of HTML form fields counts UTF-16 code units. If the frontend counts UTF-16 code units, and the backend counts UTF-8 code units, then inconsistencies arise whenever values contain characters encoded with different length in UTF-16 vs UTF-8. This results in values being rejected by the server which passed client side validation, whenever the server's UTF-8 representation longer than the browser's UTF-16 representation. |
| Validator::Regex(_) => "regex", | ||
| Validator::Range { .. } => "range", | ||
| Validator::Length { .. } => "length", | ||
| Validator::LengthUTF16 { .. } => "length_utf16", |
There was a problem hiding this comment.
| Validator::LengthUTF16 { .. } => "length_utf16", | |
| Validator::LengthUtf16 { .. } => "length_utf16", |
For consistency with Rust code style, you might want to use Utf in identifiers.
E.g. https://doc.rust-lang.org/std/str/struct.EncodeUtf16.html
There was a problem hiding this comment.
I'll address code style suggestions if the approach with an extra builtin validator type is desired for the crate users.
We could gather more comments/thumbs up in the MR description for more feedback.
There was a problem hiding this comment.
Yeah I think it makes more sense to go with a parameter approach to the length validator like mentioned in #250 otherwise we just duplicate things that are 99% the same
This PR adds length_utf16 validator.
My project exposes data from Salesforce via JsonSchema based API. I want to validate field lengths in the same way as Salesforce does - by counting UTF16 characters.
UTF16 is used for Unicode string representation in JavaScript, Java and Salesforce APEX. I think this validator could be useful to others as well. A good use case is to align backend and frontend length validators.
An example of mismatch between UTF16 and Unicode codepoints: '𝔠' symbol has 2 UTF16 characters but it's still 1 Unicode codepoint.
Should I wrap the implementation in optional feature
length_utf16?