Skip to content

Make sure the regex validation is anchored#236

Merged
Bergmann89 merged 2 commits intoBergmann89:masterfrom
p32blo:anchored-patterns
Feb 13, 2026
Merged

Make sure the regex validation is anchored#236
Bergmann89 merged 2 commits intoBergmann89:masterfrom
p32blo:anchored-patterns

Conversation

@p32blo
Copy link
Contributor

@p32blo p32blo commented Feb 11, 2026

While experimenting with xs:pattern validation I noticed that some values were passing validation when they should fail.

This PR makes sure the regex pattern match is anchored in accordance to the xsd standard, otherwise it would be possible to match only a part of the string, for example:

<xs:simpleType name="MyString">
    <xs:restriction base="xs:string">
        <xs:pattern value="\d" />
    </xs:restriction>
</xs:simpleType>

\d would match not only 1 or 2 but also 11, 12 and 22.

Besides adding the anchors (^ $), I also used an non-capturing group(?:) to make sure that when we have a pattern like dog|cat we don't consider it as ^dog or cat$ but as ^(dog|cat)$. Being non-capturing means that it won't count for the normal group index calculation.

XSD Standard

Here is the relevant section of the standard: https://www.w3.org/TR/xmlschema11-2/

G Regular Expressions

A ·regular expression· R is a sequence of characters that denote a set of strings L(R). When used to constrain a ·lexical space·, a regular expression R asserts that only strings in L(R) are valid ·literals· for values of that type.
Note: Unlike some popular regular expression languages (including those defined by Perl and standard Unix utilities), the regular expression language defined here implicitly anchors all regular expressions at the head and tail, as the most common use of regular expressions in ·pattern· is to match entire ·literals·. For example, a datatype ·derived· from string such that all values must begin with the character 'A' (#x41) and end with the character 'Z' (#x5a) would be defined as follows:

In regular expression languages that are not implicitly anchored at the head and tail, it is customary to write the equivalent regular expression as:

    ^A.*Z$

where '^' anchors the pattern at the head and '$' anchors at the tail.
In those rare cases where an unanchored match is desired, including '.*' at the beginning and ending of the regular expression will achieve the desired results. For example, a datatype ·derived· from string such that all values must contain at least 3 consecutive 'A' (#x41) characters somewhere within the value could be defined as follows:

Feel free to change, correct or otherwise comment on the PR
(or delete if doesn't make sense 😢 )

Copy link
Owner

@Bergmann89 Bergmann89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @p32blo,

Thanks for the fix, I really appreciate that 👍

Just one small remark: Could you please use the following code to resolve the str instead of simply hard coding it? Otherwise code generation might break for configurations that expect absolut paths for build-in types. Thanks 😊

let str_ = resolve_build_in!(ctx, "::core::primitive::str");

static PATTERNS: #lazy_lock<[(&#str_, #regex); #sz]> = #lazy_lock::new(|| [ #( #patterns )* ]);

@p32blo p32blo requested a review from Bergmann89 February 12, 2026 23:44
Copy link
Owner

@Bergmann89 Bergmann89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Approved 👍

@Bergmann89 Bergmann89 merged commit 40e5d92 into Bergmann89:master Feb 13, 2026
2 checks passed
@p32blo
Copy link
Contributor Author

p32blo commented Feb 13, 2026

Thanks for the quick turn around time, you're awesome !!

@Bergmann89
Copy link
Owner

You are welcome and thanks for your contribution 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants