Skip to content

Morphology-aware matching (inflection / lemmatization support) #1091

@moritzfl

Description

@moritzfl

Check for existing issues

  • Completed

Describe the feature

Summary
Vale currently relies on regex and dictionary-based matching, which makes it difficult to handle inflected word forms (e.g., pluralization, conjugation, declension). This can lead to incomplete rule coverage or complex and hard-to-maintain patterns.

Problem
When defining rules, users often need to account for multiple inflected forms of the same word manually. For example, a rule targeting a base form like run will not match:

  • runs
  • ran
  • running

To compensate, users must either:

  • enumerate all variants explicitly, or
  • write complex regex patterns (e.g., run(s|ning)?)

This approach is error-prone, and reduces readability and maintainability of rules.
This limitation becomes significantly more severe in languages with richer morphology, such as German.

This gives a short overview about what you can expect for most languages other than English: https://youtu.be/ettP9Ayrho8?is=g0hMRMZqVJE74K8p

Minimal example (in German)

Rule:

extends: substitution
message: Consider using '%s' instead of '%s'
level: warning
ignorecase: false
swap:
  gut: hervorragend

Text:

Das ist eine gute Lösung.

Current behavior:

  • No match

Expected behavior:

  • Match based on shared lemma (“gut” → “gute”)

For example, the adjective „gut“ can appear in many forms depending on case, gender, and number:

  • gut
  • gute
  • guten
  • gutem
  • guter

Similarly, verbs like „gehen“ produce forms such as:

  • gehe, gehst, geht
  • ging, gegangen

Covering these via regex or explicit lists quickly becomes impractical. As a result:

  • rule definitions become bloated
  • important variants are easily missed
  • false negatives increase significantly

This makes Vale harder to use effectively for non-English content and limits its usefulness in multilingual environments.

Discussion / possible approaches
I understand that Vale is intentionally lightweight and primarily regex/dictionary-based, and that performance and simplicity are key design goals.

With that in mind, a possible direction could be:

  • We could basically do some sort of macro expansion with a syntax that triggers an expansion to all valid forms through a dictionary: "Word A" in a rule becomes "Word A - Variant 1|Word A -Variant 2| ... | Word A - Variant n". * All of that happens before the actual linting so it is only done once per linting run. Potentially, vale could even cache results for performance.

Benefits

  • Simpler and more maintainable rules
  • Better coverage with fewer false negatives
  • Improved support for morphologically rich languages (German, Finnish, Slavic languages, etc.)
  • Better usability in multilingual teams

Question
Would morphology-aware matching be considered within Vale’s scope? I could also try working on it if we agree on an approach that would be accepted as PR.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions