Skip to content

fix: sanitize SVG output to remove invalid XML characters#46

Merged
Zir0-93 merged 8 commits into
masterfrom
fix/svg-xml-sanitization
May 3, 2026
Merged

fix: sanitize SVG output to remove invalid XML characters#46
Zir0-93 merged 8 commits into
masterfrom
fix/svg-xml-sanitization

Conversation

@Zir0-93
Copy link
Copy Markdown
Member

@Zir0-93 Zir0-93 commented Apr 30, 2026

Summary

  • SVG diagrams generated by PlantUML can contain invalid XML 1.0 character references (e.g.  backspace) originating from {@link} / {@code} inline-code delimiters in source Javadoc
  • These cause XML parse failures in downstream consumers
  • Adds sanitizeXml() in PUMLDiagram that strips both invalid &#N;/&#xN; character references and raw control characters in a single pass

Test plan

  • All 159 existing tests pass
  • Verified the generated SVG is valid XML after the fix

🤖 Generated with Claude Code

PlantUML may emit character references like  (backspace) in SVG
output when source Javadoc contains inline-code delimiters or {@link}
tags. These resolve to control characters that are illegal in XML 1.0,
causing XML parse errors in downstream consumers.

Add a sanitizeXml() step in PUMLDiagram that strips both invalid
character references and raw control characters in a single pass,
using the XML 1.0 valid character spec as the allowlist.

Bumps version to 3.8.5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dabc60a538

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

int last = 0;
while (m.find()) {
appendValidChars(sb, svg, last, m.start());
int val = m.group(1) != null ? Integer.parseInt(m.group(1)) : Integer.parseInt(m.group(2), 16);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard numeric char ref parsing against overflow

sanitizeXml assumes every matched &#...;/&#x...; fits in a Java int, but Integer.parseInt(...) throws NumberFormatException for oversized values (for example &#9999999999;). In that case diagram generation now fails with an uncaught runtime exception instead of sanitizing and dropping the invalid reference, which is a regression for any SVG text that contains out-of-range numeric character references.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Addressed in commit 30d7326 — extracted parsing into a parseCharRef helper that catches NumberFormatException and returns -1, which causes the invalid reference to be dropped silently.

Zir0-93 and others added 7 commits April 30, 2026 09:25
Remove trailing whitespace after semicolon in for-loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace ternary with helper method to satisfy AvoidInlineConditionals
- Guard parseInt against NumberFormatException from oversized char refs
  (e.g. &#9999999999;) per codex review feedback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip generic type parameters (<T>) and parentheses from component
names before passing to PlantUML to avoid syntax errors when rendering
C# code diagrams. Add C# integration and relationship extraction tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…text

When filesFilter is set, Clarpse only parses filtered files. Relations
to non-filtered components (e.g., extends ClassB where ClassB.java is
not in the filter) are silently dropped because Clarpse classifies
unresolved references as external dependencies.

This adds an opt-in config toggle that, after the initial filtered parse:
1. Scans component references for unresolved targets
2. Derives source file names from component unique names
3. Parses just those files from the full ProjectFiles
4. Merges them into the models before CodeDiff creation

Also updates ExtractedRelationships to check externalDependencies for
specialization and realization references, since references to initially
unparsed components are classified as external.

Bumps version to 3.9.0. Adds ADR-005 documenting the decision.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ening style attrs

GitHub's SVG sanitizer strips <image> elements and style attributes,
causing metric badges and all lines/borders to disappear when diagrams
are embedded in GitHub. This adds SvgImageInliner post-processing that:

1. Converts <image> elements with data:image/svg+xml;base64 data URIs
   into inline <g> elements with de-duplicated IDs
2. Flattens CSS style="stroke:X;stroke-width:Y;" into individual SVG
   presentation attributes (stroke, stroke-width)

Also includes improvements to contextual component resolution, relation
extraction, and change set tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Zir0-93 Zir0-93 merged commit 4533f24 into master May 3, 2026
5 checks passed
@Zir0-93 Zir0-93 deleted the fix/svg-xml-sanitization branch May 3, 2026 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant