Conversation
We should either restrict all invalid characters both in literal form and as character references, or none of them. Disallowing only the one character is inconsistently. Because checking literal forms means that we should decode and check all the input, this will influence performance. We are not ready to get that performance lost for now. Users of the Reader API could do their own checks themselves.
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #750 +/- ##
==========================================
+ Coverage 61.24% 61.67% +0.42%
==========================================
Files 39 39
Lines 16277 16626 +349
==========================================
+ Hits 9969 10254 +285
- Misses 6308 6372 +64
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Some characters (like control characters, at least in XML 1.1), are legal when escaped but illegal otherwise, while others (like NUL) are illegal in both cases. While we definitely are not 100% conformant with the standard, I disagree that this particular "inconsistency" is necessarily wrong. As far as I can tell, it's one we're going to have even once we do conform, because the XML standard doesn't treat all invalid characters as equally invalid. |
Yes, it is. My point is that:
So I think that removing this check should not make things worse and I propose to accept a short-term solution (yes, I know what you thought now) of removing this check and add it later with understanding and formal tests (which we should get after resolving #749). |
I'm still not certain that disabling them will help this user. I have a vague recollection (though I cannot find where it was written) that whatever weird nonstandard format they were using was dumping raw binary data into the space between tags (i.e. text event in our parlance) and using an attribute to denote the length so that the parser could entirely skip over that region. I could be mixing it up with another issue though. |
Ah, correct. |
We should either restrict all invalid characters both in literal form and as character references, or none of them. Disallowing only the one character is inconsistently. Because checking literal forms means that we should decode and check all the input, this will influence performance. We are not ready to get that performance lost for now. Users of the Reader API could do their own checks themselves.
Ironically, this is what firstly was proposed in #496. After trying to finish that PR I found that we have some unanswered questions (above) which we should work out before we will be create a consistent solution. I think, it will be tight linked with #749.
I think, the best what we can do now is to not check validity of any characters and allow users to that themselves.
cc @sashka