Skip to content

Implement unicode unescaping#438

Open
krishnangovindraj wants to merge 10 commits intotypedb:masterfrom
krishnangovindraj:unicode-escape
Open

Implement unicode unescaping#438
krishnangovindraj wants to merge 10 commits intotypedb:masterfrom
krishnangovindraj:unicode-escape

Conversation

@krishnangovindraj
Copy link
Copy Markdown
Member

Product change and motivation

Escaped unicode characters can be used in TypeQL string literals. These must be of the form '\uXXXXwhere X` is any hex digit.

rust/value.rs Outdated

impl StringLiteral {
fn unescape_unicode<'a>(bytes: &'a [u8]) -> std::result::Result<char, &'a str> {
let as_hex = std::str::from_utf8(bytes).expect("Should still be utf8");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sliced the bytes, and you could have ended up slicing a multibyte character in the middle. You can't expect valid UTF-8.

rust/value.rs Outdated
}

impl StringLiteral {
fn unescape_unicode<'a>(bytes: &'a [u8]) -> std::result::Result<char, &'a str> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason for this to be an associated function

rust/value.rs Outdated
Comment on lines +363 to +369
b'u' => Self::unescape_unicode(&bytes[2..std::cmp::min(6, bytes.len())]).map(|c| (c, 6)).map_err(|hex| {
TypeQLError::InvalidUnicodeEscapeInString {
full_string: rest.to_owned(),
escape: format!(r"\u{}", hex),
}
.into()
}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map(..).map_err(...) is more readable as a match

rust/value.rs Outdated
Comment on lines +348 to +349
let as_u32 = u32::from_str_radix(as_hex, 16).map_err(|_| as_hex)?;
char::from_u32(as_u32).ok_or(as_hex)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier and more performant to iterate through the bytes and accumulate rather than going through UTF-8 decoding and handling decoding and parsing errors.

rust/value.rs Outdated
return Err(TypeQLError::InvalidStringEscape {
escape_handler(rest.as_bytes()).map_err(|expected_escaped_len| {
let safe_len = std::cmp::min(rest.len(), expected_escaped_len);
Into::<crate::common::error::Error>::into(TypeQLError::InvalidStringEscape {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa, what is going on here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't infer for some reason. I can try again.

rust/value.rs Outdated
if bytes.len() < 2 {
return Err(TypeQLError::InvalidStringEscape {
escape_handler(rest.as_bytes()).map_err(|expected_escaped_len| {
let safe_len = std::cmp::min(rest.len(), expected_escaped_len);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still not safe, we need to make sure we slice on the character boundary. Something like \て would cause a panic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest is still chars though. Would this happen?

Err(considered_escape_seq_length) => {
let offset = escaped_string.len() - rest.len();
let considered_escape_sequence =
escaped_string[offset..].chars().take(considered_escape_seq_length).collect();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not changed this, but I also don't get why it's semantically wrong.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, ceil_char_boundary doesn't look stable in the version of rust we're using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants