Skip to content

RW for char arrays with unicode support#234

Open
foreverallama wants to merge 7 commits intoJuliaIO:masterfrom
foreverallama:char_arrays
Open

RW for char arrays with unicode support#234
foreverallama wants to merge 7 commits intoJuliaIO:masterfrom
foreverallama:char_arrays

Conversation

@foreverallama
Copy link
Contributor

Extension of #222 that attempts to fix several bugs in handling char arrays. I've attempted a common decode method decode_char_array in MAT_types.jl. The MAT_v5 and MAT_HDF5 readers now just read raw integer data, and then pass that to decode_char_array where the data is decoded. Seems to work with N-D arrays without issue.

So far all the test cases for read pass. I'll add some more test cases to check this out, particularly with unicode characters. and N-D arrays. Will need to see how to incorporate the same changes in MAT_v4.jl as well.

CC @matthijscox

@matthijscox
Copy link
Member

Looks like a great clean-up!

I'm only slightly worried about performance because I know val-based dispatching can be slower than normal dispatching. See e.g. https://github.com/ztangent/ValSplit.jl

@foreverallama
Copy link
Contributor Author

The reference you linked mentions this:

Note that dynamic dispatch does not always occur: When there are a small number of values to split on (less than 4, as of Julia 1.6), the Julia compiler automatically generates a switch statement

Here, the Val argument has 3 types (UTF-8, UTF-16, UTF-32), so I don't think the dispatching here impacts performance? In any case, an alternative I guess is we could do a manual if-else and do everything inline? Anyways only the uint16 type array would need the codec.

What do you think?

@foreverallama
Copy link
Contributor Author

I don't think we need to update MAT_v4. It doesn't support unicode so we don't need any special handling. The output type/formatting of the v4 reader is consistent with what the other formats.

@foreverallama
Copy link
Contributor Author

Added write support for both Julia string and char types. Added some tests with unicode characters as well. The basic gist is this:

  • For string arrays, the max length of a string is added as the 2nd dimension. Additionally, pad other rows with spaces if necessary for uniform width (which is required by MATLAB).
  • For char arrays, transcode each character to utf-16 code units and merge along rows. Again, pad rows if necessary.

I've added a bunch of tests for both read and write, and they all work with my version of MATLAB (2025b). I think I've handled some edge cases like 1D vectors and stuff as well.

Honestly there's probably a better way to do this, could use some pointers there. Till then I believe this is a reasonable solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants