RW for char arrays with unicode support by foreverallama · Pull Request #234 · JuliaIO/MAT.jl

foreverallama · 2026-03-06T18:20:02Z

Extension of #222 that attempts to fix several bugs in handling char arrays. I've attempted a common decode method decode_char_array in MAT_types.jl. The MAT_v5 and MAT_HDF5 readers now just read raw integer data, and then pass that to decode_char_array where the data is decoded. Seems to work with N-D arrays without issue.

So far all the test cases for read pass. I'll add some more test cases to check this out, particularly with unicode characters. and N-D arrays. Will need to see how to incorporate the same changes in MAT_v4.jl as well.

CC @matthijscox

This reverts commit 6c9c5bb.

matthijscox · 2026-03-07T07:06:32Z

Looks like a great clean-up!

I'm only slightly worried about performance because I know val-based dispatching can be slower than normal dispatching. See e.g. https://github.com/ztangent/ValSplit.jl

foreverallama · 2026-03-08T14:58:25Z

The reference you linked mentions this:

Note that dynamic dispatch does not always occur: When there are a small number of values to split on (less than 4, as of Julia 1.6), the Julia compiler automatically generates a switch statement

Here, the Val argument has 3 types (UTF-8, UTF-16, UTF-32), so I don't think the dispatching here impacts performance? In any case, an alternative I guess is we could do a manual if-else and do everything inline? Anyways only the uint16 type array would need the codec.

What do you think?

foreverallama · 2026-03-08T18:28:43Z

I don't think we need to update MAT_v4. It doesn't support unicode so we don't need any special handling. The output type/formatting of the v4 reader is consistent with what the other formats.

foreverallama · 2026-03-09T13:26:46Z

Added write support for both Julia string and char types. Added some tests with unicode characters as well. The basic gist is this:

For string arrays, the max length of a string is added as the 2nd dimension. Additionally, pad other rows with spaces if necessary for uniform width (which is required by MATLAB).
For char arrays, transcode each character to utf-16 code units and merge along rows. Again, pad rows if necessary.

I've added a bunch of tests for both read and write, and they all work with my version of MATLAB (2025b). I think I've handled some edge cases like 1D vectors and stuff as well.

Honestly there's probably a better way to do this, could use some pointers there. Till then I believe this is a reasonable solution.

matthijscox and others added 3 commits February 7, 2026 13:22

fixing HDF5 char arrays

6c9c5bb

Revert "fixing HDF5 char arrays"

a41abf0

This reverts commit 6c9c5bb.

Read char arrays with unicode characters in v5 and HDF

00d61b8

New test with some more unicode characters in new MATLAB versions

a6b84e6

foreverallama added 3 commits March 8, 2026 23:59

Improve readability

d46dc40

Write support for char arrays with unicode

8960eca

Write to char arrays; both Julia strings and chars

1086e9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RW for char arrays with unicode support#234

RW for char arrays with unicode support#234
foreverallama wants to merge 7 commits intoJuliaIO:masterfrom
foreverallama:char_arrays

foreverallama commented Mar 6, 2026

Uh oh!

matthijscox commented Mar 7, 2026

Uh oh!

foreverallama commented Mar 8, 2026

Uh oh!

foreverallama commented Mar 8, 2026

Uh oh!

foreverallama commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

foreverallama commented Mar 6, 2026

Uh oh!

matthijscox commented Mar 7, 2026

Uh oh!

foreverallama commented Mar 8, 2026

Uh oh!

foreverallama commented Mar 8, 2026

Uh oh!

foreverallama commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants