What Unicode characters can be rendered in the Command Window?

Question

Russell Carpenter on 4 Aug 2023

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/2004827-what-unicode-characters-can-be-rendered-in-the-command-window

Edited: Stephen23 on 2 Sep 2023

I would like to display "blackboard/double-struck/open-face" characters in the command window. The following examples are failed attempts to show the number "1" which should be U+1D7D9. Is there a reference or listing for which unicode characters can be rendered in the command window?

>> sprintf('\x1D7D9')

Warning: The hex value specified is outside the range of the character set.

ans =

1×0 empty char array

>> char(hex2dec('1D7D9'))

ans =

'[]'

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Stephen23 on 4 Aug 2023

1
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/2004827-what-unicode-characters-can-be-rendered-in-the-command-window#answer_1283502

Edited: Stephen23 on 4 Aug 2023

Open in MATLAB Online

The maximum character code that (currently) can be used in MATLAB is:

+char(Inf)
ans = 65535

The value you are attempting to convert is above this, we can see that it gets mapped to the highest code:

num = hex2dec('1D7D9')
num = 120793
+char(num)
ans = 65535

The MATLAB documentation states that it uses UTF16 for storing characters:

https://www.mathworks.com/help/matlab/matlab_prog/unicode-and-ascii-values.html

This was written by someone who does not realize that UTF16 is by definition a variable length encoding and therefore can use one or two blocks of 16 bit data. MATLAB uses exactly one block of 16 bit data, so either the documentation is incorrect and/or misleading.

I remember somewhere it used to state the maximum supported char code, but cannot find it right now.

15 Comments
Show 13 older commentsHide 13 older comments

Stephen23 on 8 Aug 2023

Edited: Stephen23 on 8 Aug 2023

"I speculate that Mathworks wanted to preserve the semantics that char() of a scalar numeric value always returns a scalar character"

But the Unicode character U+1D7D9 that the OP asked about is one scalar character, not two as you wrote (you are confusing the encoding with the number of characters). How it is stored in memory (normalization form, encoding, etc) is an implementation detail for the computer to manage (not for me, the user, to have to fight with bytes and by calling obscure NATIVE2UNICODE). Sure, Unicode is not simple, it involves lots of checking canonical equivalence, normalization form, and so forth, but computers are good at doing these kind of things (other languages can manage this, just MATLAB can't).

I thank my lucky stars for the developers of Python 3, who finally realized that a Unicode character is a Unicode character is a Unicode character, and abandoned the half-baked approach that TMW follows ("we still write our code with fond memories of when characters had exactly seven bits ... what do you mean, there are users who speak other languages who might like to use their computers?")

Unicode has existed for more than thirty years. Catch up, TMW.

Stephen23 on 8 Aug 2023

Edited: Stephen23 on 2 Sep 2023

"I speculate that Mathworks wanted to preserve the semantics that char() of a scalar numeric value always returns a scalar result while using a fixed size storage per character and avoiding the overkill of assigning 4 bytes per location."

That closely matches my speculation.

The problem is that this does not match user expectations (mine or the OP's) or the documentation. Either

the documentation needs to be updated to make it clear that CHAR is limited to 16 bits** and that any larger character code requires special magic incantations. Perhaps a warning would help too, rather than just silently mapping larger values to 65535.
or the character type is updated so that it works as younger users expect (whether fixed bytes or a "marginal index" is irrelevant to me, the user).

** currently the closest it gets is with the statement "However, the integers from 0 to 65535 also correspond to Unicode® characters", which is an odd statement to make about Unicode, it being as equally true as stating "the integers from 0 to N also correspond to Unicode® characters" for any 0<=N<=MaxUnicodeCharCode: factually true but not helpful (and not actually a statement about CHAR, the function).

Stephen23 on 2 Sep 2023

Edited: Stephen23 on 2 Sep 2023

Open in MATLAB Online

@Walter Roberson: perhaps rather than variable width, it might be possible to use fixed-width. This would allow any one char array to have a constant stride (giving efficient linear indexing as you state) and implement the entire code point range without special incantations. Here are two ways this could work invisibly to the user:

apparently Unicode does not even use all 32 bits of UTF-32, only 21 bits per character is required to cover all characters. Perhaps three bytes could be used, which is not a big step up from the two bytes currently used. Benefit: fixed, constant everything, linear access. Cost: 50% increase in memory (which given modern computer memory would anyone even notice this? Who uses 25 GiBi char arrays?)
store the stride in the array header (e.g. 2 bits), then store the char array using either 8/16/32 bits per character depending on the array content. This would require checking the content (slow) and some kind of process similar to the existing floating point write-on-real-to-complex to copy the array when "larger" characters are written to an existing array. Benefit: minimal memory for any one array (could also decrease memory as most(?) text uses the basic latin). Cost: checking array content.

However.... this very interesting manifesto:

https://utf8everywhere.org/

makes the case that it is anyway a mistake to think of Unicode characters as simply an extension to the ASCII set, because of e.g. combining characters which by definition must be considered equivalent to the combined character. So this necessarily breaks the correspondence of the concept "character" to something that can be simply linearly indexed, making this requirement rather moot. Perhaps it is not even meaningful to require a constant indexing stride: if we loop over two character vectors, one of which has some "latin letter with diacritic" composite character, the other has a "latin letter" character and a combining diacritic character, what is the expected behavior from MATLAB? From both the user and Unicode perspectives, they are equivalent.

One major difference MATLAB has to the supporting examples given in that manifesto is that MATLAB tends to be used to analyze file content, less often to process document/file text itself. So the argument based on meta-data (e.g. XML tags) used throughout that manifesto is perhaps less relevant for MATLAB (than for a general programming language which is used e.g. to write an XML parser).

Although fixed-width solves some things (linear indexing) it only gives the illusion of Unicode support: perhaps variable-width encoding backed by an appropriate Unicode library (for equivalences, combining characters, case conversions etc) is the only robust solution to full Unicode support. Currently MATLAB fails my basic Unicode compliance tests, e.g. for canonical equivalence:

one = sprintf('\x45E') % CYRILLIC SMALL LETTER SHORT U
one = 'ў'
numel(one)
ans = 1
two = sprintf('\x443\x306') % CYRILLIC SMALL LETTER U & COMBINING BREVE
two = 'ў'
numel(two)
ans = 2
strcmp(one,two) % not Unicode compliant
ans = logical
   0

Walter Roberson on 2 Sep 2023

@Stephen23

I wonder the extent to which the Text Analytics Toolbox is expected (by customers) to cope with such things... and if so then what Mathworks has implemented.

Sign in to comment.

What Unicode characters can be rendered in the Command Window?

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

15 Comments
Show 13 older commentsHide 13 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

What Unicode characters can be rendered in the Command Window?

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

15 Comments Show 13 older commentsHide 13 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

15 Comments
Show 13 older commentsHide 13 older comments