What Unicode characters can be rendered in the Command Window?

29 views (last 30 days)
I would like to display "blackboard/double-struck/open-face" characters in the command window. The following examples are failed attempts to show the number "1" which should be U+1D7D9. Is there a reference or listing for which unicode characters can be rendered in the command window?
>> sprintf('\x1D7D9')
Warning: The hex value specified is outside the range of the character set.
ans =
1×0 empty char array
>> char(hex2dec('1D7D9'))
ans =
'[]'

Accepted Answer

Stephen23
Stephen23 on 4 Aug 2023
Edited: Stephen23 on 4 Aug 2023
The maximum character code that (currently) can be used in MATLAB is:
+char(Inf)
ans = 65535
The value you are attempting to convert is above this, we can see that it gets mapped to the highest code:
num = hex2dec('1D7D9')
num = 120793
+char(num)
ans = 65535
The MATLAB documentation states that it uses UTF16 for storing characters:
This was written by someone who does not realize that UTF16 is by definition a variable length encoding and therefore can use one or two blocks of 16 bit data. MATLAB uses exactly one block of 16 bit data, so either the documentation is incorrect and/or misleading.
I remember somewhere it used to state the maximum supported char code, but cannot find it right now.
  15 Comments
Stephen23
Stephen23 on 2 Sep 2023
Edited: Stephen23 on 2 Sep 2023
@Walter Roberson: perhaps rather than variable width, it might be possible to use fixed-width. This would allow any one char array to have a constant stride (giving efficient linear indexing as you state) and implement the entire code point range without special incantations. Here are two ways this could work invisibly to the user:
  1. apparently Unicode does not even use all 32 bits of UTF-32, only 21 bits per character is required to cover all characters. Perhaps three bytes could be used, which is not a big step up from the two bytes currently used. Benefit: fixed, constant everything, linear access. Cost: 50% increase in memory (which given modern computer memory would anyone even notice this? Who uses 25 GiBi char arrays?)
  2. store the stride in the array header (e.g. 2 bits), then store the char array using either 8/16/32 bits per character depending on the array content. This would require checking the content (slow) and some kind of process similar to the existing floating point write-on-real-to-complex to copy the array when "larger" characters are written to an existing array. Benefit: minimal memory for any one array (could also decrease memory as most(?) text uses the basic latin). Cost: checking array content.
However.... this very interesting manifesto:
makes the case that it is anyway a mistake to think of Unicode characters as simply an extension to the ASCII set, because of e.g. combining characters which by definition must be considered equivalent to the combined character. So this necessarily breaks the correspondence of the concept "character" to something that can be simply linearly indexed, making this requirement rather moot. Perhaps it is not even meaningful to require a constant indexing stride: if we loop over two character vectors, one of which has some "latin letter with diacritic" composite character, the other has a "latin letter" character and a combining diacritic character, what is the expected behavior from MATLAB? From both the user and Unicode perspectives, they are equivalent.
One major difference MATLAB has to the supporting examples given in that manifesto is that MATLAB tends to be used to analyze file content, less often to process document/file text itself. So the argument based on meta-data (e.g. XML tags) used throughout that manifesto is perhaps less relevant for MATLAB (than for a general programming language which is used e.g. to write an XML parser).
Although fixed-width solves some things (linear indexing) it only gives the illusion of Unicode support: perhaps variable-width encoding backed by an appropriate Unicode library (for equivalences, combining characters, case conversions etc) is the only robust solution to full Unicode support. Currently MATLAB fails my basic Unicode compliance tests, e.g. for canonical equivalence:
one = sprintf('\x45E') % CYRILLIC SMALL LETTER SHORT U
one = 'ў'
numel(one)
ans = 1
two = sprintf('\x443\x306') % CYRILLIC SMALL LETTER U & COMBINING BREVE
two = 'ў'
numel(two)
ans = 2
strcmp(one,two) % not Unicode compliant
ans = logical
0
Walter Roberson
Walter Roberson on 2 Sep 2023
I wonder the extent to which the Text Analytics Toolbox is expected (by customers) to cope with such things... and if so then what Mathworks has implemented.

Sign in to comment.

More Answers (0)

Categories

Find more on Data Type Conversion in Help Center and File Exchange

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!