What are the internal differences between Matlab strings and character arrays?

Question

Jim Hokanson on 17 Oct 2017

5
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/361783-what-are-the-internal-differences-between-matlab-strings-and-character-arrays

Commented: Steven Lord on 6 May 2021

Matlab strings were introduced in 2016b presumably to make working with a string more similar to other languages, as compared to character arrays.

Although the documentation clearly states most details that people will want to know about a string, I'm a bit unclear as to how a string and character array are different, other than having different methods.

String documentation: https://www.mathworks.com/help/matlab/ref/string.html

Presumably strings are still encoded using UTF-16. Although I haven't tried it, I wouldn't expect that mex files support strings.

Also in this post, https://blogs.mathworks.com/loren/2016/09/15/introducing-string-arrays/, Loren mentions that "string arrays" are "more efficient" for storing text data. Why? (This might be a string array question and not something specific to strings vs chars).

6 Comments
Show 4 older commentsHide 4 older comments

James Tursa on 6 May 2021

Edited: James Tursa on 6 May 2021

@Steven Lord The robust functionality for users I understand. What I can't understand is why cellfun would report the class of the elements as char.

Steven Lord on 6 May 2021

Open in MATLAB Online

[This may seem like a non sequiter at first, but bear with me for a moment.] Technically, the matrix product A*B is only mathematically defined if the number of columns in A matches the number of rows in B. But MATLAB breaks that rule slightly by allowing either A or B to be a scalar even if the other is not a vector. It does so in order to not force users to create (potentially large) temporary arrays.

n = 6; % Imagine if n were 600 or 6000 instead of 6
A1 = 5;
A2 = diag(repmat(A1, 1, n));
B = magic(n);
C1 = A1*B;
C2 = A2*B;
isequal(C1, C2) % true
ans = logical
   1
whos % A2 is larger than A1
  Name      Size            Bytes  Class      Attributes

  A1        1x1                 8  double               
  A2        6x6               288  double               
  B         6x6               288  double               
  C1        6x6               288  double               
  C2        6x6               288  double               
  ans       1x1                 1  logical              
  n         1x1                 8  double               

Technically, cellfun should only operate on cell arrays including cellstrs and should throw an error if you pass in a string array. However, in order to not force users to convert string arrays into cellstrs manually (the equivalent of creating A2 from A1) which too could be expensive in terms of time and/or memory cellfun treats those strings like they were cellstrs. This includes treating them like cellstrs for the purposes of calling class on the contents.

C = {'abc', 'def'};
cellResult = cellfun(@class, C, 'UniformOutput', false)
cellResult = 1×2 cell array
    {'char'}    {'char'}
Cs = string(C);
stringResult = cellfun(@class, Cs, 'UniformOutput', false) % Treats Cs like it were C
stringResult = 1×2 cell array
    {'char'}    {'char'}

Sign in to comment.

Sign in to answer this question.

Answer 1

Walter Roberson on 23 Oct 2017

4
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/361783-what-are-the-internal-differences-between-matlab-strings-and-character-arrays#answer_287196

When storing multiple items of text, to store it as a cell array of character vectors requires 112 bytes of overhead per item, because that is the overhead for non-empty cell array entries: cell arrays do not know ahead of time that each entry will be the same type and so has to store the type and full size information for each.

string arrays, on the other hand, need an overall size, and an overall type that applies for the entire array, but after that need only a length (not full array dimensions) and data pointer per entry.

The size also changes in ways that indicate some internal chunking:

strings of length 0 through 10 take 132 bytes
strings of length 11 through 15 take 142 bytes
longer strings take an additional 16 bytes for each 8 characters or fewer

For unshared strings, this would allow small numbers of characters to be appended without reallocating, which could help performance.

5 Comments
Show 3 older commentsHide 3 older comments

Jim Hokanson on 23 Oct 2017

Edited: Jim Hokanson on 23 Oct 2017

Open in MATLAB Online

Very interesting. I'm finishing up a JSON parser and allocating memoery for character arrays, which occupied something like 15% of objects in the file occupied half of the total parsing time. What I really wanted was a single read-only string that consisted of all the strings concatenated together with pointers to the start of each string. A similar abstraction to varying length 1-d vectors would be nice ....

I'd be really surprised if they changed their mxArray header setup just to handle strings. Since 0 through 10 takes 132 bytes, this fits in nicely with the idea of the same 112 byte header and an allocation of 10 characters at 2 bytes per character.

Based on your answer I could see one possible slight speedup being that code can check for a string array, and if the value is a string array, they can avoid checks on type for every cell. This would presumably speed up method calls slightly since strings can start processing without much error checking whereas cell arrays of strings need to check every cell first before doing the processing (or during processing).

Here's some very informal testing with rough times listed as comments.

a = cell(1,1e6);
a(:) = {'test'};
b = string(a);
tic; iscellstr(a); toc;  %0.005 seconds
tic; isstring(b); toc;   %0.000010 seconds

Jim Hokanson on 23 Oct 2017

:)

Forgive me I was speaking a bit too loosely. I really meant unicode-like encoding with 2 bytes per character. String encoding/formatting in Matlab is something that continues to confuse and disappoint me. I believe UTF-16 technically has variable length encoding, which Matlab does not use. I've also heard it described as the first 2 bytes of UTF-32. Presumably someone that really understood the finer points of unicode would be able to answer this question. I'm also unclear whether it is possible to change the representation of characters based on locale settings or other "undocumented" options/features.

When writing the aforementioned JSON parser I decided to parse from the JSON strings (UTF-8) into 2 byte characters. Anything that exceeded 2 bytes (in the output) I replaced with a special "unrepresentable" character. That seemed to be the correct behavior based on my understanding of Matlab's character encodings.

Ideally we could get more clarification on this, i.e. a link to an official Mathworks document, but I have yet to see such documentation.

Walter Roberson on 23 Oct 2017

Open in MATLAB Online

We would need to come up with a test to distinguish the two cases, of "first 65536 code points" or "UTF16 encoded".

%http://www.fileformat.info/info/unicode/char/10001/index.htm
LINEAR_B_SYLLABLE_B038_E = native2unicode(uint8(sscanf('00010001','%2x')),'utf-32')
LINEAR_B_SYLLABLE_B038_E + 0

LINEAR_B_SYLLABLE_B038_E = 2×1 char array '?' '?' ans = 55296 56321

This is in fact the correct UTF16 encoding of that symbol. This tells us that at least native2unicode translates into UTF16.

But how do we then determine whether this is just a feature of native2unicode, or if MATLAB is UTF16 internally?

I propose the test that if we can take LINEAR_B_SYLLABLE_B038_E and get MATLAB to display a glyph for Linear B Symbol B038 E, then that would imply that MATLAB is UTF16 internally.

MATLAB fails this test if one uses text():

text(0.5, 0.5, LINEAR_B_SYLLABLE_B038_E)

displays a pair of rectangular boxes, just as would be expected for the case of two independent code points instead of it being UTF16.

Jim Hokanson on 24 Oct 2017

Edited: Jim Hokanson on 24 Oct 2017

As suggested by Yair check out this comment and its responses: https://undocumentedmatlab.com/blog/couple-of-matlab-bugs-and-workarounds#comment-340803

Sign in to comment.

Answer 2

Steven Lord on 23 Oct 2017

6
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/361783-what-are-the-internal-differences-between-matlab-strings-and-character-arrays#answer_287272

Open in MATLAB Online

For purposes of this answer, I'm going to use the word "phrase" kind of liberally to mean a chunk of textual data. That could be a character, a word, a sentence or phrase, a book, etc. A couple of differences that make string arrays more efficient to work with than char arrays:

A string array treats each phrase as a unit, whereas a char array treats each character as a unit. In the past we've seen plenty of people do something like this:

c = 'apple';
f = c(1) % expecting f to be 'apple', but it is 'a'

With a string:

s = "apple";
f = s(1) % expecting f to be "apple", which it is

Storing phrases of different lengths in a char matrix requires padding with blanks. This means you need to remove the padding when you want to use each phrase later on. A string array doesn't require this padding.

c = ['apple '; 'banana'; 'cherry'];
c = strvcat(c, 'watermelon');
size(c)
f = ['{' c(1, :) '}'] % Note the extra spaces between the {}
s = ["apple"; "banana"; "cherry"];
s = [s; "watermelon"];
size(s)
f1 = ['{' s(1) '}'] % Note that f1 is now a 1x3 array; each of the braces is a separate string
f2 = '{' + s(1) + '}' % Note no extra spaces between the braces and the phrase apple

In the past one way to store multiple char arrays of different lengths without padding was to store them in a cell array. But MATLAB functions that needed to process the textual data would need to check (using something like iscellstr) whether or not every element of the cell contained a char vector. That checking takes time. A string array can only contain string data, so it doesn't need to check each element in the array for "string-ness". That extra validation probably doesn't take a lot of time, unless you need to do it often and/or on a large cell of char data.
Regarding MEX-file support, I'm not certain. If you want to request MEX-file support for string arrays, or learn what support there is (nothing's listed in the documentation as far as I could find) I recommend contacting Technical Support directly using the Contact Us link in the upper-right corner of this page.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 3

Yair Altman on 24 Oct 2017

4
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/361783-what-are-the-internal-differences-between-matlab-strings-and-character-arrays#answer_287394

Edited: Yair Altman on 24 Oct 2017

The new strings are simply Matlab classes (MCOS objects), that extend 3 superclasses (matlab.mixin.internal.MatrixDisplay, matlab.mixin.internal.indexing.Paren, matlab.mixin.internal.indexing.ParenAssign). The ability to use double quotes (") to signify strings (as in s="apple") is simply syntactic sugar for the new string class. As a class object, the new strings defines 3 dozen internal class methods, such as cellstr(), char(), split() etc.

The string class is defined with class attributes Sealed and RestrictsSubclassing, to ensure that nobody can override its behavior. Moreover, TMW was extra-careful (way more that it usually is) to close most of the doors that can be used to access the internals. It's no wonder that MathWorkers on this page ignore the explicit repeated requests for information about the internals.

The internal string data is stored inside a class property called "data", which is private and hidden and so is not regularly accessible except via the class methods. If you want to access it, you can't simply use struct(), but you could try using James Tursa's mxGetPropertyPtr, as explained here: https://undocumentedmatlab.com/blog/accessing-private-object-properties

As for the discussion above regarding the specific UTF representation, I think that you will find the following discussion interesting, especially in the comments thread: https://undocumentedmatlab.com/blog/couple-of-matlab-bugs-and-workarounds

2 Comments
Show NoneHide None

Stephen23 on 24 Oct 2017

+1 for the breakdown. Hopefully this will prompt more investigation.

Jim Hokanson on 24 Oct 2017

Edited: Jim Hokanson on 24 Oct 2017

Open in MATLAB Online

I think with Walter's answer and this one together answer my question completely.

For reference, this info can be found by digging through:

mc = ?string

Sign in to comment.

Answer 4

Sruthi Geetha on 23 Oct 2017

2
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/361783-what-are-the-internal-differences-between-matlab-strings-and-character-arrays#answer_287187

First of all Strings in MATLAB are introduced in R2017a.

The main difference between strings and character arrays is that strings can be considered a complete object, where as character arrays are a vector of chars. Therefore, the latter you can access individual characters via indexing whereas in the former case, you cannot. Example:

>> s = "hi"

s = "hi"

>> sc = 'hi'

sc = 'hi'

>> sc(1)

ans = 'h'

>> s(1)

ans = "hi"

>> s(2)

Error: Index exceeds matrix dimensions.

3 Comments
Show 1 older commentHide 1 older comment

Stephen23 on 23 Oct 2017

Edited: Stephen23 on 23 Oct 2017

@Sruthi Geetha: this does not answer the question in any way whatsoever, as you do not clarify the "internal differences" that the title requests. Jim Hokanson already quotes the documentation in the question, so repeating information that can be gleaned from the help is hardly telling us what we all want to know: what are strings like inside: how are they stored in memory, which what encoding, how are they related to any other data types? Rather tantalizingly you wrote that "strings can be considered a complete object": sure, we already know that. But what kind of object?

Perhaps staff are not permitted to answer this question?

Steven Lord on 23 Oct 2017

The string class was introduced in release R2016b as Walter noted.

The ability to define a string using double quotes like "apple" was introduced in release R2017a. Perhaps that's what Sruthi had in mind.

Sign in to comment.

What are the internal differences between Matlab strings and character arrays?

6 Comments
Show 4 older commentsHide 4 older comments

Accepted Answer

5 Comments
Show 3 older commentsHide 3 older comments

More Answers (3)

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Community Treasure Hunt

What are the internal differences between Matlab strings and character arrays?

6 Comments Show 4 older commentsHide 4 older comments

Accepted Answer

5 Comments Show 3 older commentsHide 3 older comments

More Answers (3)

0 Comments Show -2 older commentsHide -2 older comments

2 Comments Show NoneHide None

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Community Treasure Hunt

6 Comments
Show 4 older commentsHide 4 older comments

5 Comments
Show 3 older commentsHide 3 older comments

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

3 Comments
Show 1 older commentHide 1 older comment