Performance considerations for using strings or cellstr in table?

Hello,
I'm looking for some information related to performance (speed, memory consumption) of using a cell array (cellstr) or string array when storing "strings" in a table variable.
More concretely, are there costs in terms of lost performance using string arrays over cellstr?
The strings aren't so long, perhaps between five to fifty characters typically.
Note: I did a little test related to memory usage, and strings seems to consume less memory than cellstr. But in my test case, char array was even less:
% Make some random char arrays of different lengths
D_cellstr = arrayfun(@(unused) char(randi([65 90], 1, randi([10 30]))), 1:100, 'uni', 0)';
% Make into char array, padded with spaces
D_char_array = char(D_cell_array{:});
% Put some non-char values in there, making it not a cellstr
D_cell_array = D_cellstr; D_cell_array([10:20:end]) = { NaN };
% Make some tables
TS_cellstr = table(D_cellstr(1:end)); % Extra code to use 'Var1' as variable name
TS_char_array = table(D_char_array(:,:));
TS_cell_array = table(D_cell_array(1:end));
TS_string = table(string(D_cellstr));
TS_string2 = table(string(D_cell_array(1:end)));
% Results
who TS_* D_*
% Name Size Bytes Class Attributes
%
% D_cell_array 100x1 14260 cell
% D_cellstr 100x1 14394 cell
% D_char_array 100x30 6000 char
% TS_cell_array 100x1 15244 table
% TS_cellstr 100x1 15378 table
% TS_char_array 100x1 6984 table
% TS_string 100x1 9712 table
% TS_string2 100x1 9338 table

10 Comments

The only drawback with char-arrays is that they will be rectangular - i.e. one has to handle the empty parts of the end of rows where the string is shorter than the longes string in the array - that is tedious, and in my experience worth the price to pay for using string-arrays...
I'm likely mainly interested in comparing cellstr to string-arrays, as logically what I'm looking at here is an array of a string-like objects. So it is also annyoing to have to do e.g. D_char_array(n, :) to get the n:th string, compared to D_cellstr{n} or D_string(n).
One thought I had was if there's a difference in the internal representation. I'm guessing a string array is basically an array of pointers to each string, but I also guess that the same holds for a cell array. So maybe the performance is the same for a string-array and a cell-array...?
From another angle, the only benifit with char-array is that they are rectanglular and can be really organized as character array not "pointer" array. It is more efficient to access them and it is sometime more efficient to store them (waste of space padding and overhead of "pointer" and the asscociated).
String array may be more efficient than cellstr. Cell is a more generall data structure, with each cell can be almost anything and internally this information must be somewhere for each cell. String array is more special and every element is a string.
I concur with @Bjorn Gustavsson on the above observation vis a vis char() variables--they are not at all convenient for string manipulation.
For memory footprint, you give up something for the convenience, but gain quite a fair amount relative to the cellstr().
I don't have any firm data on the performance aspect from the standpoint of timings; that probably depends heavily on just what are going to be doing with these or if they're only decoration going along with the other data.
One might, depdending on that usage, want to consider whether the variable is a likely candidate to be categorical instead.
Regarding my use if of interest: I'm importing mixed data from about a hundred Excel worksheets that have been manually created by a number of engineers over a decade or so. The aggregated table has about 15000 rows and 70 columns. That table only occupies about 70 MB of RAM, so not a problem per se. The quality/consistency of the worksheets is so-so, e.g. differently labelled columns, missing columns, missing values or wrong kind of value (string instad of number). However, the bulk of the data is plain text. I do use categoricals for some of it, e.g. for a column that identifies the type of row.
Part of what I'm doing is trying to help clean up the Excel mess with some algorithms and in the end export more useful worksheets.
Performance: The whole import/transform process takes 10-20 minutes, but most of that is likely interaction by Mathwork's MATLAB functions with the Excel API. However, I have noticed that some of the code I'm doing is starting to take a little time. So I'm a slightly concerned that maybe I'm using the wrong kind of data structure (perhaps tables are sluggish for instance, or maybe my use of cell array to store a bunch of text). It's unlikely to be an actual issue in my case, just annoying while developing code.
Mainly I just got curious.
OK, that gives at least one extra clue to choice for the text data -- if the text fields are used as key words or searched for string matches or the like, then the string class is probably the better choice and performance is likely to be adequate. But, if you need to extract substrings from within a string, then the syntax for strings is a klunky as the cellstr -- an array reference using "the curlies" followed indexing into the string content with normal parens. There's where the old char() array 2D indexing has its strength--it is already just an array of bytes so don't need the extra dereferencing.
As for the overhead of reading the files, there's where it's hard to judge without knowing what you're using -- depending upon how different the various files are, it might be possible to build a set of import option objects to use, but if it is a more-or-less random mixture of formats, it may not be so easy to determine which of those to use so end up having to scan each file individually.
There's where, as always, if can attach a couple of the actual files that illustrate issues you're running into and your code, folks can critique and offer advice on performance enhancements.
I appreciate the feedback. Unfortunately I can't share actual data or code:-( If I can figure out a reasonable way to create example data I'll do so.
I might also give using 'readtable' a go over 'xlsread' and see how it affects performance. If I do so I'll add a comment here with the results.
Otherwise I guess I was just asking for general experience, e.g. if there's caveats related to switching to string array over cellstr for table variables. Or perhaps someone would advice to stay away from tables for big data sets and instead just e.g. cell array.
Thanks!
My only experience on a similar problem was in a project where we updated header-format on a yearly basis (until we wisened up and made more sensible specifications) and I had to make sure to adapt the meta-data extraction to handle all formats - but then I was along for the ride. Here you seem to be put in a position where you have to detective-out what to do. This seems like a task that's "better done by someone else" - for example have a (couple of) summer interns re-editing and proofing the files into a single proper data format it is just 15000x70...
Good luck.
There's not a real hard and fast way to decide when tables are too large -- but there is certainly an overhead to be paid for the convenience of the object.
Again, it mostly depends just what it is that is going to be done with the resulting file -- and, while it's annoying, time-consuming and tedious to bring in all the disparate files, that should be a one-time only exercise -- once done, you can save the new format and then no longer have to deal with the old. Hence, I'm not sure it's worth spending too much time on trying to optimize the performance of the input side of things.
But, if there is a need to process the resulting data by some set of these various variables like the type you mentioned above, then the builtin rowfun function with grouping variables is just the cat's meow for such.
You might be able to speed up the input with the import options object as noted above, but if they vary from one to another it may not help the speed but you might be able to save on the corrections in the code after reading by spending the time with the import object options instead to make them more consistent in returning expected data types and handling missing columns, etc., etc., ... That might be less time-consuming than trying to write the fix-up code after the dumb read via xlsread.
For example, that way can pre-determine that a column is to be interpreted as either numeric, string, cellstr, etc., etc., ... which will at least minimize the difficulty in cleaning up after inconsistent cell input where various cells within the same column/variable are of different types.

Sign in to comment.

Answers (1)

nrow = 15000;
ncol = 70;
Chars = [' ', '0':'9', 'a':'z'];
NChars = length(Chars);
[R,C] = ndgrid(1:nrow, 1:ncol);
data_cell = arrayfun(@(r,c) Chars(randi(NChars,1,randi([5 50],1,1))), R, C, 'uniform', 0);
T1_cell = cell2table(data_cell);
T2_string = array2table(string(data_cell));
whos T1_cell T2_string
Name Size Bytes Class Attributes T1_cell 15000x70 166984847 table T2_string 15000x70 99927523 table
N = 10000;
rrow = randi(nrow, N, 1);
rcol = randi(ncol, N, 1);
tic; for K = 1 : N; this = T1_cell{rrow(K), rcol(K)}; end; r1_time = toc;
tic; for K = 1 : N; this = T2_string{rrow(K), rcol(K)}; end; r2_time = toc;
r1_time
r1_time = 1.0322
r2_time
r2_time = 0.7974
This tells us that using cellstr instead of string takes about 2/3 more storage (in this test), and is about 20% slower (in this test)

Categories

Answered:

on 12 Aug 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!