Performance considerations for using strings or cellstr in table?

Question

1 vote

Hello,

I'm looking for some information related to performance (speed, memory consumption) of using a cell array (cellstr) or string array when storing "strings" in a table variable.

More concretely, are there costs in terms of lost performance using string arrays over cellstr?

The strings aren't so long, perhaps between five to fifty characters typically.

Note: I did a little test related to memory usage, and strings seems to consume less memory than cellstr. But in my test case, char array was even less:

% Make some random char arrays of different lengths
D_cellstr = arrayfun(@(unused) char(randi([65 90], 1, randi([10 30]))), 1:100, 'uni', 0)';
% Make into char array, padded with spaces
D_char_array = char(D_cell_array{:});
% Put some non-char values in there, making it not a cellstr
D_cell_array = D_cellstr; D_cell_array([10:20:end]) = { NaN };
% Make some tables
TS_cellstr    = table(D_cellstr(1:end));  % Extra code to use 'Var1' as variable name
TS_char_array = table(D_char_array(:,:));
TS_cell_array = table(D_cell_array(1:end));
TS_string     = table(string(D_cellstr));
TS_string2    = table(string(D_cell_array(1:end)));
% Results
who TS_* D_*
%  Name                 Size            Bytes  Class    Attributes
%
%  D_cell_array       100x1             14260  cell               
%  D_cellstr          100x1             14394  cell               
%  D_char_array       100x30             6000  char               
%  TS_cell_array      100x1             15244  table              
%  TS_cellstr         100x1             15378  table              
%  TS_char_array      100x1              6984  table              
%  TS_string          100x1              9712  table              
%  TS_string2         100x1              9338  table 

10 Comments
Show 8 older comments Hide 8 older comments

Christian Ridderström on 11 Aug 2021

Regarding my use if of interest: I'm importing mixed data from about a hundred Excel worksheets that have been manually created by a number of engineers over a decade or so. The aggregated table has about 15000 rows and 70 columns. That table only occupies about 70 MB of RAM, so not a problem per se. The quality/consistency of the worksheets is so-so, e.g. differently labelled columns, missing columns, missing values or wrong kind of value (string instad of number). However, the bulk of the data is plain text. I do use categoricals for some of it, e.g. for a column that identifies the type of row.

Part of what I'm doing is trying to help clean up the Excel mess with some algorithms and in the end export more useful worksheets.

Performance: The whole import/transform process takes 10-20 minutes, but most of that is likely interaction by Mathwork's MATLAB functions with the Excel API. However, I have noticed that some of the code I'm doing is starting to take a little time. So I'm a slightly concerned that maybe I'm using the wrong kind of data structure (perhaps tables are sluggish for instance, or maybe my use of cell array to store a bunch of text). It's unlikely to be an actual issue in my case, just annoying while developing code.

Mainly I just got curious.

dpb on 12 Aug 2021

OK, that gives at least one extra clue to choice for the text data -- if the text fields are used as key words or searched for string matches or the like, then the string class is probably the better choice and performance is likely to be adequate. But, if you need to extract substrings from within a string, then the syntax for strings is a klunky as the cellstr -- an array reference using "the curlies" followed indexing into the string content with normal parens. There's where the old char() array 2D indexing has its strength--it is already just an array of bytes so don't need the extra dereferencing.

As for the overhead of reading the files, there's where it's hard to judge without knowing what you're using -- depending upon how different the various files are, it might be possible to build a set of import option objects to use, but if it is a more-or-less random mixture of formats, it may not be so easy to determine which of those to use so end up having to scan each file individually.

There's where, as always, if can attach a couple of the actual files that illustrate issues you're running into and your code, folks can critique and offer advice on performance enhancements.

Bjorn Gustavsson on 12 Aug 2021

My only experience on a similar problem was in a project where we updated header-format on a yearly basis (until we wisened up and made more sensible specifications) and I had to make sure to adapt the meta-data extraction to handle all formats - but then I was along for the ride. Here you seem to be put in a position where you have to detective-out what to do. This seems like a task that's "better done by someone else" - for example have a (couple of) summer interns re-editing and proofing the files into a single proper data format it is just 15000x70...

Good luck.

dpb on 12 Aug 2021

Edited: dpb on 12 Aug 2021

There's not a real hard and fast way to decide when tables are too large -- but there is certainly an overhead to be paid for the convenience of the object.

Again, it mostly depends just what it is that is going to be done with the resulting file -- and, while it's annoying, time-consuming and tedious to bring in all the disparate files, that should be a one-time only exercise -- once done, you can save the new format and then no longer have to deal with the old. Hence, I'm not sure it's worth spending too much time on trying to optimize the performance of the input side of things.

But, if there is a need to process the resulting data by some set of these various variables like the type you mentioned above, then the builtin rowfun function with grouping variables is just the cat's meow for such.

You might be able to speed up the input with the import options object as noted above, but if they vary from one to another it may not help the speed but you might be able to save on the corrections in the code after reading by spending the time with the import object options instead to make them more consistent in returning expected data types and handling missing columns, etc., etc., ... That might be less time-consuming than trying to write the fix-up code after the dumb read via xlsread.

For example, that way can pre-determine that a column is to be interpreted as either numeric, string, cellstr, etc., etc., ... which will at least minimize the difficulty in cleaning up after inconsistent cell input where various cells within the same column/variable are of different types.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Walter Roberson on 12 Aug 2021

Open in MATLAB Online

0 votes

nrow = 15000;
ncol = 70;
Chars = [' ', '0':'9', 'a':'z'];
NChars = length(Chars);
[R,C] = ndgrid(1:nrow, 1:ncol);
data_cell = arrayfun(@(r,c) Chars(randi(NChars,1,randi([5 50],1,1))), R, C, 'uniform', 0);
T1_cell = cell2table(data_cell);
T2_string = array2table(string(data_cell));
whos T1_cell T2_string
  Name               Size                Bytes  Class    Attributes

  T1_cell        15000x70            166984847  table              
  T2_string      15000x70             99927523  table              
N = 10000;
rrow = randi(nrow, N, 1);
rcol = randi(ncol, N, 1);
tic; for K = 1 : N; this = T1_cell{rrow(K), rcol(K)}; end; r1_time = toc;
tic; for K = 1 : N; this = T2_string{rrow(K), rcol(K)}; end; r2_time = toc;
r1_time
r1_time = 1.0322
r2_time
r2_time = 0.7974

This tells us that using cellstr instead of string takes about 2/3 more storage (in this test), and is about 20% slower (in this test)

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Performance considerations for using strings or cellstr in table?

10 Comments
Show 8 older comments Hide 8 older comments

Answers (1)

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Tags

Community Treasure Hunt

Performance considerations for using strings or cellstr in table?

10 Comments Show 8 older comments Hide 8 older comments

Answers (1)

0 Comments Show -2 older comments Hide -2 older comments

Categories

Tags

See Also

Community Treasure Hunt

10 Comments
Show 8 older comments Hide 8 older comments

0 Comments
Show -2 older comments Hide -2 older comments