Performance considerations for using strings or cellstr in table?
Show older comments
Hello,
I'm looking for some information related to performance (speed, memory consumption) of using a cell array (cellstr) or string array when storing "strings" in a table variable.
More concretely, are there costs in terms of lost performance using string arrays over cellstr?
The strings aren't so long, perhaps between five to fifty characters typically.
Note: I did a little test related to memory usage, and strings seems to consume less memory than cellstr. But in my test case, char array was even less:
% Make some random char arrays of different lengths
D_cellstr = arrayfun(@(unused) char(randi([65 90], 1, randi([10 30]))), 1:100, 'uni', 0)';
% Make into char array, padded with spaces
D_char_array = char(D_cell_array{:});
% Put some non-char values in there, making it not a cellstr
D_cell_array = D_cellstr; D_cell_array([10:20:end]) = { NaN };
% Make some tables
TS_cellstr = table(D_cellstr(1:end)); % Extra code to use 'Var1' as variable name
TS_char_array = table(D_char_array(:,:));
TS_cell_array = table(D_cell_array(1:end));
TS_string = table(string(D_cellstr));
TS_string2 = table(string(D_cell_array(1:end)));
% Results
who TS_* D_*
% Name Size Bytes Class Attributes
%
% D_cell_array 100x1 14260 cell
% D_cellstr 100x1 14394 cell
% D_char_array 100x30 6000 char
% TS_cell_array 100x1 15244 table
% TS_cellstr 100x1 15378 table
% TS_char_array 100x1 6984 table
% TS_string 100x1 9712 table
% TS_string2 100x1 9338 table
10 Comments
Bjorn Gustavsson
on 11 Aug 2021
The only drawback with char-arrays is that they will be rectangular - i.e. one has to handle the empty parts of the end of rows where the string is shorter than the longes string in the array - that is tedious, and in my experience worth the price to pay for using string-arrays...
Christian Ridderström
on 11 Aug 2021
Edited: Christian Ridderström
on 11 Aug 2021
Chunru
on 11 Aug 2021
From another angle, the only benifit with char-array is that they are rectanglular and can be really organized as character array not "pointer" array. It is more efficient to access them and it is sometime more efficient to store them (waste of space padding and overhead of "pointer" and the asscociated).
Chunru
on 11 Aug 2021
String array may be more efficient than cellstr. Cell is a more generall data structure, with each cell can be almost anything and internally this information must be somewhere for each cell. String array is more special and every element is a string.
dpb
on 11 Aug 2021
I concur with @Bjorn Gustavsson on the above observation vis a vis char() variables--they are not at all convenient for string manipulation.
For memory footprint, you give up something for the convenience, but gain quite a fair amount relative to the cellstr().
I don't have any firm data on the performance aspect from the standpoint of timings; that probably depends heavily on just what are going to be doing with these or if they're only decoration going along with the other data.
One might, depdending on that usage, want to consider whether the variable is a likely candidate to be categorical instead.
Christian Ridderström
on 11 Aug 2021
dpb
on 12 Aug 2021
OK, that gives at least one extra clue to choice for the text data -- if the text fields are used as key words or searched for string matches or the like, then the string class is probably the better choice and performance is likely to be adequate. But, if you need to extract substrings from within a string, then the syntax for strings is a klunky as the cellstr -- an array reference using "the curlies" followed indexing into the string content with normal parens. There's where the old char() array 2D indexing has its strength--it is already just an array of bytes so don't need the extra dereferencing.
As for the overhead of reading the files, there's where it's hard to judge without knowing what you're using -- depending upon how different the various files are, it might be possible to build a set of import option objects to use, but if it is a more-or-less random mixture of formats, it may not be so easy to determine which of those to use so end up having to scan each file individually.
There's where, as always, if can attach a couple of the actual files that illustrate issues you're running into and your code, folks can critique and offer advice on performance enhancements.
Christian Ridderström
on 12 Aug 2021
Bjorn Gustavsson
on 12 Aug 2021
My only experience on a similar problem was in a project where we updated header-format on a yearly basis (until we wisened up and made more sensible specifications) and I had to make sure to adapt the meta-data extraction to handle all formats - but then I was along for the ride. Here you seem to be put in a position where you have to detective-out what to do. This seems like a task that's "better done by someone else" - for example have a (couple of) summer interns re-editing and proofing the files into a single proper data format it is just 15000x70...
Good luck.
There's not a real hard and fast way to decide when tables are too large -- but there is certainly an overhead to be paid for the convenience of the object.
Again, it mostly depends just what it is that is going to be done with the resulting file -- and, while it's annoying, time-consuming and tedious to bring in all the disparate files, that should be a one-time only exercise -- once done, you can save the new format and then no longer have to deal with the old. Hence, I'm not sure it's worth spending too much time on trying to optimize the performance of the input side of things.
But, if there is a need to process the resulting data by some set of these various variables like the type you mentioned above, then the builtin rowfun function with grouping variables is just the cat's meow for such.
You might be able to speed up the input with the import options object as noted above, but if they vary from one to another it may not help the speed but you might be able to save on the corrections in the code after reading by spending the time with the import object options instead to make them more consistent in returning expected data types and handling missing columns, etc., etc., ... That might be less time-consuming than trying to write the fix-up code after the dumb read via xlsread.
For example, that way can pre-determine that a column is to be interpreted as either numeric, string, cellstr, etc., etc., ... which will at least minimize the difficulty in cleaning up after inconsistent cell input where various cells within the same column/variable are of different types.
Answers (1)
nrow = 15000;
ncol = 70;
Chars = [' ', '0':'9', 'a':'z'];
NChars = length(Chars);
[R,C] = ndgrid(1:nrow, 1:ncol);
data_cell = arrayfun(@(r,c) Chars(randi(NChars,1,randi([5 50],1,1))), R, C, 'uniform', 0);
T1_cell = cell2table(data_cell);
T2_string = array2table(string(data_cell));
whos T1_cell T2_string
N = 10000;
rrow = randi(nrow, N, 1);
rcol = randi(ncol, N, 1);
tic; for K = 1 : N; this = T1_cell{rrow(K), rcol(K)}; end; r1_time = toc;
tic; for K = 1 : N; this = T2_string{rrow(K), rcol(K)}; end; r2_time = toc;
r1_time
r2_time
This tells us that using cellstr instead of string takes about 2/3 more storage (in this test), and is about 20% slower (in this test)
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!