Why Does Format of Data Returned from readtable() Depend on Import Options that are Not VariableTypes?

4 views (last 30 days)
Here is the file
type text1.csv
XX,YY,ZZ 1,2,3 4,5,6
When read without options, the table output is as expected.
t1 = readtable('text1.csv')
t1 = 2×3 table
XX YY ZZ __ __ __ 1 2 3 4 5 6
But explicitly specifying some import options, seemingly unrelated to the type of data, brings the variables in as cellstr rather than numeric.
t2 = readtable('text1.csv',delimitedTextImportOptions('Delimiter',',','VariableNamesLine',1,'DataLines',[2 3]))
t2 = 2×3 table
XX YY ZZ _____ _____ _____ {'1'} {'2'} {'3'} {'4'} {'5'} {'6'}
Is that expected behavior?
Side note: I was also having problems with t2 where the first VariableName would be read in correctly as 'XX' , but the second and third would show up as 'extraVar1' and 'extraVar2' (or something like that). But I was using an earlier version of Matlab, so I guess that problem has been fixed.

Accepted Answer

Stephen23
Stephen23 on 18 Jan 2024
Edited: Stephen23 on 18 Jan 2024
"Is that expected behavior?"
Yes.
Or at least, not really very surprising.
"Why Does Format of Data Returned from readtable() Depend on Import Options that are Not VariableTypes?"
They are exactly the variable types that you told READTABLE to use!
Note that by design DELIMITEDTEXTIMPORTOPTIONS et al do not check the content of any file, hence they do not actually know anything about the data format of your file. Only DETECTIMPORTOPTIONS actually goes and takes a peek inside your file to decide how to try and import it.
This means with DELIMITEDTEXTIMPORTOPTIONS that all unspecified options are simply set to some default values that (potentially) have nothing to do with your data file. Lets look at the relevant default value:
obj = delimitedTextImportOptions('Delimiter',',','VariableNamesLine',1,'DataLines',[2 3]);
obj.VariableTypes
ans = 1×1 cell array
{'char'}
Remember also that the READTABLE documentation states that by default it uses DETECTIMPORTOPTIONS to get the required format settings. But when you provide READTABLE with the import options object yourself, you are specifically telling READTABLE to not check the file content, you are telling READTABLE "I know what the file format is, do exactly what I tell you". So it does:
tbl = readtable('text1.csv',obj)
tbl = 2×3 table
XX YY ZZ _____ _____ _____ {'1'} {'2'} {'3'} {'4'} {'5'} {'6'}
So you told READTABLE that you want it to import the file data as char and you also told it to not check the file content itself. It would be extremely unhelpful if READTABLE would override such user-specified settings.
My guess is that "char" is the default type on the basis that any data may be imported as text without loss of information.
Also note the default EXTRACOLUMNSRULE is "addvars", which explains why all columns are returned and not one.
Emphasis
This is why the READTABLE documentation emphasizes the use of DETECTIMPORTOPTIONS, to avoid surprises like this. Even the DELIMITEDTEXTIMPORTOPTIONS et al documentation emphasize at the very top of the page "You can create a DelimitedTextImportOptions object using either the detectImportOptions function... "
So on every page related to this topic, the emphasis is on using DETECTIMPORTOPTIONS because it actually takes a peek into the data file before deciding what the import settings are. In general the xxxIMPORTOPTIONS family are not very useful to call directly... perhaps only in the case that you have a lot of identically-formatted files to import and you need to specify the format before accessing any file and during development you used DETECTIMPORTOPTIONS to get all non-default settings which you then specify when creating the DELIMITEDTEXTIMPORTOPTIONS object (as inputs or by changing the object properties). And I mean all.
As the example here shows, setting up an xxxIMPORTOPTIONS object generally requires more effort because you really need to specify everything about the file format (not just three inputs as you tried):
Simply specifying one or two inputs and expecting the remaining defaults to work for every file ... is extremely optimistic. In most cases it is much easier to let DETECTIMPORTOPTIONS do that work for you.
Perspective
As rule of thumb: use DETECTIMPORTOPTIONS when you have a specific data file to import or explore the format of. Think of the xxxIMPORTOPTIONS objects as a store of the required settings, once you have them
  1 Comment
Stephen23
Stephen23 on 18 Jan 2024
Of course this applies to all settings, not just the data types. For example, if you use the default delimiter (or specify the wrong delimiter yourself) by supplying the import options object then you are telling READTABLE to use exactly that delimiter:
writematrix(randi(9,3,5),'test.txt','Delimiter','\t')
obj = delimitedTextImportOptions();
obj.Delimiter % seems like a reasonable default value
ans = 1×1 cell array
{','}
tbl = readtable('test.txt',obj) % oops, does not suit my TSV file!
tbl = 3×1 table
Var1 _____________ {'7→2→9→4→4'} {'6→7→8→5→4'} {'9→4→4→4→6'}

Sign in to comment.

More Answers (1)

Hassaan
Hassaan on 17 Jan 2024
Edited: Hassaan on 17 Jan 2024
  1. Data Type Inference: MATLAB infers the data type based on the content of each column. If you don't specify a data type in delimitedTextImportOptions, it tries to guess the type, which can lead to inconsistencies if the guess is incorrect.
  2. Explicit Data Type Specification: You can specify the data type for each column in delimitedTextImportOptions. This is a more reliable way to ensure that your data is imported in the format you expect. For example, you can specify that all columns are numeric.
  3. Issues with Variable Names: If MATLAB cannot correctly interpret the variable names from the first line of the file (as specified by VariableNamesLine), it may default to generic names like extraVar1, extraVar2, etc. This usually happens if there is a mismatch in the number of columns or an issue with the delimiter.
I recommend specifying the data types explicitly in the delimitedTextImportOptions.
% Automatically detect import options for the file
opts = detectImportOptions('text1.csv');
% Set the data type for each variable
opts = setvartype(opts, {'XX', 'YY', 'ZZ'}, 'double');
% Specify the data lines (optional if the entire file should be read)
opts.DataLines = [2 3];
% Read the table
t2 = readtable('text1.csv', opts);
disp(t2)
-----------------------------------------------------------------------------------------------------------------------------------------------------
If you find the solution helpful and it resolves your issue, it would be greatly appreciated if you could accept the answer. Also, leaving an upvote and a comment are also wonderful ways to provide feedback.
Professional Interests
  • Technical Services and Consulting
  • Embedded Systems | Firmware Developement | Simulations
  • Electrical and Electronics Engineering
It's important to note that the advice and code are based on limited information and meant for educational purposes. Users should verify and adapt the code to their specific needs, ensuring compatibility and adherence to ethical standards.
Feel free to contact me.
  1 Comment
Paul
Paul on 17 Jan 2024
yes, that would be a workaround. I'd like to know if the behavior as shown is expected, and if there is a way to make readtable return the same table for t2 as for t1 without me having to specify the data types. If it can figure that out for t1, why can't it figure it out the same way for t2?

Sign in to comment.

Products


Release

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!