How to use "str2num" ans "strrep" on array of cells?

15 views (last 30 days)
Hello, when converting strings to numbers using str2num, this commands takes "forever": Processing of ~300 MB (of log data, stored in txt file) took ~15 hours! How can one use the "strrep" and "str2num" on cell contents?
The data is stored in an array of cells (every cell contains a line from the source file as string, tabulator-separated values).
log_data_text{:} % contains a cell for each line of the source file
For pre-processing, I need to perform some substitutions (e.g. replace spaces " " with "nan"), this needs to be done individually for each cell, meaning in one cell may be a space at position 10 (and / or other), in another cell space(s) may be at different positions or no spaces at all.
Finally, I need to convert the string of each cell to an array of numbers.
My first approach was to create one long string out of all cells, convert it and reshape the resulting 1-dimensional vector back to the original size
complete_string = convertCharsToStrings([log_data_text_end_delimiter{:}]); % create one string containing all data
complete_string = strrep(complete_string, '\t\t', '\t \t'); % substitution of empty value to space
complete_string = strrep(complete_string, ' ', 'nan'); % substitute spaces (=empty values) with nan
number_array_b = str2num(complete_string);
number_of_lines = length(log_data_text_end_delimiter);
number_array_b = transpose(reshape(number_array_b, 6, number_of_lines)); % reshape to original size (6 columns)
This approach takes less than a minute, which would be fine. Unfortunately, returns an empty array most of the time, when used on different log data txt files even thouch their structure is identical.
Therefore, I have to use a cell-by-cell routine, looping over all cells.
for yy = 1 : number_of_lines
modified_string = log_data_text_end_delimiter{yy};
modified_string = strrep(modified_string, '\t\t', '\t \t');
modified_string = strrep(modified_string, ' ', 'nan');
temp_number_array = str2num(modified_string);
number_array(yy, 2:7) = temp_number_array; % the first column already contains data
end
Since there are many lines (=cells), str2num is called millions of times and takes literally hours. How can I optimize this conversion?
Thank you very much for advice,
Dan
#### Update #### - I uploaded a sample log file - I previously also tried the textscan command, but the time conversion didnot work
formatSpec = '%{yyyy-MM-dd HH:mm:ss.S}D\t%f\t%f\t%f\t%f\t%f\t%f';
result_array = textscan(fileID,formatSpec)
- Therefore, I analyzed the time string separately from the rest of the string.
  2 Comments
Stephen23
Stephen23 on 6 Sep 2018
Edited: Stephen23 on 6 Sep 2018
@Daniel Huzel: using str2num is not efficient, it would be best to avoid using it. If you want to process your data efficiently, then you should consider more efficient parsing methods, such as sscanf, or reading the file data correctly using MATLAB file reading functions:
Please upload a sample file by clicking the paperclip button. This does not have to be the whole file, just enough to be representative of the exact file structure.
Guillaume
Guillaume on 6 Sep 2018
str2double is much safer than str2num. No idea if it is any faster.
In my opinion, the proper approach would be to read the file directly in the form you want using matlab own parsing functions rather than doing your own parsing. textscan is fairly powerful and readtable is even more powerful being able to read fairly complex formats (including being able to replace missing values by NaN).
Details of the files you want to parse would be required.

Sign in to comment.

Accepted Answer

Stephen23
Stephen23 on 6 Sep 2018
Edited: Stephen23 on 6 Sep 2018
The biggest problem with your file is not the formatting, but the fact that it uses commas for the decimal radix: MATLAB only supports decimal point. To use MATLAB efficiently we can first convert the commas to periods (search this forum for various ways to approach decimal commas). textscan worked well for your sample file (attached):
str = fileread('example_log_file.txt');
str = strrep(str,',','.');
fmt = '%s%f%f%f%f%f%f';
opt = {'Delimiter','\t','TreatAsEmpty',' ','CollectOutput',true};
C = textscan(str,fmt,opt{:});
Giving:
>> C{1}
ans =
'2014-08-24 04:54:13.804'
'2014-08-24 04:54:14.026'
'2014-08-24 04:54:14.607'
'2014-08-24 04:54:15.604'
'2014-08-24 04:54:15.804'
'2014-08-24 04:54:16.026'
'2014-08-24 04:54:16.607'
'2014-08-24 04:54:17.604'
'2014-08-24 04:54:17.804'
'2014-08-24 04:54:18.026'
'2014-08-24 04:54:18.607'
'2014-08-24 04:54:19.604'
... lots more here
>> C{2}
ans =
1.0e+03 *
0.4500 NaN NaN NaN NaN 0.0161
NaN NaN 0.0004 NaN NaN NaN
0.4500 0.0003 NaN -0.0165 4.8688 0.0161
NaN NaN NaN -0.0165 NaN NaN
0.4500 NaN NaN NaN NaN 0.0160
NaN NaN 0.0004 NaN NaN NaN
0.4498 0.0003 NaN -0.0167 4.8688 0.0204
NaN NaN NaN -0.0167 NaN NaN
0.4500 NaN NaN NaN NaN 0.0164
NaN NaN 0.0004 NaN NaN NaN
0.4500 0.0004 NaN -0.0167 4.8688 0.0164
NaN NaN NaN -0.0167 NaN NaN
... lots more here
Obviously you can use isnan and indexing to efficiently get the numeric values that you require. Note that if you use a MATLAB version R2014b+ then you can also import those dates directly as datetime objects.
If the file itself does not fit into memory, then you should consider using a tall array:

More Answers (0)

Categories

Find more on Data Type Conversion in Help Center and File Exchange

Products


Release

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!