How to use "str2num" ans "strrep" on array of cells?

Question

Dan H on 6 Sep 2018

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/417810-how-to-use-str2num-ans-strrep-on-array-of-cells

Edited: Stephen23 on 6 Sep 2018

example_log_file.txt

Hello, when converting strings to numbers using str2num, this commands takes "forever": Processing of ~300 MB (of log data, stored in txt file) took ~15 hours! How can one use the "strrep" and "str2num" on cell contents?

The data is stored in an array of cells (every cell contains a line from the source file as string, tabulator-separated values).

log_data_text{:} % contains a cell for each line of the source file

For pre-processing, I need to perform some substitutions (e.g. replace spaces " " with "nan"), this needs to be done individually for each cell, meaning in one cell may be a space at position 10 (and / or other), in another cell space(s) may be at different positions or no spaces at all.

Finally, I need to convert the string of each cell to an array of numbers.

My first approach was to create one long string out of all cells, convert it and reshape the resulting 1-dimensional vector back to the original size

      complete_string = convertCharsToStrings([log_data_text_end_delimiter{:}]); % create one string containing all data
      complete_string = strrep(complete_string, '\t\t', '\t \t'); % substitution of empty value to space
      complete_string = strrep(complete_string, ' ', 'nan'); % substitute spaces (=empty values) with nan
      number_array_b = str2num(complete_string);
      number_of_lines = length(log_data_text_end_delimiter);
      number_array_b = transpose(reshape(number_array_b, 6, number_of_lines)); % reshape to original size (6 columns)

This approach takes less than a minute, which would be fine. Unfortunately, returns an empty array most of the time, when used on different log data txt files even thouch their structure is identical.

Therefore, I have to use a cell-by-cell routine, looping over all cells.

          for yy = 1 : number_of_lines
              modified_string = log_data_text_end_delimiter{yy};
              modified_string = strrep(modified_string, '\t\t', '\t \t');
              modified_string = strrep(modified_string, ' ', 'nan');
              temp_number_array = str2num(modified_string);
              number_array(yy, 2:7) = temp_number_array; % the first column already contains data
          end

Since there are many lines (=cells), str2num is called millions of times and takes literally hours. How can I optimize this conversion?

Thank you very much for advice,

Dan

#### Update #### - I uploaded a sample log file - I previously also tried the textscan command, but the time conversion didnot work

formatSpec = '%{yyyy-MM-dd HH:mm:ss.S}D\t%f\t%f\t%f\t%f\t%f\t%f';
result_array = textscan(fileID,formatSpec)

- Therefore, I analyzed the time string separately from the rest of the string.

2 Comments
Show NoneHide None

Stephen23 on 6 Sep 2018

Edited: Stephen23 on 6 Sep 2018

@Daniel Huzel: using str2num is not efficient, it would be best to avoid using it. If you want to process your data efficiently, then you should consider more efficient parsing methods, such as sscanf, or reading the file data correctly using MATLAB file reading functions:

https://www.mathworks.com/help/matlab/data-import-and-export.html

Please upload a sample file by clicking the paperclip button. This does not have to be the whole file, just enough to be representative of the exact file structure.

Guillaume on 6 Sep 2018

str2double is much safer than str2num. No idea if it is any faster.

In my opinion, the proper approach would be to read the file directly in the form you want using matlab own parsing functions rather than doing your own parsing. textscan is fairly powerful and readtable is even more powerful being able to read fairly complex formats (including being able to replace missing values by NaN).

Details of the files you want to parse would be required.

Sign in to comment.

Sign in to answer this question.

Answer 1

Stephen23 on 6 Sep 2018

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/417810-how-to-use-str2num-ans-strrep-on-array-of-cells#answer_335703

Edited: Stephen23 on 6 Sep 2018

Open in MATLAB Online

example_log_file.txt

The biggest problem with your file is not the formatting, but the fact that it uses commas for the decimal radix: MATLAB only supports decimal point. To use MATLAB efficiently we can first convert the commas to periods (search this forum for various ways to approach decimal commas). textscan worked well for your sample file (attached):

str = fileread('example_log_file.txt');
str = strrep(str,',','.');
fmt = '%s%f%f%f%f%f%f';
opt = {'Delimiter','\t','TreatAsEmpty',' ','CollectOutput',true};
C = textscan(str,fmt,opt{:});

Giving:

>> C{1}
ans = 
    '2014-08-24 04:54:13.804'
    '2014-08-24 04:54:14.026'
    '2014-08-24 04:54:14.607'
    '2014-08-24 04:54:15.604'
    '2014-08-24 04:54:15.804'
    '2014-08-24 04:54:16.026'
    '2014-08-24 04:54:16.607'
    '2014-08-24 04:54:17.604'
    '2014-08-24 04:54:17.804'
    '2014-08-24 04:54:18.026'
    '2014-08-24 04:54:18.607'
    '2014-08-24 04:54:19.604'
   ... lots more here
>> C{2}
ans =
   1.0e+03 *
    0.4500       NaN       NaN       NaN       NaN    0.0161
       NaN       NaN    0.0004       NaN       NaN       NaN
    0.4500    0.0003       NaN   -0.0165    4.8688    0.0161
       NaN       NaN       NaN   -0.0165       NaN       NaN
    0.4500       NaN       NaN       NaN       NaN    0.0160
       NaN       NaN    0.0004       NaN       NaN       NaN
    0.4498    0.0003       NaN   -0.0167    4.8688    0.0204
       NaN       NaN       NaN   -0.0167       NaN       NaN
    0.4500       NaN       NaN       NaN       NaN    0.0164
       NaN       NaN    0.0004       NaN       NaN       NaN
    0.4500    0.0004       NaN   -0.0167    4.8688    0.0164
       NaN       NaN       NaN   -0.0167       NaN       NaN
  ... lots more here

Obviously you can use isnan and indexing to efficiently get the numeric values that you require. Note that if you use a MATLAB version R2014b+ then you can also import those dates directly as datetime objects.

If the file itself does not fit into memory, then you should consider using a tall array:

https://www.mathworks.com/help/matlab/import_export/tall-arrays.html

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

How to use "str2num" ans "strrep" on array of cells?

2 Comments
Show NoneHide None

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

How to use "str2num" ans "strrep" on array of cells?

2 Comments Show NoneHide None

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments