Parsing a text file in matlab and accessing contents of each sections
Show older comments
Hi I want to separate a text file into different sections in MATLAB which is quite big.
- Ignore first set of lines
- Then the data set is repeated
- Access its content for a particular set of condition
For example, for a drag factor of 1.0 and fuel factor of 1.2, I want to find the corresponding alt for a particular weight.
Find attached the text file.
Thanks Yashvin
2 Comments
per isakson
on 10 Jun 2015
Edited: per isakson
on 10 Jun 2015
- "quite big"   how big compared to available memory?
- "different sections"   what defines the beginning of a section? "V2500_A5"_ is that a fixed string, which defines the beginning of a new a section?
yashvin
on 10 Jun 2015
Accepted Answer
More Answers (1)
Guillaume
on 10 Jun 2015
Your text file is not really designed to be read by a computer. It's not very consistent (variable number of blank lines, variable number of spaces, inconsistent number format, etc.) which makes it difficult to parse efficiently.
So the first thing to look at is if you can get the same data in a format designed to be parsed by a computer: binary, json, xml, etc.
Failing that, the following works on the attached file, but because of the inconsistencies may not work on a larger file:
dragwanted = 1.0;
fuelwanted = 1.2;
content = fileread('question.txt'); %get whole content of file
sections = regexp(content, 'DRAG FACTOR\s+([0-9.]+)\s+FUEL FACTOR\s+([0-9.]+)\s+([A-Z .]+\r\n[A-Z() ]+\r\n\s*\r\n([0-9. ]+\r\n)+)', 'tokens');
%sections is a cell array of 1x3 cell arrays of {drag factor, fuel factor, table}
dragfactors = cellfun(@(s) str2double(s{1}), sections);
fuelfactors = cellfun(@(s) str2double(s{2}), sections);
wanted = dragfactors == dragwanted & fuelfactors == fuelwanted;
assert(sum(wanted) > 0, 'No section match criteria');
assert(sum(wanted) == 1, 'More than one section match criteria');
section = sections{wanted}{3};
%parse the section:
sectionlines = strsplit(section, {'\n', '\r'});
sectionheader = strsplit(strtrim(sectionlines{1}))
sectionunits = strtrim(regexp(sectionlines{2}, '(?<=\().*?(?=\))', 'match'))
sectiontable = str2num(strjoin(sectionlines(4:end-1), '\n'))
6 Comments
Yashvin comment moved here:
Hi Thanks! It works for smaller files.
Unfortunately, the program i am using only output it in this format with lots of empty lines and blank spaces ! I tried it for a larger file but it could not work. It said no section match criteria. As from the sections part, it start generating empty arrays.
Can you please explain the sections part and how we can generalize it with more rows and with more input choices?
I have attached a new file Thanks
The file cannot be attached. Can i mail you the file?
Guillaume
on 10 Jun 2015
No you cannot email me the file. If for some reason the file is too big to post, just post a section of the file.
I used a regular expression to detect each section. This particular regular expression matches the sections of the file that:
- starts with 'DRAG FACTOR' followed by one or more blank characters (the DRAG FACTOR\s+ part)
- followed by any number of digits of dot character. That part is captured as a token (the ([0-9.]+))
- followed by one or more blank characters, followed by 'FUEL FACTOR' followed by one or more blank characters (the \s+FUEL FACTOR\s+)
- followed by any number of digits of dot character. That part is captured as the second token (the second ([0-9.]+))
- followed by any number of blank characters, this includes spaces tabs, and line returns (the \s+)
- followed by a third capture of (starts with (|, finishes at the the final |))
- a string made exclusively of uppercase alphabetic characters or dots or spaces (no tabs) finished by the windows line return characters \r
yashvin
on 10 Jun 2015
Well, your new file breaks part 5 (there are non-blank lines between the table and the drag factor/fuel factor line), and part 8 (units can also include '.', '%', '/' and probably other symbols) of the regular expression. So of course, there is no match.
To build a regular expression you need to establish what is constant in the file and what is not. It seems that the list of condition before the table is variable. Is that the case?
Also are the criteria always drag factors and fuel factors or can it be some other (like altitude)? That is what need to be parsed and what can be discarded?
Finally, does the table always start by WGHT or can it be a different header? Do you actually care about the header and units or can they be discarded?
yashvin
on 10 Jun 2015
Guillaume
on 10 Jun 2015
Your file is a real mess, sometimes you have empty lines with just one space, sometimes with no spaces, the header line starts with 3 spaces, the unit line only two, the parameter section sometimes has one parameter on a line, sometimes two. You may be better off parsing the file line by line.
Otherwise, the following will get you the table and the criteria section, but will not parse the criteria:
sections = regexp(content, ...
'CLEAN CONFIGURATION\r\n((.*\r\n)+?)(\s+WGHT.*\r\n.*\r\n.*\r\n([0-9. ]+\r\n)+)', ...
'tokens', 'dotexceptnewline);
sections is a 1 x n (n = number of section) cell array of cell arrays whose first elements are the criteria part and seconds elements the table part. You can parse the table with the same code as before. For reference, the above regular expression can be decoded as:
- match 'CLEAN CONFIGURATION' followed by '\r' (newline)
- starts the first token (at |(|)
- match any character but a newline followed by '\r' (the |(.*\r
Categories
Find more on Data Type Conversion in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!