Parsing a text file in matlab and accessing contents of each sections

Hi I want to separate a text file into different sections in MATLAB which is quite big.
- Ignore first set of lines
- Then the data set is repeated
- Access its content for a particular set of condition
For example, for a drag factor of 1.0 and fuel factor of 1.2, I want to find the corresponding alt for a particular weight.
Find attached the text file.
Thanks Yashvin

2 Comments

  • "quite big" &nbsp how big compared to available memory?
  • "different sections" &nbsp what defines the beginning of a section? "V2500_A5"_ is that a fixed string, which defines the beginning of a new a section?
It is 60mb of txt file. As an example, I am attaching a full section of a part of the txt file. The initial section until "Cruise at a given cost index" is unimportant.
Each section begins with "CLEAN CONFIGURATION" followed by a table.
For example, for drag factor=1,fuel factor=1,2 and ISA= =13,I want to access the table and get the corresponding weight.
All the parameters in the 'CLEAN CONFIGURATION', i want to treat them as field so that I can select for different conditions

Sign in to comment.

 Accepted Answer

Here is a function, which reads question2.txt and returns a struct vector. It might serve as a starting point.
>> out = cssm()
out =
1x2 struct array with fields:
DRAG_FACTOR
FUEL_FACTOR
Table
>> out(abs([out.DRAG_FACTOR]-1)<1e-6 & abs([out.FUEL_FACTOR]-1)<1e-6).Table(1:5,1:3)
ans =
1.0e+04 *
4.0000 0.0000 0.0211
4.0500 0.0000 0.0212
4.1000 0.0000 0.0213
4.1500 0.0000 0.0214
4.2000 0.0000 0.0215
where
function out = cssm()
str = fileread( 'question2.txt' );
section_separator = 'CLEAN CONFIGURATION';
cac = strsplit( str, section_separator );
len = length( cac );
out = struct( 'DRAG_FACTOR',nan(1,len-1), 'FUEL_FACTOR',[], 'Table',[] );
for jj = 2 : len
out(jj-1) = handle_one_section_( cac{jj} );
end
end
function sas = handle_one_section_( str )
sas = struct( 'DRAG_FACTOR',[], 'FUEL_FACTOR',[], 'Table',[] );
sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
sas.Table = excerpt_table_( str );
end
function val = excerpt_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
val = str2double( buf );
end
function val = excerpt_table_( str )
% Q&D, quick and dirty, search a numerical sequence, which is at least 100 character
% long. PROBLEM: requires that the preceding line ends with a "non-numerical"
% character and that the following line begins with a "non-numerical" character.
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
val = str2num( buf );
end
&nbsp
Modified function based on comment
>> cssm
ans =
1x2 struct array with fields:
DRAG_FACTOR
FUEL_FACTOR
Table
COST_INDEX
ALTITUDE
ISA
where
function out = cssm()
str = fileread( 'question2.txt' );
section_separator = 'CLEAN CONFIGURATION';
cac = strsplit( str, section_separator );
len = length( cac );
out = struct( 'DRAG_FACTOR',nan(1,len-1), 'FUEL_FACTOR',[], 'Table',[] ...
, 'COST_INDEX' ,[] , 'ALTITUDE' ,[], 'ISA' ,[] );
for jj = 2 : len
out(jj-1) = handle_one_section_( cac{jj} );
end
end
function sas = handle_one_section_( str )
sas = struct( 'DRAG_FACTOR',[], 'FUEL_FACTOR',[], 'Table',[] ...
, 'COST_INDEX' ,[], 'ALTITUDE' ,[], 'ISA' ,[] );
sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
sas.COST_INDEX = excerpt_colon_separated_num_( str, 'COST INDEX' );
sas.ALTITUDE = excerpt_colon_separated_num_( str, 'ALTITUDE' );
sas.ISA = excerpt_colon_separated_num_( str, 'ISA' );
sas.Table = excerpt_table_( str );
end
function val = excerpt_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
val = str2double( buf );
end
function val = excerpt_table_( str )
% Q&D, quick and dirty, search a numeric sequecne, which is at least 100 character
% long. PROBLEM: requires that the preceeding line ends with a "non-numeric"
% character and that the following line begins with a "non-numeric" character.
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
val = str2num( buf );
end
function val = excerpt_colon_separated_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '(?:[ \:\-]+)([\d\.])+' ], 'tokens', 'once' );
val = str2double( buf{:} );
end

9 Comments

HI Thanks! I am trying to apply it for more input conditions. Can you please explain these 2 lines. I will try apply it for other cases.
buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
buf = regexp( str, ['(?<=', name, ')', '[ ]+[\d\.]+'], 'match', 'once' );
  • Read the documentation on regular expression, especially Lookaround Assertions
  • it returns, in buf, the "numerical string" including preceding space, which follows after the value of the variable, name
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
  • it returns, in buf, the first sub-string, which is at least 100 characters long and consists only of digits, dot("."), and white space (Any white-space character; equivalent to [ \f\
\t\v])
Do you use the debug features to analyze the code? Set a breakpoint and step one line at a time. Inspect how the values of variables change.
HI, Yes sometimes i uses it. I understand the program. Except I am learning how to extract the input conditions as you did with the "buf" line. I want to do likewise with altitude,ISA, cost index. I am trying to understand the grammar to extract the corresponding values.
It takes some practicing to learn regular expressions.
I added the function &nbsp excerpt_colon_separated_num_, which despite the name also handles the slash-separated and the space-separated cases.
The slash, "-", is unfortunate because a minus, "-", will be "shadowed" by the slash. Could ISA be negative? Replacing
([\d\.])+
by
( [\d\.])+
should solve that.
Thanks! I will check and get back to you!
It looks like the attachments have been removed, so I can't check anymore, but it was my impression that the header part of each sections (the conditions) could change from file to file. Certainly, it wasn't the same in the two files that were posted.
if it is always the same format, then my original answer just need some tweaking to work. It's up to you whether you prefer one big complex regular expression that does most of the parsing in one go (as my answer) or split the parsing into several functions (as per's answer). Per's answer having the advantage of being easier to understand if you're not deeply familiar with regular expressions, and possibly also easier to adapt to other file structures.
Note that if you're trying to learn regular expressions, I've explained the two I wrote in great details in my answer, so go over them (with the help of the documentation), so I don't feel like I've wasted my time.
@Guillaume, yes the two text files differed. The first is a stripped down version of the second. I attach the copies I used.
HI! Do you still have the file? Yes! Now its clearer to me! Thanks so much! Yes both your answer were very helpful! I am getting used to it now. The first answer was of higher level! Thank you both for your contribution!

Sign in to comment.

More Answers (1)

Your text file is not really designed to be read by a computer. It's not very consistent (variable number of blank lines, variable number of spaces, inconsistent number format, etc.) which makes it difficult to parse efficiently.
So the first thing to look at is if you can get the same data in a format designed to be parsed by a computer: binary, json, xml, etc.
Failing that, the following works on the attached file, but because of the inconsistencies may not work on a larger file:
dragwanted = 1.0;
fuelwanted = 1.2;
content = fileread('question.txt'); %get whole content of file
sections = regexp(content, 'DRAG FACTOR\s+([0-9.]+)\s+FUEL FACTOR\s+([0-9.]+)\s+([A-Z .]+\r\n[A-Z() ]+\r\n\s*\r\n([0-9. ]+\r\n)+)', 'tokens');
%sections is a cell array of 1x3 cell arrays of {drag factor, fuel factor, table}
dragfactors = cellfun(@(s) str2double(s{1}), sections);
fuelfactors = cellfun(@(s) str2double(s{2}), sections);
wanted = dragfactors == dragwanted & fuelfactors == fuelwanted;
assert(sum(wanted) > 0, 'No section match criteria');
assert(sum(wanted) == 1, 'More than one section match criteria');
section = sections{wanted}{3};
%parse the section:
sectionlines = strsplit(section, {'\n', '\r'});
sectionheader = strsplit(strtrim(sectionlines{1}))
sectionunits = strtrim(regexp(sectionlines{2}, '(?<=\().*?(?=\))', 'match'))
sectiontable = str2num(strjoin(sectionlines(4:end-1), '\n'))

6 Comments

Yashvin comment moved here:
Hi Thanks! It works for smaller files.
Unfortunately, the program i am using only output it in this format with lots of empty lines and blank spaces ! I tried it for a larger file but it could not work. It said no section match criteria. As from the sections part, it start generating empty arrays.
Can you please explain the sections part and how we can generalize it with more rows and with more input choices?
I have attached a new file Thanks
The file cannot be attached. Can i mail you the file?
No you cannot email me the file. If for some reason the file is too big to post, just post a section of the file.
I used a regular expression to detect each section. This particular regular expression matches the sections of the file that:
  1. starts with 'DRAG FACTOR' followed by one or more blank characters (the DRAG FACTOR\s+ part)
  2. followed by any number of digits of dot character. That part is captured as a token (the ([0-9.]+))
  3. followed by one or more blank characters, followed by 'FUEL FACTOR' followed by one or more blank characters (the \s+FUEL FACTOR\s+)
  4. followed by any number of digits of dot character. That part is captured as the second token (the second ([0-9.]+))
  5. followed by any number of blank characters, this includes spaces tabs, and line returns (the \s+)
  6. followed by a third capture of (starts with (|, finishes at the the final |))
  7. a string made exclusively of uppercase alphabetic characters or dots or spaces (no tabs) finished by the windows line return characters \r
Thanks for your extensive explanation. I am checking the syntax. Please find attached 2 sections of the complete file for your perusal. I am getting output as "No section match criteria".
Well, your new file breaks part 5 (there are non-blank lines between the table and the drag factor/fuel factor line), and part 8 (units can also include '.', '%', '/' and probably other symbols) of the regular expression. So of course, there is no match.
To build a regular expression you need to establish what is constant in the file and what is not. It seems that the list of condition before the table is variable. Is that the case?
Also are the criteria always drag factors and fuel factors or can it be some other (like altitude)? That is what need to be parsed and what can be discarded?
Finally, does the table always start by WGHT or can it be a different header? Do you actually care about the header and units or can they be discarded?
Now I am understanding it better thanks to you! So, in fact, the list of condition before the table can be any one of them. Infact, it can also be CG location percentage, altitude value, ISA number(positive or negative),cost index value or % of MCR thrust.
In the file, in each sections, we care only from the CLEAN CONFIGURATION to the last value of the table. The remaining can be discarded.
The table always start by WGHT and the header stays same. Yes, the unit should be kept.
Thanks Yashvin
Your file is a real mess, sometimes you have empty lines with just one space, sometimes with no spaces, the header line starts with 3 spaces, the unit line only two, the parameter section sometimes has one parameter on a line, sometimes two. You may be better off parsing the file line by line.
Otherwise, the following will get you the table and the criteria section, but will not parse the criteria:
sections = regexp(content, ...
'CLEAN CONFIGURATION\r\n((.*\r\n)+?)(\s+WGHT.*\r\n.*\r\n.*\r\n([0-9. ]+\r\n)+)', ...
'tokens', 'dotexceptnewline);
sections is a 1 x n (n = number of section) cell array of cell arrays whose first elements are the criteria part and seconds elements the table part. You can parse the table with the same code as before. For reference, the above regular expression can be decoded as:
  1. match 'CLEAN CONFIGURATION' followed by '\r' (newline)
  2. starts the first token (at |(|)
  3. match any character but a newline followed by '\r' (the |(.*\r

Sign in to comment.

Categories

Asked:

on 10 Jun 2015

Commented:

on 12 Jun 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!