Remove elements appearing sequentially in a larger text.

Question

0 votes

Hello.

I just started working as an engineer, and was recently tasked with the boring task of editing. I figure this is something that can be done in Matlab, but my brief class during the studies leaves me with only the most basic (if that).

The reason why I believe this should be rather easy is that the data is sequentially arranged, with each sequence being about 2 pages long and identical in form. No rows are to be partially edited, so this would be the simplified case, left being the data and right being the edited output:

a              1 a
b              2 c
c              3 a
a              4 c
b 
c
[...]

This pattern is repeating itself a couple of hundred times, so some kinda loop has to be implemented if this is going to be quicker than just cut and paste.

There are both numbers, characters and tables.

Thanks,

Tord

4 Comments
Show 2 older comments Hide 2 older comments

Tord on 10 Jun 2014

This being classified data and me being a new employee, no. But I can illustrate for you:

[start] Analysis nr: 1234

Name: Example

Center of circle is blue. Radius of circle = 10

Curve; [table of numbers]

Circle index: 1111

[end] - then repeat hundreds of times.

And in every one of them I want to remove ie "Name: Example" and "circle index 1111" (random selection).

With dpb's answer I guess it would read: nr=length(10); ix=unique([1:2:nr] (1:6:nr)]; file(ix)=[];

I suddenly started to wonder if this is all I need. I will try.

Cedric on 10 Jun 2014

Edited: Cedric on 10 Jun 2014

If all blocks have same length, same number of characters, etc, you can remove periodically lines with a fixed period. If blocks can vary a bit in length, you either have to analyze line by line and take a decision or perform pattern matching and replacement.

For pattern matching, see my answer.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

dpb on 10 Jun 2014

Edited: dpb on 10 Jun 2014

Open in MATLAB Online

2 votes

...Point is that I want to remove i.e every third and seventh row in these datasets...

If the data are regular in line location(s) (that is, don't have to search for a pattern to locate sections), then it's pretty simple --

A) read the file into a cell array of character data--

file = textread('yourfile', '%s', 'delimiter', '\n', 'whitespace', '');

B) delete the lines not wanted...I'm not positive precisely the definition of "every third and seventh row" but assuming it's the joint combination of [1:2:end] and [1:6:end] then

nr=length(file);                % number rows in file
ix=unique([1:2:nr] [1:6:nr]);   % selected rows to delete
file(ix)=[];                    % remove the rows unwanted

C) rewrite to a file -- NB: either create a backup first or be sure to create a new copy on writing while debugging!!!

You can do in a single step if you can define a rule for any arbitrary set of lines to be deleted that are fixed in relationship to the beginning of the file no matter how complex that rule might be.

ADDENDUM

Following your example of a file, I made a local file of some number of repetitions of same...

>> file=textread('file.txt', '%s', 'delimiter', '\n', 'whitespace','')
file = 
  'Analysis nr: 1234'
  'Name: Example'
  'Center of circle is blue. Radius of circle = 10'
  'Curve; [table of numbers]'
  'Circle index: 1111'
  'Analysis nr: 1234'
  'Name: Example'
  'Center of circle is blue. Radius of circle = 10'
  'Curve; [table of numbers]'
  'Circle index: 1111'
  'Analysis nr: 1234'
  'Name: Example'
  'Center of circle is blue. Radius of circle = 10'
  'Curve; [table of numbers]'
  'Circle index: 1111'
  'Analysis nr: 1234'
  'Name: Example'
  'Center of circle is blue. Radius of circle = 10'
  'Curve; [table of numbers]'
  'Circle index: 1111'
  'Analysis nr: 1234'
  'Name: Example'
  'Center of circle is blue. Radius of circle = 10'
  'Curve; [table of numbers]'
  'Circle index: 1111'
  'Analysis nr: 1234'
  'Name: Example'
  'Center of circle is blue. Radius of circle = 10'
  'Curve; [table of numbers]'
  'Circle index: 1111'
>> nr=length(file);
>> ix=sort([[2:5:nr] [5:5:nr]]);  % no unique; this pattern has no overlap
>> file(ix)=[];
>> fid=fopen('file1.txt','w');
>> for i=1:length(file),fprintf(fid,'%s\n',file{i});end
>> fid=fclose(fid);
>> type file1.txt
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]

>>

Voila! Joy ensues... :)

4 Comments
Show 2 older comments Hide 2 older comments

dpb on 10 Jun 2014

There's nothing about the overlap that's a problem as long as the indices aren't duplicated to erroneously remove a row at the wrong location. The invocation of unique was mostly a nicety to remove duplicates and to as a corollary sort the index array which should help runtime. In effect the net result is the same either way it just looks cleaner with rather than without.

If you do need the pattern-matching solution, Cedric's the undoubted whizard on regular expressions while I'm a feeb there...but for the "deadahead" case that you seem to have, this is far the quicker.

If it does solve the problem, please go ahead and Accept the answer so we know to close the issue.

Tord on 11 Jun 2014

Yes, overlap could become an issue because I noticed at least one line that did not match exactly the template I used (small construction difference).

I just tried, and failed, with the unique function implemented. I will try some more and then accept it regardless of outcome, I understand that this is below what you guys want to use your time on.

Once again, thanks.

Sign in to comment.

Answer 2

Cedric on 10 Jun 2014

Open in MATLAB Online

2 votes

Here is an example using pattern matching and replacement..

 % - Get and modify content.
 fName   = 'tord_1.txt' ;
 content = fileread( fName ) ;
 content = regexprep( content, 'Name:[^\n]*\n', '' ) ;
 content = regexprep( content, 'Circle index:[^\n]*\n', '' ) ;
 % - Output modified version to file.
 [fPath, fBase, fExt] = fileparts( fName ) ;
 fId = fopen( fullfile( fPath, [fBase, '_modified', fExt] ), 'w' ) ;
 fwrite( fId, content ) ;
 fclose( fId ) ;

12 Comments
Show 10 older comments Hide 10 older comments

Cedric on 11 Jun 2014

Edited: Cedric on 11 Jun 2014

Ok, so there is no regularity actually (if I understand well). If it varies within a file but not across files (which means that in each file corresponding blocks have matching length), dpb's solution can be updated a little with your ranges entered by hand, and then applied very efficiently to all files. If you don't have matching lengths across files, I would advise you to try a regexp-based approach, unless you want to code the logic required for testing line by line whether to keep the line or to eliminate it.

If you want to try the regexp, I need to know exactly the, say at least 10 to 20 characters around the start and the end of a typical block (the more the better), which means e.g. 10 before and 10 after the start, and 10 before and 10 after the end. I also need to know the type of data present in each block: numbers only? mixed numbers + text? comma separated? I don't need the data or the rest of the file if the beginning and the end of the blocks that you want to eliminate are specific enough.

If you cannot share the file on a public forum but could share it with me, don't hesitate to email it to me by the way.

Cedric on 11 Jun 2014

Edited: Cedric on 11 Jun 2014

Open in MATLAB Online

EDIT: I hadn't refreshed the page I guess since earlier today, and I just see that dpb wrote another solution, so you have two ways for building indices now ;-)

Well, I'm glad it helps! I understand better after reading your last comment. Dpb's solution will work well in this case. As it is generating indices of lines to keep of to remove, you cannot have a content which varies in length. For example, it wouldn't work in the following situation where I indicate line numbers on the left:

Block1
Name = Circle
Data:  1 2 3
       4 5 6
Block2
Name = Rectangle
Data:  9 8 7
       6 5 4
       3 2 1
Block3
Name = Oval
Data:  1 2 3

if shapes could be listed in any order or have data with random length, because, as you can see, data have a varying length in term of number of lines. In such case, you cannot know a priori the order and you cannot define where relevant lines will be. This would leave us with scanning line by line and taking a decision (keep or discard) for each line, or with pattern matching.

Now your situation appears to be that you know exactly that you will have consecutive blocks with exactly 170 lines each, and that within each block you need the 50th to the 76th lines, etc. So there is no variation in the length of e.g. numeric arrays internal to each block, and you have regularity among blocks.

So we can generate the IDs of relevant lines for block 1 as follows:

lineIDs = [50:76, 86:105, 115:140] ;

and the question that remains is how to repeat that with a 170 lines interval until the end of the file. Dpb did the following to get lines/rows:

 file = textread('yourfile', '%s', 'delimiter', '\n', 'whitespace', '');
 nr   = length(file);

so we can use nr to build a vector of 170 increments until the end of the file, as follows (note that it may be 171 in your case, you'll have to check):

steps = 0 : 170 : nr ;

then we can use it to create line IDs for the whole file:

 lineIDs = repmat( lineIDs, length( steps ), 1 ) + ...
           repmat( steps(:), 1, length( lineIDs ) ) ;

I let you run that on a smaller example to see how it works, e.g. you want to repeat 2,3,5 with a 10 lines interval and you have 29 lines total

 nr      = 29 ;
 lineIDs = [2,3,5] ;
 steps   = 0 : 10 : 29 ;
 lineIDs = repmat( lineIDs, length( steps ), 1 ) + ...
                   repmat( steps(:), 1, length( lineIDs ) ) ;

Running this gives

 >> lineIDs
 lineIDs =
     2     3     5
    12    13    15
    22    23    25

Once we have this array, we can transform it to get a vector of line IDs:

lineIDs = reshape( lineIDs.', 1, [] ) ;

Applied to our previous small example we get

 >> lineIDs
 lineIDs =
     2     3     5    12    13    15    22    23    25

And we finish as explained by dpb (but in our case we keep the lines instead of removing them):

 file = file(lineIDs) ;
 fid=fopen('file1.txt','w');
 for i=1:length(file),fprintf(fid,'%s\n',file{i});end
 fid=fclose(fid);

Or we can reuse part of my solution to build a new file name based on the original..

 fName = 'tord_1.txt' ;
 .. read/process ..
 [fPath, fBase, fExt] = fileparts( fName ) ;
 fId = fopen( fullfile( fPath, [fBase, '_modified', fExt] ), 'w' ) ;
 for k = 1 : length(file),  fprintf( fId , '%s\n', file{k} );  end
 fclose( fId ) ;

Tord on 12 Jun 2014

I am writing here because you have both commented and will receive notification on this.. Most of all I wanted to sink in the ground and never enter this forum again, but that would make me an even bigger jackass...

What I found out last night was that there is a small variation in the number of lines after a couple of thousand. This variation is pretty much symmetrical, like 169 - 170 - 171 - 170, and thus the sum always added up. I actually managed to get the script working with the most basic approach, that was when I saw the domino-effect the different lines here and there had on the total.. In other words, the fixed-length approach is useless. I have been copy-pasting all day, and will be doing it for the rest of the night..

I cannot describe how sorry I am and how stupid i feel. You guys have done so much to help out, and I screw it up by defining the premises sloppy. I wish there was something I could do to give back the priceless contribution you guys have given, both solving and explaining to me. The latter is still valid tough, and once again - thanks. At least you recruited a matlab-fan..

This task needs to be done with every new project, so I will use what I've learned so far for what it's worth and see where it goes. Maybe I dare ask a question again some time in the future when my competence is at a decent level..

I really don't know what to say other than that I am truly very, very sorry and even more thankful.

dpb on 12 Jun 2014

Open in MATLAB Online

I'm not sure I'd give up entirely (yet, any way). And don't feel bad; we've all had such an experience (or worse) and particularly early on in one's career it's not an unusual experience to be humbled at all. I'll quote the tag line from a longtime poster/world-class expert on the Fortran newsgroup I also frequent--

"Good judgement comes from experience. Experience comes from bad judgement."

~ Mark Twain

Anyway, I'd suggest looking to see if perhaps there's a way to sorta' combine the two ideas -- find out if there is a way to discover when there is this "off by one" count that you can then compensate for. That might take doing some parsing of the content in that section or perhaps when you get to that section do some line-counting to find the next location and then fixup the indices on that basis.

I was going to point out that another way to handle the i/o would be to use textscan or similar and use the 'headerlines' parameter to skip a counted group of lines, then read (and copy) a group and repeat instead of having the whole file in memory. If you need to do a search and destroy mission like this, that may be a better approach. Read the first group, check all is still well, then read the next and repeat. When/if you find that "off by one" issue, it's simpler, perhaps, to be able to fix it there than globally.

If, as it sounds like, this is going to be a recurring issue for your employer, you could well be doing a big service by figuring out a way to do this as you do the mundane part. Or, depending on how these files are generated, perhaps you can make some input changes to the process that will be able to get them to stop being created such that they can't be automatically parsed in this step by showing how they could be formatted instead that would allow for the script to work reliably.

Cedric on 12 Jun 2014

Hey Tord, don't be sorry, we most likely would have done the same thing (assuming regularity until we observe a shift), and that is how we learn after all ;-)

I'll go on with pattern matching by email.

Tord on 16 Jun 2014

You both being so understanding means a great deal to me, thank you both.

As I told Cedric by mail, I have not been connected during the weekend and now I have to prioritize other tasks at work. I sat the whole night to Friday editing the text by hand.

But this task needs to be done every now and the, thus I will continue working on this so it will be ready till then.

I will look into this later today and keep you all posted on the progress - both what is being done and to what degree I actually understand it.

Sign in to comment.

Answer 3

dpb on 11 Jun 2014

Edited: dpb on 11 Jun 2014

Open in MATLAB Online

2 votes

OK, try this...this is a "deadahead" looping solution to build the vector from the information provided --it can be made to look "more Matlaby" but this I could do before my meeting...

Starting with your block definitions and the overall length of the repetitive section...

>> ix=[1 49; 77 85; 106 114; 141 147]   % the sections to remove
ix =
   1    49
  77    85
 106   114
 141   147
>> N=170;                                 % the overall block length
>> L=42000;

Following is a sanity check to compare lengths to your given ...

>> ceil(L/N)
ans =
 248
>> 248*N
ans =
     42160
>> L=ans;                                 % sanity check I did on overall lengths

The above look right I presume???

Anyway, back to the building of an overall deletion index...

>> ig=[];for i=1:size(ix,1),ig=[ig; [ix(i,1):ix(i,2)].'];end % One block

Then build the whole thing from repeating the above for the number of blocks in a file

>> ix=ig;                  % initialize to the first group
>> for i=1:L/N-1           % loop count from 2:L/N
     ix=[ix; (i*170)+ig];  % 1:L/N-1 instead of (i-1) as multiplier
   end                     % add the group plus offset and concatenate

Now use ix as the index vector to delete those lines as shown previously. Again, be sure to have a backup while you double-check your counts, etc., before you overwrite the raw data files!!! :)

Another sanity check...

>> L-ix(end)
ans =
  23
>> 170-147
ans =
  23

Lookin' good... :)

I gotta' run...good luck!

ADDENDUM:

L as above should match length(file), btw as the verification of the counting...

ADDENDUM 2:

Just as a sidepoint, the multiplications can be done away with, also...

for i=2:L/N           % loop count from 2:L/N
  ig=ig+170;          % add the offset
  ix=[ix; ig];
end

To make the script simpler to adapt to other files, move the 170 constant also to a variable that you can set at the top--then you change only those constants that define the file structure and you're done for any other similarly-constructed files.

And to look ahead a little, next you'll be looking for the answer at the FAQ --

http://matlab.wikia.com/wiki/FAQ#How_can_I_process_a_sequence_of_files.3F

:)

ADDENDUM 3 and (hopefully) final:

Not to be outdone by Cedric ( :) ), the vectorized solution for building the deletion index array --

Given the above ix array of unwanted lines and the block size N and file length L--

ig=cell2mat(arrayfun(@colon,ix(:,1),ix(:,2),'uniformoutput',false).').';
ix= bsxfun(@plus,N*[0:L/N-1],repmat(ig,1,L/N));  ix=ix(:);

4 Comments
Show 2 older comments Hide 2 older comments

Image Analyst on 14 Jun 2014

I'd be interested to know if you've found a use for MATLAB on your farm. For example to control a weather station or see if the animals are back in the barn yet or something. Maybe interfaced an arduino....

dpb on 14 Jun 2014

I've not to date other than somewhat superficially altho I had some ideas of it when TMW generously comp'ed the upgraded version but I've not actually done anything along those lines.

There's an opportunity there I think for the future even more integration of the various data sources. The biggest difference from when left for college and the off-farm career in the mid-60s and when returned besides just the increased size of typical operation which is simply scaling is the amazing use of technology in everything from GPS auto-steer and tracking to yield monitors and planters that can actually place an individual seed spacing to within 1/8" for precise planting rates as well as control side dressings and fertilizers/pesticides/herbicides on a rate that is also tied to soil conditions and other field topographical features. I've just not taken the time to do it outside the available features in the vendor-supplied software/firmware interfaces.

Sign in to comment.

Remove elements appearing sequentially in a larger text.

4 Comments
Show 2 older comments Hide 2 older comments

Accepted Answer

4 Comments
Show 2 older comments Hide 2 older comments

More Answers (2)

12 Comments
Show 10 older comments Hide 10 older comments

4 Comments
Show 2 older comments Hide 2 older comments

Categories

Tags

Community Treasure Hunt

Remove elements appearing sequentially in a larger text.

4 Comments Show 2 older comments Hide 2 older comments

Accepted Answer

4 Comments Show 2 older comments Hide 2 older comments

More Answers (2)

12 Comments Show 10 older comments Hide 10 older comments

4 Comments Show 2 older comments Hide 2 older comments

Categories

Tags

See Also

Community Treasure Hunt

4 Comments
Show 2 older comments Hide 2 older comments

4 Comments
Show 2 older comments Hide 2 older comments

12 Comments
Show 10 older comments Hide 10 older comments

4 Comments
Show 2 older comments Hide 2 older comments