Remove elements appearing sequentially in a larger text.

Hello.
I just started working as an engineer, and was recently tasked with the boring task of editing. I figure this is something that can be done in Matlab, but my brief class during the studies leaves me with only the most basic (if that).
The reason why I believe this should be rather easy is that the data is sequentially arranged, with each sequence being about 2 pages long and identical in form. No rows are to be partially edited, so this would be the simplified case, left being the data and right being the edited output:
1 a 1 a
2 b 2 c
3 c 3 a
4 a 4 c
5 b
6 c
[...]
This pattern is repeating itself a couple of hundred times, so some kinda loop has to be implemented if this is going to be quicker than just cut and paste.
There are both numbers, characters and tables.
Thanks,
Tord

4 Comments

Tord
Tord on 10 Jun 2014
Edited: Tord on 10 Jun 2014
My rows and columns collapsed.. Point is that I want to remove i.e every third and seventh row in these datasets. No calculations, functions etc are to be done.
Could you attach one of these files or copy/paste the a chunk of its content, and indicate clearly what you want to eliminate?
This being classified data and me being a new employee, no. But I can illustrate for you:
[start] Analysis nr: 1234
Name: Example
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Circle index: 1111
[end] - then repeat hundreds of times.
And in every one of them I want to remove ie "Name: Example" and "circle index 1111" (random selection).
With dpb's answer I guess it would read: nr=length(10); ix=unique([1:2:nr] (1:6:nr)]; file(ix)=[];
I suddenly started to wonder if this is all I need. I will try.
If all blocks have same length, same number of characters, etc, you can remove periodically lines with a fixed period. If blocks can vary a bit in length, you either have to analyze line by line and take a decision or perform pattern matching and replacement.
For pattern matching, see my answer.

Sign in to comment.

 Accepted Answer

dpb
dpb on 10 Jun 2014
Edited: dpb on 10 Jun 2014
...Point is that I want to remove i.e every third and seventh row in these datasets...
If the data are regular in line location(s) (that is, don't have to search for a pattern to locate sections), then it's pretty simple --
A) read the file into a cell array of character data--
file = textread('yourfile', '%s', 'delimiter', '\n', 'whitespace', '');
B) delete the lines not wanted...I'm not positive precisely the definition of "every third and seventh row" but assuming it's the joint combination of [1:2:end] and [1:6:end] then
nr=length(file); % number rows in file
ix=unique([1:2:nr] [1:6:nr]); % selected rows to delete
file(ix)=[]; % remove the rows unwanted
C) rewrite to a file -- NB: either create a backup first or be sure to create a new copy on writing while debugging!!!
You can do in a single step if you can define a rule for any arbitrary set of lines to be deleted that are fixed in relationship to the beginning of the file no matter how complex that rule might be.
ADDENDUM
Following your example of a file, I made a local file of some number of repetitions of same...
>> file=textread('file.txt', '%s', 'delimiter', '\n', 'whitespace','')
file =
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
'Analysis nr: 1234'
'Name: Example'
'Center of circle is blue. Radius of circle = 10'
'Curve; [table of numbers]'
'Circle index: 1111'
>> nr=length(file);
>> ix=sort([[2:5:nr] [5:5:nr]]); % no unique; this pattern has no overlap
>> file(ix)=[];
>> fid=fopen('file1.txt','w');
>> for i=1:length(file),fprintf(fid,'%s\n',file{i});end
>> fid=fclose(fid);
>> type file1.txt
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
Analysis nr: 1234
Center of circle is blue. Radius of circle = 10
Curve; [table of numbers]
>>
Voila! Joy ensues... :)

4 Comments

Thank you, I will soon return.
Thank you once again!
In fact I had already started to worry about both overlap and the output itself. I am truly impressed with you and Cedric's knowledge - not to mention the will to share it.
I would not have asked if I hadn't googled what I could beforehand, and I really would have had to do this the stupid way (mark-delete-repeat) for almost a day without the help.
THANKS
There's nothing about the overlap that's a problem as long as the indices aren't duplicated to erroneously remove a row at the wrong location. The invocation of unique was mostly a nicety to remove duplicates and to as a corollary sort the index array which should help runtime. In effect the net result is the same either way it just looks cleaner with rather than without.
If you do need the pattern-matching solution, Cedric's the undoubted whizard on regular expressions while I'm a feeb there...but for the "deadahead" case that you seem to have, this is far the quicker.
If it does solve the problem, please go ahead and Accept the answer so we know to close the issue.
Yes, overlap could become an issue because I noticed at least one line that did not match exactly the template I used (small construction difference).
I just tried, and failed, with the unique function implemented. I will try some more and then accept it regardless of outcome, I understand that this is below what you guys want to use your time on.
Once again, thanks.

Sign in to comment.

More Answers (2)

Here is an example using pattern matching and replacement..
% - Get and modify content.
fName = 'tord_1.txt' ;
content = fileread( fName ) ;
content = regexprep( content, 'Name:[^\n]*\n', '' ) ;
content = regexprep( content, 'Circle index:[^\n]*\n', '' ) ;
% - Output modified version to file.
[fPath, fBase, fExt] = fileparts( fName ) ;
fId = fopen( fullfile( fPath, [fBase, '_modified', fExt] ), 'w' ) ;
fwrite( fId, content ) ;
fclose( fId ) ;

12 Comments

I think modding will take very long. The example is kinda misleading with regards to size. There are 170 rows in each sequence originally, and 100 when edited. About 40k total.
The fact that no rows are partially edited makes deleting by rows the obvious choice.
By listing all 70 rows like dpb described I believe the result will be correct?
You'll have to figure out on your own if dpb's solution works. The only thing that I can say is that solutions along this line assume regularity in the length (in # of lines) of the blocks that you want to remove. If this regularity is present, this is the best solution. If not, you'll have either to process line by line and implement some logic to determine whether it has to be kept, or to use pattern matching. If you don't know regular expressions, I can help you with the pattern, but I'll need to know way more about the content.
Tord
Tord on 10 Jun 2014
Edited: Tord on 10 Jun 2014
As far as I can see I would be there if I could remove every: [1-49]th, [77-85]th, [106-114]th, [141-147]th. Total 42000. I'm about to write them all one by one, but if you have a smarter move, please share, hehe.
..if I could remove every: [1-49]th, [77-85]th, [106-114]th, [141-147]th.
The difference between the first of each group is 76, 29, and 35. What's the difference between this last group of four and the first of the next group or is the beginning of the next set 148?
Knowing that can write a recursion relation...
Ok, so there is no regularity actually (if I understand well). If it varies within a file but not across files (which means that in each file corresponding blocks have matching length), dpb's solution can be updated a little with your ranges entered by hand, and then applied very efficiently to all files. If you don't have matching lengths across files, I would advise you to try a regexp-based approach, unless you want to code the logic required for testing line by line whether to keep the line or to eliminate it.
If you want to try the regexp, I need to know exactly the, say at least 10 to 20 characters around the start and the end of a typical block (the more the better), which means e.g. 10 before and 10 after the start, and 10 before and 10 after the end. I also need to know the type of data present in each block: numbers only? mixed numbers + text? comma separated? I don't need the data or the rest of the file if the beginning and the end of the blocks that you want to eliminate are specific enough.
If you cannot share the file on a public forum but could share it with me, don't hesitate to email it to me by the way.
I don't know if I expressed myself unambiguously. What I mean is that the text consist of blocks of identical structure, each 170 lines in length. Of these, rows [50-76], [86-105], [115-140] are needed, thus making the aforementioned rows redundant. These 170 rows repeat them self with only different numbers, totally a little more than 42000.
As you have noticed this is not something I am experienced with, so I am not sure if you mean block as in every local block (ie [50-76] and [86-105]) or the "global" blocks consisting of all 170, repetitive data? I will give you the characters you need as soon as I know which one it is. Thanks again for your help, I am truly impressed with the level of help you are giving me. This is beyond my highest expectations, and you my good man, are giving me faith in humanity. Nothing less.
OK, I understood (and presumed) the file was regular; the additional piece of information needed is that the size of the repeating section is 170 lines. I've got a meeting to get ready for first; I'll try to get back before then but it may be afternoon before can actually do anything.
But the idea is to set a counter from these start/stop positions and the block size and then just build the groups with that multiple of 170 lines as an offset for each succeeding block until the next block will be past the total length of the file.
EDIT: I hadn't refreshed the page I guess since earlier today, and I just see that dpb wrote another solution, so you have two ways for building indices now ;-)
Well, I'm glad it helps! I understand better after reading your last comment. Dpb's solution will work well in this case. As it is generating indices of lines to keep of to remove, you cannot have a content which varies in length. For example, it wouldn't work in the following situation where I indicate line numbers on the left:
1 Block1
2 Name = Circle
3 Data: 1 2 3
4 4 5 6
5 Block2
6 Name = Rectangle
7 Data: 9 8 7
8 6 5 4
9 3 2 1
10 Block3
11 Name = Oval
12 Data: 1 2 3
if shapes could be listed in any order or have data with random length, because, as you can see, data have a varying length in term of number of lines. In such case, you cannot know a priori the order and you cannot define where relevant lines will be. This would leave us with scanning line by line and taking a decision (keep or discard) for each line, or with pattern matching.
Now your situation appears to be that you know exactly that you will have consecutive blocks with exactly 170 lines each, and that within each block you need the 50th to the 76th lines, etc. So there is no variation in the length of e.g. numeric arrays internal to each block, and you have regularity among blocks.
So we can generate the IDs of relevant lines for block 1 as follows:
lineIDs = [50:76, 86:105, 115:140] ;
and the question that remains is how to repeat that with a 170 lines interval until the end of the file. Dpb did the following to get lines/rows:
file = textread('yourfile', '%s', 'delimiter', '\n', 'whitespace', '');
nr = length(file);
so we can use nr to build a vector of 170 increments until the end of the file, as follows (note that it may be 171 in your case, you'll have to check):
steps = 0 : 170 : nr ;
then we can use it to create line IDs for the whole file:
lineIDs = repmat( lineIDs, length( steps ), 1 ) + ...
repmat( steps(:), 1, length( lineIDs ) ) ;
I let you run that on a smaller example to see how it works, e.g. you want to repeat 2,3,5 with a 10 lines interval and you have 29 lines total
nr = 29 ;
lineIDs = [2,3,5] ;
steps = 0 : 10 : 29 ;
lineIDs = repmat( lineIDs, length( steps ), 1 ) + ...
repmat( steps(:), 1, length( lineIDs ) ) ;
Running this gives
>> lineIDs
lineIDs =
2 3 5
12 13 15
22 23 25
Once we have this array, we can transform it to get a vector of line IDs:
lineIDs = reshape( lineIDs.', 1, [] ) ;
Applied to our previous small example we get
>> lineIDs
lineIDs =
2 3 5 12 13 15 22 23 25
And we finish as explained by dpb (but in our case we keep the lines instead of removing them):
file = file(lineIDs) ;
fid=fopen('file1.txt','w');
for i=1:length(file),fprintf(fid,'%s\n',file{i});end
fid=fclose(fid);
Or we can reuse part of my solution to build a new file name based on the original..
fName = 'tord_1.txt' ;
.. read/process ..
[fPath, fBase, fExt] = fileparts( fName ) ;
fId = fopen( fullfile( fPath, [fBase, '_modified', fExt] ), 'w' ) ;
for k = 1 : length(file), fprintf( fId , '%s\n', file{k} ); end
fclose( fId ) ;
I am writing here because you have both commented and will receive notification on this.. Most of all I wanted to sink in the ground and never enter this forum again, but that would make me an even bigger jackass...
What I found out last night was that there is a small variation in the number of lines after a couple of thousand. This variation is pretty much symmetrical, like 169 - 170 - 171 - 170, and thus the sum always added up. I actually managed to get the script working with the most basic approach, that was when I saw the domino-effect the different lines here and there had on the total.. In other words, the fixed-length approach is useless. I have been copy-pasting all day, and will be doing it for the rest of the night..
I cannot describe how sorry I am and how stupid i feel. You guys have done so much to help out, and I screw it up by defining the premises sloppy. I wish there was something I could do to give back the priceless contribution you guys have given, both solving and explaining to me. The latter is still valid tough, and once again - thanks. At least you recruited a matlab-fan..
This task needs to be done with every new project, so I will use what I've learned so far for what it's worth and see where it goes. Maybe I dare ask a question again some time in the future when my competence is at a decent level..
I really don't know what to say other than that I am truly very, very sorry and even more thankful.
I'm not sure I'd give up entirely (yet, any way). And don't feel bad; we've all had such an experience (or worse) and particularly early on in one's career it's not an unusual experience to be humbled at all. I'll quote the tag line from a longtime poster/world-class expert on the Fortran newsgroup I also frequent--
"Good judgement comes from experience. Experience comes from bad judgement."
~ Mark Twain
Anyway, I'd suggest looking to see if perhaps there's a way to sorta' combine the two ideas -- find out if there is a way to discover when there is this "off by one" count that you can then compensate for. That might take doing some parsing of the content in that section or perhaps when you get to that section do some line-counting to find the next location and then fixup the indices on that basis.
I was going to point out that another way to handle the i/o would be to use textscan or similar and use the 'headerlines' parameter to skip a counted group of lines, then read (and copy) a group and repeat instead of having the whole file in memory. If you need to do a search and destroy mission like this, that may be a better approach. Read the first group, check all is still well, then read the next and repeat. When/if you find that "off by one" issue, it's simpler, perhaps, to be able to fix it there than globally.
If, as it sounds like, this is going to be a recurring issue for your employer, you could well be doing a big service by figuring out a way to do this as you do the mundane part. Or, depending on how these files are generated, perhaps you can make some input changes to the process that will be able to get them to stop being created such that they can't be automatically parsed in this step by showing how they could be formatted instead that would allow for the script to work reliably.
Hey Tord, don't be sorry, we most likely would have done the same thing (assuming regularity until we observe a shift), and that is how we learn after all ;-)
I'll go on with pattern matching by email.
You both being so understanding means a great deal to me, thank you both.
As I told Cedric by mail, I have not been connected during the weekend and now I have to prioritize other tasks at work. I sat the whole night to Friday editing the text by hand.
But this task needs to be done every now and the, thus I will continue working on this so it will be ready till then.
I will look into this later today and keep you all posted on the progress - both what is being done and to what degree I actually understand it.

Sign in to comment.

dpb
dpb on 11 Jun 2014
Edited: dpb on 11 Jun 2014
OK, try this...this is a "deadahead" looping solution to build the vector from the information provided --it can be made to look "more Matlaby" but this I could do before my meeting...
Starting with your block definitions and the overall length of the repetitive section...
>> ix=[1 49; 77 85; 106 114; 141 147] % the sections to remove
ix =
1 49
77 85
106 114
141 147
>> N=170; % the overall block length
>> L=42000;
Following is a sanity check to compare lengths to your given ...
>> ceil(L/N)
ans =
248
>> 248*N
ans =
42160
>> L=ans; % sanity check I did on overall lengths
The above look right I presume???
Anyway, back to the building of an overall deletion index...
>> ig=[];for i=1:size(ix,1),ig=[ig; [ix(i,1):ix(i,2)].'];end % One block
Then build the whole thing from repeating the above for the number of blocks in a file
>> ix=ig; % initialize to the first group
>> for i=1:L/N-1 % loop count from 2:L/N
ix=[ix; (i*170)+ig]; % 1:L/N-1 instead of (i-1) as multiplier
end % add the group plus offset and concatenate
Now use ix as the index vector to delete those lines as shown previously. Again, be sure to have a backup while you double-check your counts, etc., before you overwrite the raw data files!!! :)
Another sanity check...
>> L-ix(end)
ans =
23
>> 170-147
ans =
23
Lookin' good... :)
I gotta' run...good luck!
ADDENDUM:
L as above should match length(file), btw as the verification of the counting...
ADDENDUM 2:
Just as a sidepoint, the multiplications can be done away with, also...
for i=2:L/N % loop count from 2:L/N
ig=ig+170; % add the offset
ix=[ix; ig];
end
To make the script simpler to adapt to other files, move the 170 constant also to a variable that you can set at the top--then you change only those constants that define the file structure and you're done for any other similarly-constructed files.
And to look ahead a little, next you'll be looking for the answer at the FAQ --
:)
ADDENDUM 3 and (hopefully) final:
Not to be outdone by Cedric ( :) ), the vectorized solution for building the deletion index array --
Given the above ix array of unwanted lines and the block size N and file length L--
ig=cell2mat(arrayfun(@colon,ix(:,1),ix(:,2),'uniformoutput',false).').';
ix= bsxfun(@plus,N*[0:L/N-1],repmat(ig,1,L/N)); ix=ix(:);

4 Comments

I am absolutely amazed by the obviously unlimited possibilities real matlab-competence offers. This is great motivation for learning, and great help for working.
I used to hate matlab, now I just wanna learn it - and hug you guys. Next time I'm asking a question here it's going to be an educated one.
I am empty of thank's soon.
THANKS.
No problem, it's generally entertainment for me...my "day job" is now back on the family farm having left the consulting gig behind so my "keeping a hand in" is here...
Particularly when someone does get some good and is appreciative makes it worthwhile...
I'd be interested to know if you've found a use for MATLAB on your farm. For example to control a weather station or see if the animals are back in the barn yet or something. Maybe interfaced an arduino....
I've not to date other than somewhat superficially altho I had some ideas of it when TMW generously comp'ed the upgraded version but I've not actually done anything along those lines.
There's an opportunity there I think for the future even more integration of the various data sources. The biggest difference from when left for college and the off-farm career in the mid-60s and when returned besides just the increased size of typical operation which is simply scaling is the amazing use of technology in everything from GPS auto-steer and tracking to yield monitors and planters that can actually place an individual seed spacing to within 1/8" for precise planting rates as well as control side dressings and fertilizers/pesticides/herbicides on a rate that is also tied to soil conditions and other field topographical features. I've just not taken the time to do it outside the available features in the vendor-supplied software/firmware interfaces.

Sign in to comment.

Categories

Tags

Asked:

on 10 Jun 2014

Commented:

on 16 Jun 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!