Remove elements appearing sequentially in a larger text.
Show older comments
Hello.
I just started working as an engineer, and was recently tasked with the boring task of editing. I figure this is something that can be done in Matlab, but my brief class during the studies leaves me with only the most basic (if that).
The reason why I believe this should be rather easy is that the data is sequentially arranged, with each sequence being about 2 pages long and identical in form. No rows are to be partially edited, so this would be the simplified case, left being the data and right being the edited output:
1 a 1 a
2 b 2 c
3 c 3 a
4 a 4 c
5 b
6 c
[...]
This pattern is repeating itself a couple of hundred times, so some kinda loop has to be implemented if this is going to be quicker than just cut and paste.
There are both numbers, characters and tables.
Thanks,
Tord
4 Comments
Cedric
on 10 Jun 2014
Could you attach one of these files or copy/paste the a chunk of its content, and indicate clearly what you want to eliminate?
Tord
on 10 Jun 2014
If all blocks have same length, same number of characters, etc, you can remove periodically lines with a fixed period. If blocks can vary a bit in length, you either have to analyze line by line and take a decision or perform pattern matching and replacement.
For pattern matching, see my answer.
Accepted Answer
More Answers (2)
Cedric
on 10 Jun 2014
Here is an example using pattern matching and replacement..
% - Get and modify content.
fName = 'tord_1.txt' ;
content = fileread( fName ) ;
content = regexprep( content, 'Name:[^\n]*\n', '' ) ;
content = regexprep( content, 'Circle index:[^\n]*\n', '' ) ;
% - Output modified version to file.
[fPath, fBase, fExt] = fileparts( fName ) ;
fId = fopen( fullfile( fPath, [fBase, '_modified', fExt] ), 'w' ) ;
fwrite( fId, content ) ;
fclose( fId ) ;
12 Comments
Tord
on 10 Jun 2014
You'll have to figure out on your own if dpb's solution works. The only thing that I can say is that solutions along this line assume regularity in the length (in # of lines) of the blocks that you want to remove. If this regularity is present, this is the best solution. If not, you'll have either to process line by line and implement some logic to determine whether it has to be kept, or to use pattern matching. If you don't know regular expressions, I can help you with the pattern, but I'll need to know way more about the content.
dpb
on 11 Jun 2014
..if I could remove every: [1-49]th, [77-85]th, [106-114]th, [141-147]th.
The difference between the first of each group is 76, 29, and 35. What's the difference between this last group of four and the first of the next group or is the beginning of the next set 148?
Knowing that can write a recursion relation...
Ok, so there is no regularity actually (if I understand well). If it varies within a file but not across files (which means that in each file corresponding blocks have matching length), dpb's solution can be updated a little with your ranges entered by hand, and then applied very efficiently to all files. If you don't have matching lengths across files, I would advise you to try a regexp-based approach, unless you want to code the logic required for testing line by line whether to keep the line or to eliminate it.
If you want to try the regexp, I need to know exactly the, say at least 10 to 20 characters around the start and the end of a typical block (the more the better), which means e.g. 10 before and 10 after the start, and 10 before and 10 after the end. I also need to know the type of data present in each block: numbers only? mixed numbers + text? comma separated? I don't need the data or the rest of the file if the beginning and the end of the blocks that you want to eliminate are specific enough.
If you cannot share the file on a public forum but could share it with me, don't hesitate to email it to me by the way.
Tord
on 11 Jun 2014
dpb
on 11 Jun 2014
OK, I understood (and presumed) the file was regular; the additional piece of information needed is that the size of the repeating section is 170 lines. I've got a meeting to get ready for first; I'll try to get back before then but it may be afternoon before can actually do anything.
But the idea is to set a counter from these start/stop positions and the block size and then just build the groups with that multiple of 170 lines as an offset for each succeeding block until the next block will be past the total length of the file.
EDIT: I hadn't refreshed the page I guess since earlier today, and I just see that dpb wrote another solution, so you have two ways for building indices now ;-)
Well, I'm glad it helps! I understand better after reading your last comment. Dpb's solution will work well in this case. As it is generating indices of lines to keep of to remove, you cannot have a content which varies in length. For example, it wouldn't work in the following situation where I indicate line numbers on the left:
1 Block1
2 Name = Circle
3 Data: 1 2 3
4 4 5 6
5 Block2
6 Name = Rectangle
7 Data: 9 8 7
8 6 5 4
9 3 2 1
10 Block3
11 Name = Oval
12 Data: 1 2 3
if shapes could be listed in any order or have data with random length, because, as you can see, data have a varying length in term of number of lines. In such case, you cannot know a priori the order and you cannot define where relevant lines will be. This would leave us with scanning line by line and taking a decision (keep or discard) for each line, or with pattern matching.
Now your situation appears to be that you know exactly that you will have consecutive blocks with exactly 170 lines each, and that within each block you need the 50th to the 76th lines, etc. So there is no variation in the length of e.g. numeric arrays internal to each block, and you have regularity among blocks.
So we can generate the IDs of relevant lines for block 1 as follows:
lineIDs = [50:76, 86:105, 115:140] ;
and the question that remains is how to repeat that with a 170 lines interval until the end of the file. Dpb did the following to get lines/rows:
file = textread('yourfile', '%s', 'delimiter', '\n', 'whitespace', '');
nr = length(file);
so we can use nr to build a vector of 170 increments until the end of the file, as follows (note that it may be 171 in your case, you'll have to check):
steps = 0 : 170 : nr ;
then we can use it to create line IDs for the whole file:
lineIDs = repmat( lineIDs, length( steps ), 1 ) + ...
repmat( steps(:), 1, length( lineIDs ) ) ;
I let you run that on a smaller example to see how it works, e.g. you want to repeat 2,3,5 with a 10 lines interval and you have 29 lines total
nr = 29 ;
lineIDs = [2,3,5] ;
steps = 0 : 10 : 29 ;
lineIDs = repmat( lineIDs, length( steps ), 1 ) + ...
repmat( steps(:), 1, length( lineIDs ) ) ;
Running this gives
>> lineIDs
lineIDs =
2 3 5
12 13 15
22 23 25
Once we have this array, we can transform it to get a vector of line IDs:
lineIDs = reshape( lineIDs.', 1, [] ) ;
Applied to our previous small example we get
>> lineIDs
lineIDs =
2 3 5 12 13 15 22 23 25
And we finish as explained by dpb (but in our case we keep the lines instead of removing them):
file = file(lineIDs) ;
fid=fopen('file1.txt','w');
for i=1:length(file),fprintf(fid,'%s\n',file{i});end
fid=fclose(fid);
Or we can reuse part of my solution to build a new file name based on the original..
fName = 'tord_1.txt' ;
.. read/process ..
[fPath, fBase, fExt] = fileparts( fName ) ;
fId = fopen( fullfile( fPath, [fBase, '_modified', fExt] ), 'w' ) ;
for k = 1 : length(file), fprintf( fId , '%s\n', file{k} ); end
fclose( fId ) ;
Tord
on 12 Jun 2014
dpb
on 12 Jun 2014
I'm not sure I'd give up entirely (yet, any way). And don't feel bad; we've all had such an experience (or worse) and particularly early on in one's career it's not an unusual experience to be humbled at all. I'll quote the tag line from a longtime poster/world-class expert on the Fortran newsgroup I also frequent--
"Good judgement comes from experience. Experience comes from bad judgement."
~ Mark Twain
Anyway, I'd suggest looking to see if perhaps there's a way to sorta' combine the two ideas -- find out if there is a way to discover when there is this "off by one" count that you can then compensate for. That might take doing some parsing of the content in that section or perhaps when you get to that section do some line-counting to find the next location and then fixup the indices on that basis.
I was going to point out that another way to handle the i/o would be to use textscan or similar and use the 'headerlines' parameter to skip a counted group of lines, then read (and copy) a group and repeat instead of having the whole file in memory. If you need to do a search and destroy mission like this, that may be a better approach. Read the first group, check all is still well, then read the next and repeat. When/if you find that "off by one" issue, it's simpler, perhaps, to be able to fix it there than globally.
If, as it sounds like, this is going to be a recurring issue for your employer, you could well be doing a big service by figuring out a way to do this as you do the mundane part. Or, depending on how these files are generated, perhaps you can make some input changes to the process that will be able to get them to stop being created such that they can't be automatically parsed in this step by showing how they could be formatted instead that would allow for the script to work reliably.
Cedric
on 12 Jun 2014
Hey Tord, don't be sorry, we most likely would have done the same thing (assuming regularity until we observe a shift), and that is how we learn after all ;-)
I'll go on with pattern matching by email.
Tord
on 16 Jun 2014
OK, try this...this is a "deadahead" looping solution to build the vector from the information provided --it can be made to look "more Matlaby" but this I could do before my meeting...
Starting with your block definitions and the overall length of the repetitive section...
>> ix=[1 49; 77 85; 106 114; 141 147] % the sections to remove
ix =
1 49
77 85
106 114
141 147
>> N=170; % the overall block length
>> L=42000;
Following is a sanity check to compare lengths to your given ...
>> ceil(L/N)
ans =
248
>> 248*N
ans =
42160
>> L=ans; % sanity check I did on overall lengths
The above look right I presume???
Anyway, back to the building of an overall deletion index...
>> ig=[];for i=1:size(ix,1),ig=[ig; [ix(i,1):ix(i,2)].'];end % One block
Then build the whole thing from repeating the above for the number of blocks in a file
>> ix=ig; % initialize to the first group
>> for i=1:L/N-1 % loop count from 2:L/N
ix=[ix; (i*170)+ig]; % 1:L/N-1 instead of (i-1) as multiplier
end % add the group plus offset and concatenate
Now use ix as the index vector to delete those lines as shown previously. Again, be sure to have a backup while you double-check your counts, etc., before you overwrite the raw data files!!! :)
Another sanity check...
>> L-ix(end)
ans =
23
>> 170-147
ans =
23
Lookin' good... :)
I gotta' run...good luck!
ADDENDUM:
L as above should match length(file), btw as the verification of the counting...
ADDENDUM 2:
Just as a sidepoint, the multiplications can be done away with, also...
for i=2:L/N % loop count from 2:L/N
ig=ig+170; % add the offset
ix=[ix; ig];
end
To make the script simpler to adapt to other files, move the 170 constant also to a variable that you can set at the top--then you change only those constants that define the file structure and you're done for any other similarly-constructed files.
And to look ahead a little, next you'll be looking for the answer at the FAQ --
:)
ADDENDUM 3 and (hopefully) final:
Not to be outdone by Cedric ( :) ), the vectorized solution for building the deletion index array --
Given the above ix array of unwanted lines and the block size N and file length L--
ig=cell2mat(arrayfun(@colon,ix(:,1),ix(:,2),'uniformoutput',false).').';
ix= bsxfun(@plus,N*[0:L/N-1],repmat(ig,1,L/N)); ix=ix(:);
4 Comments
Tord
on 11 Jun 2014
dpb
on 11 Jun 2014
No problem, it's generally entertainment for me...my "day job" is now back on the family farm having left the consulting gig behind so my "keeping a hand in" is here...
Particularly when someone does get some good and is appreciative makes it worthwhile...
Image Analyst
on 14 Jun 2014
I'd be interested to know if you've found a use for MATLAB on your farm. For example to control a weather station or see if the animals are back in the barn yet or something. Maybe interfaced an arduino....
dpb
on 14 Jun 2014
I've not to date other than somewhat superficially altho I had some ideas of it when TMW generously comp'ed the upgraded version but I've not actually done anything along those lines.
There's an opportunity there I think for the future even more integration of the various data sources. The biggest difference from when left for college and the off-farm career in the mid-60s and when returned besides just the increased size of typical operation which is simply scaling is the amazing use of technology in everything from GPS auto-steer and tracking to yield monitors and planters that can actually place an individual seed spacing to within 1/8" for precise planting rates as well as control side dressings and fertilizers/pesticides/herbicides on a rate that is also tied to soil conditions and other field topographical features. I've just not taken the time to do it outside the available features in the vendor-supplied software/firmware interfaces.
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!