Find and Replace Overlapping Substrings

3 views (last 30 days)
Hello,
I want to find a set of substrings (between 19 and 24 characters long, 'ACGT' mix = DNA sequences) in a bigger string (template DNA) and replace them with '*' for the length of the substring. I have following code.
%"template" is a 8x1 cell array with original DNA sequence data (araound 1800 chars each). To minimize the example I just go through the first cell.
%"substring" is e.g. a 50x2 cell array, with column 1 = substring and olumn 2 = length of the substring.
%"substituted_seq" is a 8x1 cell array with the replaced sequence (substrings substituted by '*')
%
substituted_seq{1,1} = strrep(template{1,1},substring{1,1},'*');
for j=1:size(substring,1)
substituted_seq{1,1} = strrep(substituted_seq{1,1},substring{j,1},'*');
end
The first problem I have is, that these substrings are overlapping with each other. So when I replace the first substring with '*' and search for the next one (which is overlapping the first) this code will not replace it anymore.
Second: I also couldn't figure out, how to replace a substing of e.g. 'ACGTCG' with the same number of '*' (in this example '******').
I would be very grateful for any help. Thanks!

Accepted Answer

Robert Cumming
Robert Cumming on 30 Aug 2012
I would make a binary flag = to the length of your string. Then run through all your substrings and mark the flag true for the characters to be replaced wiht *. This will eliminate the fact the problem of overlapping.
Once its all done you then replace all the true items recorded by flag in your string.
For your second iss: something like:
flag = 'CC';
key = regexprep ( 'CC', '.', '*' );
regexprep ( 'ABCCCCCDEFG', flag, key )
ans =
AB****CDEFG

More Answers (0)

Categories

Find more on Genomics and Next Generation Sequencing in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!