delete rows- tanimoto matrix
Show older comments
Hi everyone;
with help of some of you, we could write two scripts to delete rows, in tanimoto matrix which we create based on RDKit fingerprints.
the idea was to delete those rows (= molecules) which are similar to another molecules if :
1.tanimoto index > 0.7 but less than 1.
2. the sum of all tanimoto indexes in a row is larger than the compared molecule.
*in the first solution we used for loop, with a vector called "good_ones"; the rows that got zero we delete and we don't compare them again :
[M,text,alldata]=xlsread('test2.csv');
[r c]=size(M);
S=sum(M,2); %sum rows
good_ones = ones(1,size(M,1));
% loop over rows
for row=1:r;
for col=row+1:c;
if good_ones(row)==1
if (M(row,col) >= 0.7) & (M(row,col) <1.0 ); %if the value between 0.7 to 1 then we compare the sum column
if S(row)>S(col) % if the sum of the i line is larger we delete this line
good_ones(row) = 0;
else % the sum of the j line is larger so we delete the other line
good_ones(col) = 0;
end
end
% mark lines for deletion afterwards
end
end
end
new_M = M(find(good_ones),:);
- in the second script we used the vector way, but in this solution, rows that we compare and mark to delete, are compared again to another molecules, and in this way we loose more molecules,so the question how i can add a condition here like i did previous and check only the rows that i didn't have check before???:*
an example file of the matrix attached
[M,text,alldata]=xlsread('test2.csv.csv') ;% M is numeric date
% text is the text :-)
% all data is both M + text
[i j]=size(M);
S=sum(M,2);
triM = tril(M); %lower diagonal
[r, c] = find(triM >= 0.7 & triM < 1.0); %find position of all values in range
deleterow = S(r) > S(c); %compare row r with respective row c
%a true in deleterow means delete respective r, otherwise delete respective c
todelete = unique([r(deleterow) ;c(~deleterow)]);
nM = M;
nM(todelete, :) = [];
text2=text(2:i+1,1);
text2(todelete, :) = [];
7 Comments
I've not seen any previous thread that this looks familiar to so I'm coming in cold and I have absolutely no klew what a "tanimoto" matrix is altho I gather it's a classification statistic of some sort so these are perhaps naive queries, but--
At least with the file test2.csv you linked to, the sum already is in the file as the last column so doesn't the
S=sum(M,2);
double it over what should be? Or, does your actual data file not include the sum column?
Also, should the sum actually be the full row sum since it is a symmetric array or should it not be the sum of the elements in the upper (or lower) triangular array? And, should the sums actually include the diagonal 1 value?
With those questions resolved, does the first script produce the desired output?
I'm sure a vectorized version can be written; I think I'd do it just slightly differently than the author of the above but similar idea; just need the actual definition of right/wrong calculation clarified before spending any time writing code. Again, don't know if I'm right or totally off-base, but seems to me you'd want
triM=tril(M,-1); % the lower triang excluding diagonal
S=sum(triM,2); % sums w/o the diagonal and not double-counting
The above assumes also that M doesn't already include a summation column.
dpb
on 8 Dec 2014
That's a start altho I'm not sure what you mean precisely by "we want the sum of the whole row -1". If as suggested you pick up only the upper/lower triangular and sum that there is no one included in the summation to begin with as the diagonal elements aren't included. Or is there some other reason that one wants S-1 "like it's a a DOF adjustment or the like?"
Altho, on reflection, it looks like the magnitude of the sums doesn't matter as it's only relative so including the 1 or not makes no difference. The doubling of the symmetry could affect the ordering, however.
Also, explain more precisely the desired logic of which are the desired results -- I don't quite follow the reasoning behind the column vs the row marking in the looping solution. The block
if S(row)>S(col) % delete this line
good_ones(row) = 0;
else % the sum of the j line is larger so we delete the other line
good_ones(col) = 0;
end
doesn't make sense to me. The else clause is executed for S(row)<=S(col), not just S(row)<S(col). While it may not matter that much, there is an equality condition at least theoretically possible. What should happen then?
Actually, on yet further consideration, I think this is a case that the sequential solution via the loop may well be the optimal as you are eliminating rows from further consideration whereas the vector solution will operate over the entire array.
I'd have to think about the equivalent vectorized solution and then one would have to time the two for representative datasets to see about performance. How large a set would be typical and what kind of run times do you see?
dpb
on 9 Dec 2014
Rewriting your loop solution a little...
[r c]=size(M);
r=r-1; % last row is going to be superfluous
M=triu(M(1:r,:),1); % get upper triang only
S=sum(M,2);
good = true(1,size(M,1)); % use a logical for a logic test
for row=1:r
for col=row+1:c
if good
if M(row,col)>=0.7 % no need for range test
if S(row)>S(col)
good(row)=false;
break % go to next row; done all need here
else
good(col)=false;
end
end
end
end
end
M = M(good,:);
Shayma
on 9 Dec 2014
dpb
on 9 Dec 2014
OK, I do now grok the summation and yes, the "1" is immaterial as long as it's simply a ranking and not an absolute value.
"...them, in this case it will be the second molecule, maybe here we can add an additional condition to decide..."
In that case, change the > to >= and kill the row that lets you abort the loop earlier. It seems that one still needs to finish out the row otherwise.
I won't have time until perhaps the weekend (and can't promise that, even) to devote to trying the vector solution but given the condition of the decision process I don't think it's likely to be a case where it will save a lot of time. With the JIT compiler, loops aren't always the overhead they used to be. Unless you actually run into severe computational bottlenecks I'd suspect your best utilization of time/resources is to just "get on with it" at this point.
dpb
on 9 Dec 2014
Another possible modification that might save some time would be to test whether there are any more on the row greater than the limit on the alternative test instead of continuing the loop on a "by one" step.
Oh, alternatively, instead of the inner loop, use a logical array test on the row at once...
if ~any(M(row,:)>0.7) % this row passes
break
else
% deal with them here...
end
Accepted Answer
More Answers (0)
Categories
Find more on Matrices and Arrays in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!