delete rows- tanimoto matrix

Hi everyone;
with help of some of you, we could write two scripts to delete rows, in tanimoto matrix which we create based on RDKit fingerprints.
the idea was to delete those rows (= molecules) which are similar to another molecules if :
1.tanimoto index > 0.7 but less than 1.
2. the sum of all tanimoto indexes in a row is larger than the compared molecule.
*in the first solution we used for loop, with a vector called "good_ones"; the rows that got zero we delete and we don't compare them again :
[M,text,alldata]=xlsread('test2.csv');
[r c]=size(M);
S=sum(M,2); %sum rows
good_ones = ones(1,size(M,1));
% loop over rows
for row=1:r;
for col=row+1:c;
if good_ones(row)==1
if (M(row,col) >= 0.7) & (M(row,col) <1.0 ); %if the value between 0.7 to 1 then we compare the sum column
if S(row)>S(col) % if the sum of the i line is larger we delete this line
good_ones(row) = 0;
else % the sum of the j line is larger so we delete the other line
good_ones(col) = 0;
end
end
% mark lines for deletion afterwards
end
end
end
new_M = M(find(good_ones),:);
  • in the second script we used the vector way, but in this solution, rows that we compare and mark to delete, are compared again to another molecules, and in this way we loose more molecules,so the question how i can add a condition here like i did previous and check only the rows that i didn't have check before???:*
an example file of the matrix attached
[M,text,alldata]=xlsread('test2.csv.csv') ;% M is numeric date
% text is the text :-)
% all data is both M + text
[i j]=size(M);
S=sum(M,2);
triM = tril(M); %lower diagonal
[r, c] = find(triM >= 0.7 & triM < 1.0); %find position of all values in range
deleterow = S(r) > S(c); %compare row r with respective row c
%a true in deleterow means delete respective r, otherwise delete respective c
todelete = unique([r(deleterow) ;c(~deleterow)]);
nM = M;
nM(todelete, :) = [];
text2=text(2:i+1,1);
text2(todelete, :) = [];

7 Comments

dpb
dpb on 7 Dec 2014
Edited: dpb on 7 Dec 2014
I've not seen any previous thread that this looks familiar to so I'm coming in cold and I have absolutely no klew what a "tanimoto" matrix is altho I gather it's a classification statistic of some sort so these are perhaps naive queries, but--
At least with the file test2.csv you linked to, the sum already is in the file as the last column so doesn't the
S=sum(M,2);
double it over what should be? Or, does your actual data file not include the sum column?
Also, should the sum actually be the full row sum since it is a symmetric array or should it not be the sum of the elements in the upper (or lower) triangular array? And, should the sums actually include the diagonal 1 value?
With those questions resolved, does the first script produce the desired output?
I'm sure a vectorized version can be written; I think I'd do it just slightly differently than the author of the above but similar idea; just need the actual definition of right/wrong calculation clarified before spending any time writing code. Again, don't know if I'm right or totally off-base, but seems to me you'd want
triM=tril(M,-1); % the lower triang excluding diagonal
S=sum(triM,2); % sums w/o the diagonal and not double-counting
The above assumes also that M doesn't already include a summation column.
Shayma
Shayma on 8 Dec 2014
Edited: Shayma on 8 Dec 2014
tanimoto matrix is a statistic used for comparing the similarity and diversity of sample sets; larger the number between 2 compounds means more similarity.
I uploaded a new file - test2.csv that shows how it looks like ( without a sum column).
it's a good point to sum only one triangular because of the symmetric, thanks for raising that !
the first script doing what we want, but i want a faster way to do it over large data set, so i want to write it in the vector-version.
triM=tril(M,-1);
this is also a good idea! in this way i skip the ones in the matrix. about the sum, actually we want the sum of the whole row -1 !
what i want to get is not to compare the molecule that i'm going to delete again with another molecules ! so if in first run compound x will be deleted, i won't compare this compound again, even if it has a tanimoto coefficient >0.7 with others compounds!
That's a start altho I'm not sure what you mean precisely by "we want the sum of the whole row -1". If as suggested you pick up only the upper/lower triangular and sum that there is no one included in the summation to begin with as the diagonal elements aren't included. Or is there some other reason that one wants S-1 "like it's a a DOF adjustment or the like?"
Altho, on reflection, it looks like the magnitude of the sums doesn't matter as it's only relative so including the 1 or not makes no difference. The doubling of the symmetry could affect the ordering, however.
Also, explain more precisely the desired logic of which are the desired results -- I don't quite follow the reasoning behind the column vs the row marking in the looping solution. The block
if S(row)>S(col) % delete this line
good_ones(row) = 0;
else % the sum of the j line is larger so we delete the other line
good_ones(col) = 0;
end
doesn't make sense to me. The else clause is executed for S(row)<=S(col), not just S(row)<S(col). While it may not matter that much, there is an equality condition at least theoretically possible. What should happen then?
Actually, on yet further consideration, I think this is a case that the sequential solution via the loop may well be the optimal as you are eliminating rows from further consideration whereas the vector solution will operate over the entire array.
I'd have to think about the equivalent vectorized solution and then one would have to time the two for representative datasets to see about performance. How large a set would be typical and what kind of run times do you see?
Rewriting your loop solution a little...
[r c]=size(M);
r=r-1; % last row is going to be superfluous
M=triu(M(1:r,:),1); % get upper triang only
S=sum(M,2);
good = true(1,size(M,1)); % use a logical for a logic test
for row=1:r
for col=row+1:c
if good
if M(row,col)>=0.7 % no need for range test
if S(row)>S(col)
good(row)=false;
break % go to next row; done all need here
else
good(col)=false;
end
end
end
end
end
M = M(good,:);
Hi, each row contains a compare for one molecule with the whole data, so we need the sum of the whole row, minus 1 its the compare to the molecule itself = diagonal, so we can - 1 or not it won't change a lot! but if we take only the upper part- for the last molecule it will appear only one number- one compare, that's why we take the whole row! hope now it's clear.
**in the case of S(row)=S(col), we have to delete one of them, in this case it will be the second molecule, maybe here we can add an additional condition to decide, yet that's how we do it!
we change also a little bit in our script :) and now it double check the molecules , before continue in the loop:
% loop over rows
for row=1:r-1; %last row will be=1 only
if good_ones(row)==1;
for col=row+1:c; % go over upper triangle only !
if good_ones(col)==1;
if (M(row,col) >= 0.7); %if the value >= 0.7 then we compare the sum vector
if S(row)>S(col); % if the sum of the row line is larger we delete this line
good_ones(row) = 0;
break % & we don't continue comparing this line
else % the sum of the col line is larger so we delete it
good_ones(col) = 0;
end
end
end
end
end
end
maybe it won't matter in time if i have only hundreds of molecules, but if you have time i would like to see your solution :)
many thanks for your time
OK, I do now grok the summation and yes, the "1" is immaterial as long as it's simply a ranking and not an absolute value.
"...them, in this case it will be the second molecule, maybe here we can add an additional condition to decide..."
In that case, change the > to >= and kill the row that lets you abort the loop earlier. It seems that one still needs to finish out the row otherwise.
I won't have time until perhaps the weekend (and can't promise that, even) to devote to trying the vector solution but given the condition of the decision process I don't think it's likely to be a case where it will save a lot of time. With the JIT compiler, loops aren't always the overhead they used to be. Unless you actually run into severe computational bottlenecks I'd suspect your best utilization of time/resources is to just "get on with it" at this point.
Another possible modification that might save some time would be to test whether there are any more on the row greater than the limit on the alternative test instead of continuing the loop on a "by one" step.
Oh, alternatively, instead of the inner loop, use a logical array test on the row at once...
if ~any(M(row,:)>0.7) % this row passes
break
else
% deal with them here...
end

Sign in to comment.

 Accepted Answer

dpb
dpb on 10 Dec 2014
Edited: dpb on 11 Dec 2014
"...but if you have time i would like to see your solution :)"
Well, I've not tested this extensively but it does reproduce your script results for the sample dataset --
Presuming
a) S is the summation column vector, and
b) M is the tridiagonal form w/o the diagonal
>> M=M>0.7; % logical array
>> R=squareform(pdist(S,@(x,y) rdivide(x,y)))>1; % sum ratios to match
>> good = ~any(M&R,2) & ~any(M&~R,1).';
>> all(good==Good)
ans =
1

1 Comment

dpb
dpb on 10 Dec 2014
Edited: dpb on 11 Dec 2014
From the command line session I used to debug...
>> [~any(MR&R,2) ~any(MR&~R,1).' [~any(MR&R,2) & ~any(MR&~R,1).'] Good]
ans =
1 1 1 1
0 1 0 0
1 1 1 1
0 0 0 0
1 1 1 1
1 1 1 1
1 0 0 0
1 1 1 1
1 1 1 1
>> all(ans(:,3)==ans(:,4))
ans =
1
>>
Good is a local variable from the output from your script that I saved for comparison purposes. MR is MR=M>0.7 logical array; while testing I kept the second variable but there's no need in the actual code it would seem.
As noted above, I have no extensive debugging here; it just seems like the correct logic to match up the rows/columns that are the result of the if...else...end clause but that's as far as I went in depth.
You might want to reverse some of the logic so it's not a negative in the end...that is particularly M<0.7 instead of M>0.7; the sum ratio is a wash. Altho it occurred to me that perhaps if kept the full array M (excluding the diagonal, of course) that perhaps R and 1/R for the two triangulars just might simplify it but again I didn't pursue that thought for lack of time to devote to the problem.
One last thought if you have a problem with this back on the looping solution. One could potentially do something like
pool=find(any(M>0.7,2)).';
for r=pool % only rows with high magnitude, not all
...
How much difference, if any, one might see would be highly dependent upon the dataset I think.
Also, however, one optimization that should make some difference is to convert the M array to the ratio before the loop instead of doing the calculation every pass...
M=M>0.7;
for r=1:nR
...
for c=r+1:nC
if M(r,c) % logical flag check instead of computation
...
I don't know if it would make any real difference in timing but I'd think the conversion I showed earlier of using a logical array instead of the 0,1 numeric and explicit comparison might save a few cycles...if nothing else it saves a find when you're done because can use the logic array directly as in indexing expression instead of finding the positions.

Sign in to comment.

More Answers (0)

Categories

Asked:

on 7 Dec 2014

Edited:

dpb
on 11 Dec 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!