# Find duplicate elements and remove the rows that has similar values in one column

Dear Matlab experts,

I am using the following function to find the rows that has similar value in their 9th column. The speed of calculation is very slow as the data is big. Any suggestions for modifying my code to increase the speed or any other suggestions to achieve that purpose?

Thank you in advance.

function in1=dup_remove(out2)

b=[];

for i=1:size(out2,1)

[r,c]=find(out2(:,9)==out2(i,9));

if(length(r)==1)

b=[b;out2(i,:)];

end

end

if (~isempty(b))

in1=b;

end

end

Jan
on 19 Oct 2022

Jan
on 18 Oct 2022

Edited: Jan
on 19 Oct 2022

Avoid iteratively growing arrays, because they are extremly expensive. See:

x = [];

for k = 1:1e6

x(k) = rand;

end

This creates a new vector x in each iteration and copies the former contents of the vector to the new one, so Matlab reserves and copies sum(1:1e6)*8 Bytes, which is more than 4 TB!

Pre-allocation solves the problem:

x = zeros(1, 1e6);

for k = 1:1e6

x(k) = rand;

end

Tis reserves 8 MB only and copies just the scalar elements.

In your case:

function y = dup_remove(x)

x9 = x(:, 9); % Slightly faster than indexing each time

n = size(x,1);

match = false(n, 1);

for i = 1:n

[r, c] = find(x9 == x9(i));

match(i) = (numel(r) == 1);

end

y = x(match, :);

end

It is too strange, to call the input "out2" and the output "in1".

A smarter method:

function y = dup_remove(x)

x9 = x(:, 9); % Slightly faster than indexing each time

T = true(numel(x9), 1);

[S, idx] = sort(x9(:).');

m = [true, diff(S) ~= 0];

ini = strfind(m, [true, false]);

m(ini) = false; % Mark 1st occurence in addition

T(idx) = m; % Restore original order

y = x(T, :);

end

The sorting avoids to compare each element with all others, but only one comparison with the neighbor is required.

