speed up ismember with two-dim data
Show older comments
I have two large cell-arrays (X and Y) where each field is a two dimensional array representing pixel positions. The cell arrays are not the same size (can be in the range of thousands) and each entry can be be a different length (ranging from length 1 to length >1000). Maybe relevant: Over X and Y individually, no point can be included twice, i.e. all entries in X have no points in common (same for Y).
Now I want to find the overlap between all entries in X with all entries in Y. (in reality I filter the entries in Y for a criterion and compute only roughly 10-20 percent of the comparisons)
Is there a way to speed up this process? I have read about ismembc which requires sorted data (and thus I thought it cannot be used in my case).
Minimal working example:
X{1} = [1 1; 1 2; 1 3];
X{2} = [3 1; 3 2; 3 3];
Y{1} = [1 1; 1 2];
Y{2} = [2 3; 3 1; 3 2; 3 3];
Y{3} = [5 5];
mat_overlap = NaN(length(X), length(Y));
for ii = 1:length(X)
for jj = 1:length(Y)
mat_overlap(ii,jj) = sum(ismember(X{ii}, Y{jj}, 'rows'));
end
end
6 Comments
Are the values integers or floating point numbers? How large are the values?
The current set of data can be converted to 1D data by:
m = 10;
XX{i} = X{i}(:, 1) * m + X{i}(:, 2);
Afterwards ismembc can acclerate the search again. This works only if there is an m > max(Data(:, 2)) which does not cause values > 2^53, where rounding effects would invalidate the integrity.
An alternative is to calculate the MD5 hash of each row, but this takes some time and I guess, that this is slower in total.
Optimizing the runtime depends on the specific input: How large are the data in average? How many elements overlap? Would this be a realistic input:
X = cell(1, 1000);
for k = 1:1000
n = randi([1, 100]);
X{k} = rand(n, 2); % Or: randi([1,50], n, 2) ?
% Or: unique(randi([1,50], n, 2), 'rows')?
end
% And the same for Y
Are the rows of the elements of X and Y guaranteed to be unique? Or is this possible:
X{1} = [1 1; 1 2; 1 1];
Felix Müller
on 14 Jul 2021
Edited: Felix Müller
on 14 Jul 2021
Bjorn Gustavsson
on 14 Jul 2021
Generate arrays groupidxX and groupidxY arrays for all points in X and Y then put all of X and Y into two n-by-2 arrays, pass those 2 arrays to sortrows, those row-sorted arrays should be "reasonably" efficient to find matching pairs in by looping through only small subsets of the second array, then keeping the proper group-indices and whatnot of the matches?
Felix Müller
on 14 Jul 2021
Edited: Felix Müller
on 15 Jul 2021
Jan
on 14 Jul 2021
@Felix Müller: I assume, there is a typo in the code to produce the test data:
X{k} = random_vector(idx_first:idx_last, :);
% ^^^ added to get 2 columns
I've posted a comparison with ismembc and get a speedup of a factor 100 - much more then 60%.
Felix Müller
on 15 Jul 2021
Accepted Answer
More Answers (0)
Categories
Find more on Loops and Conditional Statements in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!