How to improve efficiency when checking for unique combinations of table values?

1 view (last 30 days)
Hello, I am trying to check each row in a sheet of data to see if it contains a combination of 2 values. For example, there might be a table like
Fruit | Color | related value
apple | blue | 3
apple | red | 8
banana | blue | 5
banana | yellow | 6
apple | blue | 16
pear | yellow | 7
I want to create a table with the first column as the top row and the second column as the first column, and with the values indicated by a corrdinate pair being the number of times that that particular pair appears.
For example:
apple | banana | pear
blue 2 1 0
red 1 0 1
yellow 0 1 1
I'm currently using code that looks like this
fileContents = zeros(size((colorNames,1),size(fruitNames,1));
for p = 1:size(uniqueFruitNames,1)
for q = 1:size(uniqueColorNames,1)
for k = 1:size(rawData,1)
if ((uniqueFruitNames(p) == rawData(k,1)) && (all((uniqueColorNames(q,:) == rawData(k,2))))
fileContents(q,p) = fileContents(q,p) + 1;
end
end
end
end
My problem is that the table that is referenced by my code contains upwards of 10 thousand values, and I have found that the above code take approximately 15 minutes to complete. The rawData sheet contains around 12 thousand rows and 70 columns. The uniqueFruitNames array is 8 x 1, and the uniqueColorNames array is aproximately 2000 x 1.
What can I do to make this more efficient?

Accepted Answer

dpb
dpb on 15 Jun 2021
tFruit=readtable('fruit.txt');
tFruit=tFruit(:,[1:2:end]);
tFruit.Properties.VariableNames={'Fruit','Color','Data'};
tFruit.Fruit=categorical(tFruit.Fruit);
tFruit.Color=categorical(tFruit.Color);
tFruit =
6×3 table
Fruit Color Data
______ ______ ____
apple blue 3
apple red 8
banana blue 5
banana yellow 6
apple blue 16
pear yellow 7
% The engine
>> histcounts2(double(tFruit.Color),double(tFruit.Fruit))
ans =
2 1 0
1 0 0
0 1 1
>>
  1 Comment
dpb
dpb on 15 Jun 2021
Essentially Walter's solution packaged by TMW.
Which will be quicker for a largeish dataset I don't know...

Sign in to comment.

More Answers (1)

Walter Roberson
Walter Roberson on 14 Jun 2021
findgroups the two columns individually. Concatenate the groups into one array of two columns. Use those as the first parameter to accumarray and use 1 as the second parameter. The result will be a count array.

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!