How do I reduce categorical array containing alternate spellings and formats?
7 views (last 30 days)
Show older comments
I bet this problem comes up all the time:
I have a categorical array of length ~ 10,000 and typical string length ~ 10 to 20. Accounting for alternative spellings and formatting (etc.), I believe it actually contains only ~ 100 unique categories (a reduction on the order of ~100). What is the best way to achieve this reduction?
I am considering using the file exchange function EditDist with a distance of around 1 to 5 but I am open to other ideas. One issue I am considering is how to programmatically pick a "version" of each categorical with the least distance to all its other versions (the "average" version of each categorical).
What do you all recommend? Can I get it done in a single for loop or are nested forloops required? Is there a pre-existing solution? Thanks!
5 Comments
Walter Roberson
on 18 Sep 2024
The edit distance between pork and fork is 1. There is no smaller edit distance available.
It is true that you could measure the distance shifted in the alphabet... but that would tell you that fork is pretty similar to cork. You need to decide whether it is reasonable to replace
Eat your pork with a fork.
with
Eat your cork with a cork.
Answers (2)
Peter Perkins
on 18 Sep 2024
Chris, something puzzles me. You say,
"I have a categorical array of length ~ 10,000 and typical string length ~ 10 to 20. Accounting for alternative spellings and formatting (etc.), I believe it actually contains only ~ 100 unique categories (a reduction on the order of ~100)."
100*100 is 10000, so are you saying you have a categorical array whose length is 10K and that has 10K categories? That can't be right. Maybe you have a string array whose length is 10K and that will have 100 categories after cleaning it up?
In any case, I can't help much with identifying which misspelled/whatever categories are equivalent to each other, but cleaning up this sort of thing is what renamecats, mergecats, etc. are for. You probably already know that.
Rahul
on 30 Sep 2024
Hi,
I believe that you’re trying to cluster a categorical array, based on the “EditDist” function metric as distance between two categories, and further trying to get the cluster centroid for each of the clusters, which are at minimum average distance to the other strings in the same cluster.
The problem seems to resemble a k-means clustering problem, function "clusterdata" can be extended to be used for categorical elements as strings, where the strings need to be encoded to numeric values before being passed as observation vectors.
Data Generation: can be done by initially having 100 unique categorical strings of length (say 15), and then performing random 1-5 ‘editDist’ operations 100 times to create variations in the respective cluster, which can be achieved as:
% Parameters
num_unique_categories = 100;
num_variations_per_category = 100;
total_size = num_unique_categories * num_variations_per_category;
string_length = 15; % Length of the base strings
% Function to apply random edit operations (substitution, insertion, deletion)
apply_random_edit = @(str) edit_string_randomly(str);
% Function to randomly edit a string (simulate substitution, insertion, and deletion)
function new_str = edit_string_randomly(str)
operations = {'substitute', 'insert', 'delete'}; % Possible edit operations
num_operations = randi([1, 3]); % Random number of operations to apply (1-3)
for i = 1:num_operations
operation = operations{randi([1, 3])}; % Randomly pick an operation
switch operation
case 'substitute'
% Substitution: replace a random character
pos = randi([1, length(str)]); % Random position
str(pos) = char(randi([97, 122])); % Random lowercase letter
case 'insert'
% Insertion: insert a random character at a random position
pos = randi([0, length(str)]); % Random position (0 means insert at the beginning)
new_char = char(randi([97, 122])); % Random lowercase letter
str = [str(1:pos), new_char, str(pos+1:end)];
case 'delete'
% Deletion: remove a random character
if length(str) > 1 % Ensure the string has more than one character
pos = randi([1, length(str)]); % Random position
str(pos) = []; % Remove character at the chosen position
end
end
end
new_str = str;
end
% Create base categories
base_categories = cell(1, num_unique_categories);
for i = 1:num_unique_categories
base_categories{i} = generate_random_string(string_length); % Generate 5-character base strings
end
% Generate the full categorical array with variations
categorical_array = cell(1, total_size);
idx = 1;
for i = 1:num_unique_categories
base_str = base_categories{i};
for j = 1:num_variations_per_category
num_modifications = randi([0, 5]); % Modifications in the range [0, 5]
modified_str = base_str;
for k = 1:num_modifications
% Apply random edit (substitute, insert, delete)
modified_str = apply_random_edit(modified_str);
end
categorical_array{idx} = modified_str;
idx = idx + 1;
end
end
Encoding & Clustering: Convert each string into a numerical form, such as using word embeddings (like, word2vec), or creating a bag-of-words. This representation gives a spare symmetrical distance matrix with distance between each element, to be clustered using the “kmeans” or “kmedoids” function:
% Number of categories
numCategories = length(categoriesStr);
% Initialize a distance matrix
distanceMatrix = zeros(numCategories);
% Calculate pairwise edit distances
for i = 1:numCategories
for j = i+1:numCategories
distanceMatrix(i, j) = editDistance(categoriesStr{i}, categoriesStr{j});
distanceMatrix(j, i) = distanceMatrix(i, j); % Symmetric matrix
end
end
% Threshold for unique categoricals
threshold = 5;
% Cluster categories based on the threshold
clusters = clusterdata(distanceMatrix, 'Criterion', 'distance', 'Cutoff', threshold)
A threshold criterion of 5, determines the max distance within a cluster along with the cluster centroid indices.
You can also use “kmeans” or “kmedoids” functions to group categoricals, while providing a custom distance parameter to the function for measuring distance between two elements:
[I,c] = kmeans(categoriesStr, 'distance', @distFun);
function d = distFun(s1, s2)
s = size(s2);
d = zeros(s(1), 1);
for i=1:size(s2, 1)
d(i, 1) = editDistance(s1, s2(i));
end
end
For more information regarding usage of functions and parameters mentioned above, refer to the documentations below:
- Create Custom Distance Metric using “kmedoids”: https://in.mathworks.com/help/stats/kmedoids.html#buioqlz-X:~:text=Custom%20distance%20function%20handle.%20A%20distance%20function%20has%20the%20form
- Create Clusters using “clusterData”: https://in.mathworks.com/help/stats/clusterdata.html
0 Comments
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!