How do I reduce categorical array containing alternate spellings and formats?

Question

Chris Hooper on 17 Sep 2024

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/2153380-how-do-i-reduce-categorical-array-containing-alternate-spellings-and-formats

Answered: Rahul on 30 Sep 2024

I bet this problem comes up all the time:

I have a categorical array of length ~ 10,000 and typical string length ~ 10 to 20. Accounting for alternative spellings and formatting (etc.), I believe it actually contains only ~ 100 unique categories (a reduction on the order of ~100). What is the best way to achieve this reduction?

I am considering using the file exchange function EditDist with a distance of around 1 to 5 but I am open to other ideas. One issue I am considering is how to programmatically pick a "version" of each categorical with the least distance to all its other versions (the "average" version of each categorical).

What do you all recommend? Can I get it done in a single for loop or are nested forloops required? Is there a pre-existing solution? Thanks!

5 Comments
Show 3 older commentsHide 3 older comments

Chris Hooper on 18 Sep 2024

@Walter Roberson it's a good point about edit distance. Wonder if there is a better alternative, or if keeping the edit distance small is good enough in my case.

I'll have a look at the toolbox you suggested, thanks!

Walter Roberson on 18 Sep 2024

The edit distance between pork and fork is 1. There is no smaller edit distance available.

It is true that you could measure the distance shifted in the alphabet... but that would tell you that fork is pretty similar to cork. You need to decide whether it is reasonable to replace

Eat your pork with a fork.

with

Eat your cork with a cork.

Sign in to comment.

Sign in to answer this question.

Answer 1

Peter Perkins on 18 Sep 2024

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/2153380-how-do-i-reduce-categorical-array-containing-alternate-spellings-and-formats#answer_1519000

Chris, something puzzles me. You say,

"I have a categorical array of length ~ 10,000 and typical string length ~ 10 to 20. Accounting for alternative spellings and formatting (etc.), I believe it actually contains only ~ 100 unique categories (a reduction on the order of ~100)."

100*100 is 10000, so are you saying you have a categorical array whose length is 10K and that has 10K categories? That can't be right. Maybe you have a string array whose length is 10K and that will have 100 categories after cleaning it up?

In any case, I can't help much with identifying which misspelled/whatever categories are equivalent to each other, but cleaning up this sort of thing is what renamecats, mergecats, etc. are for. You probably already know that.

2 Comments
Show NoneHide None

Chris Hooper on 18 Sep 2024

Edited: Chris Hooper on 18 Sep 2024

Hey Peter, thanks for the question hopefully I can clarify:

In the example I have an array of strings longer than 10,000 that reduces ~10,000 unique strings (or categoricals) but I believe it can be reduced to ~100 categories.

Some suggestions I've recieved are to consider text normalization (I am doing this already to an extent by changing to lower case without punctuation), fuzzy search, and “similarity search” (described to me as relating to natural language processing – beyond what I was hoping to tackle unless it’s a packaged matlab solution).

My own naive notion was to measure edit distance between each of the n=10,000 categories to yield a n x n score matrix, then look for clusters of categories sharing low mutual edit distance. @ Walter Roberson pointed out edit distance is a potentially weak approach. I was hoping a better solution might already exist in some neatly packaged form.

Walter Roberson on 19 Sep 2024

The editDistanceSearcher function would seem appropriate for this situation.

Sign in to comment.

Answer 2

Rahul on 30 Sep 2024

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/2153380-how-do-i-reduce-categorical-array-containing-alternate-spellings-and-formats#answer_1524600

Open in MATLAB Online

Hi,

I believe that you’re trying to cluster a categorical array, based on the “EditDist” function metric as distance between two categories, and further trying to get the cluster centroid for each of the clusters, which are at minimum average distance to the other strings in the same cluster.

The problem seems to resemble a k-means clustering problem, function "clusterdata" can be extended to be used for categorical elements as strings, where the strings need to be encoded to numeric values before being passed as observation vectors.

Data Generation: can be done by initially having 100 unique categorical strings of length (say 15), and then performing random 1-5 ‘editDist’ operations 100 times to create variations in the respective cluster, which can be achieved as:

% Parameters 
num_unique_categories = 100; 
num_variations_per_category = 100; 
total_size = num_unique_categories * num_variations_per_category; 
string_length = 15;  % Length of the base strings 
% Function to apply random edit operations (substitution, insertion, deletion) 
apply_random_edit = @(str) edit_string_randomly(str); 
% Function to randomly edit a string (simulate substitution, insertion, and deletion) 
function new_str = edit_string_randomly(str) 
operations = {'substitute', 'insert', 'delete'};  % Possible edit operations 
num_operations = randi([1, 3]);  % Random number of operations to apply (1-3) 
for i = 1:num_operations 
    operation = operations{randi([1, 3])};  % Randomly pick an operation 
    
    switch operation 
        case 'substitute' 
            % Substitution: replace a random character 
            pos = randi([1, length(str)]);  % Random position 
            str(pos) = char(randi([97, 122]));  % Random lowercase letter 
            
        case 'insert' 
            % Insertion: insert a random character at a random position 
            pos = randi([0, length(str)]);  % Random position (0 means insert at the beginning) 
            new_char = char(randi([97, 122]));  % Random lowercase letter 
            str = [str(1:pos), new_char, str(pos+1:end)]; 
            
        case 'delete' 
            % Deletion: remove a random character 
            if length(str) > 1  % Ensure the string has more than one character 
                pos = randi([1, length(str)]);  % Random position 
                str(pos) = [];  % Remove character at the chosen position 
            end 
    end 
end 
new_str = str; 
end 
% Create base categories 
base_categories = cell(1, num_unique_categories); 
for i = 1:num_unique_categories 
    base_categories{i} = generate_random_string(string_length);  % Generate 5-character base strings 
end 
% Generate the full categorical array with variations 
categorical_array = cell(1, total_size); 
idx = 1; 
for i = 1:num_unique_categories 
    base_str = base_categories{i}; 
    for j = 1:num_variations_per_category 
        num_modifications = randi([0, 5]);  % Modifications in the range [0, 5] 
        modified_str = base_str; 
        for k = 1:num_modifications 
            % Apply random edit (substitute, insert, delete) 
            modified_str = apply_random_edit(modified_str); 
        end 
        categorical_array{idx} = modified_str; 
        idx = idx + 1; 
    end 
end 

Encoding & Clustering: Convert each string into a numerical form, such as using word embeddings (like, word2vec), or creating a bag-of-words. This representation gives a spare symmetrical distance matrix with distance between each element, to be clustered using the “kmeans” or “kmedoids” function:

% Number of categories 
numCategories = length(categoriesStr); 
% Initialize a distance matrix 
distanceMatrix = zeros(numCategories); 
% Calculate pairwise edit distances 
for i = 1:numCategories 
    for j = i+1:numCategories 
        distanceMatrix(i, j) = editDistance(categoriesStr{i}, categoriesStr{j}); 
        distanceMatrix(j, i) = distanceMatrix(i, j); % Symmetric matrix 
    end 
end 
% Threshold for unique categoricals 
threshold = 5; 
% Cluster categories based on the threshold 
clusters = clusterdata(distanceMatrix, 'Criterion', 'distance', 'Cutoff', threshold) 

A threshold criterion of 5, determines the max distance within a cluster along with the cluster centroid indices.

You can also use “kmeans” or “kmedoids” functions to group categoricals, while providing a custom distance parameter to the function for measuring distance between two elements:

[I,c] = kmeans(categoriesStr, 'distance', @distFun); 
function d = distFun(s1, s2)  
    s = size(s2); 
    d = zeros(s(1), 1); 
    for i=1:size(s2, 1) 
        d(i, 1) = editDistance(s1, s2(i)); 
    end 
end 

For more information regarding usage of functions and parameters mentioned above, refer to the documentations below:

Create Custom Distance Metric using “kmedoids”: https://in.mathworks.com/help/stats/kmedoids.html#buioqlz-X:~:text=Custom%20distance%20function%20handle.%20A%20distance%20function%20has%20the%20form
Create Clusters using “clusterData”: https://in.mathworks.com/help/stats/clusterdata.html

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

How do I reduce categorical array containing alternate spellings and formats?

5 Comments
Show 3 older commentsHide 3 older comments

Answers (2)

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

How do I reduce categorical array containing alternate spellings and formats?

5 Comments Show 3 older commentsHide 3 older comments

Answers (2)

2 Comments Show NoneHide None

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

5 Comments
Show 3 older commentsHide 3 older comments

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments