How to improve K-means clustering with TF-IDF?
11 views (last 30 days)
Show older comments
Geovane Gomes
on 7 Oct 2024
Commented: Christopher Creutzig
on 22 Oct 2024 at 5:57
Hi all,
I’m currently working on a project where I need to classify company segments based on their activity descriptions.
I’ve implemented K-means clustering using TF-IDF for feature extraction from text data. However, the current clustering results aren’t entirely accurate, especially when it comes to grouping semantically similar segments (e.g., "cars" and "vehicles" are placed into separate clusters). Is this possible to optmise it, or use another approche rather than TF-IDF.
See cluster 13. More than 50% of the items were assigned to this cluster. I also tried using other distance parameters, but the results didn't improve.
Here is my code:
clear
close
% load and preprocess
d = readtable("segmentos95Translated.xlsx");
t = d.TRANSLATED;
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'EXCEPT');
t{i} = strtrim(splitStr{1});
end
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'WITHOUT PREDOMINANCE');
t{i} = strtrim(splitStr{1});
end
% tokenization
t = lower(t);
t = tokenizedDocument(t);
t = removeStopWords(t);
t = normalizeWords(t);
customStopWords = ["manufactur","activ",",","rental","(",")","*","exempt"...
"commerci","repres","agent","trade","product","retail","sale","waiv","special","wholesal"];
t = removeWords(t,customStopWords);
% bag of words and TF-IDF
bag = bagOfWords(t);
tfidfMatrix = tfidf(bag);
X = full(tfidfMatrix);
% kmeans
rng(1)
numClusters = 25; % about 10%
[idx, C, sumd, D] = kmeans(X, numClusters);
d.clusters = idx;
% display results
for i = 1:numClusters
fprintf('Cluster %d:\n', i);
disp(d.TRANSLATED(idx == i));
end
sortrows(groupcounts(d,"clusters"),"Percent","descend")
0 Comments
Accepted Answer
Sandeep Mishra
on 8 Oct 2024
Hi Geovane,
I can observe that you are trying to enhance the accuracy of your K-means clustering implementation.
The current implementation using 'TF-IDF' fails to capture the semantic meanings between words, which can lead to unrelated synonyms or related terms being treated as distinct.
To resolve this, you can use word embeddings such as 'fastText' which represent words in a continuous vector space, capturing semantic meanings.
You can leverage the 'Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding' add-on in MATLAB to implement 'fastText' word embedding.
Consider the following implementation:
% Converting tokenized documents to cell array
textData = arrayfun(@(doc) joinWords(doc), t, 'UniformOutput', false);
% Loading fastText word embedding
emb = fastTextWordEmbedding;
% Converting text to embedding
X = zeros(numel(textData), emb.Dimension);
for i = 1:numel(textData)
words = split(textData{i});
validWords = words(isVocabularyWord(emb, words));
if ~isempty(validWords)
vecs = word2vec(emb, validWords);
X(i, :) = mean(vecs, 1);
end
end
[idx, C] = kmeans(X, numClusters);
Refer to the following MathWorks Documentation to learn more about ‘Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding’ function in MATLAB: https://www.mathworks.com/matlabcentral/fileexchange/66229-text-analytics-toolbox-model-for-fasttext-english-16-billion-token-word-embedding
I hope this helps.
4 Comments
Christopher Creutzig
on 22 Oct 2024 at 5:57
Also worth checking out are documentEmbedding and, for a different workflow with “soft clustering,” fitlda.
More Answers (0)
See Also
Categories
Find more on Language Support in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!