Using TFIDF with Naive bayes
Show older comments
I'm building a sentiment classification model using TFIDF and naive bayes. But the model keeps misclassifying the second class.Although I have used TFIDf with other models such as SVM and random forest and it was working fine. Below I will describe my data and steps used: I have 2000 comments (1000 positive, 1000 negative). I did the following steps: 1) data preprocessing
cleanTextData = erasePunctuation(textData);
cleanTextData = lower(cleanTextData);
words = stopWords;
cleanDocuments = tokenizedDocument(cleanTextData);
cleanDocuments = removeWords(cleanDocuments,words);
cleanDocuments = normalizeWords(cleanDocuments);
cleanDocuments(1:10)
%%Bag of Words
cleanBag = bagOfWords(cleanDocuments)
cleanBag = removeInfrequentWords(cleanBag,2) % remove words with frequency less than or equal 2
%%remove emplty documents caused by preprocessing
[cleanBag,idx] = removeEmptyDocuments(cleanBag);
Then I used TFIDF
predictors = tfidf(cleanBag,'Normalized',true,'TFWeight','log','IDFWeight','smooth');
Then I passed the results to my naive bayes model
t = templateNaiveBayes('DistributionNames','mvmn');
CVMdl = fitcecoc(predictors,response,'KFold',10,'Learners',t,'FitPosterior',true,'Coding','onevsone','ResponseName','response');
But the confusion matrix will give the following results :
C1 C2
____ __
990 10
1000 0
It seems it is classifying almost all the 2000 observations to one class only. Please advice, I have tried almost all what I know and what ever suggested by others. This is related to my master thesis and I only have few weeks to submit it.
4 Comments
Jim David
on 27 Jul 2018
On running the code with a dataset consisting of 5000 documents from which 2500 features (unique words) were extracted, I was able to obtain accuracies exceeding 95%. Repeating the same with a dataset consisting of 2000 documents from which about the same number of features was extracted yielded an accuracy of 60%. I would expect this issue to largely be resolved by increasing the size of the dataset.
Here are certain considerations which might help achieve your objectives.
1) Increasing the size of your dataset. Dimensionality reduction may also help.
2) While running the code on my end, I encountered a warning regarding the use of 'mvmn' as the Distribution parameter. This was due to the continuous nature of the tf-idf values as opposed to the categorical values which 'mvmn' is best suited for. I would consider changing the Distribution parameter to 'normal', while making sure to handle zero-variance features appropriately. You may find this discussion helpful:
This could be done by removing all zero-variance and training on all the data at once without folding. The folding parameter as part of the built in functions doesn't handle cases where the partition of the dataset has zero-variance. This could be done instead with a user-defined function.
Sarah Alduayj
on 25 Aug 2018
Edited: Sarah Alduayj
on 25 Aug 2018
Christopher Creutzig
on 26 Nov 2018
Edited: Christopher Creutzig
on 26 Nov 2018
Do you have to use naïve Bayes, or did you try other models and got even worse results?
With only two classes, I do not see why you use fitcecoc, which is an interface to use multiple binary classifiers to build a multi-class one. You could use fitclinear instead, which in my experience is pretty good at the kind of high-dimensional fitting required in text analytics.
Oscar Green
on 10 May 2019
One thing I've done in the past is to aggregate/discretize into log-frequency buckets and treat those as features. It's a bit of a hack, but so is naive bayes, and it ends up working pretty well.
Answers (0)
Categories
Find more on Classification Ensembles in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!