Using TFIDF with Naive bayes

Question

0 votes

I'm building a sentiment classification model using TFIDF and naive bayes. But the model keeps misclassifying the second class.Although I have used TFIDf with other models such as SVM and random forest and it was working fine. Below I will describe my data and steps used: I have 2000 comments (1000 positive, 1000 negative). I did the following steps: 1) data preprocessing

 cleanTextData = erasePunctuation(textData);
cleanTextData = lower(cleanTextData);
words = stopWords;
cleanDocuments = tokenizedDocument(cleanTextData);
cleanDocuments = removeWords(cleanDocuments,words);
cleanDocuments = normalizeWords(cleanDocuments);  
cleanDocuments(1:10)
%%Bag of Words
cleanBag = bagOfWords(cleanDocuments)
cleanBag = removeInfrequentWords(cleanBag,2) % remove words with frequency less than or equal 2
%%remove emplty documents caused by preprocessing
[cleanBag,idx] = removeEmptyDocuments(cleanBag);

Then I used TFIDF

 predictors = tfidf(cleanBag,'Normalized',true,'TFWeight','log','IDFWeight','smooth');

Then I passed the results to my naive bayes model

t = templateNaiveBayes('DistributionNames','mvmn');
CVMdl = fitcecoc(predictors,response,'KFold',10,'Learners',t,'FitPosterior',true,'Coding','onevsone','ResponseName','response');

But the confusion matrix will give the following results :

It seems it is classifying almost all the 2000 observations to one class only. Please advice, I have tried almost all what I know and what ever suggested by others. This is related to my master thesis and I only have few weeks to submit it.

4 Comments
Show 2 older comments Hide 2 older comments

Jim David on 27 Jul 2018

On running the code with a dataset consisting of 5000 documents from which 2500 features (unique words) were extracted, I was able to obtain accuracies exceeding 95%. Repeating the same with a dataset consisting of 2000 documents from which about the same number of features was extracted yielded an accuracy of 60%. I would expect this issue to largely be resolved by increasing the size of the dataset.

Here are certain considerations which might help achieve your objectives.

1) Increasing the size of your dataset. Dimensionality reduction may also help.

2) While running the code on my end, I encountered a warning regarding the use of 'mvmn' as the Distribution parameter. This was due to the continuous nature of the tf-idf values as opposed to the categorical values which 'mvmn' is best suited for. I would consider changing the Distribution parameter to 'normal', while making sure to handle zero-variance features appropriately. You may find this discussion helpful:

https://datascience.stackexchange.com/questions/15526/how-to-handle-a-zero-factor-in-naive-bayes-classifier-calculation

This could be done by removing all zero-variance and training on all the data at once without folding. The folding parameter as part of the built in functions doesn't handle cases where the partition of the dataset has zero-variance. This could be done instead with a user-defined function.

Sarah Alduayj on 25 Aug 2018

Edited: Sarah Alduayj on 25 Aug 2018

Open in MATLAB Online

 Thank for your support. I have been trying to solve the problem in the last few weeks it is partially worked. I just have a problem left with the distribution parameter. I have tried both 'normal' and 'mn' and both of them will generate the same errore below. The only thing worked is 'mvmn' and I want 'mn' to work for me since it is my project is about it. it will be great if I can know why and how I can fix it.   
Thank you.

Warning: When DistributionNames is 'mn', the input data must be nonnegative integers. Warning: When DistributionNames is 'mn', the input data must be nonnegative integers. Error using classreg.learning.partition.PartitionedModel/checkFoldArgs (line 327) Indices of folds must be a vector with numbers between 1 and 0.

Error in classreg.learning.partition.PartitionedModel/kfoldPredict (line 212) [mode,~,args] = checkFoldArgs(this,varargin{:});

Error in classreg.learning.partition.ClassificationPartitionedModel/kfoldPredict (line 223) [~,score] = kfoldPredict@classreg.learning.partition.PartitionedModel(this,varargin{:});

Error in F_NaiveBayes1_custom_stopwords_1_2_3gram_TFIDF (line 56) Predict = kfoldPredict(CVMdl);

Christopher Creutzig on 26 Nov 2018

Edited: Christopher Creutzig on 26 Nov 2018

Do you have to use naïve Bayes, or did you try other models and got even worse results?

With only two classes, I do not see why you use fitcecoc, which is an interface to use multiple binary classifiers to build a multi-class one. You could use fitclinear instead, which in my experience is pretty good at the kind of high-dimensional fitting required in text analytics.

Oscar Green on 10 May 2019

One thing I've done in the past is to aggregate/discretize into log-frequency buckets and treat those as features. It's a bit of a hack, but so is naive bayes, and it ends up working pretty well.

Sign in to comment.

Sign in to answer this question.

Follow Question

Using TFIDF with Naive bayes

4 Comments
Show 2 older comments Hide 2 older comments

Answers (0)

Categories

Products

Tags

Community Treasure Hunt

Using TFIDF with Naive bayes

4 Comments Show 2 older comments Hide 2 older comments

Answers (0)

Categories

Products

Tags

See Also

Community Treasure Hunt

4 Comments
Show 2 older comments Hide 2 older comments