How to use svm in Matlab for my binary feature vector.

4 views (last 30 days)
ai ping Ng
ai ping Ng on 5 Apr 2017
Commented: Ilya on 15 Apr 2017
Let say I have a main feature set which combine of six binary feature vector. These six binary feature vector are 105X6 logical. Eg:
1. 10100001000001111111100000000001..
2. 00001010101111000010101010110001..
3. 00101011101111111100001000000000..
4. 11111111110000101010101001010111..
5. 0000011110000101010101001010111..
6. 11111111110000101010101001010110..
While three of the feature vector is for benign, another three is for malware. How can I train my feature vector using svmtrain and svmclassify? I have no idea how to start, please guide me.

Answers (2)

Walter Roberson
Walter Roberson on 8 Apr 2017
Do you mean you have 105 samples, each of which have feature vectors totaling 6 bits, or do you mean you have 6 samples, each of which has a total of 105 bits of features?
If you only have 6 samples with 105 bits of features per sample, then you do not have enough data to do classification.
  2 Comments
Walter Roberson
Walter Roberson on 9 Apr 2017
To do the calculations for classifications, you need at least as many samples as you have bits of features. More than that, actually.
user2030669, @cbeleites answer below is superb but as a rough rule of thumb: you need at least 6 times the number of cases (samples) as features. – BGreene Mar 7 '13 at 14:48 2 ... in each class. I've also seen recommendations of 5p and 3p / class. – cbeleites Mar 7 '13 at 20:02
[...] but you need a minimum of 96 observations to accurately predict the probability of a binary outcome even if there are no features to be examined [this is to achieve of 0.95 confidence margin of error of 0.1 in estimating the actual marginal probability that Y=1].

Sign in to comment.


Ilya
Ilya on 11 Apr 2017
You most certainly do not need as many samples as you have features. Statements like "you need at least 6 times the number of cases (samples) as features" are sheer nonsense.
However, with so few observations (6) you will likely find that several, perhaps many, features individually give perfect separation between the two classes. For example, staring at the posted patterns, I observe that the 6th bit is 0 for the first three samples and 1 for the last three samples. So if the first three are benign and the last three are malignant, the 6th bit is a perfect predictor. And there may be more.
You do not need SVM or any clever classifier for this problem. Just find all such perfect predictors and see if they make sense. Passing data to smart black boxes shouldn't be the first step in your analysis. Think about what your data means first. See if you can get a simple classification model by hand. If you fail, proceed with sophisticated algorithms.
  11 Comments
Ilya
Ilya on 15 Apr 2017
Walter, I appreciate this explanation.
I agree that this resource is not an academic journal, and the threshold for posting an answer is much lower than that for a publication. I also note that there are no consistent rules for people who answer on Answers (at least I am not aware of any), and for that reason you can choose any philosophy you like with respect to the quality/thoughtfulness of your answers. Yet I believe that answering questions outside your expertise without doing some verification first is dangerous and often produces plain wrong (not just somewhat incorrect) answers, which is worse than not giving any answer at all. Just like you said - "those people simply are not available", where "those people" means "experts". Because experts are not available, no one is there to refute a wrong answer, and the wrong answer stays on this site forever, serving as a source of confusion and support for similarly misguided future answers.
On my part, I choose to answer only questions for which I consider myself an expert. I doubt that by doing so I fail to provide critical help to people out there. Many people asking questions on this site are students, and they can certainly find other sources of help such as, for instance, their professors. This is especially true for questions such as this one, where the entire discussion revolves around theory and has nothing to do with MATLAB. It's just that submitting a question to Answers takes less effort than scheduling an appointment with faculty, and they resort to this easy way. If they knew the likelihood of getting a plain wrong answer was high, they would likely not resort to this easy way.
I appreciate your desire to help and am not asking you to apply the same level of scrutiny as that for an academic publication. I think though that raising the bar a bit higher would be a positive change toward improving quality, perhaps at the expense of reducing the overall number of answers; I think such a reduction would be acceptable since it would also lead to reduction of plain wrong answers. Also, doing more verification would allow you to learn the material at a deeper level and develop knowledge of new areas. I do not know to what extent you are interested in learning, of course.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!