Thanks for creating NaturalNode!
I am using your Bayes Classifier in my project, when looking into the implementation, I found it adds smoothing when calculating the probabilities.
This smoothing on unknown words in test set will cause probability to be skewed towards whichever class has the least amount of features. For instance:
say smoothing === 1, class A has 2 features, class B has 3, (0 + 1) / 2 is bigger than (0 + 1) / 3, A also wins.
I understand it may be good to have smoothing in training set, but is it really necessary for test set? Why not just discarding the tokens which are not in classFeatures[label]?
while(i--) {
if(observation[i]) {
var count = this.classFeatures[label][i] || this.smoothing;
// numbers are tiny, add logs rather than take product
prob += Math.log(count / this.classTotals[label]);
}
}
Thanks for creating NaturalNode!
I am using your Bayes Classifier in my project, when looking into the implementation, I found it adds smoothing when calculating the probabilities.
This smoothing on unknown words in test set will cause probability to be skewed towards whichever class has the least amount of features. For instance:
say smoothing === 1, class A has 2 features, class B has 3, (0 + 1) / 2 is bigger than (0 + 1) / 3, A also wins.
I understand it may be good to have smoothing in training set, but is it really necessary for test set? Why not just discarding the tokens which are not in classFeatures[label]?