Most introductory textbooks on supervised learning algorithms examine the algorithm from a single classification point of view, meaning we only need to decide if an item belongs to a certain class or not. Multi-class classification, an instance of the same problem is typically done by training multiple such binary classifiers for each pair of classes. There are various improvements over these techniques, but very few papers examine how to get meaningful probability estimates out of a learning algorithm.
At Sidelines, we use the probability output of some of our classifiers to predict how relevant a sports news article is to a specific team. We then use that input as part of our ranking algorithm. For example, this article is mostly about the Boston Celtics and it should rank high in a Celtics feed, but Mavericks’ fans would also be somewhat interested although it should not rank as high in their feed.
To get good probability estimates though, some offline experimentation is needed. An untuned instance of Naive Bayes would tend to give estimates either near 0 or near 1 with very few articles inbetween. I’m speculating that this is because of the large number of words very frequently repeated in all documents. Mizil and Caruana examine methods to tune a variety of algorithms in their 2005 paper “Predicting Good Probabilities With Supervised Learning“. The authors test two calibration methods, Platt Scaling and Isotonic Regression for SVMs, Logistic Regression, Decision Trees, KNN, Naive Bayes, etc. They conclude that a) Platt Scaling works best when the calibration set is small and works best with max margin algorithms such as SVMs, b) Isotonic Regression usually outperforms Platt Scaling when the calibration set is large and it is the calibration method of choice for Naive Bayes. How to use Isotonic Regression for calibration is described in detail by Zadrozny and Elkan in “Transforming Classifier Scores into Accurate Multiclass Probability Estimates“. Same paper also goes into detail about the multiclass case but we’ll leave that for another time.
Here is the methodology one can use to calibrate NB:
- A new article comes into the system and put through a NB classifier. The classifier outputs a probability. Our goal is to decide with a certain probability how relevant the article is to our sports community. If it is relevant, we want to keep it and put it through further processing, otherwise discard it.
- To make it easier to rationalize about the results, discretize the probability space. We use five bins, [0, 0.35) is a no go, [0.35, 0.5) is mildly relevant, [0.5, 0.7) is relevant and [0.7, 0.9) is very relevant and [0.9, 1.0] is an absolute must have. The reason we have a smaller margin towards the right edge is because Naive Bayes shows opposite bias after calibration. We are also highly unlikely to display anything with prob < 0.5 to the user (depending on other inputs) because false positives create a much worse user experience than false negatives.
- Label a large set of previously classified news articles with carefully curated probability estimates. About 2000 should suffice. These labeled samples should NOT be the ones used to initially train the model and make sure there are enough samples for each virtual bin.
- Put the labeled samples through isotonic regression. R has a package to do that using a generalized form of the Pool Adjacent Violators Algorithm (PAVA). As I’m not proficient working with R, this step was a total nightmare. Once I have a better understanding of PAVA, my goal is to rewrite this part in Python.
- Take the result and code it as a function in your favorite program. Calibrate subsequent/test data based on that function
You may notice that the probabilities being output in step 5 do not necessarily sum up to 1. I’m convinced that this is wrong and some normalization is needed, but this has not affected us in practice (yet)