If you are doing any sort of classification with thousands of features (as is common in text classification) and need to update your Bag of Words models online, then you’ll find this rather simple technique very handy. At some point while developing Sidelines, we were using LDA to cluster tens of thousands of articles in topics (this is no longer the case however). After extracting the text we would: 1) Remove stop words, 2) Construct a dictionary to map each word to a unique id, frequency tuple and 3) Prune that dictionary to remove very rarely used and very often used words. The biggest problem we faced was running out of memory as we’d have to make a pass over all our data to construct this dictionary. Even if we tried to segment our corpus and split them into different machines, keeping these huge dictionaries in memory was a pain, while swapping to disk was extremely slow. Continue reading
Most introductory textbooks on supervised learning algorithms examine the algorithm from a single classification point of view, meaning we only need to decide if an item belongs to a certain class or not. Multi-class classification, an instance of the same problem is typically done by training multiple such binary classifiers for each pair of classes. There are various improvements over these techniques, but very few papers examine how to get meaningful probability estimates out of a learning algorithm.
At Sidelines, we use the probability output of some of our classifiers to predict how relevant a sports news article is to a specific team. We then use that input as part of our ranking algorithm. For example, this article is mostly about the Boston Celtics and it should rank high in a Celtics feed, but Mavericks’ fans would also be somewhat interested although it should not rank as high in their feed. Continue reading