If you are doing any sort of classification with thousands of features (as is common in text classification) and need to update your Bag of Words models online, then you’ll find this rather simple technique very handy. At some point while developing Sidelines, we were using LDA to cluster tens of thousands of articles in topics (this is no longer the case however). After extracting the text we would: 1) Remove stop words, 2) Construct a dictionary to map each word to a unique id, frequency tuple and 3) Prune that dictionary to remove very rarely used and very often used words. The biggest problem we faced was running out of memory as we’d have to make a pass over all our data to construct this dictionary. Even if we tried to segment our corpus and split them into different machines, keeping these huge dictionaries in memory was a pain, while swapping to disk was extremely slow.
The way we solved it was to utilize the so called “hashing trick”, first mentioned at Feature Hashing for Large Scale Multitask Learning by Weinberger et al but almost certainly used in various ML package implementations long before. The hashing trick can be summarized as follows: “Instead of storing a mapping between each word and the frequency count, store (key, val) -> (h(word), freq)”. This has two immediate advantages. The first is the obvious space savings. The second and more important advantage is that we don’t have to make a pass over all our data first to figure out how big this dictionary needs to be. That meant we could update our data structures dynamically as new documents arrived.
dictionary = dict() def add_document(text): """ @param text: perform stemming, tokenizing, stop word removal prior to passing this in """ for word in text: hashed_token = hash_fcn(word) dictionary[hashed_token] = dictionary.setdefault(hashed_token,0) + 1
I should mention that in our actual implementation we assumed a massive hash table to limit collisions and used a better hash function than the default one provided by Python.
There was one remaining problem to be solved. For some of our post classification applications, we actually needed to know which words would comprise a topic but that information was lost during hashing. For example, we wanted to display to the user that this cluster of documents is about topics “NFL, Patriots, Gronkowski, Astronaut”. Storing an inverse mapping of (k,v) -> (id, word) would have negated any savings we made. To work around this, we had a static mapping of the most used words (n-grams) in sports and their hash. This mapping took very little space (<25mb) and we could serialize it to disk whenever we did not need it.