At Sidelines, we frequently need to quickly identify which athletes a news article may be talking about. Most of the time, a number of machine learning classifiers make sure that the article gets correctly labeled, but sometimes we need to run the title or even the entire text of the article against our database of athlete names. Using pre-compiled regular expressions does not scale (we have thousands of athletes and staff names in our database) and running each keyword against the target text would be too slow.
Today we talk about an algorithm that is O(n + m), where n is the length of the target text and m is the number of matches. This algorithm is ideal if you have a large number of keywords and you want to search for all of them against a corpus of documents. The algorithm, called Aho-Corasick after its authors, first appeared in a 1975 paper titled Efficient string matching: an aid to bibliographic search. The downside is that you have to pay a one-time hit to construct a trie-like data structure which is then used to efficiently search. Continue reading