Category Archives: python

Being truly asynchronous with Tornado

In the past few days I’ve had the chance to play a bit with Tornado, a non-blocking web server for Python similar to node.js. Tornado works by using something like epoll or whatever I/O event notification exists in the system. Tornado is single threaded and works like a fancy while (True) loop. As a consequence, whenever we do some blocking I/O (e.g. database call) the web server cannot process other requests. While that may not be a big deal on very low traffic sites or if your I/O subsystem is extremely fast, it starts becoming an issue as your traffic grows or if you’re hosted on something like Amazon AWS; notorious for its slow and inconsistent disk I/O performance. ┬áThe biggest problems that I encountered with Tornado are: 1) the lack of robust asynchronous libraries for things like DB access and 2) Python’s own awkwardness when dealing with async code. NodeJS programmers don’t have any of these problems because the community writes software with async in mind and Javascript as a language offers anonymous blocks which make callback style code tolerable.┬áThe purpose of this post is to encourage Python programmers using Tornado to pay attention to their blocking calls or if they don’t need any of the features that Tornado offers, stick to a WSGI server such as uwsgi or gunicorn. Continue reading

Advertisements

String searching using Aho-Corasick

At Sidelines, we frequently need to quickly identify which athletes a news article may be talking about. Most of the time, a number of machine learning classifiers make sure that the article gets correctly labeled, but sometimes we need to run the title or even the entire text of the article against our database of athlete names. Using pre-compiled regular expressions does not scale (we have thousands of athletes and staff names in our database) and running each keyword against the target text would be too slow.

Today we talk about an algorithm that is O(n + m), where n is the length of the target text and m is the number of matches. This algorithm is ideal if you have a large number of keywords and you want to search for all of them against a corpus of documents. The algorithm, called Aho-Corasick after its authors, first appeared in a 1975 paper titled Efficient string matching: an aid to bibliographic search. The downside is that you have to pay a one-time hit to construct a trie-like data structure which is then used to efficiently search. Continue reading