Continuing on the same note as the Greenwald-Khanna post, we discuss another novel data structure for order statistics called Q-Digest. Compared to GK, Q-Digest is much simpler, easier to implement and extends to a distributed solution very easily. The data structure was designed with sensor networks in mind, where minimizing radio transmission overhead is of paramount importance. It originally appeared in this paper by Shrivastava et al
Q-Digest is simply a complete binary tree over the range of values, where the leaf nodes are the values themselves. In the example below, we have a digest over the values 1 to 4 inclusive, and we have seen the value 1 once and the value 4 seven times.
The novelty is the compression algorithm which allows values with low frequencies to propagate up the tree and be combined. In the example below, we have some information loss. We know that there are 10 values between 1 and 2 inclusive, but we don’t know the exact counts of each
Assume you have a web farm and you are collecting response times for user requests. You could have hundreds of millions or even billions of data points and you want to know how your service is doing. You could maintain a running average, but averages are sensitive to outliers, not to mention that you’d need to also look at the standard deviation. Quantiles are a better way to summarize your data. They allow you to answer questions such as “what percent of requests finished faster than x milliseconds” or “what is the maximum response time for 90% of my requests”. Problem with quantiles is that you normally need to keep all the data in memory. If you lets say have 1 billion data points, you’d need about 3.5GB of RAM to store each as an integer.
Here we examine an algorithm that continuously maintains summary information and can answer queries within a small bounded error using only a fraction of the memory that would normally be required. There are numerous algorithms that solve this problem, but the one we will examine is due to Greenwald and Khanna, in their 2001 paper “Space efficient online computation of quantile summaries“. It is more complex to understand than a lot of the other algorithms but it is, as far as I know, the one with the best space usage. Also, a lot of other papers we will examine later use it to solve interesting problems.