Stream Algorithms: Order Statistics

Assume you have a web farm and you are collecting response times for user requests. You could have hundreds of millions or even billions of data points and you want to know how your service is doing. You could maintain a running average, but averages are sensitive to outliers, not to mention that you’d need to also look at the standard deviation. Quantiles are a better way to summarize your data. They allow you to answer questions such as “what percent of requests finished faster than x milliseconds” or “what is the maximum response time for 90% of my requests”. Problem with quantiles is that you normally need to keep all the data in memory. If you lets say have 1 billion data points, you’d need about 3.5GB of RAM to store each as an integer.

Here we examine an algorithm that continuously maintains summary information and can answer queries within a small bounded error using only a fraction of the memory that would normally be required. There are numerous algorithms that solve this problem, but the one we will examine is due to Greenwald and Khanna, in their 2001 paper “Space efficient online computation of quantile summaries“. It is more complex to understand than a lot of the other algorithms but it is, as far as I know, the one with the best space usage. Also, a lot of other papers we will examine later use it to solve interesting problems.

Greenwald-Khanna (GK from now on), is based on a simple data structure. We simply maintain tuples of (v, g, Δ)  and use those tuples when querying to find an estimate. In the tuple, v is our value from the data stream. The value for g is such that adding all values of g from the first tuple to any other tuple would give the minimum rank of the value for that tuple. Then adding  Δ to that sum would give the maximum rank for that tuple. The algorithm does insertions and compressions carefully to keep as few tuples around as possible while always maintaining that invariant that for any tuple we have g+Δ <= 2 * ε * n. As long as we can keep that in mind, the algorithm will always give correct  ε-approximate answers. Valid values for  ε largely depend on what kind of queries you want to do. For example, if you want the 90th percentile then  ε=0.1 is ok, for the 99th percentile  ε=0.01 is ok and so on.

The GK Insert() algorithm is straightforward. It scans the list of tuples we maintain to find the right place where the new tuple should be inserted. That tuple is always {value, 1, floor(2 * ε * n) }, where ε is the precision we are aiming for and n is the number of elements seen so far. An exception to the above is when inserting the new maximum or minimum. In that case, the tuple is {value, 1, 0}.

Insert(double v) {
Tuple newValue = new Tuple(v, 1);
int idx = 0;
foreach (Tuple t in S) {
    if (v < t.value)
break;
idx++;
  }

if (idx == 0 || idx == S.Count) // the new element is the new min or max
newValue.delta = 0;
else
newValue.delta = Floor(2 * epsilon * N);

S.Insert(idx, newValue);
++N;
  Compress();
}

The heart of the GK algorithm though lies with the Compress() function. This is where we get rid of unnecessary tuples while maintaining the invariants necessary to ensure that we can give correct answers.

for (int i = 0; i < S.Count - 1; ++i) {
if (S[i].g + S[i + 1].g + S[i + 1].delta <= Floor(2 * epsilon * N)) {
S[i + 1].g += S[i].g;
S.RemoveAt(i);
}
}

This simplified compress code walks the list and merges pairs as long as the value of g for the new pair plus Δ does not violate our invariant. Finally, we can now easily query our data structure. Recall that for any tuple, the minimum rank for that value is the sum of g up to that tuple and the maximum rank is found by adding Δ to that sum. So simply iterate over the list of tuples and stop once you find a tuple having a value greater than the element you are searching followed by a tuple having a value less than the element. To algorithm to compute the quantile is given below:

for (int i = 0; i < S.Count - 1; ++i) {
rMin += S[i].g;
rMax = rMin + S[i].delta;

if (r - rMin <= epsilon*N && rMax - r <= epsilon*N) {
return S[i].value;
}
}

The variable r above is the rank we are looking for. The above works by continuously calculating the minimum and maximum possible ranks for each tuple as per our initial definitions. If our wanted rank falls in that range, then we return the value for that tuple.
For more information on GK, I encourage you to read the original paper.

Advertisements

3 thoughts on “Stream Algorithms: Order Statistics

  1. http://www.miragehairfibers.pl/

    In a similar sense, business owners would be wise to offer their workers something to unite under.
    An HVAC repair clients are most often started
    with a technician who’s learned the trade through previous employment.
    Birthdays, Anniversaries, Housewarmings, Weddings, Baby Showers, Christmas and Valentines Day.

  2. www.ablotech.pl

    Often we hear experts on television that report a certain stock
    is likely to soar and now is the time to purchase. She invites
    one to visit her site where she is going to share a proven approach to start an business online.

    Donnie Jonston may be the author of this article about the best
    way to make money on Ebay Donnie has a lot of work
    experience like a writer as well as working with drop shippers in the variety of entrepreneurial ventures.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s