Wednesday, August 4, 2010

A taxonomy of range query problems

In this post, I will try to enumerate the range query problems that I know about. Let me know if I'm missing anything.

The query. Say you have n points in the plane, and you query for the points in an axis-parallel rectangle. What could we mean by "query"?
  • existential range queries: Is there any point in the rectangle?
  • counting queries: How many points are there in the rectangle?
  • reporting queries: List the points in the rectangle. Unlike the previous cases, the query time is now broken into two components: it is usually given as f(n) + k*g(n), where k is the number of output points.
Now let's assume the points have some number associated to them (a weight or a priority). Then one could ask the following natural queries:
  • weighted counting: What is the total weight of the points inside?
  • range min (max) query
  • range median query. (Possible generalizations: selection or quantiles.)
  • top-k reporting: Report just the top k guys, by priority (for k given). One may demand the output to be sorted. More stringently, one may ask the query algorithm to enumerate points sorted by priority, in time g(n) per point, until the user says "stop."
The number associated to a point can also be a "color". For instance, points could represent Irish pubs / Belgian bars / etc, and we may only want one point of each type in our output. Then the queries become:
  • colored counting: How many distinct colors are in the rectangle?
  • colored reporting: List the distinct colors in the rectangle (possibly with one example point from each color).
  • top-k colored reporting: If the colors are sorted by priorities (e.g. I prefer points of color 1 over points of color 2), one can then ask for the top-k distinct colors inside the rectangle.

Dynamism. The problem could be:
  • static: Preprocess the point set and then answer queries.
  • dynamic: Insert and delete from the point set.
  • incremental / decremental: We only insert or delete.
  • offline: The sequence of operations is known in advance. This is enough for many applications to algorithms.
  • parametric / kinetic. I confess ignorance with respect to these.

Orthogonal range queries. The setting from above works in any number of dimensions d≥1: the data set consists of n points in d-dimensional space and the query is a box [a1, b1]×···×[ad, bd]. This setup is usually called "orthogonal range queries".

We can consider the following important restrictions on the query:
  • dominance queries: the box is [0, b1]×···×[0, bd]. In other words, we are asking for the points dominated, coordinate-wise, by a point (b1, ..., bd).

  • k-sided queries: exactly 2d-k values in (a1, a2, ..., ad) are zero. For instance, a 3-sided query in 2D is a rectangle with one side on the x axis. Dominance queries are the special case of d-sided queries.

The universe. The coordinates of the points and queries can come from the following sets:
  • general universe. In the standard RAM model, we assume that the coordinates are integers that fit in a machine word.

  • rank space: the coordinates are from {1, 2, ..., n}. One can reduce any static problem to rank space by running 2d predecessor queries. Most problems can be shown to be at least as hard as predecessor search, so their complexity is precisely: "optimal cost of predecessor search" + "optimal cost for the problem in rank space". In other words, for most problems it is sufficient to solve them in rank space.

  • dense universe: the points are exactly the points of the grid [1, n1]×···×[1, nd] where n1n2···nd = n. In 1D this is the same as rank space, but in 2 or more dimensions the problems are very different. (To my knowledgeable readers: Is there a standard name for this case? For counting queries people call this "the partial sums problem", but how about e.g. min queries?)
For dynamic problems, the "rank space" changes when a new coordinate value is inserted. Thus, a rank-space solution must support a "insert value" operation that increments all coordinate values after a given one, creating space for a newly inserted point. (I have heard the term "list space" for this. Should we just use "rank space"?)


Stabbing. So far, our data set consisted of points and the queries asked for points in a given rectangle. Conversely, one can consider a data set of rectangles; the query is a point and asks about the rectangles containing that point ("stabbed" by it). This problem is important, among others, in routing: we can have rules for packets coming from some range of IP addresses and going to some other range of IP addresses.

The notion of rank space, and all query combinations still make sense. For instance, interval max stabbing is the following problem: given a set of interval (in 1D) with priorities, return the interval of highest priority stabbed by a query point.

Note that the rectangles in the input can intersect! If we ask that the rectangles be disjoint, or more stringently, be a partition of the space, we obtain the point location problem.


Rectangle-rectangle queries. So far we looked at containment relations between rectangles and points. More generally, the data set can consist of rectangles, and the query can also be a rectangle. Then one can ask:
  • intersection queries: analyze the set of input rectangles that intersect the query rectangle.
  • containment queries: analyze the set of rectangles that contain / are-contained-by the query.
Two important special cases arise when the rectangles degenerate to line segments:
  • orthogonal segment intersection: Given a set of horizontal segments, find the ones that intersect a vertical query segment.

  • orthogonal ray shooting: Given a set of horizontal segments, find the lowest one immediately above a query point. In other words, consider a vertical ray leaving from the point and find the first segment it intersects. (This is the min segment intersection query, where the priority of each horizontal segment is its y coordinate.)

More complex geometry. Of course, our ranges need not be orthogonal. One can consider:
  • balls
  • arbitrary lines
  • half spaces
  • simplices (e.g 2D triangles).
In non-orthogonal geometry, the concept of rank space disappears. However, most other combinations are possible. For instance, one could ask about the points in a query range; the ranges stabbed by a query point; the ranges intersecting a query range; the first rangeintersected by a ray; etc. We can ask existential, counting, or reporting questions, on ranges that can have weights or priorities.

9 comments:

JeffE said...

Under dynamism, add parametric/kinetic. In its simplest form, a kinetic data structure simply requires that the queries arrive in order along the 'time' coordinate, but the formulation allows for different interesting tradeoffs.

Under universes, add R^d — you know, real geometry.

Nearest neighbor and ray-shooting queries are special cases of range-min queries, where the weight of a point/object depends on the query.

All the special types of queries are special cases of generic range spaces, which can be defined by a set of objects X, a set of queries Q, and a function over pairs in X × Q. For example: (points, rectangles, [p∈r]) is 2d orthogonal range searching; (rectangles, points, [p∈r]) is 2d rectangle stabbing; (points, points, |pq|) is nearest-neighbor searching; (strings, strings, [x is a substring of y]) is Google; (Turing machines, input strings, running time) is all of complexity theory; and so on.

Finally, there are umpty-dozen kinds of approximation to consider.

Mihai said...

Thanks Jeff. I added the parametric/kinetic variant.

R^d is completely besides the point for orthogonal problems, since you should start by converting to rank space. (After that, what does the real model allow me to do?) For non-orthogonal queries, most data structures do use the Real RAM, since they don't want to worry about precision.

I have a hard time seeing nearest neighbor as a range query. In any case there's enough material on it that it's now a separate topic. :)

Orthogonal ray shooting is a range query -- I added a clarification. (Max segment intersection with priorities = y-coordinates.) But then I should probably not consider general ray shooting as a range query.

Another hyper-generic view you can take is semigroup range sum. For instance, counting works in the (N,+) semigroup, range min works in the (N,min) semigroup, etc.

About approximation, I have yet to be convinced about its value in range queries (as opposed to other things, like ANN). In any case, Sariel promised to do all approximation in the upcoming summer school :)

D. Eppstein said...

Did you forget, or intentionally leave out, recursive queries? Where what you want to do to the data within the range is apply some other range query on some unrelated dimension of the same data.

Of course one can always flatten the recursion and get some kind of range space in which you're just doing a counting or reporting query or whatever. But the ranges for this flattened space might be harder to describe.

Mihai said...

You're right, David, that concept deserves a mention. But it's more general than range queries. Eg, I can have records with 3D points and some string inside, and make range queries on the points and pattern matching queries on the string.

By the way, do you think it would be useful to upload this list to Wikipedia? Perhaps here: http://en.wikipedia.org/wiki/Range_query

What should one do about topics at the intersection of two fields (here, CG and databases), which nonetheless mean sufficiently different things inside those fields?

Guilherme said...

There are some variations for the weighted version based on what kind of sum is used. For example, if the sum is taken over a group, then static partial sums are trivial, but if you are in a commutative semigroup, then you cannot use subtraction and the problem gets much more complicated. Idempotence (x+x=x) is another property worth mentioning.

Data structures may be exact or approximate. In the approximate version, you can approximate the count, the fraction of the points within the range, or allow points near the boundary to be misclassified.

One may or may not consider problems where the objects are not points as range searching. For example, given a set of triangles, find the triangles intersected or contained in a rectangle.

Jack said...

Decomposability is a key property identified by Bentley.

Anonymous said...

The colored version (especially counting) is only decomposable w.r.t. colors, causing a substantial gap between best-known colored/non-colored bounds.

Anonymous said...

In your MADALGO's summer school tutorial on orthogonal range queries you claim that you can do better 2D orthogonal range queries but you do not give the output sensitive term. Do you mean that you are able to get O(n log log n) space with O(log log n +k ) or O(log log n(1+k)) query time (the latter would be only a slight improvement over FOCS00 paper).

Mihai said...

It's space O(n lglg n) and time O(k lglg n). I strongly suspect a lower bound saying that O(k)-time reporting requires Omega(n lg^eps n) time -- but I don't have a formal proof yet.