Spatial data structures
I work for a company that is the leading supplier of automotive maps, and wants to be the leading supplier of online maps. So it was only a matter of time that I needed to learn more about how spatial extensions work in different open source databases. Let's start from the beginning, understanding various spatial data structures that are used in implementations...
Links are provided to Wikipedia articles - which are both comprehensive, yet easy to understand - for those who want to get a deeper understanding of each structure. All Wikipedia articles on spatial indexes are listed here: http://en.wikipedia.org/wiki/Spatial_index#Spatial_index
The B-tree (and its variations B+tree and B*tree) is the classic index structure used in pretty much all database products from Oracle to PostgreSQL and MySQL to MongoDB. The B-tree is not a spatial index. However, we mention it here since it would typically be used underlying a quad-tree index, and also just to provide a base level to which the other indexes can be compared.
Figure 1 A B-tree
A B-tree is a generalized form of a binary tree. The key feature making it efficient for disk based database usage is the ability to store more than one record per node, and to have more than 2 children per node. This makes the tree flatter than a binary tree would be. Finally, allowing each node to be partially empty, allows for more efficient update operations, as records can be inserted and deleted into the empty space, and only infrequently is there a need to split or combine nodes.
In a B+tree the leaf node will contain copies of the records that in a simple B-tree would be stored only at the higher levels of the trees. This allows a sequential scan to complete purely by reading the leaf nodes without the need to jump "upwards" to fetch a single missing record. The other optimization is to make the leaf nodes a linked list, again allowing a scan to proceed directly to the next leaf node, without having to go "upwards" via the common parent node.
The R-tree is the most common index structure for organizing spatial data / geometric objects. It can be thought of as "B-tree for rectangles". In fact, several basic features of a B-tree are retained, such as the tree structure itself, and splitting and combining of nodes.
For other shapes than rectangles the concept of Minimum Bounding Rectangle is introduced - this is a rectangle within which the shape can fit completely.
R-trees can be used for shapes of any dimensions, but currently 2-dimensional maps are certainly most common. The problem of how to map the Earth's sphere into a 2-dimensional surface is outside the scope of R-trees: If a 2-dimensional R-tree is used, then the 2-dimensional surface is already assumed to exist.
Figure 2 An R-tree is a hierarchy of Minimum Bounding Rectangles (here a 3D tree)
The idea of an R-tree is to group nearby objects and enclose them inside yet another Minimum Bounding Rectangle. These rectangles are again enclosed in larger rectangles, eventually leading to a root node: the largest rectangle enclosing all objects in the database.
The input to a search operation is likewise a search rectangle. The R-tree index can now be traversed such that rectangles that do not overlap with the search rectangle are immediately pruned, while nodes for overlapping rectangles are traversed and overlapping leaf nodes included in the search result.
Descriptions of R-tree algorithms - including the linked Wikipedia article - are usually concerned only with finding the right set of Minimum Bounding Rectangles. However, when the actual shapes are not all rectangles, it doesn't mean that all shapes in this result set would actually intersect or be enclosed by the search rectangle. (Nor does it have to be the case that the search input is a rectangle either.) The basic idea with R-tree is still to efficiently find a superset of matching shapes, after which this set can be checked one-by-one with some other algorithm, which is different for each shape (lines, spheres, polygons). For instance, to check if two lines intersect, the equation
a1 x1 +b1 y1 + c1 = a2 x2 +b2 y2 + c2 (for valid ranges of x and y)
needs to be solved.
Insert / update
The different variants of R-tree structures differ mainly in how they build the index during inserts, and this can have significant impact on the structure of the tree and resulting efficiency. The differences arise in particular when nodes are full and need to be split. There is no single correct way to split the objects into 2 separate bounding rectangles, and furthermore computation to make the decision is expensive, leading to the use of simplifying heuristics.
The following picture show bounding rectangles in red, resulting from indexing the same 2D points with different algorithms:
Figure 3 Effect of building an R-tree with different algorithms
The picture illustrates how the 2 first algorithms result in very long and narrow rectangles. This is clearly sub-optimal: A typical application would be searching for some small area in a particular location, say Berlin. With the first two indexes, one would a) have to retrieve several pages to get all those points in Berlin, and b) one would have to read a lot of unneeded data that happens to reside on the same pages even if its quite far from Berlin. The third R*tree index on the other hand seems to have grouped the points into rectangles which minimize the radius of each rectangle, usually exactly what you want.
To mention a few variants:
The R+tree modifies the R-tree algorithm such that the bounding rectangles for each node are non-overlapping. To achieve this, if an object is inserted that intersects the rectangles of 2 or more nodes, it is copied to both/all nodes.
As seen above, the R*tree uses an algorithm which results in relatively quadratic rectangles. In addition, as a different approach to re-balancing the tree it also removes already existing objects from the tree and re-inserts them, which may lead to objects becoming stored in different nodes than their original location.
The quad tree is a simple index for a 2D space, which is based on the idea of dividing a rectangle into 4 parts of equal size, then dividing the resulting rectangles again, and again, for better granularity. This creates a search tree where each node has exactly 4 children.
The 3D counterpart to a quad-tree is called octree, it divides a 3D space into 8 parts.
A key benefit of the quadtree is that by labeling each sub-rectangle with the numbers 0 ... 3, it is possible to uniquely represent each area with a string of numbers. Whether stored as a string, integer1 or just bytes, this string of numbers is straightforward to store into a normal B-tree and rectangular areas can be searched with less-than or greater-than operations.
As a result, if a quadtree encoding is done on the application side, it becomes possible to use any common database such as MySQL or MongoDB, without the need to use any spatial features on the database level. With the increase in use of Amazon and other public clouds, and in particular PaaS services for data storage, the need for this kind of architecture is increasing.
Figure 4 A quadtree dividing a 2D surface. The blue area is indexed as 023.
For instance, the blue rectangle in the picture above is indexed as 023. It can naturally be searched from a database simply as "x = 023". However, if you also wanted to search for the sub-rectangles inside that area, such as 0231 or 0233, you could search for "x >= 023 AND x < 03".
A quadtree can of course be infinitely granular. Note that in practice the Earth is mostly empty of objects as most of the surface is water, deserts or just not mapped or connected to the internet. In practice therefore, it is yet another benefit of the quad-tree that it allows using different granularity in different areas of the total surface.
For more info: http://en.wikipedia.org/wiki/Quadtree
Quad-trees are a form of a grid-based spatial-index, but there can of course be other grids too, starting from a naive grid of rectangles refered to with (x,y) coordinates. Using such grids is straightforward, but the discussion then usually revolves around the problem of projecting the 3D sphere that is the world onto a 2D, often rectangular, surface. This gives rise to more elaborate forms of lattice, such as a Geodesic grid.
Apparently SQL Server spatial features are based on a Grid based index. I don't know how that becomes different than R-tree indexes.
This is what I learned by reading up on spatial data structures. If you have comments, or there's something more I should know, I'd be interested to hear.
- 1. If storing as an integer, one would have to use numbers from 1 to 4, to avoid problems with leading zeros.