Comparing open source GIS implementations
In my quest to understand spatial GIS functionality, I have come to the ultimate goal: evaluation the actual database products themselves.
PostgreSQL / PostGIS
PostGIS is a variant of PostgreSQL with spatial extensions. The main reason for maintaining the GIS feature set outside of PostgreSQL proper seems to be licensing: the spatial extensions are
LGPL GPL licensed.
PostGIS is widely recognized as the most mature and feature-rich GIS implementation for SQL databases (and perhaps any database), matched only by the costly Spatial extension for the Oracle Enterprise database. (See comparisons in the links below.)
For instance PostGIS supports both Geometry (x, y) and Geography (lat, long) types and functions. The functions take into account the specified Spatial Referencing System Identifier (SRID) and calculate distances using the correct projections.
Updated 2012-08-13: See comments section, especially from reader "Regina", with information about PostgreSQL 9.1 improvements wrt the criticism below, which is based on PostgreSQL 8.4 / PostGIS 1.5.
The biggest disappointment in PostGIS turns out to be that the GiST R-tree index only supports bounding rectangle operations. 1 This is a bit annoying, but in practice a developer would then have to construct a WHERE clause that first contains a ST_DWithin() against a bounding rectangle and then add a logical AND using the appropriate GIS function against the actual object, which may be something else than a rectangle. I.e:
SELECT ... WHERE ST_DWithin(shape_column, ST_MakeEnvelope(90900, 190900, 100100, 200100,312), 100) AND ST_Distance(shape_column, ST_GeomFromText('POINT(100000 200000)', 312)) < 100
In the above, the first part of the WHERE clause (Within) would select anything inside the given rectangle, and this can be done with an index scan. The second part (Distance) would then be used to filter the result set further (in this case, including objects within a circular radius instead of a rectangle).
The manual also mentions that the PostgreSQL query planner will often fail to actually use the index at all, why it may be necessary to force the using of an index or alternatively modify the cost parameters of the query planner.
Finally, the use of PostGIS extensions seems to come with various SQL overhead that may be the norm in PostgreSQL and Oracle land, but appears as quite bad usability when coming from a MySQL or NoSQL background. Consider for instance that only in recent development PostGIS versions is the creation of Geometry columns actually done in CREATE TABLE syntax, but instead a separate AddGeometryColumn() procedure has until now been called.2
Similarly, since a lot of the GIS functionality is implemented using PL/pgSQL it must be taken into use by executing various *.sql files that are supplied. This again leads to maintenance problems when upgrading to newer versions: after upgrading the PostGIS binary, one must again execute additional *.sql as part of the upgrade, or worse, perform a full DUMP and RESTORE. (In PostGIS defense, it should be noted that doing a full DUMP and RESTORE is common practice with normal PostgreSQL upgrades too.)
To round it all off, I also failed to find Postgis in the common Centos repositories. The manual only instructs you to compile it from source - big minus in user friendliness! I eventually found the OpenGeo project which provides a yum repository that includes opengeo-postgis rpm packages: http://suite.opengeo.org/docs/usermanual/installation/linux/postgis-centos.html
(Update: Again, commenter "Regina" informs that the official PostgreSQL yum reporitories do in fact contain also PostGIS binaries.)
All this makes PostGIS appear as a bolted on hack on top of standard PostgreSQL, which perhaps is not an entirely wrong perception.
Compared with MySQL, then, PostGIS is no doubt more featureful but still a bit of a disappointment due to complexity of use. The lack of native support in the GiST R-tree index for other than bounding rectangle searches seems annoying and the resulting workarounds clearly are against the declarative spirit of SQL. Since it is possible for anyone to do simple bounding box operations with any RDBMS simply by defining and indexing 4 columns [xmin, ymin, xmax, ymax] (or, perhaps using a quad-tree approach) and querying these with normal SQL, a compelling reason to use PostGIS is certainly lost there.
On the other hand the support for using correct projections with supplying the SRID, and support for Geography types, are unique PostgreSQL strengths not provided in other open source alternatives (except SQLite, which is based on the same libraries).
For more information:
MySQL implements a subset of the OpenGIS "Simple feature access" specification. A multitude of Geometry types is supported, but only on a 2D plane. The Geography type systems (lat, long) are not implemented.
Also within the Geometry implementation, MySQL has only supported functionality based on Minimum Bounding Rectangle operations. Thus, two objects that do not actually interstect, might still be returned as intersecting by the MBRIntersects() function if their bounding rectangles would intersect. This has been the primary reason for criticizing MySQL's spatial implementation as naive.
It seems that as of 5.6.1 (beta) MySQL implements proper OpenGIS functions in this area. (http://dev.mysql.com/doc/refman/5.6/en/functions-for-testing-spatial-relations-between-geometric-objects.html#functions-that-test-spatial-relationships-between-geometries) MariaDB also implemented these functions independently a while ago in version 5.3.3. A nice webcast of using these functions is available at http://blog.mariadb.org/screencast-mariadb-gis-demo/ This is a significant improvement in terms of MySQL spatial functionality.
The other key drawback in MySQL's spatial functionality is the lack of spatial indexes for InnoDB tables. While all storage engines support Geometry column types, and the spatial functions can of course operate on those types regardless of the underlying engine, R-tree indexing capability was only developed for MyISAM. This is unfortunately still the case for MySQL 5.6 and MariaDB 5.3.
Finally, MySQL ignores the SRID if specified and performs all calclulations using Euclidean geometry. For many applications this can be considered a minor issue, but it should be noted the error compared to real world distances does get quite large towards the poles. For some applications then, this inaccuracy might be a showstopper. In practical terms, while you might not care about the accurace as such, it might be a showstopper if your application needs to interact or consume data with another system that is already in a more accurate projections (such as the WGS84 Mercator projection) and you need to use the same projection to remain compatible. (OpenStreetMaps and our Nokia Maps use this projection, and I think Google Maps uses something similar but more complicated.)
As such, the lack of R-tree support for InnoDB tables could still be considered a severe weakness for most GIS usage with MySQL. Even if an organization were to exceptionally allow use of MyISAM tables for GIS, the mixing of storage engines at minimum adds cognitive overhead, and for anything but a read-only workload the lack of transactions and crash-safety is very problematic indeed.
A variation - still more complex, but safer - of the above would be to use InnoDB tables as primary storage for all data, including geometric types, and then maintain a copy of that data in a MyISAM table. This could possibly be automated with database triggers, but one still has to be mindful about MyISAM's poor read-write performance. Another variation would be to store the geometry types into MyISAM tables (with indexes) and link them to the main InnoDB tables with a foreign key relationship.
Yet another approach to use GIS with MySQL would be to keep using InnoDB tables, and add some application level workaround to compensate for lack of R-tree indexes. An obvious idea then is to encode the bounding rectancle of geometric objects as a quad-tree, which could be stored in a normal B-tree index. This could to some extent even be done with database triggers. In the searching phase the application would then have to first search the B-tree index and then filter the result set further by comparing shapes one-by-one with functions like st_intersects(). Given that the PostGIS R-tree index only works on bounding rectangles too, this solution is comparably equal, even if more "hand made".
The one thing that clearly stands out in favor of MySQL is the user friendliness. The Geometric types and functions are just there, ready to be used. They don't need to be downloaded from a separate place and manually installed, and there's no additional gymnastics to do at every upgrade. Given that PostGIS implementation isn't perfect either, this benefit can be considered MySQL's strong point.
In conclusion then, it seems that if accurate positioning calculations are needed, then PostGIS is without competition in the open source world. On the other hand, with the recent addition to MySQL of true spatial functions other than just bounding rectangles, the user friendliness of MySQL is compelling when Euclidean 2D geometrics is sufficient.
For the future, I spoke with Holyfoot from the MariaDB team who told me MariaDB is working on supporting different projections. I do not have information on anyone working on an InnoDB implementation of the R-tree index, but I wouldn't be surprised if Oracle is secretly doing just that. (This is based on knowing they recently released another MyISAM-only feature, the full text indexing, for InnoDB and the general trend to discontinue MyISAM support and focus on InnoDB.)
SQLite / Spatialite
Spatialite is a variant of the SQLite database that provides GIS features. It uses the GEOS, PROJ.4, etc... libraries that are also used in PostGIS, and hence the resulting featureset is similar. In particular, spatialite supports different geometric reference systems (the SRID values) and correctly handles projections into each reference system. Only Geometry types (2D x,y operations) are supported, the Geography type operations are not.
The R-tree index only stores bounding rectangles. What's more, while the R-tree is maintained automatically, it is not actually included in query plans by the SQLite optimizer. Instead, the user has to explicitly query a table that is the R-tree index, and include this into the actual query as a sub-query or join. Examples of rewriting a normal SQL query to one explicitly using the R-tree index are given in the tutorial below.
While it is not useful to compare SQLite to client-server databases like PostgreSQL and MySQL, the advanced features in SQLite can indirectly impact the choice of Postgres vs MySQL: Imagine an application that uses SQLite/spatialite on the client side, for instance to provide offline access to a subset of a GIS database.
In this case it would be wise to use PostGIS for the server side implementation, since it would be guaranteed to produce identical results as spatialite. Otoh MySQL's Euclidean-only geometrics would introduce a risk of producing different results for spatial operations compared to the client side spatialite operations.
The lack of support for Geography projections (lat, long) is a clear minus, since in practice the WGS84 projection is used for significant data sets.
Extensive tutorial: http://www.gaia-gis.it/gaia-sins/spatialite-tutorial-2.3.1.html
Quick tutorial: http://underdark.wordpress.com/2010/06/23/spatialite-quick-start/
As it is possible to use SQLite as a frontend to BerkeleyDB, it is also possible to use the Spatialite extensions as described above. There is no further GIS support beyond this.
MongoDB markets itself as having spatial extensions, which were sponsored by Foursquare, no less. It turns out this feature boils down to the ability to index 2D points, and querying records within a distance of a given point. (Ie. find friends or cafees within a 5km radius.)
It seems MongoDB's implementation is similar to using quad-tree encoding over a B-tree, but instead they encode locations as GeoHashes. For instance the hash "ud9wrc8" is the location of Helsinki: http://geohash.org/ud9wrc8
There is no support for other geometric objects than points, and thus also no support for related GIS functions such as intersects(). As a curiosity, the implementation also cannot wrap around the +180 longitude to -180.
As pointed out in the PostGIS section above, this feature is essentially doable with any SQL database without using any GIS functions. Otoh for a NoSQL database it is no doubt a helpful enabler.
For more information:
The PostGIS documentation explains GeoHash in more detail: http://postgis.refractions.net/documentation/manual-2.0/ST_GeoHash.html
Solr / Lucene
Solr also provides functionality to filter search results by distance from a given point. This can also be combined with other filters, for instance proximity can be used to boost search results, while not absolutely excluding remotely located points if they match the other search criteria.
There is no support for other geometric types than 2D points, nor the related GIS functions.
The feature provided by Solr seems like a perfect fit for Point-of-Interest type of applications, since it combines the ability to find a location - in particular, a point - with the ability to do text searches.
A nice and short tutorial of Solr capabilities are given at: http://derickrethans.nl/spatial-indexes-solr.html
For more info: http://wiki.apache.org/solr/SpatialSearch/
GEOS is a well known C++ library that implements the OpenGIS Simple Features Specification, ie it provides both the Geometry types and related functions. It is a C++ port of the pre-existing Java Topology Suite. GEOS is used in PostGIS, whereas MySQL implements its own functionality.
There is a family of related libraries provided by the OSGEO project, such as PROJ.4 used to transform between different projections.
For more info: http://trac.osgeo.org/geos/
Java Topology Suite is a Java Library that implements the OpenGIS Simple Features Specification.
For more info: http://tsusiatsoftware.net/jts/main.html