© 2006-2025 Henrik Ingo.
The content on this site is published with the Creative Commons Attribution License.
That means you are free to copy and reuse and redistribute the book, blog posts and other original content you find on this site.
Non-original content will be clearly attributed with their respective copyright terms.
Designed by: Golems G.A.B.B. OÜ
Hi Henrik, Very
Hi Henrik,
Very interesting post! I have one question: what tool did you use to do the benchmarking?
Carl
Hi Carl We have an
Hi Carl
We have an self-developed C++ based tool that benchmarks http web services. To benchmark these databases I created a mod_python script and ran it under apache. The actual database queries are given in the slides.
As both MongoDB and Postgres scaled less than I had expected (MySQL was tested last) I have also re-run the same queries with sysbench (C++) for Postgres, to ensure that Apache and the related TCP/IP traffic wasn't the bottleneck. (It wasn't.) Eventually I also found a Java based benchmark tool for MongoDB, created by Tim Callaghan, which I used to also verify the same for Mongo (also there it wasn't the tool that was bottlenecking.)
The published numbers are sysbench for MySQL and Postgres and the mod_python based benchmark for MongoDB.
Benchmark tests
Henrik,
Do you have a script of how you prepped your data and what queries you ran in your benchmarks? I am pretty puzzled that PostGIS came out last in your tests. And of course having worked with it so long, I'm a bit suspicious of your conclusions and would like to test these out myself.
For example did you have spatial indexes on your geometry columns. I would assume you did. Did you vacuum analyze prior to running your benchmarks? If it is indeed an issue in PostGIS, we would like to fix that.
See above: For PostGIS and
See above: For PostGIS and MySQL it was sysbench 0.5, the queries given in the slides were used in simple sysbench lua scripts. (No, I don't have access anymore to the files.)
Yes, I'm sure I had indexes. The data set was tens of gigabytes so it would have shown if not :-) (Only a subset was accessed to make it fit in memory.)
I did not do vacuum analyze or any other "expert tuning". This is also the case for MySQL, even if I happen to be an expert on that one and MySQL also takes a hit on the second test case which I'm sure could be solved by tuning. I do point this out in the talk, especially on the slide about data size.
These are "out of the box" tests, the only thing I made sure is that indexes exist and data fits in RAM (requires configuration in PostgreSQL and MySQL).
More thoughts on PostGIS performance
I asked around about this and our key PostGIS architect responded.
http://lists.osgeo.org/pipermail/postgis-devel/2013-July/023641.html
I've repeated his comments here for completeness.
"I had some thoughts about it this morning:
- One very odd finding was the overall size finding. PostGIS 10x larger than mySQL. There's one of two things going on here: he's just made a mistake in reporting one of the other sizes; the GIST index has way more overhead than a purpose-built spatial index might have in terms of space. Although, actually, he got the same 10x result for both the polygon and point layer, which argues against the second point, so that leaves me with wondering if he just made an error.
- The write-up was strange, in that it on the one hand noted the PostGIS linear scaling line as load went up and didn't note that MySQL started to choke under multiuser load, and concluded with "MySQL is faster", without adding "under a single user load scenario"
- The actual test was a bunch of singular point-in-polygon queries. We get no leverage from caching here, because each call just does a single point/polygon pairing. My ancient mysql-vs-postgis benchmark also showed mysql going faster (though only somewhat, not 5x) on simple single-statement queries (bbox queries in my case). The actual code path for the point/polygon test doesn't call into GEOS (good) and doesn't populate a cache until the second call (good) so there's no obvious postgis-level inefficiency on this workload
- The whole database is "in memory" he says. That leaves one item which could be problematic, particularly for bigger polygon geometries: external storage. We set our storage type such that geometries get compressed when they get too big, which means it's possible the system is spending time unzipping geometries before evaluating the call.
- Regina did ask if he put on indexes, hard to imagine he didn't.
- As mentioned, there's not quite enough info here to make a solid diagnosis or critique.
P. "
On a separate note -- it would be nice one of these days to have a spatial database shootout at FOSS4G - http://www.foss4g.org/
I've always been dismissive of these kind of benchmarks - not just yours but things like anything comparing Oracle Spatial and PostGIS and even saying PostGIS is faster. Just because I know how hard it is to optimize these things and the benchmarks always seem more a reflection of the knowledge of the person who is doing it (what they know about the databases they are testing) than the actual database platform and thus end up being unintentionally biased.
Thanks Regina! I think the
Thanks Regina!
I think the compression is the culprit. In the case of PostGIS it was really saturated on CPU. With 8 client threads I had 8 cores at 99% utilization. Compression is a logical explanation. It also made latency per query quite poor.
The objects really were large, polygons may have had hundreds of point, some maybe more.
MySQL did scale under multiple user load, for the polygon based model. Those results were in the range that I expected every database to achieve.
It's true that taking these kind of "shoot outs" too seriously is problematic. Fact is, for my work at the time I did these tests, it made sense to compare alternatives, and since I have results it makes sense to share them. But like I already said above, these results are not from a perfectly tuned database, which is also the case for MySQL.
Note that I don't have access to the data set or testing files anymore. Other than that, feel free to ask more questions if you want.
It's worth noting that
It's worth noting that "PostGIS with points" results were the second best of any of the databases/reverse geocoders I tested, including some that are not published. Those results were also better than MySQL with points, but not better than MySQL with Polygon based approach.
Suggest use PostGIS 2.0 near-neighbor operators <-> and <#>
PostGIS 2.0 introduced new operators <-> and <#> which efficiently find nearest neighbors by point and bounding box, respectively. They use PostGIS' geometry indexes so are fast on large datasets. Rather than "Creating areas out of points" on slide 11 (which seems to be a discretized Voronoi approximation), you would get better accuracy and I think you would get better performance, just building an index on the original points and querying them for nearest neighbor.
That would also avoid the arbitrary choice of "1" as the amount by which to expand the bounding box in the second query on slide 16. If point density is high then many points are retrieved and much time is spent sorting.
Some references:
http://blog.opengeo.org/2011/09/28/indexed-nearest-neighbour-search-in-…
http://blog.light42.com/wordpress/?p=102
http://postgis.refractions.net/support/nearestneighbor.pdf (for technical underpinnings)
Hi David Actually, the "SQL
Hi David
Actually, the "SQL with points" query from slide 16 is for MySQL. The WHERE clause is needed as MySQL Distance() alone didn't utilize the RTree index. In the case of PostGIS, I actually used just ORDER BY Distance() without WHERE clause.
(Really sorry, I never had time to properly publish a writeup on these results, I realize there is information loss with publishing just the slides. But just keep asking question re any missing information, I'll do my best to make up for it in the comments.)
Using ST_Distance() should be the same as <->, no? To me using a function somehow felt more natural than a special operator, but I was assuming they have the same performance.
Efficiently ordering by distance
Hi Henrik, Unfortunately, just like with MySQL, in PostGIS using ST_Distance() alone doesn't use the RTree index. The below example on 9.6m points shows that ordering by ST_Distance() does a sequential scan and sorts all the points taking 19sec, while ordering by <-> uses an index scan and takes 6ms.
=> create table tt as select st_makepoint(random(), random()) as geom from generate_series(1, 9600000); analyze tt; create index tt_geom_sp_idx on tt using gist (geom);
...
=> explain analyze select geom from tt order by st_distance(geom, st_makepoint(0.5, 0.5)) limit 1;
Limit (cost=2614589.00..2614589.00 rows=1 width=32) (actual time=18832.367..18832.367 rows=1 loops=1)
-> Sort (cost=2614589.00..2638589.00 rows=9600000 width=32) (actual time=18832.366..18832.366 rows=1 loops=1)
Sort Key: (st_distance(geom, '0101000000000000000000E03F000000000000E03F'::geometry))
Sort Method: top-N heapsort Memory: 17kB
-> Seq Scan on tt (cost=0.00..2566589.00 rows=9600000 width=32) (actual time=0.033..14590.472 rows=9600000 loops=1)
Total runtime: 18832.390 ms
=> explain analyze select geom from tt order by geom <-> st_makepoint(0.5, 0.5) limit 1;
Limit (cost=0.00..3.52 rows=1 width=32) (actual time=6.413..6.414 rows=1 loops=1)
-> Index Scan using tt_geom_sp_idx on tt (cost=0.00..33775681.99 rows=9600000 width=32) (actual time=6.412..6.412 rows=1 loops=1)
Order By: (geom <-> '0101000000000000000000E03F000000000000E03F'::geometry)
Total runtime: 6.447 ms
Another reference article along the same lines: http://workshops.opengeo.org/postgis-intro/knn.html
Weird. Before presenting I
Weird. Before presenting I specifically checked and it seems I got good results with ST_Distance() without any WHERE clause. I think we can rule out that it would have been a full table scan - the data set was about 30GB. There is no way of scanning through that 3000 per second.
Unfortunately I can't go back to verify once more, as I no longer work at Nokia.