Working around the performance problem of STUnion

When life gives you an O(n^k) algorithm, it's time to get creative and respond with your own logarithmic algorithm.

At DevResults we do a lot of work with maps. One such activity involves kml shape files of things like countries, states / provinces, cities, etc. While learning how to set up a new site, we grabbed the shape files for the United States off of GADM and started the import process.

An hour later, the process still hadn’t finished the first (and largest) shape file. Something was definitely wrong and so I dug in and found out where the code was stuck at and found that the code was iterating through all the points and combining them together into one big SqlGeography instance via the STUnion method. Below is an example of the code that was doing this

SqlGeography shape = polygons[0];
for (int i = 1; i < polygons.Length; i++) {
	shape = shape.STUnion(polygons[i]);
}

return shape;

When I started searching around for some alternatives to STUnion, I found this graph:

Classic O(n^k) problem from the look of it.

Taming the beast

From the graph, we know that we can combine small shapes very quickly, and large shapes take a very long time. This means for our algorithm we want to minimize the number of operations we do on large shapes. With this optimization goal in mind, we can make a divide and conquer algorithm that will combine shapes into increasingly larger shapes where finally at the end we only have two shapes to merge.

First, lets get the implementation out of the way.

public static SqlGeography CombinePolygons(List<SqlGeography> polygons, int start, int end)
{
    if (start > end) return null;
    if (start == end) return polygons[start];

    int midpoint = (start + end) / 2;

    SqlGeography left = CombinePolygons(polygons, start, midpoint);
    SqlGeography right = CombinePolygons(polygons, midpoint + 1, end);

    // if both values have shapes, combine
    if (left != null && right != null)
        return left.STUnion(right);

    // if only one has a shape, pick whichever has a value
    return left ?? right;
}

So, the way this works is it recursively splits the list of polygons until it gets down to two single polygons, unions them together, then returns that shape, which then gets unioned with another, until in the end we have one shape. This means that most of our operations are done on small objects, thus minimizing the large objects we have to merge.

For example, consider 512 polygons.

Operation Count	Left Size	Right Size
256	1	1
128	2	2
64	4	4
32	8	8
16	16	16
8	32	32
4	64	64
2	128	128
1	256	256

We can easily see that we are now minimizing the number of operations that we are doing on large shapes.

Reviewing the solution

So how well does this work out? Well the first time I ran this, it never finished the first shape file in over and hour and a half while this algorithm lets me process this shape file in just under 4 minutes. A great result from a simple change in how we process the polygon shapes.

Darren Kopp

Navigating through software, line by line