I was recently tasked with recreating an existing supplier search for a client; I was provided with a database of suppliers, most of which had been geocoded, and not much else. This scenario is fairly standard when dealing with mapping applications: a user enters in a postcode and the system will return a list of the closest suppliers to that location. The postcode part of this equation is well travelled - the Post Office in the UK will not relinquish the mapping from a postcode to a latitude, longitude tuple without a large outlay of cash (and numerous non-disclosure agreements), the easiest option is to use an external service for this. I opted for PostcodeAnywhere as I had used them before with great success. The latter part of this challenge - the return of the closest database entries - was something that I wanted to try myself as I didn't known when I would get such an opportunity again.
To say there are many different ways of calculating the distance between two points would be an understatement. One which I had used before involved northing and easting co-ordinates from a known point within the UK (usually the centroid or London). Using this meant a smattering of trigonometry would be enough to return a decent list of matches; this always struck me as crude, despite it's usefulness, using an antiquated and subjective co-ordinate system seemed the wrong way to approach the problem. Latitude and longitude are globally recognised and provide a precise way of defining points on the globe - reading up on how they are calculated was the step one. Step two was finding an algorithm that calculated the distance between two arbitrary points. The first one I found was the Haversine formula: simple, easy to follow and easy to implement. Knowing that this formula was based upon the assumption that the Earth was perfectly spherical grated slightly with me - I reasoned there must be a more accurate algorithm. I found this precision in Vinencty's algorithm, it was then I decided to enact a contrived but deliciously fun maxim: if something is worth doing, then it's worth overdoing.
Vincenty's formula is accurate to within half a millimetre if given the correct variables for your location - as the earth is an ellipsoid, and a non-uniform one at that (pesky hills), your location will determine just how accurate the formula is. Most systems use the generic and perfectly suitable WGS-84 variables which are usually accurate to within 2 metres - this is the system GPS uses. As all the suppliers I would be searching for and all the postcodes would be within the UK, I could use the more precise Airy (1830) set - likely named after George Biddell Airy for his work on planetary densities. The maths involved in the formula is dense to say the least and I would be lying if I said I grokked it but the implementation is straightforward.
I had originally envisaged doing some sort of segmenting of the suppliers prior to working out distances in a script; as my brain mulled over the possibilities (closest postcode, caching of major city locations, co-ordinate system conversions) I realised that setting it up as a database function would solve all of these problems. Firing up phpMyAdmin I bashed out an attempt and after a couple of fixes (mostly syntactic foibles of MySQL) it was working a treat:
DELIMITER // DROP FUNCTION IF EXISTS distanceVincenty// CREATE FUNCTION distanceVincenty(lat1 FLOAT, lon1 FLOAT, lat2 FLOAT, lon2 FLOAT) RETURNS INT BEGIN DECLARE a, b, f, L, U1, U2, sinU1, cosU1, sinU2, cosU2 DOUBLE; DECLARE lambda, lambdaP, sinLambda, cosLambda DOUBLE; DECLARE sinSigma, cosSigma, sigma, sinAlpha, cosSqAlpha, cos2SigmaM, C DOUBLE; DECLARE iterLimit INT; DECLARE uSq, A1, B1, deltaSigma, s DOUBLE; SET a = 6377563.396, b = 6356256.909, f = (1 / 299.3249646); SET L = RADIANS(lon2 - lon1); SET U1 = ATAN((1 - f) * TAN(RADIANS(lat1))); SET U2 = ATAN((1 - f) * TAN(RADIANS(lat2))); SET sinU1 = SIN(U1), cosU1 = COS(U1); SET sinU2 = SIN(U2), cosU2 = COS(U2); SET lambda = L, lambdaP = 0, iterLimit = 100; mainLoop: REPEAT SET sinLambda = SIN(lambda), cosLambda = COS(lambda); SET sinSigma = SQRT((cosU2 * sinLambda) * (cosU2 * sinLambda) + (cosU1 * sinU2 - sinU1 * cosU2 * cosLambda) * (cosU1 * sinU2 - sinU1 * cosU2 * cosLambda)); SET cosSigma = sinU1 * sinU2 + cosU1 * cosU2 * cosLambda; IF sinSigma = 0 THEN RETURN 0.0; END IF; SET sigma = ATAN2(sinSigma, cosSigma); SET sinAlpha = cosU1 * cosU2 * sinLambda / sinSigma; SET cosSqAlpha = 1 - sinAlpha * sinAlpha; SET cos2SigmaM = cosSigma - 2 * sinU1 * sinU2 / cosSqAlpha; IF cos2SigmaM IS NULL THEN SET cos2SigmaM = 0; END IF; SET C = f / 16 * cosSqAlpha * (4 + f * (4 - 3 * cosSqAlpha)); SET lambdaP = lambda; SET lambda = L + (1 - C) * f * sinAlpha * (sigma + C * sinSigma * (cos2SigmaM + C * cosSigma * (-1 + 2 * cos2SigmaM * cos2SigmaM))); SET iterLimit = iterLimit - 1; UNTIL ((ABS(lambda - lambdaP) > 1E-12) AND (iterLimit > 0)) END REPEAT mainLoop; SET uSq = cosSqAlpha * (a * a - b * b) / (b * b); SET A1 = 1 + uSq / 16384 * (4096 + uSq * (-768 + uSq * (320 - 175 * uSq))); SET B1 = uSq / 1024 * (256 + uSq * (-128 + uSq * (74 - 47 * uSq))); SET deltaSigma = B1 * sinSigma * (cos2SigmaM + B1 / 4 * (cosSigma * (-1 + 2 * cos2SigmaM * cos2SigmaM) - B1 / 6 * cos2SigmaM * (-3 + 4 * sinSigma * sinSigma) * (-3 + 4 * cos2SigmaM * cos2SigmaM))); SET s = b * A1 * (sigma - deltaSigma); RETURN ROUND(s); END; //
All of the trigonometry functions are built into MySQL, even helpful ones like ATAN2 and its ilk. The function returns the distance between the points in millimetres which can then be easily transformed into your chosen unit of choice.
As you can no doubt guess from the code above, this isn't exactly computationally cheap. Potentially for nearly antipodal points the main loop will repeat up to a iterLimit times (100 above) before continuing. As well as this, depending on the construction of your database and SQL statement, you could end up doing this calculation for every record in your table, e.g.:
SELECT id, title, address, distanceVincenty(latitude, longitude, @lat, @lon) AS distance FROM `suppliers` ORDER BY distance DESC
will force MySQL to calculate the distance for every row and then order accordingly. I've yet to do any benchmarks as it would be lunacy to put into production, however queries which ordinarily took a few hundred milliseconds started taking up to two seconds to complete. Thankfully the system I was working on solved this problem itself by having each supplier categorised and rated which meant searches rarely returned more than ten to twenty results before distance calculation; the possibility still exists of doing segmentation based on addresses prior to the calculation but that will be dependant on your specific requirements.
So while I doubt any users of the new supplier search realise it, their results are accurate to within a few millimetres, potentially saving them microns of shoe leather in walking or giving them true results for the case when rival suppliers are mere millimetres apart. For anyone who isn't interested in meridians and arc tangents then the Vincenty function is most certainly overkill and sticking with Haversine is likely the smarter move, but for those valuing absolute precision, this certainly provides a great little exercise.