Posts with the “database” tag

An open letter to (anime) database owners

Dear database owner,

I’m not a huge fan of the open letter format - its use for pithy snark has long since diluted whatever potency it once had - so I’ll get right to the point: I want access to your data.

Not in a creepy stalker way, I don’t want to know your three sizes and nor do I want whatever user data you choose to collect. No, I’m talking about the data you have on anime and its minutiae - characters, staff, companies and all the tidbits in between. Whether you call it a database, an encyclopaedia, a list, a planet or otherwise, I’m interested.

But why?

Read the rest of this entry

Tidbits from gallery.chaostangent.com

These are some of the neater parts of gallery.chaostangent.com that don't warrant a full exploration on their own but serve the goal of making the application more streamlined. I've crafted these examples to be focused so they don't contain superfluous details like error checking, timestamp columns and the like.

Database

The gallery schema is as follows:

CREATE TABLE IF NOT EXISTS `galleries` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `left` int(10) unsigned NOT NULL default '0',
  `right` int(10) unsigned NOT NULL default '0',
  `parent` int(10) unsigned NOT NULL default '0',
  `title` tinytext NOT NULL,
  `directory` tinytext NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `parent` (`parent`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 ;

This covers both the Modified Preorder Tree Traversal (`left` and `right` columns) model as well as the more standard hierarchical model (`parent` column). I'm still undecided as to whether indexing the `left` and `right` columns provides any benefits. Most of the queries on the gallery table involve getting the direct children of a particular node; the breadcrumb trail at the top of the page however is built using the `left` and `right` columns:

SELECT * FROM `galleries` WHERE (`left` >= ?) AND (`right` <= ?) ORDER BY `left`

Doing a multi-column index in MySQL works from the left column onwards, so for the above query, indexing on `left` and `right` would be a benefit. However when inserting and deleting nodes, queries are done singularly e.g. one for `left` and one for `right` which having an index on one and not the other may turn out to be detrimental in terms of update times. I could always do two indexes:

ALTER TABLE `galleries` ADD INDEX ( `left` , `right` ) ;
ALTER TABLE `galleries` ADD INDEX ( `right` , `left` ) ;

This runs the risk though of having a table that's more index than data. I haven't done a full benchmark of the different queries for each scenario but I would imagine only for large trees would indexing provide any tangible benefit.

Read the rest of this entry

Calculating the geodesic distance between two points

I was recently tasked with recreating an existing supplier search for a client; I was provided with a database of suppliers, most of which had been geocoded, and not much else. This scenario is fairly standard when dealing with mapping applications: a user enters in a postcode and the system will return a list of the closest suppliers to that location. The postcode part of this equation is well travelled - the Post Office in the UK will not relinquish the mapping from a postcode to a latitude, longitude tuple without a large outlay of cash (and numerous non-disclosure agreements), the easiest option is to use an external service for this. I opted for PostcodeAnywhere as I had used them before with great success. The latter part of this challenge - the return of the closest database entries - was something that I wanted to try myself as I didn't known when I would get such an opportunity again.

if something is worth doing, then it's worth overdoing

To say there are many different ways of calculating the distance between two points would be an understatement. One which I had used before involved northing and easting co-ordinates from a known point within the UK (usually the centroid or London). Using this meant a smattering of trigonometry would be enough to return a decent list of matches; this always struck me as crude, despite it's usefulness, using an antiquated and subjective co-ordinate system seemed the wrong way to approach the problem. Latitude and longitude are globally recognised and provide a precise way of defining points on the globe - reading up on how they are calculated was the step one. Step two was finding an algorithm that calculated the distance between two arbitrary points. The first one I found was the Haversine formula: simple, easy to follow and easy to implement. Knowing that this formula was based upon the assumption that the Earth was perfectly spherical grated slightly with me - I reasoned there must be a more accurate algorithm. I found this precision in Vinencty's algorithm, it was then I decided to enact a contrived but deliciously fun maxim: if something is worth doing, then it's worth overdoing.

Read the rest of this entry

Rebuilding gallery.chaostangent.com

gallery.chaostangent.com is an application for storing and organising images – ostensibly a very simple desire but one I found not catered for by existing web applications when it was first conceived in 2005. The concept was an application that was simple and easy to use while still allowing for a degree of organisation to ensure images weren’t stored in a single “pool”.

With a small, well-defined feature set it seemed like a good time to address some of the issues which had crept in

Background

When I first started developing the application, PHP 5 hadn’t been released for very long and was receiving a mixed reception. Regardless, I started developing using a custom built framework I had cobbled together from scratch – one that would eventually go on to be refined and used in some of my work projects. With the lack of other mature frameworks to compare with, it was rough round the edges and did little more than segment out code into the MVC pattern and even then it wasn’t an entirely clean encapsulation; it was however useful.

Read the rest of this entry

Replacing a server – part 1: the audit

Windows Servers - Data Center by jaxmac used under Creative Comments Attribution-Non-Commercial-No Derivative Works license

For a small digital agency, running an off-site server is as important as it is unglamorous. You don’t get any of the desirable super-tech of running a cluster but all of the headaches of running a constantly used, high-availability external computer. My workplace’s existing dedicated server (which I championed, configured and maintain) is used to provide web hosting to a variety of clients – both large and small – and for the past three and half years has provided a flawless service. Upgrading is not to be taken lightly and the reasons for doing so must always result in a better service to clients – whether that’s decreased work load for you or improved site responsiveness. For me it boiled down to entropy: three and a half years is a long time for hardware to run and it will eventually fail and make my day/week/month hell on toast.

First step on this crazy adventure: audit.

Cleaning house

the more you know and the better prepared you are, the easier it’s going to be

Audit is a filthy word round most parts and conjures up images of bespectacled pencil-pushers or greasy tax collectors. Despite this, documenting what you have is the first step to getting something better. When I begun this process however I found that, like a house, over time a server accumulates clutter: old domains, long since defunct sites, errant processes; automation only goes so far before a cleaner has to step in.

Spending a day archiving and removing cruft is tantamount to dusting the shelves and throwing away old books and furniture before moving house – it reduces the effort required later in the process. My removals included:

Domain name end-points – for ones which had either expired or the persons / companies had moved on
E-mail accounts – accounts for expired domains are useless, just as accounts for long since lapsed campaigns are
Test folders – a separate test environment means accumulation of in-progress sites was inevitable. I found a year without modification is a good metric for when to cull
Errant services – automated / scheduled processes such as a log-parsers; awstats was set to run on Apache’s log files – no longer necessary when every site we host uses Google Analytics
Old databases – very few of these but the odd one sometimes slips through

After archiving, it was time for the document itself.

Read the rest of this entry