The Cost-Benefit Paradigm

There’s a pattern I keep seeing where organisational structure or segregation of duties makes simple tasks complicated. For a bit of fun I tried to define it:

The Cohen Cost-Benefit Paradigm

When there are two parties involved:

  • Party One understands the benefit of some resource but not the cost.
  • Party Two controls the cost of the same resource but not the benefit of it.

The result is that neither can decide if the usage of the resource is appropriate or not.

This happens a lot in cases where an infrastructure team manages and is in control of costs. Another team comes to them and requests infrastructure components, say some servers.

One: Hi, we’d like some shiny new servers with a load of RAM

Two: Why?! They’re really expensive.

One: We need it to solve problem X.

Two: There has to be a cheaper way. Why can’t you use solution Y instead?

One: No, we solved the problem already, this makes the best sense. Can you just provision it please?

Two: Nope. It’s too expensive.

Then follows, debate, re-design and discussion that could have been avoided.

Note that neither party is being malicious. One is trying to deliver benefits to their customers, the other to keep their costs under control.

Possible solutions to the problem?

  • Improve Communication. This is more a fix for the symptom than the problem but would make the best of the situation and allow everyone’s concerns to be factored into the design.
  • Make one person/team responsible for both the benefit and the cost.

Are recruiters creating “appealing” faux profiles?

I received this LinkedIn invite today:


The profile page on LinkedIn looks legit but the suspicious side of me didn’t think that photo was genuine.


TinEye is an awesome service that allows you to search for copied images on the internet.

Searching for the profile image showed up 25 other misc sites including this image. Either this recruiter is a bit of an “internet sensation” or someone’s just copied a picture of someone attractive to create a faux profile.

If it’s the latter their strategy is definitely sound!


Putting a GPS Tracker in the mail

I lent my SPOT Personal Tracker to a friend recently. The SPOT is a neat piece of kit that relays your GPS location via the Globalstar satellite constellation in various different ways. There are two buttons that allow you to send pre-configured messages to a given set of SMS/email users, an optional “Track Progress” Google Maps overlay and a 911 emergency button that alerts an international emergency centre. It’s been a great reassurance to have this with me on a couple of foreign motorbike trips especially those that get out of mobile phone range.

I posted the SPOT Tracker in a padded envelope and just before I put it in the post I enabled the “Track Progress” function for a laugh. The GPS needs pretty good clear view of the sky and the satellite transmissions are low power. The low power allows it to track for an amazing 14 days or send 911 signals for approx 7 days, impressive off just a pair of Lithium AA batteries. Given that the conditions it would be in were far from ideal I didn’t have high hopes of it tracking at all.

I took it to post office near work and posted it Royal MailSpecial Delivery at lunchtime. I expected the parcel to be put on a truck/train for the 250 mile (400km) journey.

That night evening I checked in and found that it had sent out some signals!

First from the large Mount Pleasant sorting office, note the red coloured Royal Mail vehicles.
Location 1

It then spent an hour on the road, I’m guessing in a fibreglass or soft topped truck heading out of London and up the M11.

All of the updates while it was on the move are accurate enough that you can see the correct side of the motorway that it was on.
Motorway position

It stopped sending messages an hour after it left the sorting office, you can see clearly that it’s in a loading bay
Loading Bay, Stansted

Turns out it wasn’t going on its journey in a train or truck, it was going on a plane from Stansted airport. Perhaps I shouldn’t have left it turned on and tracking!

It all went a bit quiet from here until early the next morning when it showed up 5 miles or so away from my friend’s house
Prudhoe Depot
Red vans on an industrial estate? Looks like another Royal Mail depot to me.

Finally we watched it on the driver’s delivery round and spotted it a couple of streets away from his house
Close to delivery

It arrived shortly afterwards about 24 hours after being posted.

It was quite cool watching the progress of the parcel in real-time without using the carrier’s own tracking information and would have been incredibly useful if the package had gone missing at all. This was a fun demonstration of the SPOT Tracker which I’d recommend to anyone who does outdoor stuff away from mobile phone reception.

With parcels all we need next is some kind of cheap data-logging accelerometers so you can prove when the courier dropped it now…

Big datasets for full-text search benchmarking

A few times recently I’ve looked for large datasets to experiment/benchmark against and I usually manage to come up blank.

I managed to spend longer than usual on this problem yesterday and came up with some which I’ll share with you.

Project Gutenberg

This project hosts the content of over 33,000 books. You can download the data as one book per file and there are full instructions for downloading/mirroring here. It seems that they’ve blocked Amazon AWS IP ranges from mirroring content from their site which is a shame.

The Westbury Lab USENET Corpus

The contents of some USENET groups have been anonymised and cleaned up to form a set containing >28M documents and >25B words. Each week’s data is stored as a single text file and each post simply delimited which makes parsing a breeze. You can easily load this data into a MySQL database with a command similar to:


It’s also available as part of Amazon’s AWS Public Datasets offering as EBS snapshot snap-­c1d156aa in the US West AWS region. Using AWS is a really quick way of getting hold of this dataset without the need to wait for any downloading to complete.

I found this a really nice dataset with each document at around 5kB, it seemed to be a sensible size for benchmarking email body text, blog/publishing posts, etc

Wikipedia Text
Wikipedia provide huge database dumps. It seems that there’s an AWS Snapshot snap-­8041f2e9 which contains this data too but it’s a couple of years old. There’s also a “WEX” extract (snap-­1781757e on AWS) created by the Freebase team which is provided as XML markup ready to be easily imported into a db table, one row per article.

In doing this research I came across a couple of v.interesting projects that extract/compile metadata from various sources including Wikipedia. They’re Freebase and DBpedia. I hope to play with some of their datasets and write a post on that in the future.

Using DNS TTL to control migrations

Often when you’re moving services from one piece of hardware/location to another it will involve a DNS change. From my experience the DNS change is usually the final change that’s used to move the traffic.

DNS entries can have TTLs. TTL means “time to live” and is the expiry time of the record. For a normal running website you could expect a TTL of 86400 (seconds) or one day. This means that once a DNS server or other DNS client has requested the record it’ll hold onto a cached copy for up to a day before re-requesting it.

If you were to leave your TTL at 86400 and change the DNS entry to point to your new server it could take up to a day for the changeover to happen.

Let’s consider two common use cases:

You want the DNS switchover to happen quickly

Assuming you have a TTL of 86400 make sure you reduce the TTL at least before you want to perform the migration.

I would normally change the TTL to 3600 (1 hour) the day before the planned migration. Then at least one hour before the migration time reduce the TTL down to 600 secs (10 mins). Then at least 10 mins before down to 60 (1 min).

Then make the change and your traffic should flip over to the new IP address pretty quickly and if anything goes wrong you can change the entry back and all traffic should fail back within a minute.

After you’re confident all is working well from the new host you should increase the TTL back up to it’s normal setting (in steps if you want).

You want to move the traffic gradually over to a new service

If you’re not too fussed about which server the traffic hits (e.g. serving static content from file-synced servers) then you might want the traffic to move over gradually. This is a nice approach if you want more of a “soft launch” and don’t want to risk something bad happening to 100% of your traffic if there are problems on the new hardware.

In this case a larger TTL might be desirable. I’d probably go for an hour but it really depends on the situation.

You’d follow similar steps to the ones described above slowly reducing down the TTL till it’s at the value you want. Now modify the DNS record to point to your service but as you do this set a low TTL on the new record.

The low TTL on the new record won’t effect the speed that the new record rolls out but it does mean that if you need to fail back then entries should be re-cached quickly.

When everything’s failed over nicely increase the TTL again.

Why not keep my TTL low all the time then?

You can do but it’s generally not accepted as good practice. It’ll generate much more DNS traffic to your authoritative DNS servers too as other resolvers will need to re-cache entries.


These might catch you out:

  • Lots of OSes/browsers will cache a DNS entry for a minimum of 30 minutes so TTLs less than this might not be respected
  • Some caching name servers ignore the published TTL and will apply their own minimum (this is out of RFC and really frustrating)