Beware the NTP “false-ticker” – or do the time warp again…

For the uninitiated, it’s easy to keep the clocks of computers on the Internet in synch using a protocol called NTP (Network Time Protocol).

Why might you want to do this? It’s very helpful in a large network to know that all your gear is synchronised to the same time, so that things such as transactional and logging information has the correct timestamps. It’s a must have for when you’re debugging and trying to get to the bottom of a problem.

There was an incident earlier this week where the two open NTP servers run by the US Naval Observatory (the “authority” for time within the US) both managed to give out incorrect time – there are reports of computers which synchronised against these (and more importantly, only these, or one or two other systems) had their clocks reset to 2000. The error then corrected, and clocks got put back.

Because the affected systems were chiming either only against the affected master clocks, or a limited number of others, the two incorrect times, but from a high stratum source, were taken as being correct and the affected systems had their local clocks reset.

There’s been discussion about the incident on the NANOG list…

Now, this can cause particular problems with things that rely on timestamping, such as some databases.

At one point in my career, the team I ran was responsible for providing a time service. Responsibly, we ran three seperate NTP servers, in multiple locations, and we used three different source technologies: GPS, and two longwave sources, Germany’s DCF77, and the British MSF.

The three servers (by the German time company Meinberg) used a “hybrid” approach – they all took a GPS source, and either a DCF77 or an MSF source. Each hybrid system ntp peered with the other two servers in our cluster, and by comparing ntp and the two radio sources, could therefore choose which of the source technologies was “reliable”, and if one of the source technologies was errored, it could disregard it. The devices could also be configured to not serve time if the synchronisation was not confirmed.

This is the same rationale and approach advocated in the NANOG thread to how you can configure your own NTP server to avoid being led astray by multiple incorrect sources telling the same “lie” about what time it is.

Part of the reason we chose to run this “hybrid” technology when we upgraded our NTP platform was that we had been burned in the past by the previous Caesium (Cs) standards that we had inherited. This older technology was prone to giving out incorrect time if the system was restarted, because the system contained two elements – the Caesium standard and an NTP appliance which derived it’s time signal from GPS.

When the system is restarted, the Cs standard has to stablise, which can take some time. GPS sync must also be obtained. The NTP appliances were very “black boxy” and barely configurable, while they were prone to start serving time before GPS sync had been obtained, which was therefore the wrong time. Occasionally, this could even require manual intervention to bring up properly.

(It seems, from this report from a SANS ISC handler, that this is likely what happened to USNO’s tick and tock yesterday.)

Now, in theory, this have mattered shouldn’t matter too much, as we had multple NTP servers, in multiple places, and we had stratum 2 service running off the back of the stratum 1 appliances connected to the Cs standards. If only one clock is giving off a wrong signal, your local NTP instance disregards this. We thought we were good.

However, some people had decided to chime devices which only supported a single NTP server against the clock which one day lost power, and the law of sod meant that it gave out incorrect time for a few minutes when it came back up.

This caused mayhem for those systems which chimed against just the one server. Lost logging information, database corruption and all manner of evil. I recall MS Active Directory was a particularly unforgiving casualty.

The complaints we received were plentiful and “interesting” to say the least, but it sometimes felt difficult be sympathetic to the person cursing at me down the phone when a) the service was provided with no warranty in the first place, b) those doing the complaing had exposed themselves to an incident such as this through their configuration, and c) were running gear that was still lame enough only to support chiming against a single time server (which it therefore had to trust absolutely).

If they could only chime against one server, why this one? One which depended on a single source tech for it’s time, rather than a different server which compared time from lots of different places before picking the best one? Apparently, the rationale seemed to be because these were “stratum 1” sources which made them “the best”, and therefore “had to be correct”.

So, this definitely influenced us choosing the Meinberg systems at our next upgrade. Their hybrid approach and integrated nature protected against such problems, plus we didn’t have to worry about Cs beam tube lifespan – it’s best leaving this sort of stuff to the real timelords, and we could get on with running our network.

Really, any device which can only support chiming against a single time server should be chiming against a service which is collecting and comparing time from a range of sources and technologies.

Of course, an irony here is that even “up-to-date” software such as Apple’s Mac OS looks as though it only supports a single time server when configuring via the UI. By default, it asks you to choose one of three public time servers they provide, or you can enter your own. This probably doesn’t matter for your average laptop or desktop computer, but does become more of a concern for a Mac server.

There are reports that you can add multiple time servers in the UI by putting commas or spaces in between the time servers in the UI, but this doesn’t seem to work as far as I can tell. A quick ‘ntpq -p’ reveals that it’s only ever using the first one in the list.

Apple seem to work around this in public servers they provide, such as time.apple.com, by making them downstream of multiple diverse time sources, so they are less likely themselves to give out wrong time, unless of a local configuration fail.

There is a way of making Mac OS use multiple time servers, which is by editing the ntp.conf file, if you know what you’re doing. Just be prepared for what you did to be overwritten next time you configure using the UI!

The NANOG thread is full of good examples of best practice when configuring your own local NTP server on Unix systems, in particular this message.

My own config looks something like this:

server 0.debian.pool.ntp.org iburst
server 1.debian.pool.ntp.org iburst
server 2.debian.pool.ntp.org iburst
server 3.debian.pool.ntp.org iburst
server 0.uk.pool.ntp.org
server 1.uk.pool.ntp.org
server 2.uk.pool.ntp.org
server 3.uk.pool.ntp.org

This has the four default servers from the default debian pool (these were set up automatically in the original install), plus four that I added from the UK ntp pool. You can read more about the NTP pool project.

Anyway, the moral of this story is two-fold: a) diversity is almost always good, especially when routing around failure (which we like to do on the Internet), b) sometimes it’s better not to go straight to the top.