Cloudflare Leapsecond Lurch and time sync engineering
The default behavior of many clock synchronization systems is to lurch the clock on a leap-second but for TimeKeeper, default behavior is few minutes of speeding up or slowing down the clock (slewing) in order to prevent failures. Lurching or “stepping” on most leap seconds involves repeating a second on the clock. A millisecond clock ticks 23:59:59.999 (one millisecond before midnight) and then rolls back to 23:59:59.000. The backwards lurch immediately resets the clock to follow the change in UTC official time, but it can wreak havoc on both data integrity from timestamps and system software. A second can be a long time in a modern transaction system and lurching the clock back means that transactions that are ordered one way in real time may be ordered backwards by timestamp order. For example, imagine a system sends a bid to a stock exchange and timestamps it 23:59:59.999 and the confirmation comes back after 1 millisecond when the lurched clock tells us the time is 23:59:59.000, nearly one second earlier.
Application software that depends on time never going backward can also fail, like Cloudflare’s did. The software requests time from the system, does something, requests time again and gets an earlier time - which is something that should never happen. During an earlier leap second, the thread scheduler for some Java based systems failed during the leap for similar reasons. We don’t see any reason why application software should have to handle such a peculiar event.
TimeKeeper default leap second behavior is a couple of minute “slew” to keep time monotonic (no backward lurches) and preserve timestamp ordering. There is a step option that is generally only used by customers who suspend transaction processing for a couple of seconds around the leap or who have somehow otherwise compensated in application software. Google’s solution is to slew over 20 or 24 hours for a much gentler slope. We don’t do that because it means living with clocks that are nearly a second off for most of a day and because we don’t see any application software or OS software that can’t handle the shorter slew interval. You can see both the Google slew and the lurch of most public NTP servers in the TimeKeeper graph below. Close up, you can see TimeKeeper slewing, but you have to look very closely.
The leap-second has many possible failure points. Months before this one, some defective GPS clocks were jumping the gun (something that TimeKeeper detects and rejects) and I’m sure many additional reports of failures will come in over the next few weeks. Engineering to cushion the effect requires some careful design and implementation and consideration of all the ramifications for systems and application software.