fasttime internal documentation

Overview

This document is written for application developers who would like knowledge of how fasttime derives time, and for developers wishing to modify the fasttime library. A brief description of computer time is followed by implementation details of fasttime's algorithms and data structures.

Computer timekeeping

Current desktop computers have three onboard clock mechanisms for determining the time:

A battery-operated real-time clock keeps time even when the power is off, and is as accurate as any quartz wristwatch. Obtaining the time from this clock is a relatively expensive operation, and typically the operating system will query it during the bootup sequence only.
The operating system sets up a timer to interrupt the CPU at a regular interval. At this interrupt the kernel increments a counter. Windows and most Linux kernels set this interval at 10 milliseconds. This clock has reasonably low frequency drift and jitter, but has a resolution of only 10 ms (depending on the kernel).
All modern CPUs contain a register which is incremented on every clock cycle (i.e., if your processor is clocked at 1.0 GHz, it will be incremented 1 billion times every second). Different architectures give this register a different name; in this document we refer to it as the TSC register.
This clock has very high resolution, but has relatively high frequency wander due to crystal instability, temperature and power fluctuations (which can be caused by system load variation) and explicit power management (speed throttling).

Timestamps from gettimeofday

The gettimeofday system call is able to give a timestamp with up to 1 nanosecond resolution. The time is obtained from the interrupt timer's count (which was initialised to the real-time clock during startup). Any resolution below 10ms is obtained by interpolating the TSC register; since the kernel does not have accurate information about the frequency of the TSC, this is typically done using a simple calibration during startup. For example:

// startup calibration
t1 = read_tsc();
t2 = read_interrupt_count();
sleep(1);
t3 = read_tsc();
t4 = read_interrupt_count();
tsc_rate = (t4 - t2) / (t3 - t1);

// gettimeofday evaluating sub 10ms time
time = read_tsc() * tsc_rate;

The consequences of this are that despite the apparent resolution of 1 nanosecond, the system time can only accurately report times up to a resolution of 10ms, and any lower resolution is based on the boot-time calibration of the TSC register.

In many operating systems gettimeofday is implemented as a system call, requiring a context switch to the kernel and another context switch back again in order to read the time. Besides the obvious hit in performance (which is substantial on platforms such as Pentium where context switching is expensive), applications also risk losing heir time-slice if the kernel takes advantage of the switch. This can be a problem for example in network applications, which want to obtain a timeslice immediately before sending the packet.

Goals of fasttime

fasttime implements algorithms based on Network Time Protocol v4 to provide an accurate estimate of time using the TSC register.

The calibration used for converting the TSC count into a time is continually updated and refined to account for frequency drift.
This repeated calibration must not be affected by system load, and should place minimal burden on the system.
The calibration can be done by a daemon process, and its most recent calibration tables made available to all user processes that implement the fasttime library.
Calibration must adapt quickly to changes in system time; e.g. brought about by an NTP daemon or manual user intervention.
The library should run in user-space, so performance is not lost on context switches.

Derived time

fasttime models the relationship between the TSC value and the current time as a linear function. The calibration maintains an intercept and gradient which can be applied to the TSC value to yield the current time. Initially these values are calculated using a very simple calibration loop similar to that used by the operating system; this is then tuned iteratively with the phase-locked loop.

The phase-locked loop (or PLL) algorithm takes as input the current offset and returns an adjustment to apply to the gradient and intercept of the fasttime conversion. The implementation is loosely based on NTP4's PLL:

prediction_k = offset_k-1 + mu_k * (gradient_k - gradient_k-1);
correction_k = offset_k - prediction_k/2;
intercept_k = correction_k;
gradient_k = gain * correction_k / mu_k

Where k is the iteration, gain is some constant and mu is the difference in TSC value from the previous iteration.

Two factors govern the behaviour of the PLL: the value of the gain and the time between iterations, known as the loop delay. Shorter loop delays allow the derived clock to converge on the system time faster, but are more prone to oscillation and errors in offset measurement. Longer loop delays generally give more stable performance but any error in the derived clock will take longer to correct. The PLL gain plays a similar role: higher gains cause oscillation while lower gains take a longer time to settle.

In fasttime, the loop delay is periodically lengthened when the clock is stable, and shortened when the clock is unsettled. The settings that control how much to lengthen or shorten the delay by, as well as the PLL gain, can be specified as command line arguments to fasttimed.

Clock sampling

A delay exists from when fasttime calls gettimeofday to when the time is evaluated, and then again before the call returns. A naive way to estimate the offset of the derived clock versus system time is to average the derived time over a system call:

t1 = get_derived_time();
t2 = gettimeofday();
t3 = get_derived_time();
offset = t2 - (t1 + t3) / 2;

This will give an accurate result if the (t2 - t1) is equal to (t3 - t2); that is, the delays are symmetric. There are many reasons why this might not be the case, for example if the time slice is lost during one of the context switches (this has been measured and confirmed to occur at least on Pentium 4's running Linux).

A better approach is described in [Shalunov 2000], where multiple samples are taken and combined in a way that only needs to assume the delays are randomly distributed symmetrically. This is implemented both in fasttime to sample the system time, and in the demo applications when testing the accuracy of fasttime.

Rate change amortisation

One of the main applications for getting the current time is to time the length of some process. For these applications it is less important to have the time absolutely correct, so long as it is at a stable and correct rate. fasttime accomodates these programs by avoiding introducing drastic rate changes where possible.

Where the change in rate is relatively small (due to normal PLL procedure), the change is made gradually over several seconds. A large error will still be corrected immediately (for example, a change in system time).

Clock filtering

Due to the poor calibration of the system clock by the operating system, there can be occasional glitches in its return value on the order of 30-100 microseconds. This sort of jitter would not normally be noticed by most applications, but is significant to fasttime as it will introduce an oscillation into the clock which can take some time to correct.

To combat this, and expected hardware jitter, a filter is applied to the offset before passing it to the PLL. The offset is compared with the median of the 10 most recent samples; if it exceeds it by some amount (say, 5 times) the clock is not adjusted. In effect the sample is thrown away, though it does contribute to later filtering, so when a genuine change in system time is made it will make its way through the filter after a short time.

Shared memory protection

fasttime operates in a client/server model, where the calibration table is continually updated in a separate process or thread to the client application. In the case where fasttime is in a separate process (the fasttimed daemon) POSIX shared memory is used to allow the client(s) to access the calibration.

The standard means of protecting shared memory are to use mutexes, semaphores or message passing. None of these can be applied in this case as it would require the client application to make system calls, which is one of the main motivators for not using gettimeofday.

Instead, fasttime maintains a circular array of calibration tables, only one of which is active at any time (meaning it is used to calculate derived time by a client). The daemon or calibration thread updates an unused calibration table and then atomically updates the index pointing to the active table. This atomic update is done with CPU instructions rather than a system call.

Terminology

System time: The current time, as returned by gettimeofday
Derived time: Current time, calculated by fasttime
Offset: The difference between system and derived time
Rate: The frequency, or speed of a clock. For system time this is ideally 1 sec/sec, but may vary due to wander or NTP adjustment.
Loop delay: Time between iterations of the PLL

References

Clock Discipline Algorithms for the Network Time Protocol Version 4, D. Mills 1997

Adaptive Hybrid Clock Discipline Algorithm for the Network Time Protocol, D. Mills 1998

PC Based Precision Timing Without GPS, A. Pasztor & D. Veitch 2002

NTP Implementation and Assumptions about the Network, S. Shalunov 2000

Source code for NTP4 is also an excellent reference as it differs from the descriptions above.