The Limits of NTP Accuracy on Linux
Lately I’ve been trying to find (and understand) the limits of time
syncing between Linux systems. How accurate can you get? What does it
take to get that? And what things can easily add measurable amounts of
time error?
After most of a month (!), I’m starting to understand things. This is
kind of a follow-on to a
previous post, where I walked through my setup and goals, plus another
post where I discussed time syncing in general. I’m trying to get
the clocks on a bunch of Linux systems on my network synced as closely
as possible so I can trust the timestamps on distributed
tracing records that occur on different systems. My local network
round-trip times are in the 20–30
microsecond (μS) range and I’d like clocks to be less than 1 RTT
apart from each other. Ideally, they’d be within 1 μS, but 10 μS is
fine.
It’s easy to fire up Chrony
against a local GPSTechnically, GNSS,
which covers multiple satellite-backed navigation systems, not just the
US GPS system, but I’m going to keep saying “GPS” for short.-backed time source and see it claim to be within X
nanoseconds of GPS, but it’s tricky to figure out if Chrony is right or
not. Especially once it’s claiming to be more accurate than the
network’s round-trip time20 μS or so., the amount of time needed for a single CPU cache
miss50-ish nanoseconds., or even the amount of time that light would take to span
the gap between the server and the time source.About 5 ns per meter.
I’ve spent way too much time over the past month digging
into time, and specifically the limits of what you can accomplish with
Linux, Chrony, and GPS. I’ll walk through all of that here eventually,
but let me spoil the conclusion and give some limits:
- GPSes don’t return perfect time. I routinely see up to 200 ns
differences between the 3 GPSes on my desk when viewing their output on
an oscilloscope. The time gap between the 3 sources varies every second,
and it’s rare to see all three within 20 ns of each other. Even the best
GPS timing modules that I’ve seen list ~5 ns of jitter on their
datasheets. I’d be surprised if you could get 3-5 GPS receivers to agree
within 50 ns or so without careful management of consistent antenna
cable length, etc. - Even small amounts of network complexity can easily add 200-300 ns
of systemic error to your measurements. - Different NICs and their drivers vary widely on how good they are
for sub-microsecond timing. From what I’ve seen, Intel
E810 NICs are great, Intel X710s are very good, Mellanox
ConnectX-5 are okay, Mellanox ConnectX-3 and ConnectX-4
are borderline, and everything from Realtek is questionable. - A lot of Linux systems are terrible at low-latency work.
There are a lot of causes for this, but one of the biggest is random
“stalls” due to the system’s SMBIOS
running to handle power management or other activities, and “pausing”
the observable computer for hundreds of microseconds or longer. In
general, there’s no good way to know if a given system
(especially cheap systems) will be good or bad for timing
without testing them. I have two cheap mini PC systems that have
inexplicably bad time syncing behavior,1300-2000 ns. and two others with inexplicably good time
syncing20-50 ns. Dedicated server hardware is generally more
consistent.
All in all, I’m able to sync clocks to within 500 ns or so on the
bulk of the systems on my network. That’s good enough for my purposes,
but it’s not as good as I’d expected to see.
Now, it’s certainly possible to do better than this in
specific cases. For examples, see Chrony’s examples
page, where they get <100 ns of error over the network for a
specific test case. In general, though it’s going to be hard to
do much better than 200 ns consistently across a real network without a
lot of careful engineering.
I’ll explain my conclusions in a bit, but first some background and
context.
My Setup
For the sake of testing time, I’m using 8 different (but identical)
servers as time clients and 5 different GPS-backed time sources, all
local.
Relevant bits of my network for time testing. NTP sources are in blue
circles, the servers tested are the purple rectangle, and network
switches are orange or yellow rectangles.
Time Sources
ntp1: an older LeoNTP GPS-backed NTP server. In the
garage, connected to its own outdoor GPS antenna. Only has a 10/100 Mbps
Ethernet connection, but this hasn’t mattered in practice.ntp2: identical hardware tontp1. Sitting
on my desk and connected to a different Ethernet switch. Connected to a
GPS antenna splitter and an outdoor antenna.- My desktop. A 32-core AMD Threadripper 5975WX with a ConnectX-6 NIC
(2×40 Gbps) for network traffic and an Intel E810-XXVDA4T (using 2×10
Gbps, one to each switch) with a GPS receiver and hardware timing
support. Shares the antenna withntp2,ntp4,
andntp5. ntp4:Where is NTP3 you ask? I ran out of antenna ports, and
anyway the system that I dubbedntp3only supports PTP, not
NTP. a Raspberry Pi CM5 with a Timebeat GPS module including
PPS timing straight to the NIC. Connected via 1 GbE.ntp5: a Raspberry Pi CM5 with a Waveshare GPS module
with GPIO PPS but no working Ethernet PPS. Connected via 1 GbE.
Test Devices
- Eight identical servers (
d1throughd8)
running Ubuntu 24.04 with identical Chrony configs. The servers are HPE M510 blades with
16 Xeon-D cores in a pair of HPE EL4000
enclosures. Each enclosure is connected to both of the core
switches, giving each of the 8 servers 2 dedicated 10 GbE links via a
built-in Mellanox ConnectX-3 NIC. Chrony metrics are
collected every 10 seconds and stored in Prometheus for analysis. - A Siglent
SDS1204X-E Oscilloscope connected to the PPS outputs from
ntp2,ntp4, and my desktop. It can show
relative differences in PPS times within about a nanosecond.The oscilloscope only has 200 MHz bandwidth but
captures 1 billion samples per second, so I’d expect it to be able to
show differences between PPS sources to somewhere between 1 and 5
nanoseconds. In any case, the observed differences are much larger than
this, see below.
Network
The core of the network is a pair of Arista
7050QX-32S switches. These are 32-port 40 GbE switches with hardware
support for PTP. They’re older, but very solid.
Linux systems with multiple network connections (the 8 test servers
and my desktop) are connected to each with a /30 link per interface and
then run OSPF
with ECMP
to provide redundancy. Devices with a single network interface
(Raspberry Pis and LeoNTP devices in this example) are connected to
layer 2 switches which are then connected to the core switches via MLAG
links. This means that there are multiple possible paths between any two
devices through the network, as both ECMP and LAG use a mix of source
and destination addresses to decide which link to use. So the path
between d1 and ntp1 may be almost completely
different from the path between d2 and ntp1,
even though d1 and d2 are sitting less than an
inch from each other and share all of the same physical network links.
Even more entertaining, the path back from ntp1 to
d1 and d2 may or may not be the same as the
forward path. This only matters when nanosecond-level timings are
involved, as we’ll see in a bit.
Sources of Error
So — finally — I have multiple NTP servers, presumably synced to GPS
satellites as accurately as possible, and multiple servers, all synced
to the NTP servers over a relatively low-latency network. How accurately
are my servers syncing to GPS time? And where is that going wrong?
Chrony’s claims
So, if you’re trying to see how accurate Chrony’s time syncing is,
the easiest place to start is with Chrony’s own metrics. In this case,
Chrony claims that it’s had a median offset of 25–110 ns over the past
day:
Chrony’s median offset over the past day.
Now, this isn’t the best metric for a number of reasons, but it’s a
start. It says that Chrony thinks that it’s synced to within 110 ns of
something, but it doesn’t really tell us anything about what
it’s synced to or how accurate it actually is. So, let’s dig in a bit
deeper.
GPS error and drift
First, the GPS receivers in my NTP time servers aren’t perfectly
accurate. Even top-tier GPS receivers will still have ~5 ns of timing
noise, and lower-tier ones will be 20–60 ns (or possibly higher).Datasheet links: the ublox
ZED-F9T in my desktop claims 5 ns of accuracy and 4 ns of jitter.
The ublox
NEO-M8T in NTP5 (not graphed here) claims 20-500ns of accuracy,
depending on the antenna. And LeoNTP
claims 30ns of RMS accuracy and 60ns of 99-th percentile accuracy.
Fortunately, this is relatively easy to measure, at least when the
devices are within a few feet of each other. You can connect an
oscilloscope to their PPS outputs and directly view the differences
between them. Here’s the result for ntp2,
ntp4, and my desktop:
Oscilloscope output. The Raspberry Pi/Timebeat
Timecard Mini Essential is on top in yellow, then the LeoNTP in
purple, and an Intel E810 on the bottom in blue. Animated; each update
covers 1 second of real time.
Notice that (a) they don’t all agree and (b) they move around
relative to each other. In this sample, there’s about a 200ns difference
between NTP4 (top, yellow) and my desktop (bottom, blue). Some of this
is due to cable length differences (my antenna and PPS leads aren’t all
identical-length, so there’s probably ~20ns in difference there alone),
but that doesn’t explain all of it.
Even ignoring the NTP4, there’s ~25ns in variance between
ntp2 (purple, middle) and my desktop (blue, bottom). Notice
that they move relative to each other over time in a bit of a
pattern.
In general, offsets can mostly be compensated for, either in
Chrony or directly on the GPS device, but jitter is trickier.
Depending on how you look at things, I’m seeing a minimum of
25 ns of error at this level, and potentially up to 200ns.
When you give Chrony multiple time sources that are all equivalently
good, then it’ll generally average its time across the whole set of
sources. So adding one time source with 200 ns of offset to 2 other
mostly-identical time sources should only add ~67 ns of error at most,
and possibly no error at all, if Chrony decides that the 200 ns source
is too far off to be used.
Network error
Chrony tries to compensate for network delays when it syncs to NTP
sources over the network, but it has to make some assumptions that
aren’t always true. It assumes that network delays are symmetrical (that
is, if it takes 30 μS for network traffic to get from the client to the
server and back, then it takes 15 μS each way). This isn’t generally
true, but for a lot of networks it’s close enough.
Apparently it’s not particularly true for my network.
One of the things that I’m monitoring with Chrony and Prometheus is
the current offset for each time source on each Chrony client. I have
data for my 8 test servers (d1 through d8)
tracking the relative offsets for ntp1 and
ntp2. I was expecting to see that either ntp1
or ntp2 was consistently ahead of the other one, given
cable lengths, network delays, antenna differences, and so forth.
Instead, half of the servers see ntp1 as running faster,
while half show ntp2 as running faster:
The relative time offsets for ntp1 vs ntp2
across d1 through d8. Each line is one of the
d* servers. Note that half of the servers see
ntp1 as being ahead of ntp2 and half see the
opposite.
Prometheus query for graph
quantile_over_time(
0.5,
chrony_sources_last_sample_offset_seconds{instance=~"${client}",source_address="10.1.0.238"}[1h]
)
- on (instance)
quantile_over_time(
0.5,
chrony_sources_last_sample_offset_seconds{instance=~"${client}",source_address="10.1.0.239"}[1h]
)
The servers can’t agree on whether ntp1 runs faster than
ntp2 or not — 4 of the 8 see ntp1 as faster,
while 4 see ntp2 as faster, with the servers in two bands
around +100ns and -300ns. This has been consistent for weeks.To be clear, since half of the d* servers
are in one enclosure and half are in another: the timing differences are
basically random, and don’t follow which chassis they’re in or which
network cables they use. Of the 4 physical servers in each enclosure, 2
think ntp1 is faster and 2 think ntp2 is
faster, but which two aren’t even consistent between
enclosures.
Presumably this is caused by asymmetric traffic paths in my network.
If you look back to the network diagram above, you’ll see that the test
servers each have a link to each core switch, and that the L2 switches
that the NTP servers use are each connected to both core switches. Any
time you have redundant links like this, something has to
decide which path any given packet is going to take over the network. In
general network people really dislike just picking paths at
random, largely because that’d mean that packets could
arrive out of order, and a lot of TCP stacks hate out-of-order
traffic. So, generally, traffic is assigned to a path using a hash of
source and destination addresses.The exact implementation varies widely and is
frequently configurable on higher-end devices. For L2 links most devices
just hash the source and destination MACs, while for L3 links the hash
usually includes the source and destination IPs and may include
the TCP/UDP port numbers or other easy-to-locate data.
Presumably one of the possible paths between servers and time sources
on the network is faster than the others, and paths that hash onto the
faster path consistently skew the results in one direction or the other.
A less complex (and less redundant!) network would have less of this
sort of error, but asymmetric round trip times show up
everywhere in networking when you’re counting nanoseconds. At
some level, this isn’t avoidable.
So, on my network, this seems to cause a minimum of 200ns of
potential error, as various paths take different amounts of time, and
Chrony isn’t able to compensate automatically.Chrony has a per-source setting for adjusting latency
asymmetry, so I could probably hand-adjust all 16 (d* ->
ntp*) config lines to minimize the error if I
really cared about ~200 ns of error, but it’s unlikely that
it’d buy me much useful accuracy.
Cross-server synchronization
As an experiment, I told all 8 of my test servers to use each other
as time sources. I added them using Chrony’s noselect flag,
so they wouldn’t try to use each other as authoritative; they’d just
monitor the relative offsets between servers and record them over
time.I’m actually measuring time really
aggressively between d* servers. I’m polling every 250 ms
and averaging across 5 samples to try to minimize noise. Flags from
chrony.conf:
server xxx noselect xleave presend 9 minpoll -2 maxpoll -2 filter 5 extfield F323
Here’s the medianNote that the median isn’t really the best way to look
for offsets in general, but since Chrony maintains its own view
of time and slowly adjusts it time relative to its sources, so a few
wildly inaccurate responses won’t really change Chrony’s time much, if
at all. offset between servers, in nanoseconds, over 4 hours:
| d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 | |
|---|---|---|---|---|---|---|---|---|
| d1 | 83 | -70 | -138 | -18 | -161 | -207 | -132 | |
| d2 | -145 | -29 | -75 | -4 | -138 | -158 | -29 | |
| d3 | 51 | -23 | -33 | 28 | -31 | -65 | -74 | |
| d4 | 74 | 106 | -42 | 106 | -23 | 27 | 91 | |
| d5 | 5 | -40 | -66 | -89 | -49 | -48 | 0 | |
| d6 | 153 | 173 | 0 | 62 | 63 | -28 | 47 | |
| d7 | 190 | 173 | 36 | -32 | 43.0 | 19 | 58 | |
| d8 | 131 | -6 | 58 | -64 | 5 | -52 | -47 |
Prometheus query for chart
quantile_over_time(
0.5,
chrony_sources_last_sample_offset_seconds{instance=~"d[1-8].*",source_address=~"10.0.0.10[0-9]"}[5m]
)
Plus some work in Grafana to turn this into a table.
Notice that they’re all within 207ns of each other, but the timings
aren’t particularly consistent. For instance, looking at the timings
between d2 and d3 show that they’re 29 ns
apart when you query one direction and 23 ns apart when you query in the
other direction, but they’re both off in the same direction. If
network error wasn’t a factor, then I’d expect to see one number be
positive and the other be negative; that’s not always the case here.
In general, this aligns nicely with the 200-300ns of error seen in
the previous section, but it shows that there’s a serious limit to how
accurately Chrony can measure nanoseconds on this hardware.
Observed offsets across all
sources
Earlier, I discussed the difference between ntp1 and
ntp2, and how each server had a different view of the
difference between them. On average, ntp1 seems to run
50–150 ns ahead of ntp2.
Remember that my big goal here is less accurate time and
more consistent time. This 50–150 ns of inconsistency isn’t a
big deal, but when I started adding additional time sources, I
discovered that some of them were even further away from
ntp1 and ntp2, and I wanted to minimize the
total time spread. I’d really like it if adding additional NTP sources
to the mix didn’t make things even less consistent.
There are a lot of things to like about the LeoNTP time servers, but
configurability isn’t one of them. There’s no way that I can
see to add an offset between GPS time and the NTP time that they export.
On the other hand, the 3 Chrony-based time servers (my desktop,
ntp4, and ntp5) can be adjusted to
control the offset between GPS time and NTP time. And, in fact, you
can’t really run with a 0-second offset because GPS time is
based on TAI
and NTP time is usually UTC,Strictly speaking, it’s usually a weird mutant that
mostly ignores leap seconds, look up “leap smear” for the
ugly mess. and the two are currently 37 seconds apart.Leap seconds make life hard.
Originally, I discovered that time from my desktop was around 1 μS
when compared with ntp1 and ntp2 by
d* servers, and time from the Raspberry Pi-based
ntp4 was almost 38 μS off! To mitigate this, I graphed the
average offset between each time source across all 8 servers and then
adjusted offsets on my desktop and ntp4 to be as close as
possible to the median of ntp1 and ntp2. To do
this, I changed my desktop’s TAI offset from -37 to
-36.999999160 and the offset of ntp4 to
-37.000033910.
Now, all 4 sources are basically in unison:
Observed offsets over the past day.
Prometheus query for graph
avg by (source_address) (
quantile_over_time(
0.5,
chrony_sources_last_sample_offset_seconds{
instance=~"d[1-8].*",
source_address=~"10[.].*",
source_address=~"${ntpsource}"
}[1h]
)
)
Why were times so far off? For my desktop, it’s probably a mix of
multipath weirdness and delay in the network stack. 840 ns isn’t a huge
amount of time, although it’s bigger than what I’ve seen elsewhere.
I’m less sure what’s going on with ntp4. It was
originally seeing over 50 μS of error, but reducing the Ethernet
coalescing limits on eth0ethtool -C eth0 tx-usecs 0 rx-usecs 0 helped quite a bit. I’m going to have to keep poking at
this for a while.
Observed NTP jitter
across all sources
I can compare the jitter of 4 of my GPS time sources across all 8
d* servers. To calculate jitter in this case, I’m looking
at the difference between the 1st and 99th percentile of each source’s
offset from Chrony’s best estimate of the current time. I’m calculating
the percentiles over 15 minute windows, subtracting the 1st percentile
from the 99th, and then averaging those results across all 8
servers.It’s not the best way to do this statistically, but
there’s a limit to what you can do with Prometheus easily.
Graph of jitter by source across all 8 d* servers.
Prometheus query for graph
(
avg by (source_address) (
quantile_over_time(
0.99,
chrony_sources_last_sample_offset_seconds{
instance=~"d[1-8].*",
source_address=~"10[.].*",
source_address=~"${ntpsource}"
}[15m]
)
)
) - (
avg by (source_address) (
quantile_over_time(
0.01,
chrony_sources_last_sample_offset_seconds{
instance=~"d[1-8].*",
source_address=~"10[.].*"
}[15m]
)
)
)
Over the past hour, that works out to:
| Time Source | Jitter |
|---|---|
| desktop | 1.01 μS |
| ntp1 | 1.28 μS |
| ntp2 | 1.40 μS |
| ntp4 | 2.02 μS |
So, my desktop (with a fast NIC and a very good GNSS module) has the
least jitter. The two LeoNTP boxes are next, with a bit more, and the
Raspberry Pi has 2x the jitter of my desktop. Since Chrony averages out
offsets across sources and over time, jitter isn’t necessarily
a big deal as long as it’s under control.
Which brings up ntp5, which I’d excluded from the
previous graph. Here’s why:
Graph of jitter by source across all 8 d* servers
including ntp5, which has accuracy issues every 2
hours.
I still haven’t figured out why this loses accuracy every 2 hours,
but there are other weird things about ntp5, so I’m not all
that worried about it overall.
Things that hurt syncing
Along the way, I’ve found a bunch of things that hurt time syncing. A
short list:
- Network cards without hardware timestamps. Realtek, for
instance. - Tunnels. I had 3 servers who were sending traffic to the network
withntp1andntp2over VxLAN
originally, and their time accuracy was terrible. I suspect that the
NICs’ hardware timestamp wasn’t propagated correctly through the tunnel
decapsulation. Plus, it made network times even less symmetrical. - NIC packet coelescing. On Raspberry Pi CM5s especially, I had to
disable NIC coelescing viaethtool -cor I had terrible
accuracy. - Software in general. I get the best results on NTP servers where the
GPS’s PPS signal goes directly into the NIC’s hardware, completely
bypassing as much software as possible. - Running
ptp4landChronyon the same
ConnectX-4 NIC, or potentially ConnectX-3 or -5 NICs. Intel seems
perfectly happy under the same situation.
Summary
So, in all, I’m seeing time syncing somewhere in the 200–500 ns range
across my network. The GPS time sources themselves are sometimes as far
as 150 ns apart, even after compensating for systemic differences, and
the network itself adds another 200–300 ns of noise.
In an ideal world, it’d be cool to see ~10 ns accuracy, but it’s not
really possible at any level with this hardware. My time sources aren’t
that good, my network adds more systemic error than that, and when I try
to measure the difference between test servers I see a couple hundred
nanoseconds of noise. So 10 ns isn’t going to happen.
On the other hand, though, I’m almost certainly accurate to within 1
μS across the set of 8 test servers most of the time, and I’m
absolutely more accurate than my original goal of 10 μS.
