[ntp:questions] NTP Sync Issues

Discussion:

Adam Johnson

2008-06-06 10:41:39 UTC

We have 3 sites and we are experiencing some strange problems in one of
the sites. We use NTP to keep the servers in time and this works fine
for 2 of the sites but in one of the site we get these errors in the log

ntpd[30062]: synchronized to server, stratum 3
ntpd[30062]: no servers reachable
ntpd[30062]: synchronized to server, stratum 3
ntpd[30062]: time reset +2.119167 s
ntpd[30062]: synchronized to server, stratum 3

For some reason the servers in that site seems to drop back between 2
and 3 seconds behind the other sites for no apparent reason. Both the
other sites work without any problem. We have run a packet capture at a
working site and at the site with the problems and we dont see any
differences other than the server becoming unsychronized frequently. We
have checked the main firewall and that is not blocking access and the
local firewalls are disabled. All our sites are connected via a
dedicated link and I have tried connecting to ntp servers in the other
sites and the problem persists. It looks like something local keeps
changing the time but I can figure out what.

The ntp.conf is the same for all sites except the servers are different.
I have tried using burst and iburst but that hasnt worked.

# Permit time synchronization with our time source, but do not

# permit the source to query or modify the service on this system.

restrict default kod nomodify notrap nopeer noquery

restrict -6 default kod nomodify notrap nopeer noquery

# Permit all access over the loopback interface. This could

# be tightened as well, but to do so would effect some of

# the administrative functions.

restrict 127.0.0.1

restrict -6 ::1

server ipAddres iburst

restrict networkAddress mask networkMask

server ipAddres iburst

restrict networkAddress mask networkMask

driftfile /var/lib/ntp/drift

Any help would be greatly appreciated.

Thanks

David Woolley

2008-06-08 09:20:00 UTC

Permalink

This happens either because of two conflicting time synchronisation
mechanisms or lost interrupts.

I guess you are running some Unix-like system, from the process numbers
in the messages. People failing to identify such systems are usually
running Red Hat Linux. Linux (and Windows) are vulnerable to losing
clock interrupts, especially when using IDE devices in non-DMA modes.
Red Hat, in particular, tends to set the kernel interrput rate to 1000
Hz, which tends to exacerbate this.

The typical cause on other Unix systems, e.g. SunOS and at least some
versions of SCO Unix, is software that resets the software clock from
the real time clock.

Tickless Linux systems are too new for much experience of failure modes
to have been gathered.

What one can reasonably say is that is is an OS or hardware issue, not
an NTP one.

Adam Johnson

2008-06-09 07:41:15 UTC

Permalink

Yes you are right we are running RHEL5. The strange thing is that when
we try to sync the servers from a location that syncs correctly normally
to the location with the issues then we get the same issues as the local
servers are experiencing. You say that it could be conflicting time
synchronisation mechanisms, do you mean that the 2 upstream servers are
conflicting or that something other than NTP is causing this? Thank you
for your help!

Thanks

Adam

-----Original Message-----
From: questions-bounces+a.johnson=wintoncapital.com at lists.ntp.org
[mailto:questions-bounces+a.johnson=wintoncapital.com at lists.ntp.org] On
Behalf Of David Woolley
Sent: 08 June 2008 10:20
To: questions at lists.ntp.org
Subject: Re: [ntp:questions] NTP Sync Issues

David J Taylor

2008-06-10 06:16:46 UTC

Permalink

Adam Johnson wrote:
[]

Post by Adam Johnson
You say that it could be
conflicting time synchronisation mechanisms, do you mean that the 2
upstream servers are conflicting or that something other than NTP is
causing this? Thank you for your help!
Thanks
Adam

Adam,

Two servers prevents NTP working as designed, as it doesn't then have
enough information to select between the servers. Best to add another two
or three to make four or five total servers available.

Cheers,
David

David Woolley

2008-06-10 06:47:55 UTC

Permalink

Post by Adam Johnson
Yes you are right we are running RHEL5. The strange thing is that when
we try to sync the servers from a location that syncs correctly normally
to the location with the issues then we get the same issues as the local

I didn't understand that.

Post by Adam Johnson
servers are experiencing. You say that it could be conflicting time
synchronisation mechanisms, do you mean that the 2 upstream servers are
conflicting or that something other than NTP is causing this? Thank you
for your help!

I was actually referring to something other than NTP, although that is
not generally an issue on Red Hat. Two conflicting servers only happens
if the server configuration is broken, although, given the number of
people who use the local clock without understanding the risks, that's
possibly not that unlikely.

In a correctly operating NTP system, you can rely on all servers and the
client being within 1 second of some concept of true time. For public
servers, and ones based on radio reference clocks, that time is UTC.

The fix for this is to choose servers which are traceable to the same
time source and to have enough independent ones that any rogue one is
outvoted by the good ones.

However, the result of having two servers on different times is either
that both get ignored, or that times hop backwards and forwards. As
your time was always hopping backwards, the indications are for
something other than ntpd forcibly changing the clock. This will give
steps at roughly equal intervals.

More likely on Red Hat, given that your steps are always positive, is
lost timer interrupts. Lost interrupts tend to be activity related, so
the interval between steps and size of steps will be more variable. To
fix that, make sure that IDE drivers use DMA and, if possible, rebuild
the kernel with HZ set to 100.

Adam Johnson

2008-06-12 15:45:23 UTC

Permalink

We have figured out what the issue was. It turns out that it was because
of an Altiris deployment server that for some reason was conflicting
with NTP and adjusting the time on the servers. Thanks to all that
responded to my question.

Thanks

Adam

-----Original Message-----
From: questions-bounces+a.johnson=wintoncapital.com at lists.ntp.org
[mailto:questions-bounces+a.johnson=wintoncapital.com at lists.ntp.org] On
Behalf Of David Woolley
Sent: 10 June 2008 07:48
To: questions at lists.ntp.org
Subject: Re: [ntp:questions] NTP Sync Issues

I didn't understand that.

I was actually referring to something other than NTP, although that is
not generally an issue on Red Hat. Two conflicting servers only happens

if the server configuration is broken, although, given the number of
people who use the local clock without understanding the risks, that's
possibly not that unlikely.

In a correctly operating NTP system, you can rely on all servers and the

client being within 1 second of some concept of true time. For public
servers, and ones based on radio reference clocks, that time is UTC.

The fix for this is to choose servers which are traceable to the same
time source and to have enough independent ones that any rogue one is
outvoted by the good ones.

However, the result of having two servers on different times is either
that both get ignored, or that times hop backwards and forwards. As
your time was always hopping backwards, the indications are for
something other than ntpd forcibly changing the clock. This will give
steps at roughly equal intervals.

More likely on Red Hat, given that your steps are always positive, is
lost timer interrupts. Lost interrupts tend to be activity related, so
the interval between steps and size of steps will be more variable. To
fix that, make sure that IDE drivers use DMA and, if possible, rebuild
the kernel with HZ set to 100.

Richard B. Gilbert

2008-06-08 12:15:08 UTC

Permalink

Post by Adam Johnson
We have 3 sites and we are experiencing some strange problems in one of
the sites. We use NTP to keep the servers in time and this works fine
for 2 of the sites but in one of the site we get these errors in the log
ntpd[30062]: synchronized to server, stratum 3
ntpd[30062]: no servers reachable
ntpd[30062]: synchronized to server, stratum 3
ntpd[30062]: time reset +2.119167 s
ntpd[30062]: synchronized to server, stratum 3
For some reason the servers in that site seems to drop back between 2
and 3 seconds behind the other sites for no apparent reason. Both the
other sites work without any problem. We have run a packet capture at a
working site and at the site with the problems and we dont see any
differences other than the server becoming unsychronized frequently. We
have checked the main firewall and that is not blocking access and the
local firewalls are disabled. All our sites are connected via a
dedicated link and I have tried connecting to ntp servers in the other
sites and the problem persists. It looks like something local keeps
changing the time but I can figure out what.
The ntp.conf is the same for all sites except the servers are different.
I have tried using burst and iburst but that hasnt worked.
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1
server ipAddres iburst
restrict networkAddress mask networkMask
server ipAddres iburst
restrict networkAddress mask networkMask
driftfile /var/lib/ntp/drift
Any help would be greatly appreciated.
Thanks

If the above accurately describes the REAL configuration, you have
exactly two upstream servers which is the worst possible configuration!
When the two disagree, which one should NTPD believe?

Four servers are the minimum for a robust configuration. Five, seven,
and nine are the remaining "magic" numbers. Few sites actually need
more than four or five upstream servers.

DO NOT use burst! Burst was a special purpose hack intended for sites
that connect to a server by telephone two or three times a day. Iburst
is good. Burst, except in the special circumstances it was designed for
places a heavy and unwarranted load on its servers!

David Woolley

2008-06-08 12:36:46 UTC

Permalink

Post by Richard B. Gilbert
If the above accurately describes the REAL configuration, you have
exactly two upstream servers which is the worst possible configuration!
When the two disagree, which one should NTPD believe?
Four servers are the minimum for a robust configuration. Five, seven,
and nine are the remaining "magic" numbers. Few sites actually need
more than four or five upstream servers.
DO NOT use burst! Burst was a special purpose hack intended for sites
that connect to a server by telephone two or three times a day. Iburst
is good. Burst, except in the special circumstances it was designed for
places a heavy and unwarranted load on its servers!

Note that none of these are relevant to this issue, unless you are
getting both positive and negative steps and at least one of the servers
is not really synchronised.

As you say that it only drops back, none of these issues are relevant.