This was one of those network bugs which took some time to fix. To give you a short background: In the last weeks we had random time NTP drifts in our Linux infrastructure, nothing serious but it was odd as they disappeared after restarting the ntpd or chrony service. During the weekend my co-worker and I were upgrading some parts of the network infrastructure (new cables, new cabling between server and switches, upgrade to the latest Sophos UTM firmware, upgrading to XenServer 7.0 and so on). After getting everything up our OMD showed a lot of NTP errors which led to problems with FreeIPA, Kerberos and SMB. Besides that, on some of the virtual machines we had NFS problems. My co-worker dealt with this problems. Parallel to that and completely unrelated I was preparing a new Active Directory domain and wanted to setup the w32tm service on Windows Server 2012 R2.
To my suprise I was not able to synchronize the time with our firewall. Using
w32tm /stripchart /computer:$FW_IP /samples:10 /dataonly
worked flawlessly but when i tried
w32tm /resync /discover
I received the error
Der Computer wurde nicht synchronisiert, da keine Zeitdaten verfügbar waren. (The computer did not resync because no time data was available)
With w32tm debug logging enabled I got more information:
151915 13:50:46.8092075s - Logging information: NtpClient has not received response from server $FW_IP (ntp.m|0x0|0.0.0.0:123->$FW_IP:123). 151915 13:50:46.8092075s - Logging information: NtpClient: No response has been received from manual peer $FW_IP (ntp.m|0x0|0.0.0.0:123->$FW_IP:123) after 8 attempts to contact it. This peer will be discarded as a time source and NtpClient will attempt to discover a new peer from which to synchronize. Error code: 0x0000005C 151915 13:50:46.8092075s - Reachability: removing peer $FW_IP (ntp.m|0x0|0.0.0.0:123->$FW_IP:123). LAST PEER IN GROUP! 151915 13:50:46.8092075s - AddNewPendingPeer: manual 151915 13:50:46.8092075s - PeerPollingThread: PeerListUpdated 151915 13:50:46.8092075s - Logging error: NtpClient has been configured to acquire time from one or more time sources, however none of the sources are currently accessible and no attempt to contact a source will be made for 15 minutes. NTPCLIENT HAS NO SOURCE OF ACCURATE TIME.
At first I suspected a bug in the latest Sophos UTM firmware, so I checked the incoming packets on the firewall:
# NTP runs on UDP/123 tcpdump -vvv -i eth5 udp and port 123
Running w32tm /stripchart again I received the expected packets. Running w32tm /resync /discover showed nothing. My guess was that the UTM was not responsible for the error. I took a look into the changelog of XenServer 7.0 if anything had changed with the virtual interfaces of the virtual machines. The changelog itself did not have any interesting points, so it had to be an error on the lower network layers. Do you remember that I mentioned the recabling of server and switches? Well, some months ago one of our Netgear switches literally vaporized (btw, I wlll never buy a Netgear again) and had been replaced with a Cisco SG220-50. The SG220 is a nice piece of hardware and does not only work on layer 2 as the old Netgear but also on layer 3. After checking all available options we stumbled upon the settings in Security > Denial of Service > Security Suite Settings. The settings UDP Blat and TCP Blat caught our eyes: if enabled, the Cisco drops any UDP or TCP traffic where source port equals the destination port. Disabling both solved our NTP and NFS errors immediatly. By the way, w32tm /stripchart (and according to this ntpq in debug mode and with an unpriviliged source UDP port) ran flawlessly because it does not use source port 123 but another one.