UPS fiasco and mcrover to the rescue

I installed a new Eaton 5PX1500RT in my basement rack this week. I’d call it “planned, sort of…”. My last Powerware 5115 1U UPS went into an odd state which precipitated the new purchase. However, it was on my todo list to make this change.

I already own an Eaton 5PX1500RT, which I bought in 2019. I’ve been very happy with it. It’s in the basement rack, servicing a server, my gateway, ethernet switches and broadband modem. As is my desire, it is under 35% load.

The Powerware 5115 was servicing my storage server, and also under 35% load. This server has dual redundant 900W power supplies.

Installation of the new UPS… no big deal. install the ears, install the rack rails, rack the UPS.

Shut down the devices plugged into the old UPS, plug them in to the new UPS. Boot up, check each device.

Install the USB cable from the UPS to the computer that will monitor the state of the UPS. Install Network UPS Tools (nut) on that computer. Configure it, start it, check it.

This week, at this step things got… interesting.

I was monitoring the old Powerware 5115 from ‘ria’. ‘ria’ is a 1U SuperMicro server with a single Xeon E3-1270 V2. It has four 1G ethernet ports and a Mellanox 10G SFP+ card. Two USB ports. And a serial port which has been connected to the Powerware 5115 for… I don’t know, 8 years?

I can monitor the Eaton 5PX1500RT via a serial connection. However, USB is more modern, right? And the cables are less unwieldy (more wieldy). So I used the USB cable.

Trouble started here. The usbhid-ups driver did not reliably connect to the UPS. When it did, it took a long time (in excess of 5 seconds, an eternity in computing time). ‘ria’ is running FreeBSD 12.3-STABLE on bare metal.

I initially decided that I’d deal with it this weekend. Either go back to using a serial connection or try using a host other than ‘ria’. However…

I soon noticed long periods where mcrover was displaying alerts for many services on many hosts. Including alerts for local services, whose test traffic does not traverse the machine I touched (‘ria’). And big delays when using my web browser. Hmm…

Poking around, I seemed to only be able to reliably reproduce a network problem by pinging certain hosts with ICMPv4 from ria and observing periods where the round trip time would go from .05 milliseconds to 15 or 20 seconds. No packets lost, just periods with huge delays. These were all hosts on the same 10G ethernet network. ICMPv6 to the same hosts: no issues. Hmm…

I was eventually able to correlate (in my head) what I was seeing in the many mcrover alerts. On the surface, many didn’t involve ‘ria’. But under the hood they DO involve ‘ria’ simply because ‘ria’ is my primary name server. So, for example, tests that probe via both IPv6 and IPv4 might get the AAAA record but not the A record for the destination, or vice versa, or neither, or both. ‘ria’ is also the default route for these hosts. I honed in on the 10G ethernet interface on ‘ria’.

What did IPV4 versus IPv6 have to do with the problem? I don’t know without digging through kernel source. What was happening: essentially a network ‘pause’. Packets destined for ‘ria’ were not dropped, but queued for later delivery. As many as 20 seconds later! The solution? Unplug the USB cable for the UPS and kill usbhid-ups. In the FreeBSD kernel, is USB hoarding a lock shared with part of the network stack?

usbhid-ups works from another Supermicro server running the same version of FreeBSD. Different hardware (dual Xeon L5640). Same model of UPS with the same firmware.

This leads me to believe this isn’t really a lock issue. It’s more likely an interrupt routing issue. And I do remember that I had to add hw.acpi.sci.polarity="low" to /boot/loader.conf on ‘ria’ a while ago to avoid acpi0 interrupt storms (commented out recently with no observed consequence). What I don’t remember: what were all the issues I found that prompted me to add that line way back when?

Anyway… today’s lesson. Assume the last thing you changed has high probability of cause, even if there seems to be no sensible correlation. My experience this week: “Unplug the USB connection to the UPS and the 10G ethernet starts working again. Wait, what?!”.

And today’s thanks goes to mcrover. I might not have figured this out for considerably longer if I did not have alert information in my view. Being a comes-and-goes problem that only seemed to be reproducible between particular hosts using particular protocols might have made this a much more painful problem to troubleshoot without reliable status information on a dedicated display. Yes, it took some thinking and observing, and then some manual investigation and backtracking. But the whole time, I had a status display showing me what was observable. Nice!

mcrover updates

I spent some weekend time updating mcrover this month. I had two motivations:

  1. I wanted deeper tests for two web applications: WordPress and piwigo. Plex will be on the list soon, I’m just not thrilled that their API is XML-based (Xerces is a significant dependency, and other XML libraries are not as robust or complete).
  2. I wanted a test for servers using libDwmCredence. Today that’s mcroverd and mcweatherd. dwmrdapd and mcblockd will soon be transitioned from libDwmAuth to libDwmCredence.

I created a web application alert, which allows me to create code to test web applications fairly easily and generically. Obviously, the test for a given application is application-specific. Applications with REST interfaces that emit JSON are pretty easy to add.

The alerts and code to check servers using libDwmCredence allow me to attempt connection and authentication to any service using Dwm::Credence::Peer. These are the steps that establish a connection; they are the prologue to application messaging. This allows me to check the status of a service using Dwm::Credence::Peer in lieu of a deeper (application-specific) check, so I can monitor new services very early in the deployment cycle.