Modernizing mcpigdo

My custom garage door opener appliance is running FreeBSD 11.0-ALPHA5 on a Raspberry Pi 2B. It has worked fine for about 8 years now. However, I want to migrate it to FreeBSD 13.2-STABLE and from libDwmAuth to libDwmCredence. And just bring the code current.

The tricky part is that I never quite finished packaging up my device driver for the rotary encoders, and it was somewhat experimental (hence the alpha release of FreeBSD). But as of today it appears I have the rotary encoder device drivers working fine on FreeBSD 13.2-STABLE on a Raspberry Pi 4B. The unit tests for libDwmPi are passing, and I’m adding to them and doing a little cleanup so I’ll be able to maintain it longer-term.

I should note that the reason I went with FreeBSD at the time was pretty simple: the kernel infrastructure for what I needed to do was significantly better versus linux. That may or may not be true today, but for the moment I have no need to look at doing this on linux. The only non-portable code here is my device driver, and it’s relatively tiny (including boilerplate stuff).

Looking back at this project, I should have made a few more hardware-wise. The Raspberry Pi 2B is more than powerful enough for the job, and given that I put it inside a sealed enclosure, the lower power consumption versus a 4B is nice. I’m pretty sure my mom would appreciate one of these, if just by virtue of being able to open her garage doors with her phone or watch. The hardware (the Pi and the HAT I created) has been flawless, and I’ve had literally zero issues despite it being in a garage with no climate control (so it’s seen plenty of -10F and 95F days). It just works.

However, today I could likely do this in a smaller enclosure, thanks to PoE HATs. Unfortunately not the official latest Raspberry Pi PoE HAT because its efficiency is abysmal (generates too much heat). If I bump the Pi to a 4B, I’ll probably stick with a separate PoE splitter (fanless). I’ll need a new one since the power connector has changed.

The arguments for moving to a Pi 4B:

  • future-proofing. If I want to build another one, I’m steered toward the Pi 4B simply because it’s what I can buy and what’s current.
  • faster networking (1G versus 100M)
  • more oomph for compiling C and C++ code locally
  • Some day, the Pi 2B is going to stop working. I’ve no idea when that day might be, but 8 years in Michigan weather seems like it has probably taken a significant toll. On the other hand it could last another 20 years. There are no electrolytic capacitors, I’m using it headless, and none of the USB ports are in use.

The arguments against it:

  • higher power consumption, hence more heat
  • the Pi 2B isn’t dead yet

I think it’s pretty clear that during this process, I should try a Pi 4B. The day will come when I’ll have to abandon the 2B, and I’d rather do it on my timeline. No harm in keeping the 2B in a box while I try a 4B. Other than the PoE splitter, it should be a simple swap. Toward that end, I ordered a 4B with 4G of RAM (I don’t need 8G of RAM here). I still need to order a PoE splitter, but I can probably scavenge an original V2 PoE HAT from one of my other Pis and stack with stacking headers.

Over the weekend I started building FreeBSD 13.2-STABLE (buildworld) on the Pi 2B and as usual hit the limits. The problem is that 1G of RAM isn’t sufficient to utilize the 4 cores. It’s terribly slow even when you can use all 4 cores, but if you start swapping to a microSD card… it takes days for 'make buildworld‘ to finish. And since I have a device driver I’m maintaining for this device, it’s expected that I’ll need to rebuild the kernel somewhat regularly and also build the world occasionally. This is the main motivation for bumping to a Raspberry Pi 4B with 4G of RAM. It is possible it’ll still occasionally start swapping with a ‘make -j4 buildworld‘ , but the cores are faster and I don’t frequently see a single instance of the compiler or llvm-tblgen go over 500M, but it does happen. I think 4G is sufficient to avoid swapping during a full build.

Update Aug 26, 2023: duh, a while after I first created mcpigdo, it became possible to do what I need to do with the rotary encoders from user space. With FreeBSD 13.2, I can configure interrupts on the GPIO pins and be notified via a number of means. I’m going to work on changing my code to not need my device driver. This is good news since I’ve had some problems with my very old device driver despite refactoring, and I don’t have time to keep maintaining it. Moving my code to user space will make it more portable going forward, though it’ll still be FreeBSD-only. It will also allow for more flexibility.

Raspberry Pi PoE+ is a typo (should be PoS)

Seriously. ‘S’ is adjacent ‘E’ on a QWERTY keyboard.

I knew the official PoE+ HATs were pieces of poop before I bought them. This isn’t news. You don’t have to look hard to find Jeff Geerling’s comments, Martin Rowan’s comments or the many others who’ve complained. I had read them literally years before I purchased.

I decided to buy 4 of them despite the problems, for a specific purpose (4 rack mounted Pi4B 8G, all powered via PoE alone). I’ve had those running for a few days and they’re working. They’re inefficient, but so far they work.

I also ordered 2 more, with the intent of using one of them on a Raspberry Pi 4B 8G in an Anidees Pro case and keeping the other as a spare. Well, in literally 36 hours, one of them is already dead. I believe it destroyed itself via heat. And therein lies part of the problem. I’ll explain what I casually observed, since I wasn’t taking measurements.

I ran the Pi from USB-C for about a day, without the PoE HAT installed. It was in the Anidees Pro case, fully assembled. It was fine, idling around 37.4C and not seeming to go above 44C when running high loads (make -j4 on some medium-sized C++ projects; a prominent workload for me). Solid proof that for my use, the Anidees Pro case works exactly as intended. The case is a big heatsink. Note that I have the 5mm spacers installed for the lid, so it’s open around the entire top perimeter.

I then installed the PoE+ HAT, with extension headers and the correct length standoffs that are needed in the Anidees Pro case. Note that this activity isn’t trivial; the standoffs occupy the same screw holes as the bottom of the case (from opposite directions), and an unmodified standoff is likely to bottom out as it collides with the end of the case bottom screw. You can shorten the threaded end of the standoff, or do as I did and use shorter standoffs and add a nut and washers to take up some of the thread. I don’t advise shortening the screws for the bottom of the case.

I plugged in the PoE ethernet from my office lab 8-port PoE switch, which has been powering the 4 racked Pis for a few days. And observed the expected horrible noise noted by others. Since I expected it, I immediately unplugged the USB-C power. I continued installing software and started compiling and installing my own software (libDwm, libCredence, mcweather, DwmDns, libDwmWebUtils, mcloc, mcrover, etc.). It was late, so I stopped here. On my way out of the home office, I put my hand on the Pi. It was much warmer than when running from the USB-C. In fact, uncomfortably warm. I checked the CPU temperature with vcgencmd, it was under 40C. Hmm. I was tired, so I decided to leave it until the next day and see what happens.

In the morning the Pi had no power. I unplugged and plugged both ends of the 12″ PoE cable. Nothing.

It turns out that the PoE+ HAT is dead. Less than 48 hours of runtime. As near as I can tell, it cooked itself. The PoE port on the ethernet switch still works great. The Pi still works great (powered from USB-C after the dead PoE+ HAT was removed).

I find this saddening and unacceptable. “If it’s not tested, it’s broken.”. Hey Eben: it’s broken. No, literally, it’s broken. And looks to not even be smoke tested. In fact I’d say it’s worse than the issues with Rev. 1 of the original PoE HAT. This is a design problem, not a testing problem. In other words, the problem occurred at the very beginning of the process. Which means it passed through all of engineering. And this issue lies with leadership, not the engineers.

So not only have you gone backward, you’ve gone further back than you were for Rev. 1 of the PoE HAT. And you discontinued the only good PoE HAT you had? Now I’n just left with, “Don’t believe anything, ANYTHING from the mouth of Eben Upton.”.

I’m angry because my trust of Raspberry Pi has been eroding for years, and this is just another lump of coal. We hobbyists were basically screwed for 3 years on availability of all things Pi 4, and you’re still selling a PoE HAT that no one should use.

I’ve been saying this for a couple of years now: there is opportunity for disruption here. While I appreciate the things the Raspberry Pi Foundation has done, I’m starting to feel like I can’t tell anyone that the Pi hardware landscape is great. In fact for many things, it has stagnated.

For anyone for whom 20 US dollars matters, do NOT buy an official PoE+ HAT. Not that it matters… it’s June 2023 and it’s still not trivial to find something to use it (a Raspberry Pi 3 or 4).

There comes a time when a platform dies. They get lazy after the ecosystem builds around them. I’m wondering if I am seeing that on the horizon for the Raspberry Pi.

More Raspberry Pi? Thanks!

Years ago I bought a fairly simple 1U rack mount for four Raspberry Pi Model 4B computers. Then the COVID-19 pandemic happened, and for years it wasn’t possible to find Model 4B’s with 4G or 8G of RAM at less than scandalous scalper prices. So the inexpensive rack mount sat for years collecting dust.

This month, June 2023, I was finally able to buy four Model 4B Raspberry Pis with 8G of RAM, at retail price ($75 each). Hallelujah.

I also bought four of the PoE+ HATs. Which IMHO suck compared to the v2 version of the original PoE HATs; the efficiency is terrible at the tiny loads I have on them (no peripherals), they consume a lot more power and waste it as heat. I don’t need to repeat what’s been written elsewhere by those who’ve published measurements. There also appears to be a PoE to USB-C isolation issue, but fortunately for me I won’t have anything plugged into the USB-C on these Pis.

The plan is to put these four Pis in the wall-mounted switch rack in the basement. They’re mostly going to provide physical redundancy for services I run that don’t require much CPU or network and storage bandwidth. DNS, DHCP, mcrover and mcweather, for example.

I am using Samsung Pro Endurance 128G microSD cards for longevity. If I needed more and faster I/O, I’d be using a rack with space for M.2 SATA per Pi, but I don’t need it for these.

I’ve loaded the latest Raspberry Pi OS Lite 64-bit on them, configured DHCP and DNS for them (later I’ll configure static IPs on them), and started installing the things I know I want/need. They all have their PoE+ HATs on, and are installed in the rack mount. I’ll put the mount into the rack this weekend. The Pis are named grover, nomnom, snoopy and lassie.

Separately, I ordered 2 more Raspberry Pis (same model 4B with 8G of RAM), two more PoE+ HATs and 2 cases: an Argon ONE v2 and an Anidees AI-PI4-SG-PRO. Both of these turn a significant part of the case into a heatsink.

The Argon ONE v2 comes with a fan and can’t use the PoE+ HAT, but can accept an M2 SATA add-on. I’m planning to play with using this one in the master bedroom, connected to the TV. It’s nice that it routes everything to the rear of the case; it’s much easier to use in an entertainment center context.

I believe the Anidees AI-PI4-SG-PRO will allow me to use a PoE+ HAT, but I’ll need extension headers which I’ll order soon. I’ve liked my other Anidees cases, and I think this latest one should be the best I’ve had from them. They’re pricey but premium.

It’s nice that I can finally do some of the work I planned years ago. Despite hoping that I’d see RISC-V equivalents by now, the reality is that the Pi has a much larger ecosystem than anything equivalent. It’s still the go-to for several things and I’m happy.

Jan 16, 2023 Unacknowledged SYNs by country

Jan 14, 2023 Unacknowledged SYNs by country

It’s sometimes interesting to look at how different a single day might be versus the longer-term trends. And to see what happens when you make changes to your pf rules.

I added all RU networks I was blocking from ssh to the list blocked for everything. I also fired up a torrent client on my desktop.

RU moving up the list versus the previous 5 days is no surprise; a good portion of traffic I receive from RU is port scanning. But I’ll have to look to see what caused the CZ numbers to climb.

I think the only interesting thing about the torrent client is that I should do something to track UDP in a similar manner as I track TCP. If I have a torrent client running, I will wind up with a lot of UDP traffic (much of it directed to port 6881 on this day), and will respond with ICMP port unreachable. To some extent this is a burden on my outbound bandwidth, but on the other hand it will allow me to add an easy new tracker to mcflowd: “to whom am I sending ICMP port unreachables?”. Of course, UDP is trivially spoofed, so I don’t truly know the source of the UDP.

mcblockd 5 years on

mcblockd, the firewall automation I created 5 years ago, continues to work.

However, it’s interesting to note how things have changed. Looking at just the addresses I block from accessing port 22…

While China remains at the top of my list of total number of blocked IP addresses, the US is now in 2nd place. In 2017, the US wasn’t even in the top 20. What has changed?

Most of the change here is driven by my automation seeing more and more attacks originating from cloud hosted services. Amazon EC2, Google, Microsoft, DigitalOcean, Linode, Oracle, et. al. While my automation policy won’t go wider than a /24 for a probe from a known US entity, over time I see probes from entire swaths of contiguous /24 networks from the same address space allocation, which will be coalesced to reduce firewall table size. Two adjacent /24 networks become a single /23. Two adjacent /23 networks become a single /22. All the way up to a possible /8 (the automation stops there).

So today, the last of 2022, I see some very large blocks owned by our cloud providers being blocked by my automation due to receiving ssh probes from large contiguous swaths of their address space.

I am very appreciative of the good things from big tech. But I’m starting to see the current cloud computing companies as the arms dealers of cyberspace.

My top 2 countries:

    CN 131,560,960 addresses
       /9 networks:    1 (8,388,608 addresses)
      /10 networks:   10 (41,943,040 addresses)
      /11 networks:   12 (25,165,824 addresses)
      /12 networks:   18 (18,874,368 addresses)
      /13 networks:   29 (15,204,352 addresses)
      /14 networks:   48 (12,582,912 addresses)
      /15 networks:   48 (6,291,456 addresses)
      /16 networks:   37 (2,424,832 addresses)
      /17 networks:   14 (458,752 addresses)
      /18 networks:    7 (114,688 addresses)
      /19 networks:   10 (81,920 addresses)
      /20 networks:    5 (20,480 addresses)
      /21 networks:    3 (6,144 addresses)
      /22 networks:    3 (3,072 addresses)
      /23 networks:    1 (512 addresses)

    US 92,199,996 addresses
       /9 networks:    3 (25,165,824 addresses)
      /10 networks:    5 (20,971,520 addresses)
      /11 networks:   10 (20,971,520 addresses)
      /12 networks:    9 (9,437,184 addresses)
      /13 networks:   16 (8,388,608 addresses)
      /14 networks:   10 (2,621,440 addresses)
      /15 networks:    8 (1,048,576 addresses)
      /16 networks:   42 (2,752,512 addresses)
      /17 networks:   10 (327,680 addresses)
      /18 networks:   11 (180,224 addresses)
      /19 networks:    8 (65,536 addresses)
      /20 networks:   10 (40,960 addresses)
      /21 networks:    2 (4,096 addresses)
      /22 networks:    9 (9,216 addresses)
      /23 networks:    9 (4,608 addresses)
      /24 networks:  818 (209,408 addresses)
      /25 networks:    4 (512 addresses)
      /26 networks:    5 (320 addresses)
      /27 networks:    5 (160 addresses)
      /28 networks:    2 (32 addresses)
      /29 networks:    7 (56 addresses)
      /30 networks:    1 (4 addresses)

You can clearly see the effect of my automation policy for the US. Lots of /24 networks get added, most of them with a 30 to 35 day expiration. Note that expirations increase for repeat offenses. But over time, as contiguous /24 networks are added due to sending probes at my firewall, aggregation will lead to wider net masks (shorter prefix lengths). Since I’m sorting countries based on the total number of addresses I’m blocking, obviously shorter prefixes have a much more profound effect than long prefixes.

mcrover now monitors Plex server

mcrover is now monitoring my Plex server.
This was more work than expected. A big part of the issue here is that the REST API uses XML. I’ve always disliked using XML. It’s a nice technology, but when it comes to open source libraries for C++, it’s always been lacking.

Long ago, I used Xerces. Not because it’s the best, but because it was the only liberally licensed library with support for DTD and Schema validation. That is still the case today. Unfortunately, it’s very cumbersome to use and is written in old C++ (as in C++ 1998). There’s a lot of boilerplate, a considerable amount of global state (very bad for multithreaded applications), and a lot of the memory management is left to the application. I can’t imagine anyone today is using it in production.

But I plunged ahead anyway. Sadly, it was a mistake. Somewhere it was stomping on the stack, often in ways that caused problems deep inside openssl (which I don’t use directly, instead using boost::beast and boost::certify). The stack corruption caused problems trying to debug, and I didn’t have the time to figure it out. And of course I’m always suspicious of openssl, given the fact that it’s written in C and many of us lived through Heartbleed and many other critical openssl vulnerabilities. To be honest, we’ve been in desperate need of a really good, modern (and in at least C++11) C++ implementation of TLS for more than a decade. Of course I could rant about our whole TLS mess for hours, but I’ll spare you.

Time being of the essence, I switched to pugixml. I don’t need DTD/Schema validation. Problem gone, a lot less code, and a much more modern API (much harder to shoot yourself in the foot).

Inside mcrover, I’m using XPath on the XML returned by Plex. The internal code is generic; it would not be much work to support other Web applications with XML interfaces. The XPath I look for in the XML is a configuration item, and really the only reason I have something specific for Plex is their use of an API token. But the configuration is generic enough that supporting other XML Web applications shouldn’t be difficult.

At any rate, what I have now works. So now I don’t get blackholed fixing a Plex issue when I haven’t used it for months and something has gone wrong. I know ahead of time.

UPS fiasco and mcrover to the rescue

I installed a new Eaton 5PX1500RT in my basement rack this week. I’d call it “planned, sort of…”. My last Powerware 5115 1U UPS went into an odd state which precipitated the new purchase. However, it was on my todo list to make this change.

I already own an Eaton 5PX1500RT, which I bought in 2019. I’ve been very happy with it. It’s in the basement rack, servicing a server, my gateway, ethernet switches and broadband modem. As is my desire, it is under 35% load.

The Powerware 5115 was servicing my storage server, and also under 35% load. This server has dual redundant 900W power supplies.

Installation of the new UPS… no big deal. install the ears, install the rack rails, rack the UPS.

Shut down the devices plugged into the old UPS, plug them in to the new UPS. Boot up, check each device.

Install the USB cable from the UPS to the computer that will monitor the state of the UPS. Install Network UPS Tools (nut) on that computer. Configure it, start it, check it.

This week, at this step things got… interesting.

I was monitoring the old Powerware 5115 from ‘ria’. ‘ria’ is a 1U SuperMicro server with a single Xeon E3-1270 V2. It has four 1G ethernet ports and a Mellanox 10G SFP+ card. Two USB ports. And a serial port which has been connected to the Powerware 5115 for… I don’t know, 8 years?

I can monitor the Eaton 5PX1500RT via a serial connection. However, USB is more modern, right? And the cables are less unwieldy (more wieldy). So I used the USB cable.

Trouble started here. The usbhid-ups driver did not reliably connect to the UPS. When it did, it took a long time (in excess of 5 seconds, an eternity in computing time). ‘ria’ is running FreeBSD 12.3-STABLE on bare metal.

I initially decided that I’d deal with it this weekend. Either go back to using a serial connection or try using a host other than ‘ria’. However…

I soon noticed long periods where mcrover was displaying alerts for many services on many hosts. Including alerts for local services, whose test traffic does not traverse the machine I touched (‘ria’). And big delays when using my web browser. Hmm…

Poking around, I seemed to only be able to reliably reproduce a network problem by pinging certain hosts with ICMPv4 from ria and observing periods where the round trip time would go from .05 milliseconds to 15 or 20 seconds. No packets lost, just periods with huge delays. These were all hosts on the same 10G ethernet network. ICMPv6 to the same hosts: no issues. Hmm…

I was eventually able to correlate (in my head) what I was seeing in the many mcrover alerts. On the surface, many didn’t involve ‘ria’. But under the hood they DO involve ‘ria’ simply because ‘ria’ is my primary name server. So, for example, tests that probe via both IPv6 and IPv4 might get the AAAA record but not the A record for the destination, or vice versa, or neither, or both. ‘ria’ is also the default route for these hosts. I honed in on the 10G ethernet interface on ‘ria’.

What did IPV4 versus IPv6 have to do with the problem? I don’t know without digging through kernel source. What was happening: essentially a network ‘pause’. Packets destined for ‘ria’ were not dropped, but queued for later delivery. As many as 20 seconds later! The solution? Unplug the USB cable for the UPS and kill usbhid-ups. In the FreeBSD kernel, is USB hoarding a lock shared with part of the network stack?

usbhid-ups works from another Supermicro server running the same version of FreeBSD. Different hardware (dual Xeon L5640). Same model of UPS with the same firmware.

This leads me to believe this isn’t really a lock issue. It’s more likely an interrupt routing issue. And I do remember that I had to add hw.acpi.sci.polarity="low" to /boot/loader.conf on ‘ria’ a while ago to avoid acpi0 interrupt storms (commented out recently with no observed consequence). What I don’t remember: what were all the issues I found that prompted me to add that line way back when?

Anyway… today’s lesson. Assume the last thing you changed has high probability of cause, even if there seems to be no sensible correlation. My experience this week: “Unplug the USB connection to the UPS and the 10G ethernet starts working again. Wait, what?!”.

And today’s thanks goes to mcrover. I might not have figured this out for considerably longer if I did not have alert information in my view. Being a comes-and-goes problem that only seemed to be reproducible between particular hosts using particular protocols might have made this a much more painful problem to troubleshoot without reliable status information on a dedicated display. Yes, it took some thinking and observing, and then some manual investigation and backtracking. But the whole time, I had a status display showing me what was observable. Nice!

mcrover updates

I spent some weekend time updating mcrover this month. I had two motivations:

  1. I wanted deeper tests for two web applications: WordPress and piwigo. Plex will be on the list soon, I’m just not thrilled that their API is XML-based (Xerces is a significant dependency, and other XML libraries are not as robust or complete).
  2. I wanted a test for servers using libDwmCredence. Today that’s mcroverd and mcweatherd. dwmrdapd and mcblockd will soon be transitioned from libDwmAuth to libDwmCredence.

I created a web application alert, which allows me to create code to test web applications fairly easily and generically. Obviously, the test for a given application is application-specific. Applications with REST interfaces that emit JSON are pretty easy to add.

The alerts and code to check servers using libDwmCredence allow me to attempt connection and authentication to any service using Dwm::Credence::Peer. These are the steps that establish a connection; they are the prologue to application messaging. This allows me to check the status of a service using Dwm::Credence::Peer in lieu of a deeper (application-specific) check, so I can monitor new services very early in the deployment cycle.

libDwmCredence integrated into mcrover

I’ve completed my first pass at libDwmCredence (a replacement for libDwmAuth). As previously mentioned, I did this work in order to drop my use of Crypto++ in some of my infrastructure. Namely mcrover, my personal host and network monitoring software.

I’ve integrated it into mcrover, and deployed the new mcrover on 6 of my internal hosts. So far, so good.