mcblockd 5 years on

mcblockd, the firewall automation I created 5 years ago, continues to work.

However, it’s interesting to note how things have changed. Looking at just the addresses I block from accessing port 22…

While China remains at the top of my list of total number of blocked IP addresses, the US is now in 2nd place. In 2017, the US wasn’t even in the top 20. What has changed?

Most of the change here is driven by my automation seeing more and more attacks originating from cloud hosted services. Amazon EC2, Google, Microsoft, DigitalOcean, Linode, Oracle, et. al. While my automation policy won’t go wider than a /24 for a probe from a known US entity, over time I see probes from entire swaths of contiguous /24 networks from the same address space allocation, which will be coalesced to reduce firewall table size. Two adjacent /24 networks become a single /23. Two adjacent /23 networks become a single /22. All the way up to a possible /8 (the automation stops there).

So today, the last of 2022, I see some very large blocks owned by our cloud providers being blocked by my automation due to receiving ssh probes from large contiguous swaths of their address space.

I am very appreciative of the good things from big tech. But I’m starting to see the current cloud computing companies as the arms dealers of cyberspace.

My top 2 countries:

    CN 131,560,960 addresses
       /9 networks:    1 (8,388,608 addresses)
      /10 networks:   10 (41,943,040 addresses)
      /11 networks:   12 (25,165,824 addresses)
      /12 networks:   18 (18,874,368 addresses)
      /13 networks:   29 (15,204,352 addresses)
      /14 networks:   48 (12,582,912 addresses)
      /15 networks:   48 (6,291,456 addresses)
      /16 networks:   37 (2,424,832 addresses)
      /17 networks:   14 (458,752 addresses)
      /18 networks:    7 (114,688 addresses)
      /19 networks:   10 (81,920 addresses)
      /20 networks:    5 (20,480 addresses)
      /21 networks:    3 (6,144 addresses)
      /22 networks:    3 (3,072 addresses)
      /23 networks:    1 (512 addresses)

    US 92,199,996 addresses
       /9 networks:    3 (25,165,824 addresses)
      /10 networks:    5 (20,971,520 addresses)
      /11 networks:   10 (20,971,520 addresses)
      /12 networks:    9 (9,437,184 addresses)
      /13 networks:   16 (8,388,608 addresses)
      /14 networks:   10 (2,621,440 addresses)
      /15 networks:    8 (1,048,576 addresses)
      /16 networks:   42 (2,752,512 addresses)
      /17 networks:   10 (327,680 addresses)
      /18 networks:   11 (180,224 addresses)
      /19 networks:    8 (65,536 addresses)
      /20 networks:   10 (40,960 addresses)
      /21 networks:    2 (4,096 addresses)
      /22 networks:    9 (9,216 addresses)
      /23 networks:    9 (4,608 addresses)
      /24 networks:  818 (209,408 addresses)
      /25 networks:    4 (512 addresses)
      /26 networks:    5 (320 addresses)
      /27 networks:    5 (160 addresses)
      /28 networks:    2 (32 addresses)
      /29 networks:    7 (56 addresses)
      /30 networks:    1 (4 addresses)

You can clearly see the effect of my automation policy for the US. Lots of /24 networks get added, most of them with a 30 to 35 day expiration. Note that expirations increase for repeat offenses. But over time, as contiguous /24 networks are added due to sending probes at my firewall, aggregation will lead to wider net masks (shorter prefix lengths). Since I’m sorting countries based on the total number of addresses I’m blocking, obviously shorter prefixes have a much more profound effect than long prefixes.

mcrover now monitors Plex server

mcrover is now monitoring my Plex server.
This was more work than expected. A big part of the issue here is that the REST API uses XML. I’ve always disliked using XML. It’s a nice technology, but when it comes to open source libraries for C++, it’s always been lacking.

Long ago, I used Xerces. Not because it’s the best, but because it was the only liberally licensed library with support for DTD and Schema validation. That is still the case today. Unfortunately, it’s very cumbersome to use and is written in old C++ (as in C++ 1998). There’s a lot of boilerplate, a considerable amount of global state (very bad for multithreaded applications), and a lot of the memory management is left to the application. I can’t imagine anyone today is using it in production.

But I plunged ahead anyway. Sadly, it was a mistake. Somewhere it was stomping on the stack, often in ways that caused problems deep inside openssl (which I don’t use directly, instead using boost::beast and boost::certify). The stack corruption caused problems trying to debug, and I didn’t have the time to figure it out. And of course I’m always suspicious of openssl, given the fact that it’s written in C and many of us lived through Heartbleed and many other critical openssl vulnerabilities. To be honest, we’ve been in desperate need of a really good, modern (and in at least C++11) C++ implementation of TLS for more than a decade. Of course I could rant about our whole TLS mess for hours, but I’ll spare you.

Time being of the essence, I switched to pugixml. I don’t need DTD/Schema validation. Problem gone, a lot less code, and a much more modern API (much harder to shoot yourself in the foot).

Inside mcrover, I’m using XPath on the XML returned by Plex. The internal code is generic; it would not be much work to support other Web applications with XML interfaces. The XPath I look for in the XML is a configuration item, and really the only reason I have something specific for Plex is their use of an API token. But the configuration is generic enough that supporting other XML Web applications shouldn’t be difficult.

At any rate, what I have now works. So now I don’t get blackholed fixing a Plex issue when I haven’t used it for months and something has gone wrong. I know ahead of time.

UPS fiasco and mcrover to the rescue

I installed a new Eaton 5PX1500RT in my basement rack this week. I’d call it “planned, sort of…”. My last Powerware 5115 1U UPS went into an odd state which precipitated the new purchase. However, it was on my todo list to make this change.

I already own an Eaton 5PX1500RT, which I bought in 2019. I’ve been very happy with it. It’s in the basement rack, servicing a server, my gateway, ethernet switches and broadband modem. As is my desire, it is under 35% load.

The Powerware 5115 was servicing my storage server, and also under 35% load. This server has dual redundant 900W power supplies.

Installation of the new UPS… no big deal. install the ears, install the rack rails, rack the UPS.

Shut down the devices plugged into the old UPS, plug them in to the new UPS. Boot up, check each device.

Install the USB cable from the UPS to the computer that will monitor the state of the UPS. Install Network UPS Tools (nut) on that computer. Configure it, start it, check it.

This week, at this step things got… interesting.

I was monitoring the old Powerware 5115 from ‘ria’. ‘ria’ is a 1U SuperMicro server with a single Xeon E3-1270 V2. It has four 1G ethernet ports and a Mellanox 10G SFP+ card. Two USB ports. And a serial port which has been connected to the Powerware 5115 for… I don’t know, 8 years?

I can monitor the Eaton 5PX1500RT via a serial connection. However, USB is more modern, right? And the cables are less unwieldy (more wieldy). So I used the USB cable.

Trouble started here. The usbhid-ups driver did not reliably connect to the UPS. When it did, it took a long time (in excess of 5 seconds, an eternity in computing time). ‘ria’ is running FreeBSD 12.3-STABLE on bare metal.

I initially decided that I’d deal with it this weekend. Either go back to using a serial connection or try using a host other than ‘ria’. However…

I soon noticed long periods where mcrover was displaying alerts for many services on many hosts. Including alerts for local services, whose test traffic does not traverse the machine I touched (‘ria’). And big delays when using my web browser. Hmm…

Poking around, I seemed to only be able to reliably reproduce a network problem by pinging certain hosts with ICMPv4 from ria and observing periods where the round trip time would go from .05 milliseconds to 15 or 20 seconds. No packets lost, just periods with huge delays. These were all hosts on the same 10G ethernet network. ICMPv6 to the same hosts: no issues. Hmm…

I was eventually able to correlate (in my head) what I was seeing in the many mcrover alerts. On the surface, many didn’t involve ‘ria’. But under the hood they DO involve ‘ria’ simply because ‘ria’ is my primary name server. So, for example, tests that probe via both IPv6 and IPv4 might get the AAAA record but not the A record for the destination, or vice versa, or neither, or both. ‘ria’ is also the default route for these hosts. I honed in on the 10G ethernet interface on ‘ria’.

What did IPV4 versus IPv6 have to do with the problem? I don’t know without digging through kernel source. What was happening: essentially a network ‘pause’. Packets destined for ‘ria’ were not dropped, but queued for later delivery. As many as 20 seconds later! The solution? Unplug the USB cable for the UPS and kill usbhid-ups. In the FreeBSD kernel, is USB hoarding a lock shared with part of the network stack?

usbhid-ups works from another Supermicro server running the same version of FreeBSD. Different hardware (dual Xeon L5640). Same model of UPS with the same firmware.

This leads me to believe this isn’t really a lock issue. It’s more likely an interrupt routing issue. And I do remember that I had to add hw.acpi.sci.polarity="low" to /boot/loader.conf on ‘ria’ a while ago to avoid acpi0 interrupt storms (commented out recently with no observed consequence). What I don’t remember: what were all the issues I found that prompted me to add that line way back when?

Anyway… today’s lesson. Assume the last thing you changed has high probability of cause, even if there seems to be no sensible correlation. My experience this week: “Unplug the USB connection to the UPS and the 10G ethernet starts working again. Wait, what?!”.

And today’s thanks goes to mcrover. I might not have figured this out for considerably longer if I did not have alert information in my view. Being a comes-and-goes problem that only seemed to be reproducible between particular hosts using particular protocols might have made this a much more painful problem to troubleshoot without reliable status information on a dedicated display. Yes, it took some thinking and observing, and then some manual investigation and backtracking. But the whole time, I had a status display showing me what was observable. Nice!

An ode to NSFNET and ANSnet: a simple NMS for home

A bit of history…

I started my computing career at NSFNET at the end of 1991. Which then became ANSnet. In those days, we had a home-brewed network monitoring system. I believe most/all of it was originally the brainchild of Bill Norton. Later there were several contributors; Linda Liebengood, myself, others. The important thing for today’s thoughts: it was named “rover”, and its user interface philosophy was simple but important: “Only show me actionable problems, and do it as quickly as possible.”

To understand this philosophy, you have to know something about the primary users: the network operators in the Network Operations Center (NOC). One of their many jobs was to observe problems, perform initial triage, and document their observations in a trouble ticket. From there they might fix the problem, escalate to network engineering, etc. But it wasn’t expected that we’d have some omniscient tool that could give them all of the data they (or anyone else) needed to resolve the problem. We expected everyone to use their brains, and we wanted our primary problem reporter to be fast and as clutter-free as possible.

For decades now, I’m spent a considerable amount of time working at home. Sometimes because I was officially telecommuting, at other times just because I love my work and burn midnight hours doing it. As a result, my home setup has become more complex over time. I have 10 gigabit ethernet throughout the house (some fiber, some Cat6A).  I have multiple 10 gigabit ethernet switches, all managed.  I have three rackmount computers in the basement that run 7×24.  I have ZFS pools on two of them, used for nightly backups of all networked machines, source code repository redundancy, Time Machine for my macOS machines, etc.  I run my own DHCP service, an internal DNS server, web servers, an internal mail server, my own automated security software to keep my pf tables current, Unifi, etc.  I have a handful of Raspberry Pis doing various things.  Then there’s all the other devices: desktop computers in my office, a networked laser printer, Roku, AppleTV, Android TV, Nest thermostat, Nest Protects, WiFi access points, laptops, tablet, phone, watch, Ooma, etc.  And the list grows over time.

Essentially, my home has become somewhat complex.  Without automation, I spend too much time checking the state of things or just being anxious about not having time to check everything at a reasonable frequency.  Are my ZFS pools all healthy?  Are all of my storage devices healthy?  Am I running out of storage space anywhere?  Is my DNS service working?  Is my DHCP server working?  My web server?  NFS working where I need it?  Is my Raspberry Pi garage door opener working?  Are my domains resolvable from the outside world?  Are the cloud services I use working?  Is my Internet connection down?  Is there a guest on my network?  A bandit on my network?  Is my printer alive?  Is my internal mail service working?  Are any of my UPS units running on battery?  Are there network services running that should not be?  What about the ones that should be, like sshd?

I needed a monitoring system that worked like rover; only show me actionable issues.  So I wrote my own, and named it “mcrover”.  It’s more of a host and service monitoring system than a network monitoring system, but it’s distributed and secure (using ed25519 stuff in libDwmAuth).  It’s modern C++, relatively easy to extend, and has some fun bits (ASCII art in the curses client when there are no alerts, for example).  Like the old Network Operations Center, I have a dedicated display in my office that only displays the mcrover Qt client, 24 hours a day.  Since most of the time there are no alerts to display, the Qt client toggles between a display of the next week’s forecast and a weather radar image when there are no alerts.  If there are alerts, the alert display will be shown instead, and will not go away until there are no alerts (or I click on the page switch in the UI).  The dedicated display is driven by a Raspberry Pi 4B running the Qt client from boot, using EGLFS (no X11).  The Raspberry Pi4 is powered via PoE.  It is also running the mcrover service, to monitor local services on the Pi as well as many network services.  In fact the mcrover service is running on every 7×24 general purpose computing device.  mcrover instances can exchange alerts, hence I only need to look at one instance to see what’s being reported by all instances.

This has alleviated me of a lot of sys admin and network admin drudgery.  It wasn’t trivial to implement, mostly due to the variety (not the quantity) of things it’s monitoring.  But it has proven itself very worthwhile.  I’ve been running it for many months now, and I no longer get anxious about not always keeping up with things like daily/weekly/monthly mail from cron and manually checking things.  All critical (and some non-critical) things are now being checked every 60 seconds, and I only have my attention stolen when there is an actionable issue found by mcrover.

So… an ode to the philosophy of an old system.  Don’t make me plow through a bunch of data to find the things I need to address.  I’ll do that when there’s a problem, not when there isn’t a problem.  For 7×24 general purpose computing devices running Linux, macOS or FreeBSD, I install and run the mcrover service and connect it to the mesh.  And it requires very little oomph; it runs just fine on a Raspberry Pi 3 or 4.

So why the weather display?  It’s just useful to me, particularly in the mowing season where I need to plan ahead for yard work.  And I’ve just grown tired of the weather websites.  Most are loaded with ads and clutter.  All of them are tracking us.  Why not just pull the data from tax-funded sources in JSON form and do it myself?  I’ve got a dedicated display which doesn’t have any alerts to display most of the time, so it made sense to put it there.

The Qt client using X11, showing the weather forecast.

mcrover Qt client using X11, showing the weather forecast

The Qt client using EGLFS, showing the weather radar.

The curses client, showing ASCII art since there are no alerts to be shown.

mcrover curses client with no alerts.

mcblockd’s latest trick works: drop TCP connections

Evidence in the logs of mcblockd’s latest feature working. It’s successfully killing TCP connections when it adds a prefix to one of the pf tables.

Apr 29 03:42:40 ria mcblockd: [I] Dropped TCP connection from
Apr 29 03:42:40 ria mcblockd: [I] Added 221.144/12 (KR) to ssh_losers for 180 days
Apr 29 05:02:02 ria mcblockd: [I] Dropped TCP connection from
Apr 29 05:02:02 ria mcblockd: [I] Added 46.118/15 (UA) to ssh_losers for 180 days
Apr 29 07:07:42 ria mcblockd: [I] Dropped TCP connection from
Apr 29 07:07:42 ria mcblockd: [I] Added 120.128/13 (CN) to ssh_losers for 180 days
Apr 29 10:04:23 ria mcblockd: [I] Dropped TCP connection from
Apr 29 10:04:23 ria mcblockd: [I] Added 95.215.0/22 (RU) to ssh_losers for 180 days
Apr 29 11:51:34 ria mcblockd: [I] Dropped TCP connection from
Apr 29 11:51:34 ria mcblockd: [I] Added 110.240/12 (CN) to ssh_losers for 180 days
Apr 29 12:22:42 ria mcblockd: [I] Dropped TCP connection from
Apr 29 12:22:42 ria mcblockd: [I] Added 183.184/13 (CN) to ssh_losers for 180 days
Apr 29 13:13:54 ria mcblockd: [I] Dropped TCP connection from
Apr 29 13:13:54 ria mcblockd: [I] Dropped TCP connection from
Apr 29 13:13:54 ria mcblockd: [I] Added 120.144/12 (AU) to ssh_losers for 180 days
Apr 29 14:42:30 ria mcblockd: [I] Dropped TCP connection from
Apr 29 14:42:30 ria mcblockd: [I] Added 113.209/16 (CN) to ssh_losers for 180 days

mcblockd’s latest tricks: kill pf state, walk PCB list and kill TCP connections

Today I added a new feature to mcblockd to kill pf state for all hosts in a prefix when the prefix is added to one of my pf tables. This isn’t exactly what I want, but it’ll do for now.

mcblockd also now walks the PCB (protocol control block) list and drops TCP connections for hosts in a prefix I’ve just added to a pf table. Fortunately there was sample code in /usr/src/usr.sbin/tcpdrop/tcpdrop.c. The trick here is that I don’t currently have a means of mapping a pf table to where it’s applied (which ports, which interfaces). In the long term I might add code to figure that out, but in the interim I can configure ports and interfaces in mcblockd’s configuration file that will allow me to drop specific connections. For this first pass, I just toast all PCBs for a prefix.

The reason I added this feature: I occasionally see simultaneous login attempts from different IP addresses in the same prefix. If I’m going to block the prefix automatically, I want to cut off all of their connections right now, not after all of their connections have ended. Blowing away their pf state works, but leaves a hanging TCP connection in the ESTABLISHED state for a while. I want the PCBs to be cleaned up.

more on mcblockd automation progress

Similar to what I have for sshd, I have real time log processing on my web server. The secure remote communication with mcblockd is very nice to have, since my web server is a separate machine from my gateway/firewall. Below you can see offending web server log entries followed immediately by an action from mcblockd. Instant blocking, without my involvement. - [20/Apr/2017:19:08:08] "GET /blog/xmlrpc.php HTTP/1.0" 200 42
Apr 20 19:08:09 ria mcblockd: [I] Added 185.36.100/22 (CZ) to www_losers for 30 days - [20/Apr/2017:19:57:52] "POST /blog/xmlrpc.php HTTP/1.1" 500 -
Apr 20 19:57:52 ria mcblockd: [I] Added 191.101/16 (CL) to www_losers for 90 days - [20/Apr/2017:20:12:07] "GET /blog/xmlrpc.php HTTP/1.0" 200 42
Apr 20 20:12:08 ria mcblockd: [I] Added 5.164/14 (RU) to www_losers for 90 days - [22/Apr/2017:21:59:24] "GET /wp-login.php HTTP/1.1" 404 210
Apr 22 21:59:24 ria mcblockd: [I] Added 160.202.160/22 (KR) to www_losers for 90 days - [23/Apr/2017:00:58:00] "GET /wp-login.php HTTP/1.1" 404 210
Apr 23 00:58:00 ria mcblockd: [I] Added 104.173.193/24 (US) to www_losers for 30 days - [23/Apr/2017:04:18:19] "GET /wp-login.php HTTP/1.1" 404 210
Apr 23 04:18:19 ria mcblockd: [I] Added 191.37.0/17 (BR) to www_losers for 90 days - [23/Apr/2017:07:50:15] "GET /xmlrpc.php HTTP/1.1" 404 208
Apr 23 07:50:15 ria mcblockd: [I] Added 103.229.124/22 (TW) to www_losers for 30 days - [23/Apr/2017:09:40:35] "GET /wp-login.php HTTP/1.1" 404 210
Apr 23 09:40:36 ria mcblockd: [I] Added 61.72/13 (KR) to www_losers for 90 days - [23/Apr/2017:10:30:24] "GET /blog/xmlrpc.php HTTP/1.0" 405 42
Apr 23 10:30:24 ria mcblockd: [I] Added 46.161.0/18 (RU) to www_losers for 90 days

And yes, the threshold policy code works fine. Below is the result of someone trying to log in 5 times over a period of about 26 minutes. Since I have the threshold set to 5 times in 30 days, they were way above the threshold, but this would be considered a ‘slow’ attempt by some measures.

Apr 21 17:08:59 ria mcblockd: [I] Pending 69.162.73/24 (US) for ssh_losers, 1/5
Apr 21 17:08:59 ria mcblockd: [I] Pending 69.162.73/24 (US) for ssh_losers, 2/5
Apr 21 17:22:14 ria mcblockd: [I] Pending 69.162.73/24 (US) for ssh_losers, 3/5
Apr 21 17:22:14 ria mcblockd: [I] Pending 69.162.73/24 (US) for ssh_losers, 4/5
Apr 21 17:35:21 ria mcblockd: [I] Added 69.162.73/24 (US) to ssh_losers for 30 days

And another over a period of about 91 minutes:

Apr 23 01:39:43 ria mcblockd: [I] Pending 64.179.211/24 (CA) for ssh_losers, 1/5
Apr 23 01:39:43 ria mcblockd: [I] Pending 64.179.211/24 (CA) for ssh_losers, 2/5
Apr 23 02:25:48 ria mcblockd: [I] Pending 64.179.211/24 (CA) for ssh_losers, 3/5
Apr 23 02:25:48 ria mcblockd: [I] Pending 64.179.211/24 (CA) for ssh_losers, 4/5
Apr 23 03:11:15 ria mcblockd: [I] Added 64.179.211/24 (CA) to ssh_losers for 30 days

Raspberry Pi garage door opener: part 9 (done)

Not much to say here. I’ve been using the garage door opener for many months and it just works and is very stable.

dwm@pi1:/home/dwm% uptime
 3:33AM  up 123 days,  4:25, 1 users, load averages: 0.40, 0.15, 0.10

dwm@pi1:/home/dwm% psg mcpigdod
dwm   930   0.0  1.2 46452 11372  0- S    22Aug16  1748:10.15 mcpigdod

Raspberry Pi garage door opener: part 8

On Wednesday night I stuffed the enclosure with the Raspberry Pi, buttons, indicators and POE splitter after making all of the internal connections. I assembled the second Neutrik dataCON on the second rotary encoder. I temporarily taped my enclosure to the garage wall for testing, and connected the rotary encoders, door activation wires and the POE connection. I also attached the second magnetic door switch to the wall above the south door, and attached the magnet to the top of the door. I then did some basic testing. Both doors work correctly via the web app from my iPhone, and the rotary encoder connections work correctly.

On Thursday night I extended the wiring for the magnetic door switches (soldered joints and heat shrink), then sleeved the extensions with gray braided sleeve. Since I’m still waiting for a Neutrik jack for these, I’m temporarily using a dual row barrier strip to connect them to my PCB inside my enclosure.

Raspberry Pi garage door opener: part 7

I received my HAT PCBs that I designed for the garage door opener. I populated one of them and tested all of the outputs as well as the door closed switch inputs. Everything works. Yay! I will continue assembly tomorrow, and possibly test it wired into the garage doors.