Threadripper 3960X: the birth of ‘thrip’

I recently assembled a new workstation for home. My primary need was a machine for software development, including deep learning. This machine is named “thrip”.

Having looked hard at my options, I decided on AMD Threadripper 3960X as my CPU. A primary driver was of course bang for the buck. I wanted PCIe 4.0, at least 18 cores, at least 4-channel RAM, the ability to utilize 256G or more of RAM, and to stay in budget.

By CPU core count alone, the 3960X is over what I needed. On the flip side, it’s constrained to 256G of RAM, and it’s also more difficult to keep cool than most CPUs (280W TDP). But on price-per-core, and overall performance per dollar, it was the clear winner for my needs.

Motherboard-wise, I wanted 10G ethernet, some USB-C, a reasonable number of USB-A ports, room for 2 large GPUs, robust VRM, and space for at least three NVMe M.2 drives. Thunderbolt 3 would have been nice, but none of the handful of TRX40 boards seem to officially support it (I don’t know if this is an Intel licensing issue or something else). The Gigabyte board has the header and Wendell@Level1Techs seems to have gotten it working, but I didn’t like other aspects of the Gigabyte TRX40 AORUS EXTREME board (the XL-ATX form factor, for example, is still limiting in terms of case options).

I prefer to build my own workstations. It’s not due to being particularly good at it, or winding up with something better than I could get pre-built. It’s that I enjoy the creative process of selecting parts and putting it all together.

I had not assembled a workstation in quite some time. My old i7-2700K machine has met my needs for most of the last 8 years. And due to a global pandemic, it wasn’t a great time to build a new computer. The supply chain has been troublesome for over 6 months now, especially for some specific parts (1000W and above 80+ titanium PSUs, for example). We’ve also had a huge availability problem for the current GPUs from NVIDIA (RTX 3000 series) and AMD (Radeon 6000 series). And I wasn’t thrilled about doing a custom water-cooling loop again, but I couldn’t find a worthy quiet cooling solution for Threadripper and 2080ti without going custom loop. Given the constraints, I wound up with these parts as the guts:

  • Asus TRX40 ROG Zenith II Extreme Alpha motherboard
  • AMD Threadripper 3960X CPU (24 cores)
  • 256 gigabytes G.Skill Trident Z Neo Series RGB DDR4-3200 CL16 RAM (8 x 32G)
  • EVGA RTX 2080 Ti FTW3 Ultra GPU with EK Quantum Vector FTW3 waterblock
  • Sabrent 1TB Rocket NVMe 4.0 Gen4 PCIe M.2 Internal SSD
  • Seasonic PRIME TX-850, 850W 80+ Titanium power supply
  • Watercool HEATKILLER IV PRO for Threadripper, pure copper CPU waterblock

It’s all in a Lian Li PC-O11D XL case. I have three 360mm radiators, ten Noctua 120mm PWM fans, an EK Quantum Kinetic TBE 200 D5 PWM pump, PETG tubing and a whole bunch of Bitspower fittings.

My impressions thus far: it’s fantastic for Linux software development. It’s so nice to be able to run ‘make -j40‘ on large C++ projects and have them complete in a timely manner. And thus far, it runs cool and very quiet.

An ode to NSFNET and ANSnet: a simple NMS for home

A bit of history…

I started my computing career at NSFNET at the end of 1991. Which then became ANSnet. In those days, we had a home-brewed network monitoring system. I believe most/all of it was originally the brainchild of Bill Norton. Later there were several contributors; Linda Liebengood, myself, others. The important thing for today’s thoughts: it was named “rover”, and its user interface philosophy was simple but important: “Only show me actionable problems, and do it as quickly as possible.”

To understand this philosophy, you have to know something about the primary users: the network operators in the Network Operations Center (NOC). One of their many jobs was to observe problems, perform initial triage, and document their observations in a trouble ticket. From there they might fix the problem, escalate to network engineering, etc. But it wasn’t expected that we’d have some omniscient tool that could give them all of the data they (or anyone else) needed to resolve the problem. We expected everyone to use their brains, and we wanted our primary problem reporter to be fast and as clutter-free as possible.

For decades now, I’m spent a considerable amount of time working at home. Sometimes because I was officially telecommuting, at other times just because I love my work and burn midnight hours doing it. As a result, my home setup has become more complex over time. I have 10 gigabit ethernet throughout the house (some fiber, some Cat6A).  I have multiple 10 gigabit ethernet switches, all managed.  I have three rackmount computers in the basement that run 7×24.  I have ZFS pools on two of them, used for nightly backups of all networked machines, source code repository redundancy, Time Machine for my macOS machines, etc.  I run my own DHCP service, an internal DNS server, web servers, an internal mail server, my own automated security software to keep my pf tables current, Unifi, etc.  I have a handful of Raspberry Pis doing various things.  Then there’s all the other devices: desktop computers in my office, a networked laser printer, Roku, AppleTV, Android TV, Nest thermostat, Nest Protects, WiFi access points, laptops, tablet, phone, watch, Ooma, etc.  And the list grows over time.

Essentially, my home has become somewhat complex.  Without automation, I spend too much time checking the state of things or just being anxious about not having time to check everything at a reasonable frequency.  Are my ZFS pools all healthy?  Are all of my storage devices healthy?  Am I running out of storage space anywhere?  Is my DNS service working?  Is my DHCP server working?  My web server?  NFS working where I need it?  Is my Raspberry Pi garage door opener working?  Are my domains resolvable from the outside world?  Are the cloud services I use working?  Is my Internet connection down?  Is there a guest on my network?  A bandit on my network?  Is my printer alive?  Is my internal mail service working?  Are any of my UPS units running on battery?  Are there network services running that should not be?  What about the ones that should be, like sshd?

I needed a monitoring system that worked like rover; only show me actionable issues.  So I wrote my own, and named it “mcrover”.  It’s more of a host and service monitoring system than a network monitoring system, but it’s distributed and secure (using ed25519 stuff in libDwmAuth).  It’s modern C++, relatively easy to extend, and has some fun bits (ASCII art in the curses client when there are no alerts, for example).  Like the old Network Operations Center, I have a dedicated display in my office that only displays the mcrover Qt client, 24 hours a day.  Since most of the time there are no alerts to display, the Qt client toggles between a display of the next week’s forecast and a weather radar image when there are no alerts.  If there are alerts, the alert display will be shown instead, and will not go away until there are no alerts (or I click on the page switch in the UI).  The dedicated display is driven by a Raspberry Pi 4B running the Qt client from boot, using EGLFS (no X11).  The Raspberry Pi4 is powered via PoE.  It is also running the mcrover service, to monitor local services on the Pi as well as many network services.  In fact the mcrover service is running on every 7×24 general purpose computing device.  mcrover instances can exchange alerts, hence I only need to look at one instance to see what’s being reported by all instances.

This has alleviated me of a lot of sys admin and network admin drudgery.  It wasn’t trivial to implement, mostly due to the variety (not the quantity) of things it’s monitoring.  But it has proven itself very worthwhile.  I’ve been running it for many months now, and I no longer get anxious about not always keeping up with things like daily/weekly/monthly mail from cron and manually checking things.  All critical (and some non-critical) things are now being checked every 60 seconds, and I only have my attention stolen when there is an actionable issue found by mcrover.

So… an ode to the philosophy of an old system.  Don’t make me plow through a bunch of data to find the things I need to address.  I’ll do that when there’s a problem, not when there isn’t a problem.  For 7×24 general purpose computing devices running Linux, macOS or FreeBSD, I install and run the mcrover service and connect it to the mesh.  And it requires very little oomph; it runs just fine on a Raspberry Pi 3 or 4.

So why the weather display?  It’s just useful to me, particularly in the mowing season where I need to plan ahead for yard work.  And I’ve just grown tired of the weather websites.  Most are loaded with ads and clutter.  All of them are tracking us.  Why not just pull the data from tax-funded sources in JSON form and do it myself?  I’ve got a dedicated display which doesn’t have any alerts to display most of the time, so it made sense to put it there.

The Qt client using X11, showing the weather forecast.

mcrover Qt client using X11, showing the weather forecast

The Qt client using X11, showing the weather radar.

mcrover Qt client using X11, showing the weather radar

The curses client, showing ASCII art since there are no alerts to be shown.

mcrover curses client with no alerts.