Exploring reflection in C++26 (P2996)

I’ve been exploring C++ reflection a bit, using the Bloomberg fork of clang. I’ve yet to get my head fully around the syntax and the implications, but I have an obvious use case: more serialization functionality in libDwm.

In order to play around in ‘production’ code, I needed some feature test macros that are not in the Bloomberg fork in order to conditionally include code only when C++26 features I need are present. P2996 tentatively proposed __cpp_impl_reflection and __cpp_lib_reflection and hence I added those. I also need features from P3491, whose proposed test macro is __cpp_lib_define_static which I also added. Finally, I added __cpp_expansion_statements (feature test macro for P1306).

Technically, __cpp_lib_reflection and __cpp_lib_define_static should be in <meta>, but I added them to the compiler built-ins just because it’s convenient for now.

I’ve run into some minor gotchas when implementing some generic reflection for types not already covered by serialization facilities in libDwm.

As an example… what to do about deserializing instances of structs and classes with const data members. Obviously the const members can’t be cleanly written to without going through a constructor. I haven’t given much thought to it yet, but at first glance it’s a bit of a sore spot.

Another is what to do about members with types such as std::mutex, whose presence inside a class would generally imply mutual exclusion logic within the class. That logic can’t be ascertained by reflection. Do I need to lock the discovered mutex during serialization? During deserialization too? Do I serialize the mutex as a boolean so that deserialization can lock or unlock it, or skip it?

For now, since we have P3394 in the Bloomberg clang fork, I’ve decided that using annotations that allow members of a structure or class to be skipped is a good idea. So in my libDwm experiment, I now have a Dwm::skip_io annotation that will cause Read() and Write() functions to skip over data that is marked with this annotation. For example:

  #include <sstream>
  #include "DwmStreamIO.hh"

  struct Foo {
    [[=Dwm::skip_io]] int  i;
    std::string            s;
  };

  int main(int arc, char *argv[]) {
    Foo  foo1 { 42, "hello" };
    Foo  foo2 { 99, "goodbye" };

    std::stringstream  ss;
    if (Dwm::StreamIO::Write(ss, foo1)) {
      if (Dwm::StreamIO::Read(ss, foo2)) {
        std::cout << foo2.i << ' ' << foo2.s << '\n';
      }
    }
    return 0;
  }

Would produce:

  99 hello

The i member of Foo is neither written nor read, while the s member is written and read. Hence we see that foo2.i remains unchanged after Read(), while foo2.s is changed.

Chasing pointers

A significant issue with trying to use reflection for generic serialization/deserialization: what to do with pointers (raw, std::unique_ptr, std::shared_ptr, et. al.). One of the big issues here is rooted in the fact that for my own use, reflection for serialization/deserialization is desired for structures that come from the operating system environment (POSIX, etc.). Those structures were created for C APIs, not C++ APIs, and pointers within them are always raw. Generically, there’s no way to know what they point to. A single element on the heap? An array on the heap? A static array? In other words, I don’t know if the pointer points to one object or an unbounded number of objects, and hence I don’t know how many objects to write, allocate or read.

Even with allocations performed via operator new[], we don’t have a good means of determining how many entries are in the array.

For now, I deny serialization / deserialization of pointers, with one experimental exception: std::unique_ptr whose deleter is std::default_delete<T> (implying a single object).

Reflection-based code is easier to read

Let’s say we have a need to check that all the types in a std::tuple satisfy a boolean predicate named IsStreamable. Before C++26 reflection, I’d wind up writing something like this:

    template <typename TupleType>
    consteval bool TupleIsStreamable()
    {
      auto  l = []<typename ...ElementType>(ElementType && ...args)
        { return (IsStreamable<ElementType>() && ...); };
      return std::apply(l, std::forward<TupleType>(TupleType()));
    }

While this is far from terrible, it’s not much fun to read (especially for a novice), and is still using relatively modern C++ features (lambda expressions with an explicit template parameter list, and consteval). Not to mention that std::apply is only applicable to tuple-like types (std::tuple, std::pair and std::array). And critically, the above requires default constructibility (note the TupleType() construct call). Because of this final requirement, I often have to resort to the old school technique (add a size_t template parameter and use recursion to check all element types) or the almost as old school technique of using make_index_sequence. Also note that real code would have a requires clause on this function template to verify that TupleType is a tuple-like type, I only left it out here for the sake of brevity.

The equivalent with C++26 reflection:

    template <typename TupleType>    
    consteval bool TupleIsStreamable()
    {
      constexpr const auto tmpl_args =
        define_static_array(template_arguments_of(^^TupleType));
      template for (constexpr auto tmpl_arg : tmpl_args) {
        if constexpr (! IsStreamable<typename[:tmpl_arg:]>()) {
          return false;
        }
      }
      return true;
    }

While this is more lines of code, it’s significantly easier to reason about once you understands the basics of P2996. And without changes (other than renaming the function), this works with some other standard class templates such as std::variant. And with some minor additions, it will work for other standard templates. The example below will also handle std::vector, std::set, std::multiset, std::map, std::multimap, std::unordered_set, std::unordered_multiset, std::unordered_map and std::unordered_multimap. We only check type template parameters, and if ParamCount is non-zero, we only look at the first ParamCount template parameters:

    template <typename TemplateType, size_t ParamCount = 0>    
    consteval bool IsStreamableStdTemplate()
    {
      constexpr const auto tmpl_args =
        define_static_array(template_arguments_of(^^TemplateType));
      size_t  numParams = 0, numTypes = 0, numStreamable = 0;
      template for (constexpr auto tmpl_arg : tmpl_args) {
        ++numParams;
        if (ParamCount && (numParams > ParamCount)) {
          break;
        }
        if (std::meta::is_type(tmpl_arg)) {
          ++numTypes;
          if constexpr (! IsStreamable<typename[:tmpl_arg:]>()) {
            break;
          }
          ++numStreamable;
        }
      }
      return (numTypes == numStreamable);
    }

We can use this with std::vector, std::deque, std::list, std::set, std::multiset, std::unordered_set and std::unordered_multiset by using a ParamCount value of 1. We can use this with std::map, std::multimap, std::unordered_map and std::unordered_multimap by using a ParamCount value of 2. Hence the following would be valid calls to this function (at compile time), just as a list of examples:

    IsStreamableStdTemplate<std::array<int,42>>()
    IsStreamableStdTemplate<std::pair<std::string,int>>()
    IsStreamableStdTemplate<std::set<std::string>>()
    IsStreamableStdTemplate<std::vector<int>,1>()
    IsStreamableStdTemplate<std::deque<int>,1>()
    IsStreamableStdTemplate<std::list<int>,1>()
    IsStreamableStdTemplate<std::map<std::string,int>,2>()
    IsStreamableStdTemplate<std::tuple<int,std::string,bool,char>>()
    IsStreamableStdTemplate<std::variant<int,bool,char,std::string>>()

Early thoughts

I have some initial thoughts about what we’re getting with reflection in C++26.

I think that the decision to approve P2996 for C++26 was a good decision. Having std::meta::info be an opaque type to the user with a bunch of functions (more can be added later) to access it is a good thing. From just the simple examples here, it’s pretty easy to see how it’s going to make some code much easier to understand, even without doing anything fancy (noting that I only used a single splice in each example).

I’ve done my fair share of template metaprogramming over the years. While it’s satisfying to be able to do it right when it’s the best (or only) option, it’s generally much more work than I’d like. And more than once I’ve found the cognitive load to be much higher than I’d like. I probably can’t count the number of times that I’ve gone to modify some template metacode 3 years after writing it, only to find that I underestimated the time to make a change just due to the difficulty of comprehending the code (regardless of the quality of the comments and names). This is especially true when I haven’t recently been deep in this kind of code. It has only gotten worse on this front over time; standard C++ has never been a small, simple language. But despite many features in C++11 and beyond being very useful, the scope of the language is now so big that there are probably no programmers on the planet that know the whole language and standard library. It’s too much for any one person to hold in their head.

One other thought is a caution for what some people are expecting without yet digging into details. For example, what we’re getting is not a panacea for serialization and deserialization. Yes, it will indeed be useful for such activities. But it doesn’t magically solve concurrency issues, nor the legacy issues with pointers, unbounded arrays, arrays decaying to pointers, etc. Even our smart pointers introduce unsolved issues for serialization and deserialization. But it’s the legacy stuff we often need from our operating system that will remain largely unsolved for the foreseeable future, as well as any other place where we are forced to interact with C APIs that were not designed for security, safety, and concurrency.

I’m optimistic that what we’re getting is going to be incredibly useful. While I’ve only scratched the surface in this post, I’m already at the point in my own code changes where I wish we had all the voted-in papers in our compilers today. The increased number of features I’m going to be able to cleanly add to my libraries is significant. The refactoring I’m going to be able to do is also significant (already underway with conditional compilation via feature macros I can later remove). I’m anxiously awaiting official compiler support!

New home network monitor using ICMP

I finally got around to finding a real purpose for a monitor in my home office that’s been idle for a long time. I put together a quick hack to display the round trip time and packet loss to some of the devices in my home. It is using ICMP, is multi-threaded (as is nearly everything I’ve written in C++ in the last 20 years), and I’m using qcustomplot for display.

I’m calling this ‘qmcping’. I run this on an ancient 1920×1200 display in my home office, next to the qmcrover display. It runs fine on a Raspberry Pi 4, but I’m running in on my Threadripper workstation since that’s the machine connected to this monitor. I only use the workstation via remote access, so I run ‘qmcping’ full screen via a direct frame buffer (no X11 or Wayland). It runs 7×24.

The statistical boxes show the minimum RTT, 25th percentile RTT, median RTT, 75th percentile RTT and 95th percentile RTT for the last 100 echo requests. The red bars show packet loss for the last 100 echo requests. At a glance I can see if things are fairly normal on my home network.

I also have a curses version called ‘mcping’ that I can run in a terminal. It looks like below.

To some extent there is sort of some humor in the fact that I didn’t do this a long time ago. In the early 1990’s, I was the author of an ICMP monitor for a large network service provider, with a list of targets in the 10s of thousands (eventually more than 100,000). It was written in C, but not all that much different. This isn’t difficult code to write.

What has changed? Well, I don’t trust the Internet and haven’t for a long time. The late 1980’s and early 1990’s were a naive time for TCP/IP. That time has passed. So today, my ICMP monitor puts a cryptographically strong random 32-byte sequence in each echo request. When I receive an echo reply, I verify that this sequence is one I recently sent to the destination. This helps prevent someone from spoofing an ICMP echo reply.

Using GitHub for repository backups

Just a note to myself since it’s convenient to have another set of off-site backups of some of my repositories…

I’ll use libDwm as an example here.

gh repo create libDwm --public
cd ~/tmp
git clone --mirror git@depot:libDwm
cd libDwm.git
git push --mirror git@github.com:dwmcrobb/libDwm

Tracking my “1,000 miles by Sept. 1” cycling goal with software

I decided I wanted a means of tracking progress toward my 1,000 miles of cycling by September 1st. So this weekend I threw together a C++ library and tool to read a fairly simple JSON data file and emit chart.js plots and a table.

The JSON file is pretty straightforward. All I need to do after each rides is add a new entry to the “rides” array. I split the lines in the display below, but in reality I have a single line per “ride” entry. It’s easy to add an entry since I just copy one and edit as needed. The order of the “rides” entries doesn’t matter since my library will sort them by the “odometer” field as needed. So I always add my latest ride as the first “rides” entry just because it’s the fastest (no scrolling).

Note that the “odometer” field for each ride is my bike odometer reading at the end of the given ride. This lets me compute the miles for the given ride by subtracting the odometer reading for the previous ride. In the case of the first ride, I subtract the “startMiles” which represents the odometer reading when the goal is created.

{
    "rides": [
	{ "time": "6/29/2024 21:30", "odometer": 538.6,
          "minutes": 42, "calories": 481 },
	{ "time": "6/28/2024 15:45", "odometer": 529.1,
          "minutes": 40, "calories": 476 },
	{ "time": "6/27/2024 20:30", "odometer": 519.5,
          "minutes": 40, "calories": 168 },
	{ "time": "6/27/2024 15:00", "odometer": 515.0,
          "minutes": 53, "calories": 578 },
	{ "time": "6/26/2024 19:25", "odometer": 504.6,
          "minutes": 15, "calories": 121 },
	{ "time": "6/26/2024 15:22", "odometer": 500.6,
          "minutes": 34, "calories": 317 },
	{ "time": "6/25/2024 20:40", "odometer": 491.4,
          "minutes": 35, "calories": 430 },
	{ "time": "6/24/2024 23:50", "odometer": 482.6,
          "minutes": 35, "calories": 410 },
	{ "time": "6/24/2024 14:00", "odometer": 473.8,
          "minutes": 60, "calories": 830 }
    ],
    "startTime": "6/24/2024 00:00",
    "goalTime": "9/1/2024 00:00",
    "startMiles": 461.0,
    "goalMiles": 1000
}

From this simple file I can generate useful charts and a table. I wrote a CGI program that utilizes the C++ library and can emit charts.js graphs and and HTML table. I can embed these with iframes in a web page. That’s exactly what I’m doing to track my progress. You can see these at https://www.rfdm.com/Daniel/Cycling/Goals/20240901.php. And I can of course embed them here too!

One of the things that’s cool about this is that it’s really easy to create a new goal file and just update it after each ride. I can create shorter goals in new files. I can easily change the goal miles or goal time as desired.

I’ll probably create a command line utility to allow me to add rides without editing the file directly. Mainly because I can use it to sanity check input (say an incorrect date/time), and of course can make it a secure client/server application. And at some point I might make an iOS app that can grab the relevant data via HealthKit APIs so I don’t have to type at all.

Unacknowledged SYNs June 11 through June 13, 2024

The AI race continues unabated. It’s interesting to note that Facebook appears to be ramping up their web scraping. Is this their plan B, in response to the EU not being happy about their plans to scrape their own users’ content without their consent? 1 They’ve been banging on my web server’s door harder than anyone else for the last few days. I’ve had them blocked for some time, so I can only assume their recent escalation of probing is their desperation for LLM training data.

I wonder when legislators will figure out that big tech is basically aiming tractor beams at the rest of the Internet. I can’t help but think of the giant eletromagnet on Lockdown’s “Knight Ship” from the “Transformers: Age of Extinction” movie, indiscriminately vacuuming up every magnetic object in its reach.

I have more than a decade’s worth of data for packets that have entered or exited my home network. I recently archived all but the last 2 years or so in cold storage. But I can say with confidence that the traffic to my web site has completely transformed in the last few years. It used to be mostly what appeared to be ordinary people (coming from residential broadband address space), most referred by a search engine or a message board. Today, the legitimate traffic looks the same as old, but it is dwarfed (by more than 4 decimal orders of magnitude) by big tech scraping and ne’er-do-wells using cloud infrastructure for nefarious purposes.

The Internet is quickly becoming another casualty of corporate greed. My self-written firewall automation is currently blocking 1,098,452,334 IPv4 addresses from port 80 and 443. That’s 27.54% of the publicly routable IPv4 unicast address space according to my bar napkin arithmetic. Yikes. Would you swim in a river where more than 1 in 4 of the molecules is toxic?

  1. Meta halts plans to train AI on Facebook, Instagram posts in EU â†Šī¸Ž

Unacknowledged SYNs by AS (autonomous system) May 17, 2024 to May 23, 2024

I still have just two of Google’s /24’s blocked in AS 15169: 66.249.66/24 and 66.249.72/24. But connections from these 2 networks accounted for 58,935 unacknowledged SYNs sent to my home network this week. That’s more than one every 11 seconds on average. Should I assume this is all due to Google looking for more training data to prevent their LLM from telling me to put glue on my pizza? 🙂

Probably worth noting that as of today, I’m blocking more than 20% of the publicly routable IPv4 address space from my web server.

% dwmnet -f /etc/pf.www_losers
Addresses: 831,455,990 (19.36% of 2^32, 20.85% of publicly routable unicast)

Unacknowledged SYNs by AS (autonomous system) May 20, 2024

Congratulations to Hetzner Online for hosting douchebag of the day (144.76.72.24, i.e. static.24.72.76.144.clients.your-server.de). I’ve had all of Hetzner Online’s address space blocked from my web server for a long time due to this sort of thing (their customers running braindead bots). They’re another AS I’d recommend completely blocking from your web site.

If it helps anyone, here’s a list of the routes currently originating from Hetzner’s AS 24940.

5.9/16
5.75.128/17
23.88.0/17
37.27/16
45.136.70/23
45.145.227/24
45.148.28/22
46.4/16
49.12/15
65.21/16
65.108/15
78.46/15
78.138.62/24
85.10.192/18
88.99/16
88.198/16
89.42.83/24
91.107.128/17
91.190.240/21
91.233.8/22
94.130/16
95.216/15
116.202/15
128.140.0/17
135.181/16
136.243/16
138.201/16
142.132.128/17
144.76/16
148.251/16
157.90/16
159.69/16
162.55/16
167.233/16
167.235/16
168.119/16
171.25.225/24
176.9/16
178.63/16
178.212.75/24
185.50.120/23
185.107.52/22
185.126.28/22
185.157.83/24
185.157.176/22
185.171.224/22
185.189.228/22
185.213.45/24
185.216.237/24
185.226.99/24
185.228.8/23
185.242.76/24
185.253.111/24
188.34.128/17
188.40/16
188.245/16
193.25.170/23
193.110.6/23
193.163.198/24
194.42.180/22
194.42.184/22
194.62.106/24
195.60.226/24
195.201/16
195.248.224/24
197.242.84/22
201.131.3/24
204.29.146/24
213.133.96/19
213.232.193/24
213.239.192/18
216.55.108/22
217.78.237/24

Unacknowledged SYNs by AS (autonomous system) May 15, 2024

I still have two /24 Google networks blocked by automation: 66.249.66.0/24 and 66.249.72.0/24. These are used by Googlebot crawlers. I don’t intend to leave them blocked indefinitely. But with just these two /24 networks, Google sits at the top of the list of TCP SYN senders I don’t acknowledge.

I don’t log User-agent on my web server, nor use it to make any decisions. Some would argue that I could use it to differentiate Google Gemini / Vertex AI scraping versus other Google bots. However, User-agent isn’t reliable. I’ve seen rogue crawlers with User-agent set to Googlebot or some other bot they want to impersonate for some reason. My favorite values are the ones with typos. 🙂 In the age of LLM training, the longstanding handshake agreement is breaking down. For the same reason, I don’t use robots.txt to try to prevent crawlers from crawling particular content. I can’t trust a web client to abide a handshake agreement that’s trivially betrayed and unenforceable. I don’t expect Google to violate the handshake agreement. But if I want to block or not block Google bots, I’m going to do it via more reliable means than User-agent.

For this reason, I wish Google would explicitly use different address space for different bots. It would make blocking or shaping traffic to/from their bots easier and more reliable.

Unacknowledged SYNs by AS (autonomous system) May 8, 2024

The only interesting thing about this day… one of Google’s crawlers tripped one of my automatic blockers (by loading my blog login page). So 66.249.66/24 got blocked. And as I expect (and blogged about before), Google is far and away the leading consumer of my web site. Just their crawlers in 66.249.66/24 account for 9,177 connection attempts per day. Yes, some of the SYNs are retries. But just the same… if your web site is around for a while, it’ll get hammered by Google day in and day out.

I already know their crawlers are not particularly smart. For example, I’ve blocked them from pulling PDF files because they grab the same ones over and over (often in the same week), including PDFs that haven’t changed in over 10 years. So I’m going to leave the block in place for a while to see what happens. I suspect nothing; I don’t think the crawlers are smart enough to stop trying, despite getting nothing in response.

Unacknowledged SYNs by AS (autonomous system) May 6, 2024

Another day, mostly the same stuff. The only significant difference from recent days: one douchebag on Comcast in Illinois (67.163.12.10).

The douchebag in Illinois…

  Source Addr  Destination Port     Pkts  Bytes
  ------------ -----------------  ------ ------
  67.163.12.10 http               1.448K 57.92K
  67.163.12.10 imaps              1.447K 57.88K
  67.163.12.10 https              1.443K 57.72K
  67.163.12.10 submission         1.433K 57.32K
  67.163.12.10 imap                1.43K  57.2K
  67.163.12.10 ssh                1.418K 56.72K
  67.163.12.10 2222               1.417K 56.68K