The AI race continues unabated. It’s interesting to note that Facebook appears to be ramping up their web scraping. Is this their plan B, in response to the EU not being happy about their plans to scrape their own users’ content without their consent? 1 They’ve been banging on my web server’s door harder than anyone else for the last few days. I’ve had them blocked for some time, so I can only assume their recent escalation of probing is their desperation for LLM training data.
I wonder when legislators will figure out that big tech is basically aiming tractor beams at the rest of the Internet. I can’t help but think of the giant eletromagnet on Lockdown’s “Knight Ship” from the “Transformers: Age of Extinction” movie, indiscriminately vacuuming up every magnetic object in its reach.
I have more than a decade’s worth of data for packets that have entered or exited my home network. I recently archived all but the last 2 years or so in cold storage. But I can say with confidence that the traffic to my web site has completely transformed in the last few years. It used to be mostly what appeared to be ordinary people (coming from residential broadband address space), most referred by a search engine or a message board. Today, the legitimate traffic looks the same as old, but it is dwarfed (by more than 4 decimal orders of magnitude) by big tech scraping and ne’er-do-wells using cloud infrastructure for nefarious purposes.
The Internet is quickly becoming another casualty of corporate greed. My self-written firewall automation is currently blocking 1,098,452,334 IPv4 addresses from port 80 and 443. That’s 27.54% of the publicly routable IPv4 unicast address space according to my bar napkin arithmetic. Yikes. Would you swim in a river where more than 1 in 4 of the molecules is toxic?
I still have just two of Google’s /24’s blocked in AS 15169: 66.249.66/24 and 66.249.72/24. But connections from these 2 networks accounted for 58,935 unacknowledged SYNs sent to my home network this week. That’s more than one every 11 seconds on average. Should I assume this is all due to Google looking for more training data to prevent their LLM from telling me to put glue on my pizza? đ
Probably worth noting that as of today, I’m blocking more than 20% of the publicly routable IPv4 address space from my web server.
% dwmnet -f /etc/pf.www_losers
Addresses: 831,455,990 (19.36% of 2^32, 20.85% of publicly routable unicast)
Congratulations to Hetzner Online for hosting douchebag of the day (144.76.72.24, i.e. static.24.72.76.144.clients.your-server.de). I’ve had all of Hetzner Online’s address space blocked from my web server for a long time due to this sort of thing (their customers running braindead bots). They’re another AS I’d recommend completely blocking from your web site.
If it helps anyone, here’s a list of the routes currently originating from Hetzner’s AS 24940.
I still have two /24 Google networks blocked by automation: 66.249.66.0/24 and 66.249.72.0/24. These are used by Googlebot crawlers. I don’t intend to leave them blocked indefinitely. But with just these two /24 networks, Google sits at the top of the list of TCP SYN senders I don’t acknowledge.
I don’t log User-agent on my web server, nor use it to make any decisions. Some would argue that I could use it to differentiate Google Gemini / Vertex AI scraping versus other Google bots. However, User-agent isn’t reliable. I’ve seen rogue crawlers with User-agent set to Googlebot or some other bot they want to impersonate for some reason. My favorite values are the ones with typos. đ In the age of LLM training, the longstanding handshake agreement is breaking down. For the same reason, I don’t use robots.txt to try to prevent crawlers from crawling particular content. I can’t trust a web client to abide a handshake agreement that’s trivially betrayed and unenforceable. I don’t expect Google to violate the handshake agreement. But if I want to block or not block Google bots, I’m going to do it via more reliable means than User-agent.
For this reason, I wish Google would explicitly use different address space for different bots. It would make blocking or shaping traffic to/from their bots easier and more reliable.
The only interesting thing about this day… one of Google’s crawlers tripped one of my automatic blockers (by loading my blog login page). So 66.249.66/24 got blocked. And as I expect (and blogged about before), Google is far and away the leading consumer of my web site. Just their crawlers in 66.249.66/24 account for 9,177 connection attempts per day. Yes, some of the SYNs are retries. But just the same… if your web site is around for a while, it’ll get hammered by Google day in and day out.
I already know their crawlers are not particularly smart. For example, I’ve blocked them from pulling PDF files because they grab the same ones over and over (often in the same week), including PDFs that haven’t changed in over 10 years. So I’m going to leave the block in place for a while to see what happens. I suspect nothing; I don’t think the crawlers are smart enough to stop trying, despite getting nothing in response.
Different day, similar stuff. One thing that has changed here is that I’ve disabled port 587 (submission) on my gateway, since the only mail I care about receiving on my local mail server is generated locally. I also added to the list of those I block from IMAPS. So now the spammers and those attempting to relay show up, as well as those trying to access my IMAPS service for my local mail. Hurricane Electric is a primary offender here, as is G-Core Labs. All of the traffic from Redoubt Networks is for port 587; they’re apparently happy to host spammers.
The total count of unacknowledged SYNs received on this day: 37,611. On average, that’s one every 1.81 seconds. Put another way, about 26 per minute. The disheartening part: this is more than 8X the number of SYNs I accepted. 89.5% of the TCP connection attempts directed at my home network from the public Internet are rejected.
Think about that for a minute… almost 90% of attempted incoming TCP connections are unwanted garbage. This is the modern Internet. đ
If you’re a webmaster or content creator, it’s worth noting that my two leading offenders, Microsoft and Amazon, are both defendants in lawsuits involving copyright. The problem here is that big tech is scraping every web site they can for LLM and other AI training data, seemingly with little to no regard for copyright or attribution. See https://www.theregister.com/2024/03/13/nyt_hacking_response/ for The NY Times claim against Microsoft and OpenAI, and https://www.theregister.com/2024/04/22/ghaderi_v_amazon/ for a claim against Amazon. In today’s landscape, blocking just their address space isn’t going to stop them since they can buy hosting elsewhere just like the rest of us. But I’m coming to the conclusion that I shouldn’t make it easy for any of them.
Same stuff, different day. The top 25 ASes probing my home network despite the fact that they get nothing in response to the TCP SYNs they’re sending. Per usual, the top 3 spots are occupied by U.S. cloud providers.
On the topic of undesired traffic directed at my home network from the public internet, let’s look at unacknowledged SYN packets for the top offending autonomous systems (ASes) on a per-port basis, for just one day.
It’s worth noting a distinction between two classes of SYNs I don’t acknowledge: those directed at ports on which there is nothing listening and those directed at ports on which something is listening but from a source which I don’t acknowledge due to a history of nefarious activity. Nefarious activity to my web server, for example, might include looking for vulnerabilities I’ve never had (say ‘/login.php’ or ‘phpmyadminwhatever…’).
What’s interesting about looking at things from this perspective is just the different nature of probes from various host/cloud providers. For example, compare what comes from Cloudflare (just web traffic) to the skulduggery-only that comes from G-Core Labs S.A.. The difference is fairly astounding to find at this level of granularity (entire autonomous systems). And of course we see what we’d expect from Amazon and Google… mostly port 443 and 80 probes, but also scans of the entire 16-bit port range.
So let’s take a look…
First up, today’s leading offender: Microsoft. Most of the probing is directed at my web server.
8075 Microsoft Corporation (US)
Port
Packets
443 (https)
5372
80 (http)
936
587 (submission)
7
2375
3
102 (iso-tsap)
2
1521
1
2000
1
21 (ftp)
1
3306 (mysql)
1
5432 (postgresql)
1
5985
1
6379
1
9000
1
In second place we have Huawei. All of it directed at my web server.
136907 Huawei (HK)
Port
Packets
443 (https)
3793
80 (http)
600
In third place we have G-Core Labs S.A.. Most of it login attempts. I recommend blocking this AS entirely. All of the traffic I’ve received from them is nefarious in nature. I only recently added all of their address space to my ‘deny’ lists, but parts of it had been blocked by automation before.
199524 G-Core Labs S.A. (LU)
Port
Packets
22 (ssh)
936
143 (imap)
930
993 (imaps)
923
587 (submission)
916
2222
166
443 (https)
165
80 (http)
162
In fourth place we have Amazon. Most of it is directed at my web server, but there are probes to every port.
16509 Amazon.com, Inc. (US)
Port
Packets
80 (http)
2185
443 (https)
1761
8414
2
3524
2
3092
2
1022 (exp2)
2
96 (dixie)
1
98 (tacnews)
1
100 (newacct)
1
106 (pop3pw)
1
123 (ntp)
1
263 (hdap)
1
95 (supdup)
1
448 (ddm-ssl)
1
450 (tserver)
1
502 (mbap)
1
541 (uucp-rlogin)
1
646 (ldp)
1
88 (kerberos-sec)
1
… port scans of ALL ports …
In fifth place we have Cloudflare. All of their probes were directed at my web server.
13335 Cloudflare, Inc. (US)
Port
Packets
80 (http)
1460
443 (https)
406
In sixth place we have Brightspeed. All of their probes were directed at my web server.
19901 Brightspeed (US)
Port
Packets
443 (https)
1787
80 (http)
74
In seventh place we have SEMrush. All of their probes were directed at my web server.
209366 SEMrush CY LTD (CY)
Port
Packets
443 (https)
1100
80 (http)
425
In eighth place we have Google cloud.
396982 Google Cloud (US)
Port
Packets
80 (http)
243
443 (https)
125
8088
7
20257
7
20256
6
22 (ssh)
6
3389 (ms-wbt-server)
5
3000
5
10001
5
5000
5
8080 (http-alt)
4
8888
4
… scans of all ports …
In ninth place we have Facebook. All of their probes were directed at my web server.
32934 Facebook, Inc. (US)
Port
Packets
443 (https)
695
80 (http)
340
In 10th place we have PTBatanghari Baik Net. Pure skulduggery: ssh only.
141069 PT Batanghari Baik Net (ID)
Port
Packets
22 (ssh)
964
I have this data, every day, for every AS, in 5 minute intervals. And then some, actually, since the data is at the IP address level and I can quickly and easily roll it up into per-AS data. And of course I can use it to make decisions about who I should block from my home network.
This week’s SYNners. The U.S. cloud providers continue to be the worst offenders. I don’t see this changing without some type of legislation. The AI arms race continues unabated.
Note that Apple and Google will show up high in this list as soon as I add their crawler address space to my blocked address space. Google likely at the top, and Apple likely in second.