Guarding My Git Forge Against AI Scrapers
In August 2024, one of my roommates and partners messaged the apartment group
chat, saying she noticed the internet was slow again at our place, and my
forgejo was unable to render any page in
under 15 seconds.
i investigated, thinking it would be a trivial little problem to solve. Soon
enough, however, i would uncover hundreds of thousands of queries a day from
thousands of individual IPs, fetching seemingly-random pages in my forge
every single day, all the time.
This post summarizes the practical issues that arose as a result of the onslaught
of scrapers eager to download millions of commits off of my forge, and the
measures i put in place to limit the damage.
#
Why the forge?
In the year 2025, on the web, everything is worth being scraped. Everything
that came out of the mind of a human is susceptible to be snatched under the
vastest labor theft scheme in the history of mankind. This very article, the
second it gets published in any indexable page, will be added to countless
datasets meant to train foundational large-language models. My words, your
words, have contributed infinitesimal shifts of neural-network weights
underpinning the largest, most grotesque accumulation of wealth seen over the
lifetime of my parents, grandparents, and their grandparents themselves.
Oh, and forges have a lot of commits. See, if you have a public repository that
is publicly exposed, every file in every folder for every commit will be connected.
Add other options, such as a git blame on a file, and multiply it by the
number of files and commits. Add the raw download link, also multiplied by the
number of commits.
Say, hypothetically, you have a linux repository available, and only with
all the commits in the master branch up to the v6.17 tag from 2025-09-18.
That’s 1,383,738 commits in the range 1da177e4c3f4..e5f0a698b34e. How many
files is that? Well:
count=0;
while read -r rev; do
point=$(git ls-tree -tr $rev | wc -l);
count=$(( $count + $point ));
printf "[%s] %s: %d (tot: %d)n" $(git log -1 --pretty=tformat:%cs $rev) $rev $point $count;
done < <(git rev-list "1da177e4c3f4..e5f0a698b34e");
printf "Total: $countn";
i ran this on the 100 commits before v6.17. If you have git ls-tree -tr $rev, you get both files and directories counted. If you replace it with git ls-tree -r $rev only shows files. i got 72024729 files, and 76798658 files and
directories. Running on the whole history of Linux’s master branch yields
78,483,866,182 files, and 83,627,462,277 files and directories.
Now, for a ballpark estimate of the number of pages that can be scraped if you
have a copy of Linux, apply the formula:
(Ncommits * Nfiles) * 2 + (Ncommits * Nfilesandfolders) * 2 + Ncommits * 3
That is, applied to my hypothetical Linux repository:
78483866182 * 2 + 83627462277 * 2 + 1383738 * 3 = 324,226,808,132 pages
The *3 accounts for the fact that every file of every commit can be scraped
raw, and git-blame‘d. The second part of the
formula considers every single file or folder page. The third part accounts for
the fact that every file of every commit can be diffed with its version of
every commit (in theory). The final component considers every commit summary
page.
That gives, for me, 324 billion 226 million 808 thousand and 132 pages that can
be scraped. From a single repository. Assume that every scraper agent that
enters one of these repositories will also take note of every other link on the
page, and report it so that other agents can scrapes them. These scrapers
effectively act like early 2000s web spiders that crawled the internet to index
it, except they do not care about robots.txt, and they will absolutely keep
scraping new links again and again with no strategy to minimize the cost on
you, as a host.
#
The Cost of Scraping
As i am writing the original draft of this section, the longer-term measures i
put in place have been removed, so i could gather up-to-date numbers to display
how bad the situation is.
i pay for the electricity that powers my git Forge. Okay, actually, one of my
roommate does, but we put it on the calc sheet where we keep track of who pays
what (when we remember).
At the time i began fighting scrapers, my git forge ran from an old desktop
computer plugged in my living room. Now, it is in our home’s rackable server
in a virtual machine. i never got to measure differences in power consumption
when we got scraped or not scraped on the desktop machine, but i did on the
rackable server. If memory serves me right, stopping the wave of scrapers reduced
the power draw of the server from ~170W to ~150W.
Right now, with all the hard drives in that server spinning, and every protection
off, we are drawing 200W from the power grid on that server. Constantly. By the
end of this experiment, me and my roommates will have computed that the difference
in power usage caused by scraping costs us ~60 euros a year.
Another tied cost is that the VM that runs the forge is figuratively suffocating
from the amount of queries. Not all queries are born equal as well: requests to
see the blame of a file or a diff between commits incurs a worse cost than
just rendering the front page of a repository. The last huge wave of scraping
left my VM at 99+% usage of 4 CPU cores and 2.5GiB of RAM,
whereas the usual levels i observe are closer to 4% usage of CPUs, and an oscillation
between 1.5GiB and 2GiB of RAM.
As i’m writing this, the VM running forgejo eats 100% of 8 CPU cores.
Additionally, the networking cost is palpable. Various monitoring tools let me see
the real-time traffic statistics in our apartment. Before i put the first
measures in place to thwart scraping, we could visibly see the traffic coming out
of the desktop computer running my forge and out to the internet. My roommates’
complaints that it slowed down the whole internet here were in fact founded: when
we had multiple people watching live streams or doing pretty big downloads, they
were throttled by the traffic out of the forge.1
The egress data rate of my forge’s VM is at least 4MBps of data (32Mbps). Constantly.
Finally, the human cost: i have spent entire days behind my terminals
trying to figure out 1) what the fuck was going on and 2) what the fuck to do
about it. i have had conversations with other people who self-host their
infrastructure, desperately trying to figure out workable solutions that would
not needlessly impact our users. And the funniest detail is: that rackable
server is in the living room, directly in front of my bedroom door. It usually
purrs like an adorable cat, but, lately, it’s been whirring louder and louder.
i can hear it. when i’m trying to sleep.
#
Let’s do some statistics.
i was curious to analyze the nginx logs to understand where the traffic came
from and what shape it took.
As a study case, we can work on /var/log/nginx/git.vulpinecitrus.info/ from
2025-11-14 to 2025-11-19. Note that on 2025-11-15 at 18:27 UTC, i
stopped the redirection of new agents into the Iocaine crawler maze (see
below). At 19:15 UTC, i removed the nginx request limit zone from the
/Lymkwi/linux/ path. At 19:16 UTC i removed the separation of log files
between IPs flagged as bots, and IPs not flagged as bots.
The three measures i progressively put in place later were: web caching
(2025-11-17), manually sending IPs to a garbage generator with a rate-limit
(Iocaine 2) (2025-11-14, 15 and 18), and then Iocaine
3 (2025-11-19).
| Common Logs | Successful | Delayed (429) | Error (5XX) | Measures in place |
|---|---|---|---|---|
| 2025-11-14 | 275323 | 66517 | 0 | Iocaine 2.1 + Rate-limiting |
| 2025-11-15 | 71712 | 54259 | 9802 | Iocaine 2.1 + Rate-limiting |
| 2025-11-16 | 140713 | 0 | 65763 | None |
| 2025-11-17 | 514309 | 25986 | 3012 | Caching, eventually rate-limiting2 |
| 2025-11-18 | 335266 | 20280 | 1 | Iocaine 2.1 + Rate-limiting |
| 2025-11-19 | 3183 | 0 | 0 | Iocaine 3 |
| Bot Logs | Successful | Delayed (429) | Error (5XX) | Measures in place |
| 2025-11-14 (bots) | 41388 | 65517 | 0 | Iocaine 2.1 + Rate-limiting |
| 2025-11-15 (bots) | 34190 | 53403 | 63 | Iocaine 2.1 + Rate-limiting |
| 2025-11-16 (bots) | – | – | – | (no bot-specific logs) |
| 2025-11-17 (bots) | – | – | – | (no bot-specific logs) |
| 2025-11-18 (bots) | 390013 | 0 | 13 | Iocaine 2.1 + Rate-limiting |
| 2025-11-19 (bots) | 731593 | 0 | 0 | Iocaine 3 |
Table 1: Number of Queries Per Day
(Commands used to generate Table 1)
Assuming your log file is git-access-2025-11-14.log.gz:
zcat git-access-2025-11-14.log.gz | grep '" 200 ' | wc -l
zcat git-access-2025-11-14.log.gz | grep '" 429 ' | wc -l
Without spoiling too much, caching was an utter failure, and the improvement i
measurement by manually rate-limiting a set of IPs (from Huawei Cloud and Alibaba)
on the Linux repository only helped so much. When all protections dropped, my
server became so unresponsive that backend errors (usually timeouts) spiked.
Error also happened with caching, when nginx encountered an issue when buffering
a reply. Overall, caching encouraged more queries overall.
Once Iocaine was deployed, the vast majority of queries were routed away from
the backend, with no errors reported, and no delaying because all of the IPs
i manually rate-limited were caught by Iocaine instead.
Out of all these queries, 117.64.70.34 is the most common source of requests,
with 226023 total queries originating from the ChinaNet-Backbone ASN (AS4134).
It is followed by 136.243.228.193 (13849 queries), an IP from Hetzner whose
hostname ironically resolves to
crawling-gateway-136-243-228-193.dataforseo.com. Then, 172.17.0.3 the
uptime prober of VC Status with 6908
queries, and 74.7.227.127, an IP from Microsoft’s AS 8075 (6117 queries).
| Day | Unique IP Count |
|---|---|
| 2025-11-14 | 16461 |
| 2025-11-15 | 18639 |
| 2025-11-16 | 41712 |
| 2025-11-17 | 47252 |
| 2025-11-18 | 22480 |
| 2025-11-19 | 14230 |
Table 2: Grand Total of Unique IPs Querying the Forge
(Commands used to generate Table 2)
Assuming your log files are called *git-access-2025-11-14.log.gz:
zcat *git-access-2025-11-14.log.gz | awk '{ print $1 }' | sort | uniq -c | wc -l
On the two days where restrictions were lifted or there was only caching, the
amount of unique IPs querying the forge doubled. The more you facilitate the
work of these crawlers, the more they are going to pound you. They will always
try and get more out of your server than you are capable of providing.
| Day | Top 1 | Top 2 | Top 3 | Top 4 | Top 5 |
|---|---|---|---|---|---|
| 2025-11-14 | (226089) – /reibooru/reibooru |
(40189) – /Lymkwi/linux |
(1454) – / |
(1405) – /rail |
(1174) – /Soblow/indi-hugo |
| 2025-11-15 | (35163) – /Lymkwi/linux |
(18952) – /vc-archival/youtube-dl |
(4197) – /vc-archival/youtube-dl-original |
(1655) – /reibooru/reibooru |
(1635) – /Lymkwi/gr-gsm |
| 2025-11-14 (bots) | (40189) – /Lymkwi/linux |
(270) – /oror/necro |
(79) – /Lymkwi/[REDACTED]3 |
(55) – /vc-archival/youtube-dl |
(52) – /oror/asm |
| 2025-11-15 (bots) | (32895) – /Lymkwi/linux |
(260) – /oror/necro |
(193) – /Lymkwi/gr-gsm |
(95) – /Lymkwi/[REDACTED]3 |
(48) – /alopexlemoni/GenderDysphoria.fyi |
| 2025-11-16 | (72687) – /vc-archival/youtube-dl |
(23028) – /Lymkwi/linux |
(16779) – /vc-archival/youtube-dl-original |
(5390) – /reibooru/reibooru |
(3585) – /Lymkwi/gr-gsm |
| 2025-11-17 | (361632) – /vc-archival/youtube-dl |
(74048) – /vc-archival/youtube-dl-original |
(18136) – /reibooru/reibooru |
(13147) – /oror/necro |
(12921) – /alopexlemoni/GenderDysphoria.fyi |
| 2025-11-18 | (227019) – /vc-archival/youtube-dl |
(46004) – /vc-archival/youtube-dl-original |
(12644) – /alopexlemoni/GenderDysphoria.fyi |
(12624) – /reibooru/reibooru |
(7712) – /oror/necro |
| 2025-11-18 (bots) | (261346) – /vc-archival/youtube-dl |
(43923) – /vc-archival/youtube-dl-original |
(20195) – /alopexlemoni/GenderDysphoria.fyi |
(18808) – /reibooru/reibooru |
(10134) – /oror/necro |
| 2025-11-19 | (1418) – / |
(1248) – /rail |
(356) – /Soblow |
(31) – /assets/img |
(25) – /Soblow/IndigoDen |
| 2025-11-19 (bots) | (448626) – /vc-archival/youtube-dl |
(73164) – /vc-archival/youtube-dl-original |
(39107) – /reibooru/reibooru |
(37107) – /alopexlemoni/GenderDysphoria.fyi |
(25921) – /vc-archival/YSLua |
Table 3: Top 5 Successful Repo/Account/Page Hits Per Day
(Commands used to generate Table 3)
Assuming you want data for the log file called git-access-2025-11-14.log.gz:
zcat git-access-2025-11-14.log.gz | grep '" 200 ' | awk '{ print $7 }'
| cut -d/ -f -3 | sort | uniq -c | sort -n
| tail -n 5 | tac
Big repositories with a lot of commits and a lot of files are a bountiful resource
for the crawlers. Once they enter those, they will take ages to leave, at least because of the
sheer amount of pages that can be generated by following the links of a repository.
Most legitimate traffic seems to be either fetching profiles (a couple of my users
have their profiles listed in their fediverse bios) or the root page of my forge.
| 2025-11-14 (all) | 2025-11-15 (all) | 2025-11-16 (all) | |
|---|---|---|---|
| Top 1 | (8532) – AS136907 (Huawei Clouds) | (8537) – AS136907 (Huawei Clouds) | (8535) – AS136907 (Huawei Clouds) |
| Top 2 | (2142) – AS45899 (VNPT Corp) | (2107) – AS45899 (VNPT Corp) | (4002) – AS212238 (Datacamp Limited) |
| Top 3 | (803) – AS153671 (Liasail Global Hongkong Limited) | (895) – AS153671 (Liasail Global Hongkong Limited) | (3504) – AS9009 (M247 Europe SRL) |
| Top 4 | (555) – AS5065 (Bunny Communications) | (765) – AS45102 (Alibaba US Technology Co., Ltd.) | (3206) – AS3257 (GTT Communications) |
| Top 5 | (390) – AS21859 (Zenlayer Inc) | (629) – AS5065 (Bunny Communications) | (2874) – AS45899 (VNPT Corp) |
Table 4: Top ASN Per Day For The First Three Days, Per Unique IP Count
(Commands used to generate Table 4)
For this, i needed a database of IP-to-ASN data. i got one from
IPInfo by registering for a free account and using their
web API. i first scripted a mapping of unique IP addresses to AS number. For
example, for the log file bot-git-access-2025-11-18.log.gz:
while read ip; do
ASN=$(curl -qfL api.ipinfo.io/lite/$ip?token= | jq -r .asn);
printf "$ip $ASNn" | tee -a 2025-11-18-bot.ips.txt;
done < <(zcat bot-git-access-2025-11-18.log.gz | awk '{ print $1 }' | sort | uniq)
Then, with this map, i run:
cat 2025-11-18-bot.ips.txt | cut -d' ' -f 2 | sort | uniq -c | sort -n | tail -n 5
So my largest hits are from Huawei Clouds (VPS provider), VPNT (a Vietnamese mobile and home ISP),
Liasail Global HK Limited (a VPS/”AI-powering service” provider),
Bunny Communications LLC (a broadband ISP for residential users), and Zenlayer (CDN/Cloud infrastructure provider).
When i lifted all protections, Datacamp Limited (a VPS provider), GTT Communications (some sort of bullshit-looking ISP4 who, i have been informed, is in fact a backbone operator), and M247 Europe SRL (a hosting provider) suddenly appeared. If memory serves me right, Datacamp, GTT and M247 were also
companies i had flagged during my initial investigation in summer 2024, and added to the manually blocked/limited IPs alongside
all of Huawei Cloud and Alibaba.
Interestingly, both Liasail and Zenlayer mention that they “Power AI” on their front page. They sure do.
Worryingly, VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that
their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once
you remove the most obviously malicious actors.
#
The Protection Measures
i have one goal, and one constraint. My goal is that i need to protect the forge
as much as possible, by means of either blocking bots or offloading the cost to
my VPS provider (whose electricity i do not pay for). My only constraint: i was
not going to deploy a proof-of-work-based captcha system such as Anubis.
There are two reasons for these constraints:
- i personally find that forcing your visitors to have to expand more computational power to prove they’re not a scraper is
bad praxis. There are devices out there that legitimately want that access, but have
limited computational power or features. And, yeah, there are multiple types of challenges, some of which take low-power devices into account or even those that cannot run JavaScript, but, - Scrapers can easily bypass Anubis. It’s not a design flaw. Anubis is harm
reduction, not pixie dust.
i tried layers of solutions:
- caching on the reverse proxy
- Iocaine 2 with no classifiers, which generates garbage in reply to any query
you send it - Manually redirecting IPs and rate-limiting them
- Deploying Iocaine 3, with its classifiers (Nam-Shub-of-Enki)
##
Reverse-Proxy Caching
i have a confession to make: i never realized that nginx did not cache anything
by default. That realization promptly came with the other realization that
caching things correctly is hard. i may, some day, write about my experience
of protecting a service that posted links to itself on the fediverse, so that
it wouldn’t slow to a crawl for ten minutes after every post.
As for the rest of these, i will be showing my solution in nginx. You can,
almost certainly, figure out a way of doing exactly the same thing with any other
decent reverse proxy software.
To create a cache for my forge, i add the following line to /etc/nginx.conf:
proxy_cache_path /var/cache/nginx/forge/ levels=1:2 keys_zone=forgecache:100m;
That will create a 2-level cache called forgecache that will hold 100MB of data
located at /var/cache/nginx/forge. i create the directory and make www-data
its owner and group.
In /etc/nginx/sites-enabled/vcinfo-git.conf, where my git forge’s site
configuration sits, i have a location block that serves the whole root of the
service, which i modify thusly:
location / {
proxy_cache forgecache;
proxy_buffering on;
proxy_cache_valid any 1h;
add_header X-Cached $upstream_cache_status;
expires 1h;
proxy_ignore_headers "Set-Cookie";
proxy_hide_header "Set-Cookie";
# more stuff...
}
That configuration does several things: it turns on caching and buffering at
the proxy
(proxy_buffering),
telling it to use forgecache
(proxy_cache)
and keep any page valid for an hour
(proxy_cache_valid).
It also adds a cookie that will let you debug whether or not a query hit or
missed the cache (add_header). The expires directive adds headers telling
your visitor’s browser that the content they cache will also expire in an hour
(expires).
Finally, the cache ignores any response header that sets a cookie
(proxy_ignore_headers,
proxy_hide_header),
to attempt to remove any page that could be customized for a user once they log
in.
The result? Caching was a disaster, predictably so. Caching works when the
same resource is repeatedly queried, like with page assets, JavaScript, style
sheets, etc. In this case, the thousands of actors querying my forge are
coordinated, somehow, never (or rarely) query the same resource twice, and only
download the raw HTML of the web pages.
Worse, caching messed up the display of authenticated pages. The snippets above
are not enough to delineate between an authenticated session and an
unauthenticated one, and it broke my forge so badly that i had to disable
caching and enable the next layer early on 2025-11-17, or i just could not
use my forge.
##
Rate-Limiting on the Proxy
The next layer of protection simply consisted in enabling a global rate-limit on
the most-hit repositories:
limit_req_zone wholeforge zone=wholeforge:10m rate=3r/s;
server {
// ...
location ^~ (/alopexlemoni/GenderDysphoria.fyi|/oror/necro|/Lymkwi/linux|/vc-archival/youtube-dl-original|/reibooru/reibooru) {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_max_temp_file_size 2048m;
limit_req zone=wholeforge nodelay;
proxy_pass http:///;
}
}
This was achieved in two directives. The first one, limit_req_zone, sits outside
the server {} block and defines a zone called wholeforge that stores 10MB of
state data and limits to 3 requests per second.
When this was in place, however, actually accessing the Linux repository as a
normal user (or any of the often-hit repositories) became a nightmare of waiting
and request timeouts.
##
Manually Redirecting to a Garbage Generator
Because caching was (predictably) useless, and rate-limiting was hindering me
as well, i re-enabled the initial setup that was in place before my
experiments: manually redirecting queries to a garbage generator (in this case,
an old version of Iocaine). It’s largely based on my initial setup following
this tutorial in
french.
For the purpose of this part, you do not have to know what Iocaine does precisely.
In the next section, i will present my current and final setup, with an updated
Iocaine that also includes a classifier to decide which queries are bots and
which are regular users. For now, i will present the version where i manually
chose who to return garbage to based on IP addresses.
As a little bonus, it will also include rate-limiting of those garbage-hungry
bots.
i add a file called /etc/nginx/snippets/block_bots.conf which contains:
if ($bot_user_agent) {
rewrite ^ /deflagration$request_uri;
}
if ($bot_ip) {
rewrite ^ /deflagration$request_uri;
}
location /deflagration {
limit_req zone=bots nodelay;
proxy_set_header Host $host;
proxy_pass ;
}
This will force any query categorized as bot_user_agent or bot_ip to be
routed through to a different upstrea which serves garbage. That upstream is
also protected by rate-limiting on a zone called bots which is defined in the
next bit of code. This snippet is actually meant to be included in your server {}
block using the include directive.
i then add the following in /etc/nginx/conf.d/bots.conf:
map $http_user_agent $bot_user_agent {
default 0;
# from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.txt
~*amazonbot 1;
~*anthropic-ai 1;
~*applebot 1;
~*applebot-extended 1;
~*brightbot 1;
~*bytespider 1;
~*ccbot 1;
~*chatgpt-user 1;
~*claude-web 1;
~*claudebot 1;
~*cohere-ai 1;
~*cohere-training-data-crawler 1;
~*crawlspace 1;
~*diffbot 1;
~*duckassistbot 1;
~*facebookbot 1;
~*friendlycrawler 1;
~*google-extended 1;
~*googleother 1;
~*googleother-image 1;
~*googleother-video 1;
~*gptbot 1;
~*iaskspider 1;
~*icc-crawler 1;
~*imagesiftbot 1;
~*img2dataset 1;
~*isscyberriskcrawler 1;
~*kangaroo 1;
~*meta-externalagent 1;
~*meta-externalfetcher 1;
~*oai-searchbot 1;
~*omgili 1;
~*omgilibot 1;
~*pangubot 1;
~*perplexitybot 1;
~*petalbot 1;
~*scrapy 1;
~*semrushbot-ocob 1;
~*semrushbot-swa 1;
~*sidetrade 1;
~*timpibot 1;
~*velenpublicwebcrawler 1;
~*webzio-extended 1;
~*youbot 1;
# Add whatever other pattern you want down here
}
geo $bot_ip {
default 0;
# Add your IP ranges here
}
# Rate-limiting setup for bots
limit_req_zone bots zone=bots:30m rate=1r/s;
# Return 429 (Too Many Requests) to slow them down
limit_req_status 429;
That bit of configuration does a mapping between the client IP and a variable
called bot_ip, and the client’s user agent and a variable called
bot_user_agent. When a known pattern listed in those blocks is found, the
corresponding variable is flipped to the provided value (here, 1). Otherwise,
it stays 0. Then, we define the rate-limiting zone that is used to slow down
the bots so they don’t feed on slop too fast. You will then need to install the
http-geoip2 nginx module (on Debian-based distributions, something like apt install libnginx-mod-http-geoip2 will do).
Once that is done, add the following line to the server block of every site
you want to protect:
include /etc/nginx/snippets/block_bots.conf;
And when you feel confident enough, roll a nginx -t and reload the unit for
nginx.
Now, if you’re using caddy or any other reverse proxy, there are probably
similar mechanisms available. You can go and peruse the documentation of Iocaine,
or look online for specific tutorials that, i am sure, other people have made
better than i would.
Immediately after enabling it, and shoving all the IPs from Alibaba Cloud and
Huawei Cloud in the bot config file, the activity slowed down on my server.
Power usage went down to ~180W, CPU usage to rougly 60%, and it stopped making
a hellish noise.
As the stats showed earlier, however, a lot of traffic was still hitting the
server itself. Even weirder, there were still occasional spikes, every 3 hour,
that lasted about one and a half hour, where the server would whirr and
forgejo suffocate again.
Bots were still hitting my server, and there was no clear source for it.
##
Automatically Classifying Bots and Poisoning Them: Iocaine and Nam-Shub-of-Enki
So far, the steps i showed so far help when a single IP is hammering at your
forge, or when someone is clearly scraping you from an Autonomous System that
you do not mind blocking. Sadly, as i’ve showed above in Table 4, a
surprising amount of scraping comes from broadband addresses. i can assemble
lists of IPs as big as i want, or block entire ASNs, but i would love to have a
per-query way of determining if a query looks legitimate.
The next steps of protection will rely on categorizing a source IP based on its
the credibility of its user agent. This mechanism is
largely based on the documentation for Iocaine
3.x. We finally
get to talk about Iocaine!
Iocaine is a tool that traps scrapers in a maze of meaningless pages that
endlessly lead to more meaningless pages. The content of these pages is
generated using a Markov chain, based on a corpus of texts given to the
software. Iocaine (specifically all versions after 3 at least5) is a middleware, in
the sense that it works by being placed on the line between your reverse proxy
and the service. Your reverse proxy will first begin by redirecting traffic to
Iocaine, and, if Iocaine deems a query legitimate, it will return a 421 Misdirected Request back at your reverse-proxy. The
latter must then catch it, and use the real upstream as a fallback. If
Iocaine’s Nam-Shub-of-Enki6 decides query came from a bogus or otherwise undesirable source, it
will happily reply 200 OK and send generated garbage.
My setup lodges Iocaine 3 between nginx and my forge, following the Iocaine
documentation to use the container
version.
i recommend you follow it, and then add the next little things to enable
categorization statistics, and prevent the logging they’re based on from
blowing up your storage:
- In
etc/config.d/03-nam-shub-of-enki.kdl, change the logging block to:
logging {
enable #true
classification {
enable #true
}
}
- In
docker-compose.yaml, add the following bits to limit classification
logging to 50MB:
services:
iocaine:
# The things you already have here...
# ...
env:
- RUST_LOG=iocaine=info
logging:
driver: "json-file"
options:
max-size: "50m"
My checks block in Nam-Shub-of-Enki is as such:
checks {
disable cgi-bin-trap
asn {
database-path "/data/ipinfo_lite.mmdb"
asns "45102" "136907"
}
ai-robots-txt {
path "/data/ai.robots.txt-robots.json"
}
generated-urls {
identifiers "deflagration"
}
big-tech {
enable #true
}
commercial_scrapers {
enable #true
}
}
I snatched a copy of the latest ipinfo ASN database for
free and blocked AS52102 (Alibaba) and AS136907 (Huawei Clouds).
On 2025-11-18 at 00:00:29 UTC+1, i enabled Iocaine with the Nam-Shub-of-Enki
classifier in front of my whole forge. Immediately, my server was no longer
hammered. Power draw went down to just above 160W.
One problem i noticed however, while trying to deploy the artifact for this
blog post on my forge, is that Iocaine causes issues when huge PUT/PATCH/POST
requests with large bodies are piped through it: it will hang up before the
objects are entirely written. i am trying to figure out a way of only redirecting
HEAD and GET requests to Iocaine in nginx, like is done in the Caddy example
of the Iocaine documentation.
What i ended up settling on requires a bit of variable mapping. At the start of
your site configuration, before the server {} block:
map $request_method $upstream_location {
GET ;
HEAD ;
default ;
}
map $request_method $upstream_log {
GET bot_access;
HEAD bot_access;
default access;
}
Then, in the block that does the default location, write:
location / {
proxy_cache off;
access_log /var/log/nginx/$upstream_log.log combined;
proxy_intercept_errors on;
error_page 421 = @fallback;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_pass http://$upstream_location;
}
That is, replace the upstream in proxy_pass with the upstream decided by the
variable mapping, and, while we’re at it, use $upstream_log to know which log
will be the final one for that request. i differentiate between bot_access.log
and access.log to gather my statistics, so the difference matters to me. Change
the variables to suit the way you do it (or remove it, if you don’t distinguish
clients in your log files).
#
Monitoring Iocaine
Currently, on 2025-11-30 at 16:33:00 UTC+1, Iocaine has served 38.16GB of garbage.
Over the past hour, 152.11MB of such data was thrown back at undesirable visitors.
3.39GB over the past day, 22.22GB over the past week. You can get the snippet
that describes my Iocaine-specific Grafana views here.
The vast majority of undesirable queries come from Claude, OpenAI, and
Disguised Bots. Claude and OpenAI are absolutely gluttonous, and, once they
have access to a ton of pages, they will greedily flock to fetch them like
pigeons being fed breadcrumbs laced with strychnine.
AI bot scrapers (ai.robots.txt) maintain a constant 920~930 query per minute
(15-ish QPS) over the 6 domains i have protected with Iocaine, including the
forge.
There is also a low hum of a mix of commercial scrapers (~1 request every two
second), big tech crawlers (Facebook, Google, etc, about 2QPS or 110 query/min),
and, especially, fake browsers.
Classifying fake browsers is where Iocaine really shines, specifically thanks
to the classifiers implemented via Nam-Shub-of-Enki. The faked bots classifier
detects the likelihood that the user agent reported by the client is bullshit,
generated from a list of technologies mashed together. For example, if your
client reports a user agent for a set of software that never supported HTTP2,
or never actually existed together, or is not even released yet, it will get
flagged. Think, for example, Windows NT 4 running Chrome, pretending to be
able to do TLS1.3.
The background-noise level of such queries is usually 140~160 queries per minute
(or 2~3 QPS). However, notice those spikes in the graph above?
##
The Salves of Queries
For a while during my experiments i noticed those pillars of queries. My
general nginx statistics would show a sharp increase of connections, with an
iniital ramp-up, and a stable-ish plateau lasting about one and a half hour,
before suddenly stopping. It would then repeat again, roughly three hours later.
Between October 29th and November 19th, and on November 28th, these spikes would
constantly show up. As soon as i got Iocaine statistics running, it would flag
all of those queries as faked browsers.
i investigated those spikes in particular, because they baffled and scared me:
the regularity with which they probed me, and the sharpness of the ramp-up and
halts, made me afraid that someone, somewhere, was organizing thousands of IPs
to specifically take turns at probing websites. i have not reached any solid
conclusions, beyond the following:
- The initial phase of an attack wave begins with a clear exponential ramp-up
- The ramp-up stops when the server starts either throwing errors, or the
response latency reaches a given threshold - Every wave of attack lasts roughly one hour and a half
- An individual IP will often contribute no more than one query, but it can
reach 50 to 60 queries per IP - The same 15 or so ASN keep showing up, with five regular leaders in IP count:
- AS212238: Datacamp Limited
- AS3257: GTT Communications
- AS9009: M247 Europe SRL
- AS203020: HostRoyale Technologies Pvt Ltd
- AS210906: UAB “Bite Lietuva” (a Lithuanian ISP)
All of those as service providers. My working theory at the moment is that
someone registered thousands of cheap servers in many different companies, and
are selling access to them as web proxies for scraping and scanning. i will
probably write something up later when i have properly investigated that
specific phenomenon.
#
Conclusion
Self-hosting anything that is deemed “content” openly on the web in 2025 is
a battle of attrition between you and forces who are able to buy tens of
thousands of proxies to ruin your service for data they can resell.
This is depressing. Profoundly depressing. i look at the statistics board for
my reverse-proxy and i never see less than 96.7% of requests classified as bots
at any given moment. The web is filled with crap, bots that pretend to be real
people to flood you. All of that because i want to have my little corner of the
internet where i put my silly little code for other people to see.
i have to learn to protect myself from industrial actors in order to put anything
online, because anything a person makes is valuable, and that value will be
sucked dry by every tech giant to be emulsified, liquified, strained, and
ultimately inexorably joined in an unholy mesh of learning weights.
This experience has rather profoundly radicalized the way i think about
technology. Sanitized content can be chewed on and shat out by companies from
training, but their AI tools will never swear. They will never use a slur. They
will never have a revolutionary thought. Despite being amalgamation of shit
rolled up in the guts of the dying capitalist society, they are sanitized to
hell and beyond.
The developer of Iocaine put it best
when explaining why Iocaine has absolutely unhinged identifiers
(such as SexDungeon, PipeBomb, etc) is that they will all trigger “safeguard”
mechanisms in commercial AI tools: absolutely no coding agent will accept
analyzing and explaining code where the memory allocator’s free function is
called liberate_palestine. i bet that if i described, in graphic details, in
the comments of this page, the different ways being a furry intersects with my
sexuality, that no commercial scraper would even dare ingest this page.
Fuck tech companies. Fuck “AI”. Fuck the corporate web.
