Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Gartner: Why neoclouds are the future of GPU-as-a-Service

    Runlayer is now offering secure OpenClaw agentic capabilities for large enterprises

    Microsoft Copilot ignored sensitivity labels twice in eight months — and no DLP stack caught either one

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      Gartner: Why neoclouds are the future of GPU-as-a-Service

      February 21, 2026

      The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

      February 12, 2026

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026
    • Crypto

      Another European Country Bans Polymarket, Threatens Massive Fine

      February 20, 2026

      Why Is The US Stock Market Up Today?

      February 20, 2026

      Is XRP Price Preparing To Breach Its 2026 Downtrend? Here’s What History Says

      February 20, 2026

      “Disgrace” or “Win for American Wallets”? Supreme Court Tariff Bombshell Sparks Political Meltdown in Washington

      February 20, 2026

      Perle Labs CEO Ahmed Rashad on Why AI Needs Verifiable Data Infrastructure

      February 20, 2026
    • Technology

      Runlayer is now offering secure OpenClaw agentic capabilities for large enterprises

      February 21, 2026

      Microsoft Copilot ignored sensitivity labels twice in eight months — and no DLP stack caught either one

      February 21, 2026

      Be Wary of Bluesky

      February 21, 2026

      CERN rebuilt the original browser from 1989

      February 21, 2026

      Across the US, people are dismantling and destroying Flock surveillance cameras

      February 21, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Guarding My Git Forge Against AI Scrapers
    Technology

    Guarding My Git Forge Against AI Scrapers

    TechAiVerseBy TechAiVerseDecember 12, 2025No Comments2 Mins Read0 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Guarding My Git Forge Against AI Scrapers
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Guarding My Git Forge Against AI Scrapers

    In August 2024, one of my roommates and partners messaged the apartment group
    chat, saying she noticed the internet was slow again at our place, and my
    forgejo
    was unable to render any page in
    under 15 seconds.

    i investigated, thinking it would be a trivial little problem to solve. Soon
    enough, however, i would uncover hundreds of thousands of queries a day from
    thousands of individual IPs, fetching seemingly-random pages in my forge
    every single day, all the time.

    This post summarizes the practical issues that arose as a result of the onslaught
    of scrapers eager to download millions of commits off of my forge, and the
    measures i put in place to limit the damage.

    # 
    Why the forge?

    In the year 2025, on the web, everything is worth being scraped. Everything
    that came out of the mind of a human is susceptible to be snatched under the
    vastest labor theft scheme in the history of mankind. This very article, the
    second it gets published in any indexable page, will be added to countless
    datasets meant to train foundational large-language models. My words, your
    words, have contributed infinitesimal shifts of neural-network weights
    underpinning the largest, most grotesque accumulation of wealth seen over the
    lifetime of my parents, grandparents, and their grandparents themselves.

    Oh, and forges have a lot of commits. See, if you have a public repository that
    is publicly exposed, every file in every folder for every commit will be connected.
    Add other options, such as a git blame on a file, and multiply it by the
    number of files and commits. Add the raw download link, also multiplied by the
    number of commits.

    Say, hypothetically, you have a linux repository available, and only with
    all the commits in the master branch up to the v6.17 tag from 2025-09-18.
    That’s 1,383,738 commits in the range 1da177e4c3f4..e5f0a698b34e. How many
    files is that? Well:

    count=0;
    while read -r rev; do
        point=$(git ls-tree -tr $rev | wc -l);
        count=$(( $count + $point ));
        printf "[%s] %s: %d (tot: %d)n" $(git log -1 --pretty=tformat:%cs $rev) $rev $point $count;
    done < <(git rev-list "1da177e4c3f4..e5f0a698b34e");
    printf "Total: $countn";
    

    i ran this on the 100 commits before v6.17. If you have git ls-tree -tr $rev, you get both files and directories counted. If you replace it with git ls-tree -r $rev only shows files. i got 72024729 files, and 76798658 files and
    directories. Running on the whole history of Linux’s master branch yields
    78,483,866,182 files, and 83,627,462,277 files and directories.

    Now, for a ballpark estimate of the number of pages that can be scraped if you
    have a copy of Linux, apply the formula:

    (Ncommits * Nfiles) * 2 + (Ncommits * Nfilesandfolders) * 2 + Ncommits * 3
    

    That is, applied to my hypothetical Linux repository:

    78483866182 * 2 + 83627462277 * 2 + 1383738 * 3 = 324,226,808,132 pages
    

    The *3 accounts for the fact that every file of every commit can be scraped
    raw, and git-blame‘d. The second part of the
    formula considers every single file or folder page. The third part accounts for
    the fact that every file of every commit can be diffed with its version of
    every commit (in theory). The final component considers every commit summary
    page.

    That gives, for me, 324 billion 226 million 808 thousand and 132 pages that can
    be scraped. From a single repository. Assume that every scraper agent that
    enters one of these repositories will also take note of every other link on the
    page, and report it so that other agents can scrapes them. These scrapers
    effectively act like early 2000s web spiders that crawled the internet to index
    it, except they do not care about robots.txt, and they will absolutely keep
    scraping new links again and again with no strategy to minimize the cost on
    you, as a host.

    # 
    The Cost of Scraping

    As i am writing the original draft of this section, the longer-term measures i
    put in place have been removed, so i could gather up-to-date numbers to display
    how bad the situation is.

    i pay for the electricity that powers my git Forge. Okay, actually, one of my
    roommate does, but we put it on the calc sheet where we keep track of who pays
    what (when we remember).

    At the time i began fighting scrapers, my git forge ran from an old desktop
    computer plugged in my living room. Now, it is in our home’s rackable server
    in a virtual machine. i never got to measure differences in power consumption
    when we got scraped or not scraped on the desktop machine, but i did on the
    rackable server. If memory serves me right, stopping the wave of scrapers reduced
    the power draw of the server from ~170W to ~150W.

    Right now, with all the hard drives in that server spinning, and every protection
    off, we are drawing 200W from the power grid on that server. Constantly. By the
    end of this experiment, me and my roommates will have computed that the difference
    in power usage caused by scraping costs us ~60 euros a year.

    Another tied cost is that the VM that runs the forge is figuratively suffocating
    from the amount of queries. Not all queries are born equal as well: requests to
    see the blame of a file or a diff between commits incurs a worse cost than
    just rendering the front page of a repository. The last huge wave of scraping
    left my VM at 99+% usage of 4 CPU cores and 2.5GiB of RAM,
    whereas the usual levels i observe are closer to 4% usage of CPUs, and an oscillation
    between 1.5GiB and 2GiB of RAM.

    As i’m writing this, the VM running forgejo eats 100% of 8 CPU cores.

    Additionally, the networking cost is palpable. Various monitoring tools let me see
    the real-time traffic statistics in our apartment. Before i put the first
    measures in place to thwart scraping, we could visibly see the traffic coming out
    of the desktop computer running my forge and out to the internet. My roommates’
    complaints that it slowed down the whole internet here were in fact founded: when
    we had multiple people watching live streams or doing pretty big downloads, they
    were throttled by the traffic out of the forge.1

    The egress data rate of my forge’s VM is at least 4MBps of data (32Mbps). Constantly.

    Finally, the human cost: i have spent entire days behind my terminals
    trying to figure out 1) what the fuck was going on and 2) what the fuck to do
    about it. i have had conversations with other people who self-host their
    infrastructure, desperately trying to figure out workable solutions that would
    not needlessly impact our users. And the funniest detail is: that rackable
    server is in the living room, directly in front of my bedroom door. It usually
    purrs like an adorable cat, but, lately, it’s been whirring louder and louder.
    i can hear it. when i’m trying to sleep.

    # 
    Let’s do some statistics.

    i was curious to analyze the nginx logs to understand where the traffic came
    from and what shape it took.

    As a study case, we can work on /var/log/nginx/git.vulpinecitrus.info/ from
    2025-11-14 to 2025-11-19. Note that on 2025-11-15 at 18:27 UTC, i
    stopped the redirection of new agents into the Iocaine crawler maze (see
    below). At 19:15 UTC, i removed the nginx request limit zone from the
    /Lymkwi/linux/ path. At 19:16 UTC i removed the separation of log files
    between IPs flagged as bots, and IPs not flagged as bots.

    The three measures i progressively put in place later were: web caching
    (2025-11-17), manually sending IPs to a garbage generator with a rate-limit
    (Iocaine 2) (2025-11-14, 15 and 18), and then Iocaine
    3
    (2025-11-19).

    Common Logs Successful Delayed (429) Error (5XX) Measures in place
    2025-11-14 275323 66517 0 Iocaine 2.1 + Rate-limiting
    2025-11-15 71712 54259 9802 Iocaine 2.1 + Rate-limiting
    2025-11-16 140713 0 65763 None
    2025-11-17 514309 25986 3012 Caching, eventually rate-limiting2
    2025-11-18 335266 20280 1 Iocaine 2.1 + Rate-limiting
    2025-11-19 3183 0 0 Iocaine 3
    Bot Logs Successful Delayed (429) Error (5XX) Measures in place
    2025-11-14 (bots) 41388 65517 0 Iocaine 2.1 + Rate-limiting
    2025-11-15 (bots) 34190 53403 63 Iocaine 2.1 + Rate-limiting
    2025-11-16 (bots) – – – (no bot-specific logs)
    2025-11-17 (bots) – – – (no bot-specific logs)
    2025-11-18 (bots) 390013 0 13 Iocaine 2.1 + Rate-limiting
    2025-11-19 (bots) 731593 0 0 Iocaine 3

    Table 1: Number of Queries Per Day

    (Commands used to generate Table 1)

    Assuming your log file is git-access-2025-11-14.log.gz:

    zcat git-access-2025-11-14.log.gz | grep '" 200 ' | wc -l
    zcat git-access-2025-11-14.log.gz | grep '" 429 ' | wc -l
    

    Without spoiling too much, caching was an utter failure, and the improvement i
    measurement by manually rate-limiting a set of IPs (from Huawei Cloud and Alibaba)
    on the Linux repository only helped so much. When all protections dropped, my
    server became so unresponsive that backend errors (usually timeouts) spiked.
    Error also happened with caching, when nginx encountered an issue when buffering
    a reply. Overall, caching encouraged more queries overall.

    Once Iocaine was deployed, the vast majority of queries were routed away from
    the backend, with no errors reported, and no delaying because all of the IPs
    i manually rate-limited were caught by Iocaine instead.

    Out of all these queries, 117.64.70.34 is the most common source of requests,
    with 226023 total queries originating from the ChinaNet-Backbone ASN (AS4134).
    It is followed by 136.243.228.193 (13849 queries), an IP from Hetzner whose
    hostname ironically resolves to
    crawling-gateway-136-243-228-193.dataforseo.com. Then, 172.17.0.3 the
    uptime prober of VC Status with 6908
    queries, and 74.7.227.127, an IP from Microsoft’s AS 8075 (6117 queries).

    Day Unique IP Count
    2025-11-14 16461
    2025-11-15 18639
    2025-11-16 41712
    2025-11-17 47252
    2025-11-18 22480
    2025-11-19 14230

    Table 2: Grand Total of Unique IPs Querying the Forge

    (Commands used to generate Table 2)

    Assuming your log files are called *git-access-2025-11-14.log.gz:

    zcat *git-access-2025-11-14.log.gz | awk '{ print $1 }' | sort | uniq -c | wc -l
    

    On the two days where restrictions were lifted or there was only caching, the
    amount of unique IPs querying the forge doubled. The more you facilitate the
    work of these crawlers, the more they are going to pound you. They will always
    try and get more out of your server than you are capable of providing.

    Day Top 1 Top 2 Top 3 Top 4 Top 5
    2025-11-14 (226089) – /reibooru/reibooru (40189) – /Lymkwi/linux (1454) – / (1405) – /rail (1174) – /Soblow/indi-hugo
    2025-11-15 (35163) – /Lymkwi/linux (18952) – /vc-archival/youtube-dl (4197) – /vc-archival/youtube-dl-original (1655) – /reibooru/reibooru (1635) – /Lymkwi/gr-gsm
    2025-11-14 (bots) (40189) – /Lymkwi/linux (270) – /oror/necro (79) – /Lymkwi/[REDACTED]3 (55) – /vc-archival/youtube-dl (52) – /oror/asm
    2025-11-15 (bots) (32895) – /Lymkwi/linux (260) – /oror/necro (193) – /Lymkwi/gr-gsm (95) – /Lymkwi/[REDACTED]3 (48) – /alopexlemoni/GenderDysphoria.fyi
    2025-11-16 (72687) – /vc-archival/youtube-dl (23028) – /Lymkwi/linux (16779) – /vc-archival/youtube-dl-original (5390) – /reibooru/reibooru (3585) – /Lymkwi/gr-gsm
    2025-11-17 (361632) – /vc-archival/youtube-dl (74048) – /vc-archival/youtube-dl-original (18136) – /reibooru/reibooru (13147) – /oror/necro (12921) – /alopexlemoni/GenderDysphoria.fyi
    2025-11-18 (227019) – /vc-archival/youtube-dl (46004) – /vc-archival/youtube-dl-original (12644) – /alopexlemoni/GenderDysphoria.fyi (12624) – /reibooru/reibooru (7712) – /oror/necro
    2025-11-18 (bots) (261346) – /vc-archival/youtube-dl (43923) – /vc-archival/youtube-dl-original (20195) – /alopexlemoni/GenderDysphoria.fyi (18808) – /reibooru/reibooru (10134) – /oror/necro
    2025-11-19 (1418) – / (1248) – /rail (356) – /Soblow (31) – /assets/img (25) – /Soblow/IndigoDen
    2025-11-19 (bots) (448626) – /vc-archival/youtube-dl (73164) – /vc-archival/youtube-dl-original (39107) – /reibooru/reibooru (37107) – /alopexlemoni/GenderDysphoria.fyi (25921) – /vc-archival/YSLua

    Table 3: Top 5 Successful Repo/Account/Page Hits Per Day

    (Commands used to generate Table 3)

    Assuming you want data for the log file called git-access-2025-11-14.log.gz:

     zcat git-access-2025-11-14.log.gz | grep '" 200 ' | awk '{ print $7 }' 
        | cut -d/ -f -3 | sort | uniq -c | sort -n 
        | tail -n 5 | tac
    

    Big repositories with a lot of commits and a lot of files are a bountiful resource
    for the crawlers. Once they enter those, they will take ages to leave, at least because of the
    sheer amount of pages that can be generated by following the links of a repository.

    Most legitimate traffic seems to be either fetching profiles (a couple of my users
    have their profiles listed in their fediverse bios) or the root page of my forge.

    2025-11-14 (all) 2025-11-15 (all) 2025-11-16 (all)
    Top 1 (8532) – AS136907 (Huawei Clouds) (8537) – AS136907 (Huawei Clouds) (8535) – AS136907 (Huawei Clouds)
    Top 2 (2142) – AS45899 (VNPT Corp) (2107) – AS45899 (VNPT Corp) (4002) – AS212238 (Datacamp Limited)
    Top 3 (803) – AS153671 (Liasail Global Hongkong Limited) (895) – AS153671 (Liasail Global Hongkong Limited) (3504) – AS9009 (M247 Europe SRL)
    Top 4 (555) – AS5065 (Bunny Communications) (765) – AS45102 (Alibaba US Technology Co., Ltd.) (3206) – AS3257 (GTT Communications)
    Top 5 (390) – AS21859 (Zenlayer Inc) (629) – AS5065 (Bunny Communications) (2874) – AS45899 (VNPT Corp)

    Table 4: Top ASN Per Day For The First Three Days, Per Unique IP Count

    (Commands used to generate Table 4)

    For this, i needed a database of IP-to-ASN data. i got one from
    IPInfo by registering for a free account and using their
    web API. i first scripted a mapping of unique IP addresses to AS number. For
    example, for the log file bot-git-access-2025-11-18.log.gz:

    while read ip; do
        ASN=$(curl -qfL api.ipinfo.io/lite/$ip?token= | jq -r .asn);
        printf "$ip $ASNn" | tee -a 2025-11-18-bot.ips.txt;
    done < <(zcat bot-git-access-2025-11-18.log.gz | awk '{ print $1 }' | sort | uniq)
    

    Then, with this map, i run:

    cat 2025-11-18-bot.ips.txt | cut -d' ' -f 2 | sort | uniq -c | sort -n | tail -n 5
    

    So my largest hits are from Huawei Clouds (VPS provider), VPNT (a Vietnamese mobile and home ISP),
    Liasail Global HK Limited (a VPS/”AI-powering service” provider),
    Bunny Communications LLC (a broadband ISP for residential users), and Zenlayer (CDN/Cloud infrastructure provider).
    When i lifted all protections, Datacamp Limited (a VPS provider), GTT Communications (some sort of bullshit-looking ISP4 who, i have been informed, is in fact a backbone operator), and M247 Europe SRL (a hosting provider) suddenly appeared. If memory serves me right, Datacamp, GTT and M247 were also
    companies i had flagged during my initial investigation in summer 2024, and added to the manually blocked/limited IPs alongside
    all of Huawei Cloud and Alibaba.

    Interestingly, both Liasail and Zenlayer mention that they “Power AI” on their front page. They sure do.
    Worryingly, VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that
    their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once
    you remove the most obviously malicious actors.

    # 
    The Protection Measures

    i have one goal, and one constraint. My goal is that i need to protect the forge
    as much as possible, by means of either blocking bots or offloading the cost to
    my VPS provider (whose electricity i do not pay for). My only constraint: i was
    not going to deploy a proof-of-work-based captcha system such as Anubis.
    There are two reasons for these constraints:

    1. i personally find that forcing your visitors to have to expand more computational power to prove they’re not a scraper is
      bad praxis. There are devices out there that legitimately want that access, but have
      limited computational power or features. And, yeah, there are multiple types of challenges, some of which take low-power devices into account or even those that cannot run JavaScript, but,
    2. Scrapers can easily bypass Anubis. It’s not a design flaw. Anubis is harm
      reduction, not pixie dust.

    i tried layers of solutions:

    • caching on the reverse proxy
    • Iocaine 2 with no classifiers, which generates garbage in reply to any query
      you send it
    • Manually redirecting IPs and rate-limiting them
    • Deploying Iocaine 3, with its classifiers (Nam-Shub-of-Enki)

    ## 
    Reverse-Proxy Caching

    i have a confession to make: i never realized that nginx did not cache anything
    by default. That realization promptly came with the other realization that
    caching things correctly is hard. i may, some day, write about my experience
    of protecting a service that posted links to itself on the fediverse, so that
    it wouldn’t slow to a crawl for ten minutes after every post.

    As for the rest of these, i will be showing my solution in nginx. You can,
    almost certainly, figure out a way of doing exactly the same thing with any other
    decent reverse proxy software.

    To create a cache for my forge, i add the following line to /etc/nginx.conf:

    proxy_cache_path /var/cache/nginx/forge/ levels=1:2 keys_zone=forgecache:100m;
    

    That will create a 2-level cache called forgecache that will hold 100MB of data
    located at /var/cache/nginx/forge. i create the directory and make www-data
    its owner and group.

    In /etc/nginx/sites-enabled/vcinfo-git.conf, where my git forge’s site
    configuration sits, i have a location block that serves the whole root of the
    service, which i modify thusly:

    location / {
        proxy_cache forgecache;
        proxy_buffering on;
        proxy_cache_valid any 1h;
        add_header X-Cached $upstream_cache_status;
        expires 1h;
        proxy_ignore_headers "Set-Cookie";
        proxy_hide_header "Set-Cookie";
    
        # more stuff...
    }
    

    That configuration does several things: it turns on caching and buffering at
    the proxy
    (proxy_buffering),
    telling it to use forgecache
    (proxy_cache)
    and keep any page valid for an hour
    (proxy_cache_valid).
    It also adds a cookie that will let you debug whether or not a query hit or
    missed the cache (add_header). The expires directive adds headers telling
    your visitor’s browser that the content they cache will also expire in an hour
    (expires).
    Finally, the cache ignores any response header that sets a cookie
    (proxy_ignore_headers,
    proxy_hide_header),
    to attempt to remove any page that could be customized for a user once they log
    in.

    The result? Caching was a disaster, predictably so. Caching works when the
    same resource is repeatedly queried, like with page assets, JavaScript, style
    sheets, etc. In this case, the thousands of actors querying my forge are
    coordinated, somehow, never (or rarely) query the same resource twice, and only
    download the raw HTML of the web pages.

    Worse, caching messed up the display of authenticated pages. The snippets above
    are not enough to delineate between an authenticated session and an
    unauthenticated one, and it broke my forge so badly that i had to disable
    caching and enable the next layer early on 2025-11-17, or i just could not
    use my forge.

    ## 
    Rate-Limiting on the Proxy

    The next layer of protection simply consisted in enabling a global rate-limit on
    the most-hit repositories:

    limit_req_zone wholeforge zone=wholeforge:10m rate=3r/s;
    
    server {
        // ...
    	location ^~ (/alopexlemoni/GenderDysphoria.fyi|/oror/necro|/Lymkwi/linux|/vc-archival/youtube-dl-original|/reibooru/reibooru) {
    		proxy_set_header Host $host;
    		proxy_set_header X-Real-IP $remote_addr;
    		proxy_max_temp_file_size 2048m;
    
    		limit_req zone=wholeforge nodelay;
    
    		proxy_pass http:///;
    	}
    }
    

    This was achieved in two directives. The first one, limit_req_zone, sits outside
    the server {} block and defines a zone called wholeforge that stores 10MB of
    state data and limits to 3 requests per second.

    When this was in place, however, actually accessing the Linux repository as a
    normal user (or any of the often-hit repositories) became a nightmare of waiting
    and request timeouts.

    ## 
    Manually Redirecting to a Garbage Generator

    Because caching was (predictably) useless, and rate-limiting was hindering me
    as well, i re-enabled the initial setup that was in place before my
    experiments: manually redirecting queries to a garbage generator (in this case,
    an old version of Iocaine). It’s largely based on my initial setup following
    this tutorial in
    french
    .

    For the purpose of this part, you do not have to know what Iocaine does precisely.
    In the next section, i will present my current and final setup, with an updated
    Iocaine that also includes a classifier to decide which queries are bots and
    which are regular users. For now, i will present the version where i manually
    chose who to return garbage to based on IP addresses.

    As a little bonus, it will also include rate-limiting of those garbage-hungry
    bots.

    i add a file called /etc/nginx/snippets/block_bots.conf which contains:

    if ($bot_user_agent) {
        rewrite ^ /deflagration$request_uri;
    }
    if ($bot_ip) {
        rewrite ^ /deflagration$request_uri;
    }
    location /deflagration {
        limit_req zone=bots nodelay;
        proxy_set_header Host $host;
        proxy_pass ;
    }
    

    This will force any query categorized as bot_user_agent or bot_ip to be
    routed through to a different upstrea which serves garbage. That upstream is
    also protected by rate-limiting on a zone called bots which is defined in the
    next bit of code. This snippet is actually meant to be included in your server {}
    block using the include directive.

    i then add the following in /etc/nginx/conf.d/bots.conf:

    map $http_user_agent $bot_user_agent {
        default 0;
    
        # from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.txt
        ~*amazonbot 1;
        ~*anthropic-ai  1;
        ~*applebot  1;
        ~*applebot-extended 1;
        ~*brightbot 1;
        ~*bytespider  1;
        ~*ccbot 1;
        ~*chatgpt-user  1;
        ~*claude-web  1;
        ~*claudebot 1;
        ~*cohere-ai 1;
        ~*cohere-training-data-crawler  1;
        ~*crawlspace  1;
        ~*diffbot 1;
        ~*duckassistbot 1;
        ~*facebookbot 1;
        ~*friendlycrawler 1;
        ~*google-extended 1;
        ~*googleother 1;
        ~*googleother-image 1;
        ~*googleother-video 1;
        ~*gptbot  1;
        ~*iaskspider  1;
        ~*icc-crawler 1;
        ~*imagesiftbot  1;
        ~*img2dataset 1;
        ~*isscyberriskcrawler 1;
        ~*kangaroo  1;
        ~*meta-externalagent  1;
        ~*meta-externalfetcher  1;
        ~*oai-searchbot 1;
        ~*omgili  1;
        ~*omgilibot 1;
        ~*pangubot  1;
        ~*perplexitybot 1;
        ~*petalbot  1;
        ~*scrapy  1;
        ~*semrushbot-ocob 1;
        ~*semrushbot-swa  1;
        ~*sidetrade 1;
        ~*timpibot  1;
        ~*velenpublicwebcrawler 1;
        ~*webzio-extended 1;
        ~*youbot  1;
    
        # Add whatever other pattern you want down here
    }
    
    geo $bot_ip {
        default 0;
    
        # Add your IP ranges here
    }
    
    # Rate-limiting setup for bots
    limit_req_zone bots zone=bots:30m rate=1r/s;
    
    # Return 429 (Too Many Requests) to slow them down
    limit_req_status 429;
    

    That bit of configuration does a mapping between the client IP and a variable
    called bot_ip, and the client’s user agent and a variable called
    bot_user_agent. When a known pattern listed in those blocks is found, the
    corresponding variable is flipped to the provided value (here, 1). Otherwise,
    it stays 0. Then, we define the rate-limiting zone that is used to slow down
    the bots so they don’t feed on slop too fast. You will then need to install the
    http-geoip2 nginx module (on Debian-based distributions, something like apt install libnginx-mod-http-geoip2 will do).

    Once that is done, add the following line to the server block of every site
    you want to protect:

    include /etc/nginx/snippets/block_bots.conf;
    

    And when you feel confident enough, roll a nginx -t and reload the unit for
    nginx.

    Now, if you’re using caddy or any other reverse proxy, there are probably
    similar mechanisms available. You can go and peruse the documentation of Iocaine,
    or look online for specific tutorials that, i am sure, other people have made
    better than i would.

    Immediately after enabling it, and shoving all the IPs from Alibaba Cloud and
    Huawei Cloud in the bot config file, the activity slowed down on my server.
    Power usage went down to ~180W, CPU usage to rougly 60%, and it stopped making
    a hellish noise.

    As the stats showed earlier, however, a lot of traffic was still hitting the
    server itself. Even weirder, there were still occasional spikes, every 3 hour,
    that lasted about one and a half hour, where the server would whirr and
    forgejo suffocate again.

    Bots were still hitting my server, and there was no clear source for it.

    ## 
    Automatically Classifying Bots and Poisoning Them: Iocaine and Nam-Shub-of-Enki

    So far, the steps i showed so far help when a single IP is hammering at your
    forge, or when someone is clearly scraping you from an Autonomous System that
    you do not mind blocking. Sadly, as i’ve showed above in Table 4, a
    surprising amount of scraping comes from broadband addresses. i can assemble
    lists of IPs as big as i want, or block entire ASNs, but i would love to have a
    per-query way of determining if a query looks legitimate.

    The next steps of protection will rely on categorizing a source IP based on its
    the credibility of its user agent. This mechanism is
    largely based on the documentation for Iocaine
    3.x
    . We finally
    get to talk about Iocaine!

    Iocaine is a tool that traps scrapers in a maze of meaningless pages that
    endlessly lead to more meaningless pages. The content of these pages is
    generated using a Markov chain, based on a corpus of texts given to the
    software. Iocaine (specifically all versions after 3 at least5) is a middleware, in
    the sense that it works by being placed on the line between your reverse proxy
    and the service. Your reverse proxy will first begin by redirecting traffic to
    Iocaine, and, if Iocaine deems a query legitimate, it will return a 421 Misdirected Request back at your reverse-proxy. The
    latter must then catch it, and use the real upstream as a fallback. If
    Iocaine’s Nam-Shub-of-Enki6 decides query came from a bogus or otherwise undesirable source, it
    will happily reply 200 OK and send generated garbage.

    My setup lodges Iocaine 3 between nginx and my forge, following the Iocaine
    documentation to use the container
    version
    .
    i recommend you follow it, and then add the next little things to enable
    categorization statistics, and prevent the logging they’re based on from
    blowing up your storage:

    1. In etc/config.d/03-nam-shub-of-enki.kdl, change the logging block to:
    logging {
        enable #true
        classification {
            enable #true
        }
    }
    
    1. In docker-compose.yaml, add the following bits to limit classification
      logging to 50MB:
    services:
      iocaine:
        # The things you already have here...
        # ...
        env:
          - RUST_LOG=iocaine=info
        logging:
          driver: "json-file"
          options:
            max-size: "50m"
    

    My checks block in Nam-Shub-of-Enki is as such:

    checks {
        disable cgi-bin-trap
    
        asn {
            database-path "/data/ipinfo_lite.mmdb"
            asns "45102" "136907"
        }
        ai-robots-txt {
            path "/data/ai.robots.txt-robots.json"
        }
        generated-urls {
            identifiers "deflagration"
        }
        big-tech {
            enable #true
        }
        commercial_scrapers {
            enable #true
        }
    }
    

    I snatched a copy of the latest ipinfo ASN database for
    free and blocked AS52102 (Alibaba) and AS136907 (Huawei Clouds).

    On 2025-11-18 at 00:00:29 UTC+1, i enabled Iocaine with the Nam-Shub-of-Enki
    classifier in front of my whole forge. Immediately, my server was no longer
    hammered. Power draw went down to just above 160W.

    One problem i noticed however, while trying to deploy the artifact for this
    blog post on my forge, is that Iocaine causes issues when huge PUT/PATCH/POST
    requests with large bodies are piped through it: it will hang up before the
    objects are entirely written. i am trying to figure out a way of only redirecting
    HEAD and GET requests to Iocaine in nginx, like is done in the Caddy example
    of the Iocaine documentation.

    What i ended up settling on requires a bit of variable mapping. At the start of
    your site configuration, before the server {} block:

    map $request_method $upstream_location {
    	GET	;
    	HEAD	;
    	default	;
    }
    
    map $request_method $upstream_log {
    	GET	bot_access;
    	HEAD	bot_access;
    	default	access;
    }
    

    Then, in the block that does the default location, write:

    	location / {
    	    proxy_cache off;
    	    access_log /var/log/nginx/$upstream_log.log combined;
    	    proxy_intercept_errors on;
    	    error_page 421 = @fallback;
    	    proxy_set_header Host $host;
    	    proxy_set_header X-Real-IP $remote_addr;
    	    proxy_pass http://$upstream_location;
    	}
    

    That is, replace the upstream in proxy_pass with the upstream decided by the
    variable mapping, and, while we’re at it, use $upstream_log to know which log
    will be the final one for that request. i differentiate between bot_access.log
    and access.log to gather my statistics, so the difference matters to me. Change
    the variables to suit the way you do it (or remove it, if you don’t distinguish
    clients in your log files).

    # 
    Monitoring Iocaine

    Currently, on 2025-11-30 at 16:33:00 UTC+1, Iocaine has served 38.16GB of garbage.
    Over the past hour, 152.11MB of such data was thrown back at undesirable visitors.
    3.39GB over the past day, 22.22GB over the past week. You can get the snippet
    that describes my Iocaine-specific Grafana views here.

    The vast majority of undesirable queries come from Claude, OpenAI, and
    Disguised Bots. Claude and OpenAI are absolutely gluttonous, and, once they
    have access to a ton of pages, they will greedily flock to fetch them like
    pigeons being fed breadcrumbs laced with strychnine.

    AI bot scrapers (ai.robots.txt) maintain a constant 920~930 query per minute
    (15-ish QPS) over the 6 domains i have protected with Iocaine, including the
    forge.

    There is also a low hum of a mix of commercial scrapers (~1 request every two
    second), big tech crawlers (Facebook, Google, etc, about 2QPS or 110 query/min),
    and, especially, fake browsers.

    Classifying fake browsers is where Iocaine really shines, specifically thanks
    to the classifiers implemented via Nam-Shub-of-Enki. The faked bots classifier
    detects the likelihood that the user agent reported by the client is bullshit,
    generated from a list of technologies mashed together. For example, if your
    client reports a user agent for a set of software that never supported HTTP2,
    or never actually existed together, or is not even released yet, it will get
    flagged. Think, for example, Windows NT 4 running Chrome, pretending to be
    able to do TLS1.3.

    The background-noise level of such queries is usually 140~160 queries per minute
    (or 2~3 QPS). However, notice those spikes in the graph above?

    ## 
    The Salves of Queries

    For a while during my experiments i noticed those pillars of queries. My
    general nginx statistics would show a sharp increase of connections, with an
    iniital ramp-up, and a stable-ish plateau lasting about one and a half hour,
    before suddenly stopping. It would then repeat again, roughly three hours later.

    Between October 29th and November 19th, and on November 28th, these spikes would
    constantly show up. As soon as i got Iocaine statistics running, it would flag
    all of those queries as faked browsers.

    i investigated those spikes in particular, because they baffled and scared me:
    the regularity with which they probed me, and the sharpness of the ramp-up and
    halts, made me afraid that someone, somewhere, was organizing thousands of IPs
    to specifically take turns at probing websites. i have not reached any solid
    conclusions, beyond the following:

    • The initial phase of an attack wave begins with a clear exponential ramp-up
    • The ramp-up stops when the server starts either throwing errors, or the
      response latency reaches a given threshold
    • Every wave of attack lasts roughly one hour and a half
    • An individual IP will often contribute no more than one query, but it can
      reach 50 to 60 queries per IP
    • The same 15 or so ASN keep showing up, with five regular leaders in IP count:
      1. AS212238: Datacamp Limited
      2. AS3257: GTT Communications
      3. AS9009: M247 Europe SRL
      4. AS203020: HostRoyale Technologies Pvt Ltd
      5. AS210906: UAB “Bite Lietuva” (a Lithuanian ISP)

    All of those as service providers. My working theory at the moment is that
    someone registered thousands of cheap servers in many different companies, and
    are selling access to them as web proxies for scraping and scanning. i will
    probably write something up later when i have properly investigated that
    specific phenomenon.

    # 
    Conclusion

    Self-hosting anything that is deemed “content” openly on the web in 2025 is
    a battle of attrition between you and forces who are able to buy tens of
    thousands of proxies to ruin your service for data they can resell.

    This is depressing. Profoundly depressing. i look at the statistics board for
    my reverse-proxy and i never see less than 96.7% of requests classified as bots
    at any given moment. The web is filled with crap, bots that pretend to be real
    people to flood you. All of that because i want to have my little corner of the
    internet where i put my silly little code for other people to see.

    i have to learn to protect myself from industrial actors in order to put anything
    online, because anything a person makes is valuable, and that value will be
    sucked dry by every tech giant to be emulsified, liquified, strained, and
    ultimately inexorably joined in an unholy mesh of learning weights.

    This experience has rather profoundly radicalized the way i think about
    technology. Sanitized content can be chewed on and shat out by companies from
    training, but their AI tools will never swear. They will never use a slur. They
    will never have a revolutionary thought. Despite being amalgamation of shit
    rolled up in the guts of the dying capitalist society, they are sanitized to
    hell and beyond.

    The developer of Iocaine put it best
    when explaining why Iocaine has absolutely unhinged identifiers
    (such as SexDungeon, PipeBomb, etc) is that they will all trigger “safeguard”
    mechanisms in commercial AI tools: absolutely no coding agent will accept
    analyzing and explaining code where the memory allocator’s free function is
    called liberate_palestine. i bet that if i described, in graphic details, in
    the comments of this page, the different ways being a furry intersects with my
    sexuality, that no commercial scraper would even dare ingest this page.

    Fuck tech companies. Fuck “AI”. Fuck the corporate web.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleSmartphone Without a Battery (2022)
    Next Article Journalism students expose Russian-linked vessels off the Dutch and German coast
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Runlayer is now offering secure OpenClaw agentic capabilities for large enterprises

    February 21, 2026

    Microsoft Copilot ignored sensitivity labels twice in eight months — and no DLP stack caught either one

    February 21, 2026

    Be Wary of Bluesky

    February 21, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025684 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025276 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025158 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025119 Views
    Don't Miss
    Business Technology February 21, 2026

    Gartner: Why neoclouds are the future of GPU-as-a-Service

    Gartner: Why neoclouds are the future of GPU-as-a-Service Neoclouds are set to change the economcs…

    Runlayer is now offering secure OpenClaw agentic capabilities for large enterprises

    Microsoft Copilot ignored sensitivity labels twice in eight months — and no DLP stack caught either one

    Be Wary of Bluesky

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Gartner: Why neoclouds are the future of GPU-as-a-Service

    February 21, 20262 Views

    Runlayer is now offering secure OpenClaw agentic capabilities for large enterprises

    February 21, 20260 Views

    Microsoft Copilot ignored sensitivity labels twice in eight months — and no DLP stack caught either one

    February 21, 20260 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.