I read Yann Esposito’s blog post, How I protect my forgejo instance from AI Web Crawlers, and think that’s a great idea. My main concern with the crawlers is that they’re horribly written and behave poorly. My own Forgejo server was getting slammed with about 600,000 crawler requests per day. This little server is where I share tiny personal projects like my Advent of Code solutions. I wouldn’t expect any project there to get more than a handful of queries per day, but suddenly I was serving 10 requests per second. That’s not a lot compared to any popular website, but that’s a lot for this service, on this tiny VPS, on my shoestring budget.
Worse, the traffic patterns were flat-out abusive. All the content on this site comprises nearly static Git repositories. The scrapers try things like:
- For every Git commit, fetch the version of every file in the repository at that commit.
- See
git blamefor every file at every commit. - Attempt to download the archive of each repo at every commit.
- Run every possible pull request search filter combination.
- Run every possible issue search filter combination.
- Fetch each of those URLs at random from some residential IP in Brazil that had not ever accessed my server before.
My first huge success at cutting through the flurry of bad traffic was with deploying Anubis. You know those anime girl pictures you see before accessing lots of web pages now? Well, those are part of a highly effective bot blocker. There’s a reason you’re seeing more and more of them.
And this morning, I also adapted Yann’s idea for my server which runs behind Caddy instead of Nginx. I made a file named /etc/caddy/shibboleth like this (but with the cookie name suitably altered to a random local value):
@needs_cookie {
not {
header User-Agent *git/*
}
not {
header User-Agent *git-lfs/*
}
not {
header X-Runner-Uuid *
}
not {
header Cookie *Yogsototh_opens_the_door=1*
}
}
handle @needs_cookie {
header Content-Type text/html
respond 418 {
body `<script>document.cookie = "Yogsototh_opens_the_door=1; Path=/;"; window.location.reload();</script>`
}
}
Note the extra X-Runner-Uuid line that Yann did’t have. This allows my Forgejo Action Runners to connect without going through the cookie handshake.
Then I added a line to the configurations for services I wanted to protect, like:
myserver.example.com {
root * /path/to/files
...
import shibboleth
}
This way I can easily reuse the snippet for any of those services.
Thanks for the great idea, Yann!