You Don't Need Anubis

(fxgn.dev)

166 points | by flexagoon 18 hours ago

28 comments

uqers 16 hours ago
> Unfortunately, the price LLM companies would have to pay to scrape every single Anubis deployment out there is approximately $0.00.
The math on the site linked here as a source for this claim is incorrect. The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so. The whole purpose of Anubis is to be expensive for bots that repeatedly request the same site multiple times a second.
[-]
- drum55 16 hours ago
  The "cost" of executing the JavaScript proof of work is fairly irrelevant, the whole concept just doesn't make sense with a pessimistic inspection. Anubis requires the users to do an irrelevant amount of sha256 hashes in slow javascript, where a scraper can do it much faster in native code; simply game over. It's the same reason we don't use hashcash for email, the amount of proof of work a user will tolerate is much lower than the amount a professional can apply. If this tool provides any benefit, it's due to it being obscure and non standard.
  When reviewing it I noticed that the author carried the common misunderstanding that "difficulty" in proof of work is simply the number of leading zero bytes in a hash, which limits the granularity to powers of two. I realize that some of this is the cost of working in JavaScript, but the hottest code path seems to be written extremely inefficiently.
```
    for (; ;) {
        const hashBuffer = await calculateSHA256(data + nonce);
        const hashArray = new Uint8Array(hashBuffer);

        let isValid = true;
        for (let i = 0; i < requiredZeroBytes; i++) {
          if (hashArray[i] !== 0) {
            isValid = false;
            break;
          }
        }
```
  It wouldn’t be exaggerating to say that a native implementation of this with even a hair of optimization could reduce the “proof of work” to being less time intensive than the ssl handshake.
  [-]
  - jsnell 15 hours ago
    That is not a productive way of thinking about it, because it will lead you to the conclusion that all you need is a smarter proof of work algorithm. One that's GPU-resistant, ASIC-resistant, and native code resistant. That's not the case.
    Proof of work can't function as a counter-abuse challenge even if you assume that the attackers have no advantage over the legitimate users (e.g. both are running exactly the same JS implementation of the challenge). The economics just can't work. The core problem is that the attackers pay in CPU time, which is fungible and incredibly cheap, while the real users pay in user-observable latency which is hellishly expensive.
  - aniviacat 14 hours ago
    They do use SubtleCrypto digest [0] in secure contexts, which does the hashing natively.
    Specifically for Firefox [1] they switch to the JavaScript fallback because that's actually faster [2] (because of overhead probably):
    > One of the biggest sources of lag in Firefox has been eliminated: the use of WebCrypto. Now whenever Anubis detects the client is using Firefox (or Pale Moon), it will swap over to a pure-JS implementation of SHA-256 for speed.
    [0] https://developer.mozilla.org/en-US/docs/Web/API/SubtleCrypt...
    [1] https://github.com/TecharoHQ/anubis/blob/main/web/js/algorit...
    [2] https://github.com/TecharoHQ/anubis/releases/tag/v1.22.0
  - xena 10 hours ago
    If you can optimize it, I would love that as a pull request! I am not a JS expert.
  - gruez 9 hours ago
    >but the hottest code path seems to be written extremely inefficiently.
    Why is this inefficient?
- tptacek 16 hours ago
  Right, but that's the point. It's not that the idea is bad. It's that PoW is the wrong fit for it. Internet-wide scrapers don't keep state? Ok, then force clients to do something that requires keeping state. You don't need to grind SHA2 puzzles to do that; you don't need to grind anything at all.
- valicord 16 hours ago
  The point is that the scrapers can easily bypass this if they cared to do so
  [-]
  - uqers 16 hours ago
    How so?
    [-]
    - valicord 5 hours ago
      The parent comment was "The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so.". There's no technical reason why they wouldn't reuse those tokens, they don't do that today because they don't care. If anubis gets enough adoption to cause meaningful inconvenience, the scrapers would just start caching the tokens to amortize the cost.
      The point of the article is that if the scraper is sufficiently motivated, Anubis is not going to do much anyway, and if the scraper doesn't care, same result can be achieved without annoying your actual users.
    - tecoholic 16 hours ago
      Hmm… by setting the verified=1 cookie on every request to the website?
      Am I missing something here? All this does is set an unencrypted cookie and reload the page right?
      [-]
      - notpushkin 16 hours ago
        They could, but if this is slightly different from site to site, they’ll have to either do this for every site (annoying but possible if your site is important enough), or go ahead and run JS (which... I thought they do already, with plenty of sites still being SPAs?)
        [-]
        rezonant 15 hours ago
        I would be highly surprised if most of these bots are already running JavaScript, I'm confused by this unquestioned notion that they don't.
iamnothere 9 hours ago
All the critics here miss the point. Anubis has worked to stop DDoS-level scraping against a number of production sites, especially self-hosted source repos and forums. If it stops working, then either Anubis contributors will come up with a fix, site devs will find their own fix, or the sites under attack will be shut down. It’s an arms race in which there is no permanent solution, each escalation will of course be easily bypassed (in theory) until the majority of the attackers find that further adaptations are not worth the additional revenue or there is no further defense possible.
Anubis isn’t some conspiracy to show you pictures of anime catgirls, it’s a desperate attempt to stave off bot-driven downtime. Many admins who install it do so reluctantly, because obviously it is annoying to have a delay when you access a website. Nobody is doing that for fun.
(There are probably a few people who install it not to protect against scraper DDoS, but due to ideological opposition to AI scrapers. IMHO this is fruitless, as the more intelligent scrapers will find ways around it without calling attention to themselves. Anubis makes almost no sense on a static personal blog.)
notpushkin 16 hours ago
My favourite thing about Anubis is that (in default configuration) it completely bypasses the actual challenge altogether if you set User-Agent header to curl.
E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...
But if you run this, you get the page content straight away:
```
  curl https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b
```
I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.
[-]
- xena 10 hours ago
  This was a tactical decision I made in order to avoid breaking well-behaved automation that properly identifies itself. I have been mocked endlessly for it. There is no winning.
  [-]
  - seba_dos1 8 hours ago
    The winning condition does not need to consider people who write before they think.
- rezonant 15 hours ago
  It only challenges user agents with Mozilla in their name by design, because user agents that do otherwise are already identifiable. If Anubis makes the bots change their user agents, it has done its job, as that traffic can now be addressed directly.
  [-]
  - samlinnfer 12 hours ago
    This has basically been Wikipedia's bot policy for a long long time. If you run a bot you should identify it via the UserAgent.
    https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Found...
  - hshdhdhehd 15 hours ago
    What if everyone requests from the bot has a different UA?
    [-]
    - skylurk 13 hours ago
      Success. The goal is to differentiate users and bots who are pretending to be users.
    - trenchpilgrim 12 hours ago
      Then you can tell the bots apart from legitimate users through normal WAF rules, because browsers froze the UA a while back.
  - hsbauauvhabzb 13 hours ago
    Can you explain what you mean by this? Why Mozilla specifically and not WebKit or similar?
    [-]
    - gucci-on-fleek 13 hours ago
      Due to weird historical reasons [0] [1], every modern browser's User-Agent starts with "Mozilla/5.0", even if they have nothing to do with Firefox.
      [0]: https://en.wikipedia.org/wiki/User-Agent_header#Format_for_h...
      [1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
- seba_dos1 8 hours ago
  > I’m pretty sure this gets abused by AI scrapers a lot.
  In practice, it hasn't been an issue for many months now, so I'm not sure why you're so sure. Disabling Anubis takes servers down; allowing curl bypass does not. What makes you assume that aggressive scrapers that don't want to identify themselves as bots will willingly identify themselves as bots in the first place?
katdork 10 hours ago
I don't like this solution because it is hostile to those who use solutions such as UMatrix / NoScript in their browser, who use TUI browsers (e.g. chawan, lynx, w3m, ...) or who have disabled Javascript outright.
Admittedly, this is no different than the kinds of ways Anubis is hostile to those same users, truly a tragedy of the commons.
weinzierl 14 hours ago
"Unfortunately, Cloudflare is pretty much the only reliable way to protect against bots."
With footnote:
"I don’t know if they have any good competition, but “Cloudflare” here refers to all similar bot protection services."
That's the crux. Cloudflare is the default, no one seems to bother to take the risk with a competitor for some reason. They seem to exist but when asked people can't even name them.
(For what it's worth I've been using AWS Cloudfront but I had to think a moment to remember its name.)
[-]
- Avamander 7 hours ago
  It's actually not that reliable either given a bit of effort. Only their paid offerings actually give you tools to properly defend against intentional attacks.
indrora 16 hours ago
The problem is that increasingly, they are running JS.
In the ongoing arms race, we're likely to see simple things like this sort of check result in a handful of detection systems that look for "set a cookie" or at least "open the page in headless chrome and measure the cookies."
[-]
- moebrowne 13 hours ago
  > increasingly, they are running JS.
  Does anyone have any proof of this?
  [-]
  - xena 10 hours ago
    I'm seeing more big botnets hosted on Alibaba Cloud, Huawei Cloud, and one on Tencent Cloud that run Headless Chrome. IP space blocks have been the solution there. I currently have a thread open with Tencent Cloud abuse where they've been begging me to not block them by default.
- utopiah 15 hours ago
  > increasingly, they are running JS.
  I mean they have access to a mind-blowing amount of computing resources so to they using a fraction of that to improve the quality of the data because they have this fundamental belief (because it's convenient for their situation) that scale is everything, why not use JS too. Heck if they have to run on a container full a browser, not even headless, they will.
  [-]
  - typpilol 13 hours ago
    Chrome even released a dev tools mcp they gives any LLM full tool access to do anything in the browser.
    Navigate, screenshots, etc. it has like 30 tools in it alone.
    Now we can just run real browsers with LLMs attached. Idk how you even think about defeating that.
praptak 16 hours ago
There are reasons to choose the slightly annoying solution on purpose though. I'm thinking of a political statement along the lines "We have a problem with asshole AI companies and here's how they make everyone's life slightly worse."
yellow_lead 16 hours ago
Anubis should be something that doesn't inconvenience all the real humans that visit your site.
I work with ffmpeg so I have to access their bugtracker and mailing list site sometimes. Every few days, I'm hit with the Anubis block. And 1/3 - 1/5 of the time, it fails completely. The other times, it delays me by a few seconds. Over time, this has turned me sour on the Anubis project, which was initially something I supported.
[-]
- xena 10 hours ago
  I've finally found a ruleset that works for that fwiw. The newest release has that fix.
  [-]
  - yellow_lead 9 hours ago
    Thank you!
    [-]
    - xena 9 hours ago
      No problem. I wish I had found it sooner, but between doing this nights and weekends while working a full time job, trying to help my husband find a new job, navigating the byzantine nightmare that is sales to education institutions, and other things I have found out that I hate, I have not had a lot of time to actually code things. I wish I could afford to work on this full time. Government grants have not gone through because I don't have the metrics they need. Probably gonna have to piss people off to get the bare minimum of metrics that I need in order to justify why I should get those grants.
- opan 13 hours ago
  I only had issues with it on GNOME's bug tracker and could work around it with a UA change, meanwhile Cloudflare challenges are often unpassable in qutebrowser no matter what I do.
- mariusor 14 hours ago
  I don't understand the hate when people look at a countermeasure against unethical shit and complain about it instead of being upset at the unethical shit. And it's funny when it's the other way around, like cookie banners being blamed on GDPR not on the scumminess of some web operators.
  [-]
  - elashri 12 hours ago
    I don't understand that some people don't realize that you can be upset about status que that both sides of the equation sucks. And you can hate thing and also the countermeasure that someone deploy against. These are not mutually exclusive.
    [-]
    - mariusor 12 hours ago
      I didn't see parent be upset about both sides on this one. I don't see it implied anywhere that they even considered it.
      [-]
      - elashri 12 hours ago
        >which was initially something I supported.
        That quote is strong indication that he sees it this way.
        [-]
        yellow_lead 9 hours ago
        Yup, I'm against the AI scraping. But personally for me, the equation breaks when I'm getting delays and errors when just visiting a bug tracker.
        Sounds like maybe it'll be fixed soon though
  - m4rtink 11 hours ago
    Also the Anubis mascot is very cute! ;-)
- throwaway290 14 hours ago
  I understand why ffmpeg does it. No one is expected to pay for it. Until this age of LLMs when bot traffic became dominant on the web ffmpeg site was probably acceptable expense. But they probably don't want to be unpaid data provider for big LLM operators who get to extract a few bucks from their users.
  It's like airplane checkin. Are we inconvenienced? Yes. Who is there to blame? Probably not the airline or the company who provides the services. Probably people who want to fly without a ticket or bring in explosives.
  As long as Anubis project and people on it don't try to play both sides and don't make the LLM situation worse (mafia racket style), I think if it works it works.
  [-]
  - TJSomething 12 hours ago
    I know it's beside the point, but I think a chunk of the reason for many of the security measures in airports is because creating the appearance of security increases people's willingness to fly.
- bakql 12 hours ago
  [flagged]
  [-]
  - trenchpilgrim 12 hours ago
    Unfortunately in countries like Brazil and India, where a majority of humans collectively live, better computers are taxed at extremely high rates and are practically unaffordable.
    [-]
    - bakql 12 hours ago
      [flagged]
paweladamczuk 14 hours ago
Internet in its current form, where I can theoretically ping any web server on earth from my bedroom, doesn't seem sustainable. I think it will have to end at some point.
I can't fully articulate it but I feel like there is some game theory aspect of the current design that's just not compatible with the reality.
[-]
- noAnswer 10 hours ago
  Years ago, wasn't there a proposal from google or the likes to have push notifications for search engines? Instead of the bots checking offer and offer again if there is something new, you would inform them about it. I think that would be a fair middle ground. You don't ddos us and in exchange we inform you timely if there is something new. (Bot would need a way to subscript themselves.)
  I have a personal website that sometimes doesn't get an update for a year. Still the bots are in the majority of visitors. (Not so much that I would need counter measures but still.) Most bot visits could be avoided with such a scheme.
  [-]
  - redwall_hp 3 hours ago
    Ah, so blog pingbacks are new again. https://en.wikipedia.org/wiki/Pingback
    That's how Technorati worked.
geokon 15 hours ago
Big picture, why does everyone scrape the web?
Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?
[-]
- utopiah 15 hours ago
  My bet is that they believe https://commoncrawl.org isn't good enough and, precisely as you are suggesting, the "rest" is where is their competitive advantage might stem from.
  [-]
  - Jackson__ 13 hours ago
    Thinking that there is anything worth scraping past the llm-apocalypse is pure hubris imo. It is slop city out there, and unless you have an impossibly perfect classifier to detect it, 99.9% of all the great new "content" you scrape will be AI written.
    E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.
agnishom 12 hours ago
Exactly. I don't understand what computation you can afford to do in 10 seconds on a small number of cores that bots running on large data centers cannot
[-]
- juliangmp 11 hours ago
  The point of anubis isn't to make the scraping impossible, but make it more expensive.
  [-]
  - agnishom 10 hours ago
    by how much? I don't understand the cost model here at all.
    [-]
    - eqvinox 8 hours ago
      AIUI the idea is to ratelimit each "solution". A normal human's browser only needs to "solve" once. A LLM crawler either needs to slow down (= objective achieved) or solve the puzzle n times to get n × the request rate.
gucci-on-fleek 16 hours ago
> But it still works, right? People use Anubis because it actually stops LLM bots from scraping their site, so it must work, right?
> Yeah, but only because the LLM bots simply don’t run JavaScript.
I don't think that this is the case, because when Anubis itself switched from a proof-of-work to a different JavaScript-based challenge, my server got overloaded, but switching back to the PoW solution fixed it [0].
I also semi-hate Anubis since it required me to add JS to a website that used none before, but (1) it's the only thing that stopped the bot problem for me, (2) it's really easy to deploy, and (3) very few human visitors are incorrectly blocked by it (unlike Captchas or IP/ASN bans that have really high false-positive rates).
[0]: https://github.com/TecharoHQ/anubis/issues/1121
gbuk2013 13 hours ago
The Caddy config in the parent article uses status code 418. This is cute, but wouldn’t this break search engine indexing? Why not use 307 code?
[-]
- flexagoon 10 hours ago
  I use this for a personal Redlib instance, so search indexing is not important. I don't know if this will allow indexing even with a 307 status code - maybe you just need to add an exception for Googlebot.
defraudbah 15 hours ago
[flagged]
[-]
- m4rtink 11 hours ago
  Working as intended! ;-)
- viaoktavia 15 hours ago
  [dead]
jchw 15 hours ago
I was briefly messing around with Pangolin, which is supposed to be a self-hosted Cloudflare Tunnels sort of thing. Pretty cool.
One thing I noticed though was that the Digital Ocean Marketplace image asks you if you want to install something called Crowdsec, which is described as a "multiplayer firewall", and while it is a paid service, it appears there is a community offering that is well-liked enough. I actually was really wondering what downsides it has (except for the obvious, which is that you are definitely trading some user privacy in service of security) but at least in principle the idea seems kind of a nice middleground between Cloudflare and nothing if it works and the business model holds up.
[-]
- bootsmann 15 hours ago
  Not sure crowdsec is fit for this purpose. Its more a fail2ban replacement than a ddos challenge.
  [-]
  - jchw 12 hours ago
    One of the main ways that Cloudflare is able to avoid presenting CAPTCHAs to a lot of people while still filtering tons of non-human traffic is exactly that, though: just having a boatload of data across the Internet.
andersmurphy 11 hours ago
So I don't use cloudflare. But only serve clients that support brotli and have a valid cookie. All the actual content comes down an SSE connection. Haven't had any problems with bots on my 5$ VPS.
What I realised recently is for non user browsers my demos are effectively zip bombs.
Why?
Because I stream each frame and each frame is around 180kb uncompressed (compressed frames can be as small as 13bytes). This is fine as the users browser doesn't hold onto the frames.
But, a crawler will hold onto those frames. Very quickly this ends up being a very bad time for them.
Of course there's nothing of value to scrape so mostly pointless. But, I found it entertaining that some scummy crawler is getting nuked by checkboxes [1].
- https://checkboxes.andersmurphy.com
greatgib 11 hours ago
Just a personal fact, when I want to see a page and instead I have to face a 3s stupid nagscreens like the one of anubis, I'm very pissed off and pushed even more to bypass the website when possible to get the info I want directly from llm or search engine.
It's kind of a self fulfilling prophecy, you make it the visitor experience worse, giving a self justification why llm giving the content is wanted and needed.
All of that because in the current lambda/cloud computing word, it became very expensive to process only a few requests.
[-]
- robinsonb5 10 hours ago
  Unfortunately the choice isn't between sites with something like Anubis and sites with free and unencumbered access. The choice is between putting up with Anubis and the sites simply going away.
  A web forum I read regularly has been playing whack-a-mole with LLM scrapers for much of this year, with multiple weeks-long periods where the swarm-of-locusts would make the site inaccessible to actual users.
  The admins tried all manner of blocks, including ultimately banning entire countries' IP ranges, all to no avail.
  The forum's continued existence depends on being able to hold off abusive crawlers. Having to see half-a-second of the Anubis splashscreen occasionally is a small price to pay for keeping it alive.
  [-]
  - greatgib 8 hours ago
    [flagged]
    [-]
    - pushcx 7 hours ago
      The scrapers will not attempt to discover and use an efficient representation. They will attempt to hit every URL they can discover on a site, and they'll do it at a rate of hundreds of hits per second, from enough IPs that each only requests at a rate of 1/minute. It's rude to talk down to people for not implementing a technique that you can't get scrapers to adopt, and for matching their investment in performance to their needs instead of accurately predicting years beforehand that traffic would dramatically change.
    - xena 8 hours ago
      I challenge you to take a critical look at the performance of things like PHPBB and see how even naive scraping brings commonly deployed server CPUs to their knees.
- eqvinox 8 hours ago
  If you don't feel like understanding the thing to be pissed off about here are the AI crawlers, we don't feel like understanding your displeasure about the Anubis wall either. The choices are either the Anubis wall or nothing. This isn't theoretical, I've been involved in this decision: we had to either close off the service entirely, or put [something like] Anubis in front of it.
  > have to face a 3s stupid nagscreens like the one of anubis, I'm very pissed off and pushed even more to bypass the website when possible to get the info I want directly from llm or search engine.
  Most (freely accessible) LLMs will take more than 3s to "think". Why are you pissed off about Anubis, but not the slow LLM? And then you have to double check the LLM anyway...
  > All of that because in the current lambda/cloud computing word, it became very expensive to process only a few requests.
  You're making some very arrogant assumptions here. FOSS repos and bugtrackers are generally not lambda/cloud hosted.
  [-]
  - redwall_hp 3 hours ago
    There are a lot of phpBB/XenForo/Discourse/etc fouls out there too that get slammed hard by those, and many cases of them just shutting down rather than eating much higher hosting costs. Which, of course, further pushes online communities in the hands of corporations like Reddit and Facebook.
    Most of them are simply throwing one of those tools on a VPS or such, which is perfect for their community size, and then falls over under LLM companies' botnets DDoSing them.
- DanOpcode 9 hours ago
  I agree, I think it gives a bad impression when I need to see the anime Anubis girl before the page loads. Codeberg.org oftens shows me the nag screen, and it has worsened my impression of their service.
tptacek 16 hours ago
This came up before (and this post links to the Tavis Ormandy post that kicked up the last firestorm about Anubis) and without myself shading the intent or the execution on Anubis, just from a CS perspective, I want to say again that the PoW thing Anubis uses doesn't make sense.
Work functions make sense in password hashes because they exploit an asymmetry: attackers will guess millions of invalid passwords for every validated guess, so the attacker bears most (really almost all) of the cost.
Work functions make sense in antispam systems for the same reason: spam "attacks" rely on the cost of an attempt being so low that it's efficient to target millions of victims in the expectation of just one hit.
Work functions make sense in Bitcoin because they function as a synchronization mechanism. There's nothing actually valorous about solving a SHA2 puzzle, but the puzzles give the whole protocol a clock.
Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.
None of this is to say that a serious anti-scraping firewall can't be built! I'm fond of pointing to how Youtube addressed this very similar problem, with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.
The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.
[-]
- mariusor 14 hours ago
  With all due respect, but almost all I see in this thread is people looking down their nose at a proven solution, and giving advice instead of doing the work. I can see how you are a _very important person_ with bills to pay and money to make, but at least have the humility of understanding that the solution we got is better than the solution that could be better if only there was someone else to think of it and build it.
  [-]
  - tptacek 7 hours ago
    You can't moralize a flawed design into being a good one.
    [-]
    - mariusor 7 hours ago
      How about into a "good enough one"?
      [-]
      - tptacek 7 hours ago
        Look, I don't care if you run Anubis. I'm not against "Anubis". I'm interested in the computer science of the current Anubis implementation. It's not great. It doesn't make sense. Those are descriptive observations, and you can't moralize them into being false; you need to present an actual argument.
        [-]
        mariusor 5 hours ago
        This is not me being aggro because you're picking on my favourite project, I dislike Anubis for more or less the same complaints you see in this thread. I don't want JavaScript on otherwise static sites, I don't like the anime girl, etc. What I don't agree with is people like you pontificating about what an inferior solution it is, and *how* obvious that should be for everybody, but you fail to provide any better alternatives. So, I guess that what I'm trying to say, is to put up or shut up.
        [-]
        tptacek 4 hours ago
        Sorry, but I really can't think of anything less interesting to debate than how a computer science argument makes you feel about how it might make someone else feel.
  - yumechii 10 hours ago
    [dead]
    [-]
    - mariusor 9 hours ago
      It's weird that you get offended by something which was not directed at you.
      "The work" is providing those better alternatives to anubis, that everyone in this thread except for Xe seem to know all about.
      The humility is about accepting the fact that the solution works for some people, the small site operators that get hammered by DDoSes and unethical LLM over crawling, despite not being perfect. And if that inconveniences you as a user of those sites - which I imagine is what you mean by "user backlash", the solution for you is to stop going there, not talk down at them for doing something about an issue that impacts them.
      [-]
      - yumechii 9 hours ago
        How am I offended? Did I accuse you of anything? I didn't even accuse Anubis of anything. You asked for the work, I post the work and evidence to ground the discussion in "work", as you demanded.
        [-]
        mariusor 8 hours ago
        I repeat "the work" is to make a better thing than Anubis, not provide proof of concept that it can be beaten. :)
        [-]
        yumechii 8 hours ago
        You criticized what you identified as an "advice" for not providing work to your scope (which you clarified as "make a better thing than Anubis"), why should I suddenly have to meet your scope of "work" to be a valid criticism of your advice this time? Showing a negative result is also work.
        [-]
        mariusor 7 hours ago
        If you're operating your reasoning in a moral framework where helping the bad agents is a good outcome, then you'd be right. I personally do not, however.
        [-]
        yumechii 6 hours ago
        If your moral framework is supporting a nominally good "solution" with no evidence (where if your evidence that your assertion the solution is "proven"?) is "a good outcome", pointing out the solution is flawed, with evidence, is somehow not, then you'd be right. I personally do not share your nominal goodness compass, however.
        [-]
        mariusor 5 hours ago
        Codeberg and sourcehut[1] have both blogged about Anubis decreasing loads on their servers at the beginning of the year when this saga has started. Since then, one, or both have moved to different solutions, but that was not due to ineffectiveness but rather to requiring JavaScript.
        [1] https://sourcehut.org/blog/2025-04-15-you-cannot-have-our-us...
- gucci-on-fleek 16 hours ago
  > Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.
  Agreed, residential proxies are far more expensive than compute, yet the bots seem to have no problem obtaining millions of residential IPs. So I'm not really sure why Anubis works—my best guess is that the bots have some sort of time limit for each page, and they haven't bothered to increase it for pages that use Anubis.
  > with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.
  > The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.
  They did [0], but it doesn't work [1]. Of course, the Anubis implementation is much simpler than YouTube's, but (1) Anubis doesn't have dozens of employees who can test hundreds of browser/OS/version combinations to make sure that it doesn't inadvertently block human users, and (2) it's much trickier to design an open-source program that resists reverse-engineering than a closed-source program, and I wouldn't want to use Anubis if it went closed-source.
  [0]: https://anubis.techaro.lol/docs/admin/configuration/challeng...
  [1]: https://github.com/TecharoHQ/anubis/issues/1121
  [-]
  - tptacek 16 hours ago
    Google's content-protection system didn't simply make sure you could run client-side Javascript. It implemented an obfuscating virtual machine that, if I'm remembering right (I may be getting some of the detailed blurred with Blu Ray's BD+ scheme) built up a hash input of runtime artifacts. As I understand it, it was one person's work, not the work of a big team. The "source code" we're talking about here is clientside Javascript.
    Either way: what Anubis does now --- just from a CS perspective, that's all --- doesn't make sense.
- Gander5739 11 hours ago
  But youtube can still be scraped with yt-dlp, so apparently it wasn't enough.
  [-]
  - tptacek 7 hours ago
    Preventing that wasn't the objective of the content-protection system. You'll have to go read up on it.
Borg3 13 hours ago
It seems that people do NOT understand its already game over.. Lost.. When stuff was small, and we had abusive actors, nobody cared.. oh just few bad actors, nothing to worry about, they will get bored and go away. No, they wont, they will grow and grow and now most even good guys turned bad because there is no punishment for it.. So as I said, game over.
Its time to start do own walled gardens, build overlay VPN networks for humans. Put services there, if someone misbehave? BAN his IP. Came back? BAN again. Came back? wtf? BAN VPN provider.. Just clean the mess.. different networks can peer and exchange. Look, Internet is just network of networks, its not that hard.
[-]
- timeon 11 hours ago
  Good idea. Another solutions is to move our things to p2p. These corporations need expensive servers to run huge models on or just collect data. Sometimes winning move is not play the game: true server-less.
yumechii 10 hours ago
Here are some benchmarks, TLDR is Anubis is not as performant as an optimized client prover running on the same HEDT CPU.
So the "PoW tax" essentially only applies to low volume requester who have no incentive to optimize or bespoke solution too diverse to optimize at scale.
https://yumechi.jp/en/blog/2025/proof-of-mutex-outspeeding-a...
https://github.com/eternal-flame-AD/pow-buster
The problem was "fixed" but then reverted because the fix has deadlock bug. (Changelog entry: "Remove bbolt actorify implementation due to causing production issues.")
hubraumhugo 14 hours ago
What's the endgame of this increasing arms race? A gated web where you need to log in everywhere? Even more captchas and Cloudflare becoming the gateway to the internet? There must be a better way.
We're somehow still stuck with CAPTCHAs (and other challenges), a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
[0] https://arxiv.org/abs/2311.10911
[-]
- DecoySalamander 11 hours ago
  Maybe a web where you provide your credit card number upfront and pay for each outgoing request.
utopiah 15 hours ago
"Yes, it works, and does so as effectively as Anubis, while not bothering your visitors with a 10-second page load time."
Cool... but I guess now we need a benchmark for such solutions. I don't know the author, I roughly know the problem (as I self host and most of my traffic now comes from AI scrapper bots, not the usual indexing bots or, mind you, humans) but when they are numerous solutions to a multi-dimensional problem I need a common way to compare them.
Yet another solution is always welcomed but without being able to efficiently compare it doesn't help me to pick the right one for me.
Razengan 16 hours ago
How else would I inter my dead and make sure they get to the afterlife?
echelon 16 hours ago
This whole thing is pointless.
OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.
The firewall is now moot.
The bigger AI company, Google, has already been doing this for decades. They were the middlemen between your reader and you, and that position is unassailable. Without them, you don't have readers.
At this point, the only people you're keeping out with LLM firewalls are the smaller players, which further entrenches the leaders.
OpenAI and Google want you to block everybody else.
[-]
- happyopossum 16 hours ago
  > Google, has already been doing this for decades
  Do you have any proof, or even circumstantial evidence to point to this being the case?
  If chrome actually scraped every site ever you visited and sent it off to Google, it’d be trivially simple to find some indication of that in network traffic, or heck - even chromium code.
  [-]
  - echelon 16 hours ago
    Sorry, I mean they're between the customer relationship.
    Who would dare block Google Search from indexing their site?
    The relationship is adversarial, but necessary.
    [-]
- Dylan16807 16 hours ago
  Is it confirmed that site loads go into the training database?
  But for anyone whose main concern is their server staying up, Atlas isn't a problem. It's not doing a million extra loads.
  [-]
  - heavyset_go 15 hours ago
    > Is it confirmed that site loads go into the training database?
    Would you trust OpenAI if they told you it doesn't?
    If you would, would you also trust Meta to tell you if its multibillion dollar investment was trained on terabytes of pirated media the company downloaded over BitTorrent?
    [-]
    - viraptor 12 hours ago
      We don't have to trust it or not. If there's such claim, surely someone can point at least at a pcap file with an unknown connection. Or at some decompiled code. Otherwise it's just a conspiracy theory.
      [-]
      - _flux 10 hours ago
        Surely the data must go to the OpenAI servers, how else would they use LLMs on it? We cannot see if that data ends up in the training data.
        Personally I would just believe what they say for the time being; there would be backlash in doing something else, possibly legal one.
        [-]
        viraptor 10 hours ago
        I think the original claim was about something different. "Is it confirmed that site loads..." - I read it as the author taking about general browsing, not just explicit questions, with the context of the page.
      - heavyset_go 10 hours ago
        Whatever is included in context is in OpenAI's control from that point forward, and you just have to trust them not to do anything with it.
        That isn't a conspiracy theory, it's fundamentally how interfacing with 3rd party hosted LLMs works.
- seba_dos1 7 hours ago
  The "LLM firewall" is usually there so AI companies don't take the server down, not to prevent model training (that's just an acceptable side effect).
- _flux 11 hours ago
  As I understand it, the main point of Anubis is to reduce the costs caused by (AI company) bots and agent-generated load is still a lot less than simply spidering the complete web site; it might actually be quite close to what a user would manually browse.
  Unless the user asked something that just needs visiting many pages, I suppose. For example, Google Gemini was pretty helpful in finding out the typical price ranges and dishes a local shopping centre coffee shops have, as the information was far from being just in a single page..
- masklinn 11 hours ago
  > This whole thing is pointless.
  It's definitely pointless if you completely miss the point of it.
  > OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.
  Cool. Anubis' fundamental purpose is not to prevent all bot access tho, as clearly spelled in its overview:
  > This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies.
  OpenAI atlas piggybacking on the user's normal browsing is not within the remit of anubis, because it's not going to take a small site down or dramatically increase hosting costs.
  > At this point, the only people you're keeping out with LLM firewalls are the smaller players
  Oh no, who will think of the small assholes?
GaryBluto 15 hours ago
[flagged]
herpessimplex10 15 hours ago
[flagged]
[-]
- viraptor 15 hours ago
  Or you know, they may be serious and grown up enough to not be bothered by an image of an anime cat girl. There's really nothing there to be offended by.
  [-]
  - pkal 12 hours ago
    I am also not a fan, but since a recent discussion on HN I have been thinking about what I don't like about it.
    The conclusion I have come to is more general: I just personally don't like nerd-culture. Having an anime girl (but the same would be the case for an star trek, my litte pony/furry, etc.-themed site) signifies a kind of personality that I don't feel comfortable about, mainly due to the oblivious social awkwardness, but also due to "personal" habits of some people you meet in nerdy spaces. I guess there is something about the fact of not distinguishing between a public presentation and personal interests that this is reminiscent of. For instance: A guy can enjoy model trains, sure, but he is your college at work and always just goes on about model trains (without considering if this interests you or not!), then the fact that this subsumes his personality becomes a bore or even just plain unpleasant. This is not to generalize that this is the case for everyone in these spaces, I am friends with nerdy-people on an individual basis, but I am painfully aware that I don't fit in perfectly like the last piece of a jigsaw puzzle -- and increasingly have less of a desire to do so.
    So for me at least this is not offence, but in addition to the above also some kind of reminder that there is a fundamental rift in how decency and intersocial relations are imagined between the people who share my interests and me, which does bother me. Having that cat-girl appear every time I open some site reminds me of this fact.
    Does any of this make sense? The way you and others phrase objections to the objections makes it seem like anyone who dislikes this is some obsessive or bigoted weirdo, which I hope I don't make the impression of. (Hit me up, even privately off-HN if anyone wants to chat about this, especially if you disagree with me, this is a topic that I find interesting and want to better understand!)
    [-]
    - GaryBluto 34 minutes ago
      Thank you for putting into words what I could not. "Nerd culture" has fallen a long way since the early 2000s and your quote about decency and intersocial relations spoke to me.
      It is really bizarre how everybody tries to make it about politics. While I may or may not disagree with a developer's politics, it's their conduct that I care about, and I associate those who express appreciation for anime at every possible opportunity with especially poor conduct and have yet to have encounter an exception to the rule.
      The amount of flags I'm seeing for posts simply expressing disagreement on the matter is quite worrying.
    - adlinb 9 hours ago
      [flagged]
  - GaryBluto 15 hours ago
    > grown up enough to not be bothered by an image of an anime cat girl
    This is some real four-dimensional chess. "You're the childish one for not wanting Japanese cartoons on software projects!"
    [-]
    - tavavex 14 hours ago
      It's not just "not wanting" something, the original comment wasn't nearly that mild. It's being enraged by it to the extent of making petty, low, personal attacks on someone who steps just barely out of line of their preferred behaviors.
      This whole comment chain solidifies my opinion that disgust is one of the driving human emotions. People feel initial, momentary disgust and only then explain it using the most solid justification that comes to mind, but the core disgust is unshakable and precedes all explanations. No one here has managed to procure any argument for why seeing a basic sketch in a certain style is objectively bad or harmful to someone, only that it's "weird" in some vague way. Basically, it goes against the primal instinct of how the person thinks the world "ought to work", therefore it's bad, end of story.
      To me it seems obvious. The anime art style is in, especially in Western countries, especially^2 among younger people, and especially^3 among techy people. Ergo, you may see a mascot in that style once in a while in hobbyist projects. Doesn't seem like anything particularly objectionable to me.
      [-]
      - krapp 12 hours ago
        It isn't a driving human emotion. The world is full of serious businesses that use "cute" icons or employ anime-styled elements, and most people don't care. It's just a subset of tech and CS people who feel compelled to register their disdain at every opportunity.
        And yet if you bring up that "Gimp" is an unserious name, or anything about RMS that's far more problematic than a cute cartoon, that same subset will defend it to the death.
        [-]
        GaryBluto 54 minutes ago
        I'd argue there's a difference between a funny mascot or punny name (that have been used in professional environments) and a mascot that looks like a child and is designed to look "cute" or "silly" to a fandom mostly comprising of eccentric (in a bad way) grown men. I don't think I've ever seen a developer on the internet who publicly enjoys anime who didn't act neurotically or childishly, although that's just anecdotal.
        > The world is full of serious businesses that use "cute" icons or employ anime-styled elements
        I can't think of any outside of Japan.
    - pyrale 11 hours ago
      > This is some real four-dimensional chess. "You're the childish one for not wanting Japanese cartoons on software projects!"
      I would be OK with the sentence if it was "You're the childish one for not wanting Japanese cartoons on your software projects!".
      As you wrote it, well, that's none of your business.
  - herpessimplex10 15 hours ago
    [flagged]
    [-]
    - viraptor 12 hours ago
      If those 2 are in any way close in your mind... maybe think about why a cartoon character and explicit sexual image seem related? There's nothing weird about an anime cat girl for me, but also I don't see any relation to anything sexualised in it.
    - mariusor 14 hours ago
      Mister simplex, having a preference for small internet to crumble under the shit LLM companies are piling upon it to being exposed once in a while to an inoffensive image of an anime character would make for a very worrisome direction of my priorities if I were you.
    - petesergeant 15 hours ago
      > an anime cat-girl
      > an animated cock and balls
      You don't see a difference between these things?
      [-]
      - GaryBluto 15 hours ago
        While it's a bit of an extreme comparison, they're both weird, unprofessional imagery associated with things you wouldn't wan to associate with a software project.
        [-]
        lakecresva 15 hours ago
        If it were just the imagery I don't think this would be such a huge flashpoint relative to something like tux or octocat (the github mascot).
        [-]
        squigz 14 hours ago
        Because it's really about the type of people (they think) watch anime, and their inability to separate this preconception from reality.
        petesergeant 14 hours ago
        I guess I'm not seeing the special category that an anime cat girl sits in. Is there some kind of sex implication I'm just not aware of? Linux has a penguin, FreeBSD has a devil(!), OpenBSD has a blowfish, Go has the weird Gopher thing, Gnome has a foot...
        Wikipedia suggests that there's an association with queer and trans youth, is that what's meant to make the cock-and-balls comparison work? But it also says it has a history back to 17th century Japan...
    - spinf97 15 hours ago
      I don't, actually. Why is it weird?
- spinf97 15 hours ago
  Only man-children can be bothered by anime catgirls enough to post about it on on hacker news, so it says more about you tbh
  [-]
  - GaryBluto 15 hours ago
    When did this notion that caring about things and wanting things to be professional is bad, or makes you a "man-child"? That would mean that practically everybody in human history has been a man-child. It feels like the whole world (even formerly professional areas) have decided to be casual and it's frustrating to those who think things matter.
    [-]
    - DecoySalamander 11 hours ago
      Adhering to a narrow definition of a "professional" look signifies immaturity, stemming from a desire for approval from a stereotypically "adult" third party. Personally, I wouldn't take seriously anyone who has a problem with Anubis but doesn't blink when presented with people drawn in the corporate Memphis style.
      [-]
      - GaryBluto 42 minutes ago
        >Adhering to a narrow definition of a "professional" look signifies immaturity, stemming from a desire for approval from a stereotypically "adult" third party.
        I'm not looking for anyone's approval. If I was, I wouldn't be publicly disagreeing with people on an internet forum, would I? Relax with your armchair psychology.
        > Personally, I wouldn't take seriously anyone who has a problem with Anubis but doesn't blink when presented with people drawn in the corporate Memphis style.
        I don't like either and find them both ugly.
      - redwall_hp 2 hours ago
        Whole countries of comparable size to the US happily put similar mascots all over their products, and pay other companies big money to use their characters. They're all over busses and billboards. The Korean ramen brand I buy has Kpop Demon Hunters on it now. (And Buldak usually has their little chicken dude.) Casio and Fender have expensive products with Hatsune Miku on them...which has been used in ad campaigns by petroleum and rail companies in Japan.
        American corporate culture is dehumanizing and dystopian, not a standard for professionalism.
    - pyrale 11 hours ago
      > wanting things to be professional
      Nothing says "professional" like starting a debate on HN about the weirdness of the mascot of a free software project, likely for political reasons.
      [-]
      - GaryBluto 46 minutes ago
        > likely for political reasons.
        You're engaging in bad faith here. Nobody has brought up politics at all. If an almost identical clone of myself (with the same opinions on everything but mascots) developed a software project with an anime mascot I'd still disapprove.
      - adlinb 9 hours ago
        [flagged]
    - brendoelfrendo 14 hours ago
      I want less professionalism, thanks. I think the idea that everything needs to be an emotionless product has been largely harmful to the internet as a place of community and expression.
      [-]
      - GaryBluto 14 hours ago
        Professional =/= emotionless or product. I'd argue that early Linux, with all of Linus' rants, was more professional than most companies today.
        I suppose it all comes down to what your definition of "professional" is.
    - ncruces 9 hours ago
      Why is this worse than the octocat?
    - watwut 13 hours ago
      > That would mean that practically everybody in human history has been a man-child.
      I would argue that this statement is blatantly false. Currently, most people really do not care about anubis anime cat girl icon which is actually fairly tame and boring picture.
      In history, people used all kind of images for professional things, including stuff they found funny or cute.
  - adlinb 10 hours ago
    [flagged]
GauntletWizard 16 hours ago
Anubis's design is copied from a great botnet protection mechanism - You serve the Javascript cheaply from memory, and then the client is forced to do expensive compute in order to use your expensive compute. This works great at keeping attackers from attempting to waste your time; It turns a 1:1000 amplification in compute costs into a 1000:1.
It is a shitty, and obviously bad solution for preventing scraping traffic. The goal of scraping traffic isn't to overwhelm your site, it's to read it once. If you make it prohibitively expensive to read your site even once, nobody comes to it. If you make it only mildly expensive, nobody scraping cares.
Anubis is specifically DDOS protection, not generally anti-bot, aside from defeating basic bots that don't emulate a full browser. It's been cargo-culted in front of a bunch of websites because of the latter, but it was obviously not going to work for long.
[-]
- viraptor 15 hours ago
  > The goal of scraping traffic isn't to overwhelm your site, it's to read it once.
  If the authors of the scrapers actually cared about it, we wouldn't have this problem in the first place. But today the more appropriate description is: the goal is to scrape as much data as possible as quickly as possible, preferably before your site falls over. They really don't care and side effects beyond that. Search engines have an incentive to leave your site running. AI companies don't. (Maybe apart from perplexity)
- reppap 13 hours ago
  First of all Anubis isn't meant to protect simple websites that gets read once. It's meant for things like a gitlabs instance where AI bots are indexing every single commit of every single file. Resulting in thousands of not millions of reads. And reading an Anubis page once isn't expensive either. So I don't really understand what point you are trying to make as the premise seems completely wrong.
- purple_turtle 13 hours ago
  Some people deployed Anubis not to stop scraping, but to stop scraping the same page multiple times per second.