16 comments

  • tasuki 40 minutes ago
    > If you have a public website, they are already stealing your work.

    I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

    • spiderfarmer 18 minutes ago
      If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?
      • falcor84 6 minutes ago
        That really depends, but the quick answer is that according to our human social contract, we'd just ask "how many can I take?". Until now, the only real tool to limit scrapers has been throttling, but I don't see any reason for there not to be a similar conversational social contract between machines.
      • GaggiX 16 minutes ago
        I will copy the supermarket and paste it somewhere else.

        I'm also going to download a car.

  • aldousd666 16 minutes ago
    This is ultimately just going to give them training material for how to avoid this crap. They'll have to up their game to get good code. The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped. The bottom has always been threatening to fall out of the ads paid for eyeballs, And nobody could anticipate the trigger for the downfall. Looks like we found it.
    • aldousd666 10 minutes ago
      To be clear, I mean AI is going to be the downfall of ad supported content. But let's face it. We have link farms and spam factories as a result of the ad supported content market. I think this is going to eventually do justice for users because it puts a premium on content quality that someone will want to pay a direct licensing fee to scrape for your AI bots as opposed to tricking somebody into clicking on a link and looking at an impression for something they won't buy.
  • madeofpalk 1 hour ago
    Is there any evidence or hints that these actually work?

    It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

    • raincole 6 minutes ago
      It might work against people just use their Mini Mac with OpenClaw to summarize news every morning, but it certainly won't work against Google.

      More centralized web ftw.

    • sd9 1 hour ago
      Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired.
      • 20k 1 hour ago
        I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand
    • spiderfarmer 17 minutes ago
      There are hundreds of bots using residential proxies. That is not free. Make them pay.
    • m00dy 31 minutes ago
      it won't work, especially on gemini. Googlebot is very experienced when it comes to crawling. It might work for OpenAI and others maybe.
    • nubg 1 hour ago
      What kind of migitations? How would you detect the poison fountain?
      • avereveard 1 hour ago
        style="display: none;" aria-hidden="true" tabindex="1"

        many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups

        • m00dy 30 minutes ago
          Google will give your website a penalty for doing this.
      • GaggiX 1 hour ago
        Because the internet is noisy and not up to date all recent LLMs are trained using Reinforcement Learning with Verifiable Rewards, if a model has learned the wrong signature of a function for example it would be apparent when executing the code.
    • phoronixrly 1 hour ago
      It does work, on two levels:

      1. Simple cheap, easy-to-detect and badly-behaved bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way.

      2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.

      My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.

  • snehesht 1 hour ago
    Why not simply blacklist or rate limit those bot IP’s ?
    • xprnio 47 minutes ago
      If you have real traffic and bot traffic, you still need to identify which is which. On top of that, bots very likely don’t reuse the same IPs over and over again. I assume if we knew all the IPs used only by bots ahead of time, then yeah it would be simple to blacklist them. But although it’s simple in theory, the practice of identifying what to blacklist in the first place is the part that isn’t as simple
      • snehesht 24 minutes ago
        You wouldn’t permanently block them, it’s more like a rolling window.

        You can use security challenges as a mechanism to identify false positives.

        Sure bots can get tons of proxies for cheap, doesn’t mean you can’t block them similar to how SSH Honeypots or Spamhaus SBL work albeit temporarily.

    • aduwah 59 minutes ago
      There are way too many to do that
      • snehesht 9 minutes ago
        True, most of the blacklists systems today aren’t realtime like Amazon WAF or Cloudflare.

        We need a Crawler blacklist that can in realtime stream list deltas to centralized list and local dbs can pull changes.

        Verified domains can push suspected bot ips, where this engine would run heuristics to see if there is a patters across data sources and issue a temporary block with exponential TTL.

        There are many problems to solve here, but as any OSS it will evolve over time if there is enough interest in it.

        Costs of running this system will be huge though and corp sponsors may not work but individual sponsors may be incentivized as it’s helps them reduce bandwidth, compute costs related to bot traffic.

    • phyzome 54 minutes ago
      Because punishment for breaking the robots.txt rules is a social good.
  • foxes 13 minutes ago
    Wonder if you can just avoid hiding it to make it more believable

    Why not have a library of babel esq labrinth visible to normal users on your website,

    Like anti surveillance clothing or something they have to sift through

  • meta-level 1 hour ago
    Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?
    • suprfsat 1 hour ago
      "disobeys robots.txt" is more of a feature
  • Imustaskforhelp 2 hours ago
    I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.
  • imdsm 1 hour ago
    Applied model collapse
  • rvz 1 hour ago
    > > Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

    Can't the LLMs just ignore or spoof their user agents anyway?

    • phoronixrly 1 hour ago
      Well-behaved agents will obey robots.txt and not fall into the trap.
  • devnotes77 0 minutes ago
    [dead]
  • SophieVeldman 1 hour ago
    [dead]
  • firekey_browser 50 minutes ago
    [dead]
  • GaggiX 1 hour ago
    These projects are the new "To-Do List" app.
  • splitbrainhack 2 hours ago
    -1 for the name
  • obsidianbases1 1 hour ago
    Why do this though?

    It's like if someone was trying to "trap" search crawlers back in the early 2000s.

    Seems counterproductive

    • bilekas 55 minutes ago
      Because of bots that don't respect ROBOTS.txt .

      If you want an AI bot to crawl your website while you pay for that bandwidth then you wont use the tool.

    • Forgeties79 54 minutes ago
      Web crawlers didn’t routinely take down public resources or use the scraped info to generate facsimiles that people are still having ethical debates over. Its presence didn’t even register and it was indexing that helped them. It isn’t remotely the same thing.

      https://www.libraryjournal.com/story/ai-bots-swarm-library-c...