Nuxt HN | OpenAI crawler burning money for nothing

OpenAI crawler burning money for nothing

I have a bunch of blog posts, with URLs like these:

  https://mywebsite/1-post-title
  https://mywebsite/2-post-title-second
  https://mywebsite/3-post-title-third
  https://mywebsite/4-etc

For some reason, it tries every combination of numbers, so the requests look like this:

  https://mywebsite/1-post-title/2-post-title-second
  https://mywebsite/1-post-title/3-post-title-third

etc.

Since the blog engine simply discards everything after number (1,2,3...) and just serves the content for blog post #1, #2, #3,... the web server returns a valid page. However, all those pages are the same.

The main problem here is that there is no website page that has such compound links like https://mywebsite/1-post-title/2-post-title-second

So it's clearly some bug in the crawler.

Maybe OpenAI is using AI code for their crawler because it has so dumb bugs you cannot believe any human would write it.

They will make 90000 requests to load my small blog with 300 posts.

Cannot imagine what happens with larger websites that have thousands of blog posts.

12 points | by babuskov 525 days ago

6 comments

readyplayernull 525 days ago
They are decided to set the web on fire:
https://news.ycombinator.com/item?id=42660377
markus_zhang 524 days ago
I wonder if one can build maze webpages to trap these AI crawlers. So if it's a human it doesn't bother, but once identified as a crawler it dynamically generates webpages after webpages of garbage. It doesn't need to save all those garbage but the crawler has to.
[-]
- KomoD 523 days ago
  I did see someone here on HN that did that, but not specifically for AI crawlers. Every page had links to more pages and those pages had links to even more pages (and it just goes on forever)
codemusings 524 days ago
For what it's worth: they do honor the robots.txt file. I had the same problem with a client's CMS and denying all AI crawler user agents did the trick.
It's clear they've all gone mad. The traffic spiked 400% overnight and made the CMS unresponsive a few times a day.
gbertb 525 days ago
how are the links structured in the ahref tag? is it relative or absolute? if relative, then thats prob why.
[-]
- babuskov 525 days ago
  Relative.
  For example, the page:
  https://website/blog/1-post
  contains:
  href="2-post"
  Browsers and other bots like Google Bot correctly interpret this as a link to
  https://website/blog/2-post
  While OpenAI crawler goes to:
  https://website/blog/1-post/2-post
  I wonder is there some way to report this bug to them?
  [-]
  - fullstackwife 524 days ago
    Google recommends absolute urls:
    https://web.archive.org/web/20221208150134/https://www.webma...
  - AznHisoka 524 days ago
    I actually think OpenAI is right, unless you have a base url tag? That’s a relative url and its relative to the current url you are on, not the root domain
    [-]
thiago_fm 524 days ago
They believe they can take market share from Google, which currently has a mkt. cap of over $2T, so with that amount of money in line, they don't care if they will hammer down the internet or the amount of hate & lawsuits they will get.
The issue is that they don't understand that the search business took decades to develop and be what it currently is. And is only so profitable to Google because they hold a monopoly because the US is an oligarchy.
The stuff OpenAI is building has been proven to be easy (and expensive) to replicate, with many competitors having posting similar results, even while starting later.
Whatever new iteration of the search business they will develop will likely mean profits will be smaller, but nobody cares as long as there are billions being invested in this space.
Not to mention their AGI goals. When you can't reliably trust their software to answer basic questions.
So, currently we are at the internet of trash age. We now have trash content being generated, trash bots hammering your tiny website and trash ambitions.
I doubt this CAPEX will go on for more than 2 years, once the bubble burst companies will start reviewing what they built and they will fix the Crawler bug you've just mentioned.
101008 524 days ago
Cloudflare should provide a service (paid or free) to block AI crawlers.
[-]
- CallMeMarc 524 days ago
  They actually already have that!
  You can find it at Security > Bots > Block AI Bots