News publishers limit Internet Archive access due to AI scraping concerns

(niemanlab.org)

212 points | by ninjagoo 2 hours ago

29 comments

kevincloudsec 1 hour ago
There's a compliance angle to this that nobody's talking about. Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention. A lot of that evidence lives at URLs. When a vendor's security documentation, a published incident response, or a compliance attestation disappears from the web and can't be archived, you've got a gap in your audit trail that no auditor is going to be happy about.
I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited. The web being unarchivable isn't just a cultural loss. It's becoming a real operational problem for anyone who has to prove to an auditor that something was true at a specific point in time.
[-]
- alexpotato 1 hour ago
  > Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention
  Sidebar:
  Having been part of multiple SOC audits at large financial firms, I can say that nothing brings adults closer to physical altercations in a corporate setting than trying to define which jobs are "critical".
  - The job that calculates the profit and loss for the firm, definitely critical
  - The job that cleans up the logs for the job above, is that critical?
  - The job that monitors the cleaning up of the logs, is that critical too?
  These are simple examples but it gets complex very quickly and engineering, compliance and legal don't always agree.
  [-]
  - Ucalegon 6 minutes ago
    Thats when you reach out to your insurer and ask them their requirements as per the policy and/or if there are any contractual obligations associated with the requirements which might touch indemnity/SLAs. If it does, then it is critical, if not, then its the classic conversation of cost vs risk mitigate/tolerance.
  - a13n 53 minutes ago
    depends, if you don’t clean up the logs and monitor that cleanup will it eventually hit the p&l? eg if you fail compliance audits and lose customers over it? then yes. it still eventually comes back to the p&l.
  - hsbauauvhabzb 12 minutes ago
    And in the big scheme of things, none of those things are even important, your family, your health and your happiness are :-)
- ninjagoo 1 hour ago
  At some point Insurance is going to require companies to obtain paper copies of any documentation/policies, precisely to avoid this kind of situation. It may take a while to get there though. It'll probably take a couple of big insurance losses before that happens.
  [-]
  - kevincloudsec 1 hour ago
    Insurance is already moving that direction for cyber policies. Some underwriters now require screenshots or PDF exports of third-party vendor security attestations as part of the application process, not just URLs. The carriers learned the hard way that 'we linked to their SOC 2 landing page' doesn't hold up when that page disappears after an acquisition or rebrand.
  - layer8 1 hour ago
    More likely, there will be trustee services taking care of document preservation, themselves insured in case of data loss.
    [-]
    - ninjagoo 1 hour ago
      Isn't the Internet Archive such a trustee service?
      Or are you thinking of companies like Iron Mountain that provide such a service for paper? But even within corporations, not everything goes to a service like Iron Mountain, only paper that is legally required to be preserved.
      A society that doesn't preserve its history is a society that loses its culture over time.
      [-]
      - layer8 1 hour ago
        The context was regulatory requirements for companies. I mean that as a business you pay someone to take care of your legal document preservation duties, and in case data gets lost, they will be liable for the financial damage this incurs to you. Outsourcing of risk against money.
        [-]
        ninjagoo 1 hour ago
        Whether or not the Internet Archive counts as a legally acceptable trustee service is being litigated in the court systems [1]. The link is a bit dated so unsure what the current situation is. There's also this discussion [2].
        [1] https://www.mololamken.com/assets/htmldocuments/NLJ_5th%20Ci...
        [2] https://www.nortonrosefulbright.com/en-au/knowledge/publicat...
  - mycall 1 hour ago
    Also, getting insurance to pay for cybercrimes is hard and sometimes doesn't justify their costs.
  - seanmcdirmid 1 hour ago
    Digital copies will also work I don’t understand why they just don’t save both the URL and the content at the URL when last checked.
    [-]
    - ninjagoo 1 hour ago
      I think maybe because the contents of the URL archived locally aren't legally certifiable as genuine - the URL is the canonical source.
      That's actually a potentially good business idea - a legally certifiable archiving software that captures the content at a URL and signs it digitally at the moment of capture. Such a service may become a business requirement as Internet archivability continues to decline.
    - trollbridge 1 hour ago
      What if the TOS expressly prohibits archiving it, and it's also copyrighted?
      [-]
      - pixl97 1 hour ago
        Then said writers of TOS should be dragged in front of a judge to be berated, then tarred and feathered, and ran out of the courtroom on a rail.
        Having your cake and eating it too should never be valid law.
        [-]
        croes 32 minutes ago
        Maybe we should start with those who made such copyright claims a possibility in the first place
        [-]
        wizzwizz4 30 minutes ago
        They're long, long dead.
- mycall 1 hour ago
  Perhaps those companies should have performed verified backups of third-party vendor's published security policies into a secure enclave with paired keys with the auditor, to keep a trail of custody.
- riddlemethat 1 hour ago
  https://www.page-vault.com/ These guys exist to solve that problem.
- staticassertion 1 hour ago
  > I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited.
  Seriously? What kind of auditor would "fail" you over this? That doesn't sound right. That would typically be a finding and you would scramble to go appease your auditor through one process or another, or reach out to the vendor, etc, but "fail"? Definitely doesn't sound like a SOC2 audit, at least.
  Also, this has never particularly hard to solve for me (obviously biased experience, so I wonder if this is just a bubble thing). Just ask companies for actual docs, don't reference urls. That's what I've typically seen, you get a copy of their SOC2, pentest report, and controls, and you archive them yourself. Why would you point at a URL? I've actually never seen that tbh and if a company does that it's not surprising that they're "failing" their compliance reviews. I mean, even if the web were more archivable, how would reliance on a URL be valid? You'd obviously still need to archive that content anyway?
  Maybe if you use a tool that you don't have a contract with or something? I feel like I'm missing something, or this is something that happens in fields like medical that I have no insight into.
  This doesn't seem like it would impact compliance at all tbh. Or if it does, it's impacting people who could have easily been impacted by a million other issues.
f33d5173 1 hour ago
So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money. The only real loser is the common man who doesn't have the resources to scrape the entire web himself.
I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.
[-]
- CqtGLRGcukpy 1 hour ago
  The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.
  This is from my experience having a personal website. AI companies keep coming back even if everything is the same.
  [-]
  - giancarlostoro 4 minutes ago
    Weird, considering IA has most of its content in a way you could rehost it all idk why nobody’s just hosting a IA carbon copy that AI companies can hit endlessly, and then cutting IA a nice little check in the process, but I guess some of the wealthiest AI startups are very frugal about training data?
    This also goes back to something I said long ago, AI companies are relearning software engineering poorly. I can think of so many ways to speed up AI crawlers, im surprised someone being paid 5x my salary cannot.
- fartfeatures 1 hour ago
  IPFS was an attempt at this: https://en.wikipedia.org/wiki/InterPlanetary_File_System
  [-]
  - Seattle3503 25 minutes ago
    Is there a good post-mortem of IPFS out there?
  - lukeasch21 1 hour ago
    Coincidentally most of the funding towards IPFS development dried up because the VC money moved onto the very technology enabling these problems...
- toomuchtodo 1 minute ago
  AI browsers will be the scrapers, shipping content back to the mothership for processing and storage.
- demetris 28 minutes ago
  I don’t believe resips will be with us for long, at least not to the extent they are now. There is pressure and there are strong commercial interests against the whole thing. I think the problem will solve itself in some part.
  Also, I always wonder about Common Crawl:
  Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?
- Operyl 1 hour ago
  They already are, I've been dealing with Vietnam and Korea residential proxies destroying my systems for weeks, I'm growing tired. I cannot survive 3500 RPS 24/7.
- raincole 1 hour ago
  Even if the site is archived on IA, AI companies will still do the same.
daniel31x13 9 minutes ago
I maintain an open-source project called Linkwarden and this exact discussion is one of the reasons why it exists, teams needed a way to preserve referenced URLs reliably without having to depend on external services.
It stores webpages in multiple formats (HTML snapshot, screenshot, PDF snapshot, and a fully dedicated reader view) so you’re not relying on a single fragile archive method.
There’s both a hosted cloud plan [1] which directly supports the project, and a fully self-hosted option [2], depending on how much control you need over storage and retention.
[1]: https://linkwarden.app
[2]: https://github.com/linkwarden/linkwarden
jruohonen 1 hour ago
It affects science too (and there you'd want solid archiving as much as possible). Increasingly, meta-data is full of errors and general purpose search engines for science are breaking down, including even things like Google Scholar. I suppose some big science publishers are blocking AI bots too.
[-]
- shevy-java 1 hour ago
  Google ruined its own search engine on top of that as well though.
  We are increasingly becoming blind. To me it looks as if this is done on purpose actually.
  [-]
  - salawat 1 hour ago
    It was. Advertising is incompatible with accurate data retrieval/routing. We've also implemented "obligation to deindex". So providing an unbiased index of the web as she is is essentially (in the U.S.) verboten.
- ninjagoo 1 hour ago
  > I suppose some big science publishers are blocking AI bots too.
  That's a travesty, considering that a huge chunk of science is public-funded; the public is being denied the benefits of what they're paying for, essentially.
  [-]
  - galleywest200 1 hour ago
    The public can still access the sites themselves.
    [-]
    - ninjagoo 1 hour ago
      > The public can still access the sites themselves.
      Indefinitely? Probably not.
      What about when a regime wants to make the science disappear?
      [-]
      - thwarted 1 hour ago
        So the solution is to allow the AI scraping and hide the content, with significantly reduced fidelity and accuracy and not in the original representation, in some language model?
      - pa7ch 1 hour ago
        What has that got to do with blocking AI crawlers?
        [-]
        ninjagoo 1 hour ago
        If it's publicly funded, why shouldn't AI crawlers have access to that data? Presumably those creating the AI crawlers paid taxes that paid for the science.
        [-]
        JumpCrisscross 31 minutes ago
        > If it's publicly funded, why shouldn't AI crawlers have access to that data?
        Becase it costs money to serve them the content.
ninjagoo 2 hours ago
Publishers like The Guardian and NYT are blocking the IA/Wayback Machine. 20% of news websites are blocking both IA and Common Crawl. As an example, https://www.realtor.com/news/celebrity-real-estate/james-van... is unarchivable, with IA being 429ed while the site is accessible otherwise.
[-]
- trollbridge 1 hour ago
  And whilst the IA will honour requests not to archive/index, more aggressive scrapers won't, and will disguise their traffic as normal human browser traffic.
  So we're basically decided we only want bad actors to be able to scrape, archive, and index.
  [-]
  - JumpCrisscross 30 minutes ago
    > we're basically decided we only want bad actors to be able to scrape, archive, and index
    AI training will be hard to police. But a lot of these sites inject ads in exchange for paywall circumvention. Just scanning Reddit for the newest archive.is or whatever should cut off most of the traffic.
- fc417fc802 1 hour ago
  Presumably someone has already built this and I'm just unaware of it, but I've long thought some sort of crowd sourced archival effort via browser extension should exist. I'm not sure how such an extension would avoid archiving privileged data though.
  [-]
  - ajb 7 minutes ago
    That exists for court documents (RECAP) but I think they didn't have to solve the issue of privilege as PACER publishes unprivileged docs.
Brian_K_White 1 hour ago
Time for a crowd source plugin that relays copies of what individuals view right from the browser.
Users control what sites they want to allow it to record so no privacy worries, especially assuming the plugin is open source.
No automated crawling. The plugin does not drive the users browser to fetch things. Just whatever a user happens to actually view on their own, some percentage of those views from the activated domains gets submitted up to some archive.
Not every view, just like maybe 100 people each submit 1% of views, and maybe it's a random selection or maybe it's weighted by some feedback mechanism where the archive destination can say "Hey if the user views this particular url, I still don't have that one yet so definitely send that one if you see it rather than just applying the normal random chance"
Not sure how to protect the archive itself or it's operators.
[-]
- nerdsniper 9 minutes ago
  For a historical archive, the issue with this is that it could be difficult to ensure that the data being sent from users' devices wasn't modified in some way, leading to an inaccurate archival copy.
- digiown 59 minutes ago
  SingleFile does the archiving fairly well.
  > no privacy worries
  This is harder than you might expect. Publishing these files is always risky because sites can serve you fingerprinting data, like some hidden HTML tag containing your IP and other identifiers.
  [-]
  - Brian_K_White 25 minutes ago
    oof good point
upboundspiral 1 hour ago
I feel like a government funded search engine would resolve a lot of the issues with the monetized web.
The purpose of a search engine is to display links to web pages, not the entire content. As such, it can be argued it falls under fair use. It provides value to the people searching for content and those providing it.
However we left such a crucially important public utility in the hands of private companies, that changed their algorythms many times in order to maximize their profits and not the public good.
I think there needs to be real competition, and I am increasingly becoming certain that the government should be part of that competition. Both "private" companies and "public" governement are biased, but are biased in different ways, and I think there is real value to be created in this clash. It makes it easier for individuals to pick and choose the best option for themselves, and for third independent options to be developed.
The current cycle of knowledge generation is academia doing foundational research -> private companies expanding this research and monetizing it -> nothing. If the last step was expanded to the government providing a barebones but useable service to commodotize it, years after private companies have been able to reap immense profits, then the capabilities of the entire society are increased. If the last step is prevented, then the ruling companies turn to rentseeking and sitting on their lawrels, turn from innovating to extracting.
[-]
- LPisGood 51 minutes ago
  The government having the power to curate access to information seems bad. You could try to separate it as an independent agency, but as the current US administration is showing, that’s not really a thing.
- digiown 36 minutes ago
  We can start by forcing sites to treat crawlers equally. Google's main moat is less physical infrastructure or the algorithms, and more that sites allow only Google to scrape and index them.
  They can charge money for access or disallow all scrapers, but it should not be allowed to selectively allow only Google.
  [-]
  - charcircuit 25 minutes ago
    It's not like only allowing Google actually means that only Google is allowed forever. Crawlers are free to make agreements with sites to allow themselves to crawl easier or pretend they are a regular user to bypass whatever block they are trying to do.
- underlipton 34 minutes ago
  I'm feeling it. Addressing the other reply: zero moderation or curation, and zero shielding from the crawler, if what you've posted is on a public network. Yes, users will be able to access anything they can think of. And the government will know. I think you don't have to worry about them censoring content; they'll be perfectly happy to know who's searching for CSAM or bomb-making materials. And if people have an issue with what the government does with this information (for example, charging people who search for things the Tangerine-in-Chief doesn't want you to see), you stop it at the point of prosecution, not data access. (This does only work in a society with a functioning democracy... but free information access is also what enables that. As Americans, with our red-hot American blood, do we dare?)
derefr 2 hours ago
I wonder if these publishers would be more amenable to a private archiver that only serves registered academic / journalistic research projects (the way most physical private archives do), with a specific provision to never provide data to companies that would resell it or use it for training of generative models.
[-]
- coffeefirst 4 minutes ago
  Yes. Most publishers already do syndication deals. This is a fine idea.
  The problem with the LLMs is they capture the value chain and give back nothing. It didn’t have to be this way. It still doesn’t.
- eternauta3k 1 hour ago
  They already have archives with online and printed articles which they license to libraries, because the libraries take care of rate limiting and limiting abuse.
- ninjagoo 2 hours ago
  They probably have internal archives if they're smart; but that isn't accessible to the public. I think the issue isn't whether the data is archived, but whether that information is available to the public for the foreseeable future.
  [-]
  - g-b-r 1 hour ago
    They sure have archives of the newspapers, they're much less likely to have archives of what they publish online.
    And a local archive is one fire, business decision, poor technical choice etc away from getting permanently lost
nananana9 1 hour ago
The silver lining is that it's increasingly not worth being archived as well.
[-]
- idiotsecant 1 hour ago
  We really lucked out existing at a time when the internet was a place for weirdos and enthusiasts. I think those days are well and done.
- Flavius 1 hour ago
  Agreed. It’s mostly just disposable clickbait masquerading as journalism at this point. Outside of feeding people's FOMO, there's little content worth preserving for history.
bmiekre 17 minutes ago
Explain it to me like I’m 5, why is ai scraping the way back machine bad?
RajT88 1 hour ago
Proposed solution:
Sell a "truck full of DAT tapes" type service to AI scrapers with snapshots of the IA. Sort of like the cloud providers have with "Data Boxes".
It will fund IA, be cheaper than building and maintaining so many scrapers, and may relieve the pressure on these news sites.
[-]
- atrus 1 hour ago
  Even sites with that option already (like wikipedia) still report being hammered by scrapers. It's the full-funded aligned with the incompetent at work here.
- digiown 1 hour ago
  IA has always been in legal jeopardy without offering paid access. For that to work we need to get rid of copyright first.
cdrnsf 1 hour ago
This is a natural response to AI companies plundering the web to enrich themselves and provide no benefit to the sites being scraped.
holoduke 7 minutes ago
The end of traditional news sites is coming. At least for the newspaper websites. Future mcp like systems will generate on the fly newstites in your desired style and content. Journalists will have some kind of paid per view model provided by these gpt like platforms which of course take a too big of a chunk. I can't imagine a WSJ is able to survive.
jackfranklyn 39 minutes ago
There's a mundane version of this that hits small businesses every day. Platform terms of service pages, API documentation, pricing policies, even the terms you agreed to when you signed up for a SaaS product - these all live at URLs that change or vanish.
I've been building tools that integrate with accounting platforms and the number of times a platform's API docs or published rate limits have simply disappeared between when I built something and when a user reports it broken is genuinely frustrating. You can't file a support ticket saying "your docs said X" when the docs no longer say anything because they've been restructured.
For compliance specifically - HMRC guidance in the UK changes constantly, and the old versions are often just gone. If you made a business decision based on published guidance that later changes, good luck proving what the guidance actually said at the time. The Wayback Machine has saved me more than once trying to verify what a platform's published API behaviour was supposed to be versus what it actually does.
The SOC 2 / audit trail point upthread is spot on. I'd add that for smaller businesses, it's not just formal compliance frameworks - it's basic record keeping. When your payment processor's fee schedule was a webpage instead of a PDF and that webpage no longer exists, you can't reconcile why your fees changed.
notepad0x90 34 minutes ago
The internet isn't so simple anymore. I think it's important to separate commercial websites from non-commercial ones. Commercial sites shouldn't be expected to be achievable to begin with, unless it's part of their business model. A lot of sites (like reddit), started of as ad-supported sites, but now they're commercial (not just post-IPO, but accept payments and sell things to/from consumers). Even for ad-supported sites, there is a difference between ad-supported non-profit, and sites that exist to generate revenue from ads. As in, the primary purpose of the site is to generate ad-revenue, the content is just a means to that end.
I've said it before, and I'll say it again: The main issue is not design patterns, but lack of acceptable payment systems. The EU with their dismantling of visa and mastercard now have the perfect opportunity to solve this, but I doubt they will. They'll probably just create a european wechat.
yellowapple 30 minutes ago
Framing this as some anti-AI thing is wild. The simpler, more obvious, and more evidenced reason for this is that these sites want to make money with ads and paywalls that an archived copy tends to omit by design. Scapegoating AI lets them pretend that they're not the greedy bad guys here — just like how the agricultural sector is hell-bent on scapegoating AI (and lawns, and golf courses, and long showers, and free water at restaurants) for excess water consumption when even the worst-offending datacenters consume infinitesimally-tiny fractions of the water farms in their areas consume.
Havoc 2 hours ago
Yup. Recently built something that needs to do low volume scraping. About 40% success rate - rest hits bot detection even on first try
[-]
- ninjagoo 2 hours ago
  Did you have rate limits built in? Ultimately scraping tools will need to mimic humans. Ironic.
  I wonder if bots/ai will need to build their own specialized internet for faster sharing of data, with human centered interfaces to human spaces.
  [-]
  - fc417fc802 1 hour ago
    IPFS and IPNS already exist.
mellosouls 36 minutes ago
editorialised. Original title (submitted previously a few times correctly by others):
News publishers limit Internet Archive access due to AI scraping concerns
shevy-java 1 hour ago
> The Financial Times, for example, blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive
But then it was not really open content anyway.
> When asked about The Guardian’s decision, Internet Archive founder Brewster Kahle said that “if publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”
Well - we need something like wikipedia for news content. Perhaps not 100% wikipedia; instead, wikipedia to store the hard facts, with tons of verification; and a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers. I don't know how the model could work, but IF we could come up with this then newspapers who have gatewalls to information would become less relevant automatically. That way we win long-term, as the paid gatewalls aren't really part of the open web anyway.
[-]
- ninjagoo 1 hour ago
  Wikipedia relies on the institutional structure of journalism, with newsroom independence, journalistic standards, educational system and probably a ton of other dependencies.
  Journalism as an institution is under attack because the traditional source of funding - reader subscriptions to papers - no longer works.
  To replicate the Wikipedia model would need to replicate the structure of Journalism for it to be reliable. Where would the funding for that come from? It's a tough situation.
- riquito 1 hour ago
  > we need something like wikipedia for news content
  Interesting idea. It could be something that archives first and releases at a later date, when the news aren't as much new
- fc417fc802 1 hour ago
  > a news editorial that focuses on free content but in a newspaper-style
  Isn't that what state funded news outlets are?
- JumpCrisscross 1 hour ago
  > it was not really open content anyway
  Practically no quality journalism is.
  > we need something like wikipedia for news
  Wikipedia editors aren’t flying into war zones.
  [-]
  - fc417fc802 1 hour ago
    Statistically, at least a few of them live in war zones. And I'm sure some of them would fly in to collect data if you paid them for it.
    [-]
    - JumpCrisscross 28 minutes ago
      > at least a few of them live in war zones
      Which is a valuable perspective. But it's not a subsitute for a seasoned war journalist who can draw on global experience. (And relating that perspective to a particular home market.)
      > I'm sure some of them would fly in to collect data if you paid them for it
      Sure. That isn't "a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers."
      One part of the population imagines journalists as writers. They're fine on free, ad-supported content. The other part understands that investigation is not only resource intensive, but also requires rare talent and courage. That part generally pays for its news.
      Between the two, a Wikipedia-style journalistic resource is not entertaining enough for the former and not informative enough for the latter. (Importantly, compiling an encyclopedia is principally the work of research and writing. You can be a fine Wikipedia–or scientific journal or newspaper–editor without leaving your room.)
  - ghaff 1 hour ago
    Well, and it would be considered "original research" anyway which some admin would revert.
colesantiago 17 minutes ago
I fear that these news publishers would come after RSS next as I see hundreds of AI companies misusing the terms of the news publishers's RSS feed for profit on mass scraping.
They do not care and we will be all worse off for it if these AI companies keep continuing to bombard news publishers RSS feeds.
It is a shame that the open web as we know it is closing down because of these AI companies.
WesBrownSQL 1 hour ago
As someone who has been dealing with SOC 2, HIPAA, ISO 9001, etc., for years, I have always maintained copies of the third-party agreements for all of our downstream providers for compliance purposes. This documentation is collected at the time of certification, and our policies always include a provision for its retrieval on schedule. The problem is when you certify their policy said X and were in compliance, they quietly change that and don't send proper notification downstream to us, and captain lawsuit comes by, we have to be able to prove that they did claim they were in compliance and the time we certified. We don't want to rely on their ability to produce that documentation. We can't prove that it wasn't tampered with, or that there is a chain of custody for their documentation and policies. If I wanted to use a vendor that wouldn't provide that information, then I didn't use them. Welcome to the world of highly regulated industries.
zeagle 2 hours ago
I mean why wouldn’t they? All their IP was scraped for at their own cost of hosting it for AI training. It further pulls away from their own business models as people ask the AI models the questions instead of reading primary sources. Plus it doesn’t seem likely they’ll ever be compensated for that loss given the economy is all in on AI. At least search engines would link back.
[-]
- szmarczak 1 hour ago
  Those countermeasures don't really have an effect in terms of scraping. Anyone skilled can overcome any protection within a week or two. By officially blocking IA, IA can't archive those websites in a legal way, while all major AI companies use copyrighted content without permission.
  [-]
  - zeagle 1 hour ago
    For sure. There are many billions and brilliant engineers propping up AI so they will win any cat and mouse game of blocking. It would be ideal if sites gave their data to IA and IA protected it exactly from what you say. But as someone that intentionally uses AI tools almost daily (mainly open evidence) IMO blame the abuser not the victim that it has come to this.
    [-]
    - szmarczak 1 hour ago
      I'm not blaming the victim, but don't play the 'look what you made me do' game. Making content accessible to anyone (even behind a paywall) is a risk they need to take nevertheless. It's impossible to know upfront if the content is used for consumption or to create derived products (e.g. write an article in NYT style etc.). If this was a newspaper, this would be equivalent to scanning paper and then training AI. You can't prevent scanning, as the process is based on exactly the same phenomenon what makes your eyes see, iow information being sent and received. The game was lost before it even started.
- ninjagoo 1 hour ago
  That is a good question. However, copyright exists (for a limited time) to allow for them to be compensated. AI doesn't change that. It feels like blocking AI-use is a ploy to extract additional revenue. If their content is regurgitated within copyright terms, yes, they should be compensated.
  [-]
  - fc417fc802 1 hour ago
    The problem is that producing a mix of personalized content that doesn't appear (at least on its face) to violate copyright still completely destroys their business model. So either copyright law needs to be updated or their business model does.
    Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.
    [-]
    - ninjagoo 1 hour ago
      > Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.
      Great point. If my personal AI assistant cannot find your product/website/content, it effectively may no longer exist! For me. Ain't nobody got the time to go searching that stuff up and sifting through the AI slop. The pendulum may even swing the other way and the publishers may need to start paying me (or whoever my gatekeeper is) for access to my space...
zachlatta 1 hour ago
The death of trust on the cloud.
g-b-r 1 hour ago
This is awful, they need to at the very least allow private archivals.
Maybe the Internet Archive might be ok to keeping some things private until x time passes; or they could require an account to access them
JumpCrisscross 1 hour ago
Let’s be honest, one of the most-common uses of these archive sites has been paywall circumvention. An academics-only archive might make sense, or one that is mutually-owned and charges a fee for lookup. But a public archive for content that costs money to make obviously doesn’t work.
[-]
- lurking_swe 1 hour ago
  if that’s the real motive, why don’t they allow access to scrape content after some period? when that news is not as relevant. For example after 6 months.
  [-]
  - JumpCrisscross 32 minutes ago
    > why don’t they allow access to scrape content after some period? when that news is not as relevant. For example after 6 months
    I belive many publications used to do this. The novel threat is AI training. It doesn't make sense to make your back catalog de facto public for free like that. There used to be an element of goodwill in permitting your content to be archived. But if the main uses are circumventing compensation and circumventing licensing requirements, that goodwill isn't worth much.
  - otterley 26 minutes ago
    Enabling research is a business model for many publications. Libraries pay money for access to the publishers’ historical archives. They don’t want to cannibalize any more revenue streams; they’re already barely still operating as it is.
macinjosh 2 hours ago
We need something like SETI@home/Folding@home but for crawling and archiving the web or maybe something as simple as a browser extension that can (with permission) archive pages you view.
[-]
- dunder_cat 2 hours ago
  This exists although not in the traditional BOINC space, it's Archiveteam^1. I run two of their warrior^2 instances in my home k3s instance via the docker images. One of them is set to the "Team's choice" where it spends most of its time downloading Telegram chats. However, when they need the firepower for sites with imminent risk of closure, it will switch itself to those. The other one is set to their URL shortener project, "Terror of Tiny Town"^3.
  Their big requirement is you need to not be doing any DNS filtering or blocking of access to what it wants, so I've got the pod DNS pointed to the unfiltered quad9 endpoint and rules in my router to allow the machine it's running on to bypass my PiHole enforcement+outside DNS blocks.
  ^1 https://wiki.archiveteam.org/
  ^2 https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
  ^3 https://wiki.archiveteam.org/index.php/URLTeam
- ninjagoo 2 hours ago
  In the US at least, there is no expectation of privacy in public. Why should these websites that are public-facing get an exemption from that? Serving up content to the public should imply archivability.
  Sometimes it feels like ai-use concerns are a guise to diminish the public record. While on the other hand services like Ring or Flock are archiving the public forever.
  [-]
  - sejje 2 hours ago
    Ring and Flock are not a standard we should be striving towards. Their massive databases tracking citizens need to go.
- pclmulqdq 2 hours ago
  Your TV probably does that, and you definitely gave it permission when you clicked "accept" on the terms.
- ryoshu 2 hours ago
  This is a good idea. Not sure what ToS it would violate. But a good idea.
blell 1 hour ago
That’s good. I don’t like archival sites. Let things disappear.
[-]
- braebo 51 minutes ago
  Yea.. I’ve noticed data hoarding largely resembles yet-another form of death denialism.
OGEnthusiast 2 hours ago
If most of the Internet is AI-generated slop (as is already the case), is there really any value in expensing so much bandwidth and storage to preserve it? And on the flip side, I'd imagine the value of a pre-2022 (ChatGPT launch) Internet snapshot on physical media will probably increase astronomically.
[-]
- nicole_express 2 hours ago
  The sites that are most valuable to preserve are likely the same ones that are most likely to put up barriers to archiving
- ninjagoo 2 hours ago
  Perhaps the AI slop isn't worth preserving, but the unarchivability of news and other useful content has implications for future public discourse, historians, legal matters and who knows what else.
  In the past libraries used to preserve copies of various newspapers, including on microfiche, so it was not quite feasible to make history vanish. With print no longer out there, the modern historical record becomes spotty if websites cannot be archived.
  Perhaps there needs to be a fair-use exception or even a (god forbid!) legal requirement to allow archivability? If a website is open to the public, shouldn't it be archivable?
sejje 2 hours ago
This is a good thing, IMO.
I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.
[-]
- GaryBluto 1 hour ago
  > I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.
  I don't understand this line of thinking. I see it a lot on HN these days, and every time I do I think to myself "Can't you realize that if things kept on being erased we'd learn nothing from anything, ever?"
  I've started archiving every site I have bookmarked in case of such an eventuality when they go down. The majority of websites don't have anything to be used against the "folks" who made them. (I don't think there's anything particularly scandalous about caring for doves or building model planes)
- otterley 2 hours ago
  Consider the impact, though, on our ability to learn and benefit from history. If the records of people’s activities cannot be preserved, are we doomed to live in ignorance?
  [-]
  - sejje 1 hour ago
    I don't think so. Most of my original creations were before the archiving started, and those things are lost. But they weren't the kind of history you learn and benefit from--nor is most of the internet.
    The truly important stuff exists in many forms, not just online/digital. Or will be archived with increased effort, because it's worth it.
    [-]
    - otterley 1 hour ago
      Like it or not, the Internet is today’s store of record for a significant proportion—if not the majority—of the world’s activities.
      If you don’t want your bad behavior preserved for the historical record, perhaps a better answer is to not engage in bad behavior instead of relying on some sort of historical eraser.
      [-]
      - sejje 18 minutes ago
        Behavior that isn't bad, becomes bad retrospectively after a regime change
    - nine_k 1 hour ago
      Think about the stuff archeologists get to work with.
  - ninjagoo 1 hour ago
    What's that famous quote - those who do not learn from history ...
    BUT, it's hard to learn from history if there's no history to learn...
- TheRealPomax 1 hour ago
  Kind of the "think of the children" argument: most things that are worth archiving have nothing to do with content that can be used against someone in the future. But the raw volume is making it impossible to filter out the worthwhile stuff from the slop (all forms of, not just AI), even with automation (again, not AI, we've been doing NLP using regular old ML for decades now).
- UltraSane 1 hour ago
  Man I cannot disagree more. This is a terrible thing.