Tell HN: We should snapshot a mostly AI output free version of the web

While we can, and if it isn't too late already. The web is overrun with AI generated drivel, I've been searching for information on some widely varying subjects and I keep landing in recently auto-generated junk. Unfortunately most search engines associate 'recency' with 'quality' or 'relevance' and that is very much no longer true.

While there is still a chance I think we should snapshot a version of the web and make it publicly available. That can serve as something to calibrate various information sources against to get an idea of whether or not they are to be used or rather not. I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on, and such data will rapidly become as precious as 'low background steel'.

https://en.wikipedia.org/wiki/Low-background_steel

136 points | by jacquesm 15 days ago

31 comments

  • simonw 15 days ago
    Sounds like you want Common Crawl - they have snapshots going back to 2013, take your pick: https://data.commoncrawl.org/crawl-data/index.html

    (A semi-ironic detail: Common Crawl is one of the most common sources used as part of the training data for LLMs)

    • echelon 15 days ago
      > such data will rapidly become as precious as 'low background steel'.

      I'm also totally not convinced by this argument.

      Synthetic data as an input to a careful training regimen will result in better outputs, not worse, because you're still subjecting the model to optimization and new information. Over time you can pull out the worse performing (original and synthetic) training data. That careful curation is the part that makes the difference.

      It's like DNA in the chemical soup. It's been replicating polymers since the beginning, but in the end intelligence arises. It didn't need magical ingredients. When you climb a gradient, it typically takes you somewhere better.

      • lioeters 15 days ago
        > in the end intelligence arises. It didn't need magical ingredients.

        That's the current prevailing hypothesis, but we don't yet fully understand the phenomenon of intelligence enough to definitively rule out any magical ingredients: unknown variables or characteristics of the system/inputs/data that made it possible for intelligence to emerge.

        This proposed snapshot of the web, before it gets further "contaminated" by synthetic AI/LLM-generated data, might prove to be valuable or it might not. The premise could be wrong. Maybe we learn that there's nothing fundamentally special about human-generated data, compared to synthetic data derived from it.

        It seems worthwhile to consider though, in case it turns out that there is some yet unknown quality of the more or less "pure" human data. In the metaphor of low-background steel, we could be entering a period of unregulated nuclear testings without being fully aware of the consequences.

        • nh23423fefe 14 days ago
          I don't buy this at all. AI data is a real part of the environment. The thing to modify is the loss function not the training data. You need to be able to evaluate text on the internet and so do models.

          This idea of contamination by AI vs pristine human data isn't persuasive to me at all. It feels like a continuation of the wrong idea that LLMs are parrots.

      • eviks 15 days ago
        "Careful curation" is the part you lose when you use synthetic data. Subjecting models to "new information" isn't useful otherwise you could just feed it random 01s and hope to carefully curate it later

        (also, how much time did the soup take? Can you wait that long?)

      • xs83 15 days ago
        Training AI on AI generated data produces some increasingly weird outputs, I am sure we are already seeing the results of this in some models but the level of Hallucination is only going to increase unless some kind of checks and balances are implemented
      • cyanydeez 14 days ago
        Im convinced just cleaning existing dataseys would be more effective
  • vitovito 15 days ago
    2024 might already be too late, since this sentiment has been shared since at least 2021:

    2021: https://twitter.com/jackclarkSF/status/1376304266667651078

    2022: https://twitter.com/william_g_ray/status/1583574265513017344

    2022: https://twitter.com/mtrc/status/1599725875280257024

    Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.

    Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde... with their previous and subsequent crawls listed in the dropdown here: https://commoncrawl.org/overview

    Internet Archive's crawls are here: https://archive.org/details/web organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: https://archive.org/details/wide00018. Wide Crawl 17 was from late 2018 and is 644.4TB: https://archive.org/details/wide00017

    • Atotalnoob 15 days ago
      Why is wide crawl 18 smaller than 17?
      • el_duderino_ 15 days ago
        The tumbler purge was worse than I expected…
  • talldayo 15 days ago
    > I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on

    They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.

    Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.

    • dangerwill 15 days ago
      > Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff.

      Which studies show this? https://arxiv.org/abs/2305.17493 shows the exact opposite and my (layman's) understanding of statistics and epistemology lines up entirely with this finding.

      Like, how could this even theoretically work? In the best case scenario wouldn't training on synthetic training data make LLMs overconfident / overfit the data once faced with new (human) input to respond to?

      • talldayo 15 days ago
        I don't have any exact references, but multiple finetuning datasets have used curated GPT-3/4 conversations as training data. It's less that they're overtly superior to human data, and more that they're less-bad and more abundantly available.

        > Like, how could this even theoretically work?

        I'm not really an expert on it either, but my understanding is that it works the same way curating human data works. You sift through the garbage, nonsense, impolite and incoherent AI responses and only include the exemplary conversations in your training set.

        It feels kinda like the "monkeys on typewriters writing shakespeare" parable. If you have enough well-trained AIs generate enough conversations, eventually enough of them will be indistinguishable enough from human data to be usable for training.

    • Havoc 15 days ago
      > They probably just use publicly-available resources like The Pile

      I’d be very surprised if the big orgs don’t have in house efforts that far exceed the pile. Hell we know Google paid Reddit a pile of money for data and other orgs are also willing to pay

      • vagabund 15 days ago
        Yeah they absolutely do not use the pile.
        • talldayo 15 days ago
          GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.

          It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.

    • skybrian 15 days ago
      To index the web, you generally do make a copy of it.

      Google has a huge number of books scanned, too.

      • TowerTall 15 days ago
        “Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

        https://www.theatlantic.com/technology/archive/2017/04/the-t...

      • pinko 15 days ago
        Yeah, I hadn't thought about their abandoned effort to scan every book and archived newspaper in the world in a while, but I bet they're regretting now that they didn't finish. A non-trivial amount of that physical media has been tossed or degraded by underfunded libraries since then. And it's more valuable to them now that it ever was.
    • bigyikes 15 days ago
      I learned Rust, with great help from ChatGPT-4.

      If I can learn from AI-generated content, then I totally believe that AI can too.

      • coder-3 15 days ago
        The problem with AI-generated content is not necessarily that it's bad, rather, it's not novel information. To learn something, you must not already know it. If it's AI-generated, the AI already knows it.
        • HeatrayEnjoyer 15 days ago
          How much work do individual humans do that could be considered genuinely truly novel? I measure the answer to be "almost none."
        • skybrian 15 days ago
          That's true to some extent, but training on synthetic content is big these days:

          https://importai.substack.com/p/import-ai-369-conscious-mach...

        • fdr 15 days ago
          We might also say the same thing about spelling and grammar checkers. The difference will be in the quality of oversight of the tool. The "AI generated drivel" has minimum oversight.

          Example: I have a huge number of perplexity.ai search/research threads, but the ones I share with my colleagues are a product of selection bias. Some of my threads are quite useless, much like a web search that was a dud. Those do not get shared.

          Likewise, if I use LLM to draft passages or even act as something like an overgrown thesaurus, I do find I have to make large changes. But some of the material stays intact. Is it AI, or not AI? It's bit of both. Sometimes my editing is heavyhanded, other times, less so, but in all cases, I checked the output.

      • gorjusborg 15 days ago
        You are assuming that you and AI are the same sort of thing.

        I do not think we are at that point yet. In the meantime, the idea that we might get to intelligence by feeding in more data might get choked out by poisoned data.

        I have a suspicion that there's a bit more to it than just more data though.

      • janice1999 15 days ago
        AI does not 'learn' like a human.
      • mr90210 15 days ago
        I learned.. If I can… then I totally…
  • uyzstvqs 15 days ago
    I've posted this recently on another post as well, but before AI-generated spam there was content farm spam. This has been increasing in search results and on social networking sites for years now.

    The solution is sticking to the websites you trust. And LLMs and RAG can actually make for a really good, very relevant search engine.

  • potatoman22 15 days ago
    I feel like archive.org and The Pile have this covered, no?
    • pixl97 15 days ago
      Until some lawyers force us to get rid of it.
  • Zenzero 15 days ago
    This implies that the pre-AI internet wasn't already overrun with SEO optimized junk. Much of the internet is not worth preserving.
    • Lammy 15 days ago
      ROSE : We've always kept records of our lives. Through words, pictures, symbols... from tablets to books…

      COLONEL : But not all the information was inherited by later generations. A small percentage of the whole was selected and processed, then passed on. Not unlike genes, really.

      ROSE : That's what history is, Jack.

      COLONEL : But in the current, digitized world, trivial information is accumulating every second, preserved in all its triteness. Never fading, always accessible.

      ROSE : Rumors about petty issues, misinterpretations, slander…

      COLONEL : All this junk data preserved in an unfiltered state, growing at an alarming rate.

      ROSE : It will only slow down social progress, reduce the rate of evolution.

      COLONEL : Raiden, you seem to think that our plan is one of censorship.

  • skybrian 15 days ago
    SEO content farms have been publishing for decades now.
  • signaru 15 days ago
    Alternatively, searching has to be changed. The non AI content doesn't necessarily disappear, but are gradually becoming "hidden gems". Something like Marginalia which does this for SEO noise would be nice.
  • jdswain 15 days ago
    At least I think I can tell when I am reading AI generated content, and stop reading and go somewhere else. Eventually though it'll get better to the point where it'll be hard to tell, but maybe then it's also good enough to be worth reading?
    • DanHulton 15 days ago
      I mean, that is one assumption you could make.
  • anigbrowl 15 days ago
    I don't really have this problem because I habitually use the Tools option on Google (or equivalent on other search engines like DDG) to only return information from before a certain date. It's not flawless, as some media companies use a more or less static URL that they update frequently, but SEO-optimizers like this are generally pretty easy to screen out.

    That said it's a problem, even if it's just the latest iteration of an older problem like content farming, article spinners and so on. I've said for years that spam is the ultimate cancer and that the tech community's general indifference to spam and scams will be its downfall.

  • aaronblohowiak 15 days ago
    Internet archive?
    • metadat 15 days ago
      Not sure if they have a thorough snapshot, but good idea for sure. IA is probably the only entity on earth who might share this dataset instead of hoarding it.
  • neilk 15 days ago
    Using "before:2023" in your Google query helps. For now.

    A few months ago, Lispi314 made a very interesting suggestion: an index of the ad-free internet. If you can filter ads and affiliate links then spam is harder to monetize.

    https://udongein.xyz/notice/AcwmRcIzxOLmrSamum

    There are some obvious problems with it, but I think I'd still like to see what that would look like.

    • ikt 15 days ago
      good lord that is a horribly designed website, the OP that that person is linking to:

      https://infosec.exchange/@bhawthorne/111601578642616056

      :

      How bad are the thousands of new stochastically-generated websites?

      Last night I wanted to roast some hazelnuts, and I could not remember the temperature I used last time. So I searched on DuckDuckGo. Every website that I could find was machine-generated with different temps listed. One site had three separate methods listed that were essentially differently worded versions of the same thing. With different temperatures.

      So I pulled my copy of Rodale’s Basic Natural Foods Cookbook off the shelf and looked it up there.

      I think it may be time to download an archive copy of the 2022 Wikipedia before we lose all of our reference material. It was nice having all the world’s knowledge at my fingertips for a couple of decades, but that time seems to be past.

  • giantg2 15 days ago
    Sure, we can take a snapshot of our bot filled web today before it goes true AI. Not sure what the real benefit would be.
  • dudus 15 days ago
    I have a sliver of hope AI generated content will actually be good one day. Just like I believe automated cars will be better than humans. I have nothing against reading content that was written by AI, for some of my reading.
  • ccgreg 13 days ago
    I've been giving talks about Common Crawl for the last year with a slide about exactly this, using low background steel as an example.
  • greyzor7 14 days ago
    that's what archive.org already does, but if you want to re-implement it, you would have to crawl all the web, eventually save thumbnails of pages with screenshotone (https://microlaunch.net/p/screenshotone)
  • wseqyrku 15 days ago
    > recently auto-generated junk

    this would only apply for pre-agi era though

  • MattGaiser 15 days ago
    Is this really all that different from the procedurally generated drivel or the offshore freelance copy/paste generated drivel?

    I find that I get a lot more AI content, but it mostly displaced the original freelancer/procedurally generated spam.

  • metadat 15 days ago
    Reality is a mess in a lot of ways. Unfortunately in this case, it's a bit late.

    Wouldn't it be nice if Elgoog, OpenAI, or Character.ai published this dataset, considering they definitely have it, and also they caused this issue.

    I'm not holding my breath.

  • jamesy0ung 15 days ago
    Internet Archive exists for webpages
  • acheron 15 days ago
    The web has been overrun by drivel for over two decades now.
  • mceoin 15 days ago
    Isn’t this common crawl?
  • RecycledEle 15 days ago
    It's way too late.
  • LorenDB 15 days ago
    r/Datahoarder probably already has you covered.
  • fuzztester 15 days ago
    Same seems to have been happening on hn from the last several months.

    had actually posted a question about this around that time, but the only reply i got was by a guy saying it is not likely, because the hn hive mind would drive down such posts.

    not sure if he was right because I still see evidence of such stuff.

  • alpenbazi 15 days ago
    yes
  • aaron695 15 days ago
    [dead]
  • keepamovin 15 days ago
    Embrace it. Stop living in the past, Gatsby. Just ask ChatGPT for the answers you seek. Hahaha! :)

    What are you searching for anyway??