While there is still a chance I think we should snapshot a version of the web and make it publicly available. That can serve as something to calibrate various information sources against to get an idea of whether or not they are to be used or rather not. I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on, and such data will rapidly become as precious as 'low background steel'.
https://en.wikipedia.org/wiki/Low-background_steel
(A semi-ironic detail: Common Crawl is one of the most common sources used as part of the training data for LLMs)
I'm also totally not convinced by this argument.
Synthetic data as an input to a careful training regimen will result in better outputs, not worse, because you're still subjecting the model to optimization and new information. Over time you can pull out the worse performing (original and synthetic) training data. That careful curation is the part that makes the difference.
It's like DNA in the chemical soup. It's been replicating polymers since the beginning, but in the end intelligence arises. It didn't need magical ingredients. When you climb a gradient, it typically takes you somewhere better.
That's the current prevailing hypothesis, but we don't yet fully understand the phenomenon of intelligence enough to definitively rule out any magical ingredients: unknown variables or characteristics of the system/inputs/data that made it possible for intelligence to emerge.
This proposed snapshot of the web, before it gets further "contaminated" by synthetic AI/LLM-generated data, might prove to be valuable or it might not. The premise could be wrong. Maybe we learn that there's nothing fundamentally special about human-generated data, compared to synthetic data derived from it.
It seems worthwhile to consider though, in case it turns out that there is some yet unknown quality of the more or less "pure" human data. In the metaphor of low-background steel, we could be entering a period of unregulated nuclear testings without being fully aware of the consequences.
This idea of contamination by AI vs pristine human data isn't persuasive to me at all. It feels like a continuation of the wrong idea that LLMs are parrots.
(also, how much time did the soup take? Can you wait that long?)
2021: https://twitter.com/jackclarkSF/status/1376304266667651078
2022: https://twitter.com/william_g_ray/status/1583574265513017344
2022: https://twitter.com/mtrc/status/1599725875280257024
Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.
Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde... with their previous and subsequent crawls listed in the dropdown here: https://commoncrawl.org/overview
Internet Archive's crawls are here: https://archive.org/details/web organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: https://archive.org/details/wide00018. Wide Crawl 17 was from late 2018 and is 644.4TB: https://archive.org/details/wide00017
They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.
Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.
Which studies show this? https://arxiv.org/abs/2305.17493 shows the exact opposite and my (layman's) understanding of statistics and epistemology lines up entirely with this finding.
Like, how could this even theoretically work? In the best case scenario wouldn't training on synthetic training data make LLMs overconfident / overfit the data once faced with new (human) input to respond to?
> Like, how could this even theoretically work?
I'm not really an expert on it either, but my understanding is that it works the same way curating human data works. You sift through the garbage, nonsense, impolite and incoherent AI responses and only include the exemplary conversations in your training set.
It feels kinda like the "monkeys on typewriters writing shakespeare" parable. If you have enough well-trained AIs generate enough conversations, eventually enough of them will be indistinguishable enough from human data to be usable for training.
I’d be very surprised if the big orgs don’t have in house efforts that far exceed the pile. Hell we know Google paid Reddit a pile of money for data and other orgs are also willing to pay
It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.
Google has a huge number of books scanned, too.
https://www.theatlantic.com/technology/archive/2017/04/the-t...
If I can learn from AI-generated content, then I totally believe that AI can too.
https://importai.substack.com/p/import-ai-369-conscious-mach...
Example: I have a huge number of perplexity.ai search/research threads, but the ones I share with my colleagues are a product of selection bias. Some of my threads are quite useless, much like a web search that was a dud. Those do not get shared.
Likewise, if I use LLM to draft passages or even act as something like an overgrown thesaurus, I do find I have to make large changes. But some of the material stays intact. Is it AI, or not AI? It's bit of both. Sometimes my editing is heavyhanded, other times, less so, but in all cases, I checked the output.
I do not think we are at that point yet. In the meantime, the idea that we might get to intelligence by feeding in more data might get choked out by poisoned data.
I have a suspicion that there's a bit more to it than just more data though.
The solution is sticking to the websites you trust. And LLMs and RAG can actually make for a really good, very relevant search engine.
COLONEL : But not all the information was inherited by later generations. A small percentage of the whole was selected and processed, then passed on. Not unlike genes, really.
ROSE : That's what history is, Jack.
COLONEL : But in the current, digitized world, trivial information is accumulating every second, preserved in all its triteness. Never fading, always accessible.
ROSE : Rumors about petty issues, misinterpretations, slander…
COLONEL : All this junk data preserved in an unfiltered state, growing at an alarming rate.
ROSE : It will only slow down social progress, reduce the rate of evolution.
COLONEL : Raiden, you seem to think that our plan is one of censorship.
That said it's a problem, even if it's just the latest iteration of an older problem like content farming, article spinners and so on. I've said for years that spam is the ultimate cancer and that the tech community's general indifference to spam and scams will be its downfall.
A few months ago, Lispi314 made a very interesting suggestion: an index of the ad-free internet. If you can filter ads and affiliate links then spam is harder to monetize.
https://udongein.xyz/notice/AcwmRcIzxOLmrSamum
There are some obvious problems with it, but I think I'd still like to see what that would look like.
https://infosec.exchange/@bhawthorne/111601578642616056
:
How bad are the thousands of new stochastically-generated websites?
Last night I wanted to roast some hazelnuts, and I could not remember the temperature I used last time. So I searched on DuckDuckGo. Every website that I could find was machine-generated with different temps listed. One site had three separate methods listed that were essentially differently worded versions of the same thing. With different temperatures.
So I pulled my copy of Rodale’s Basic Natural Foods Cookbook off the shelf and looked it up there.
I think it may be time to download an archive copy of the 2022 Wikipedia before we lose all of our reference material. It was nice having all the world’s knowledge at my fingertips for a couple of decades, but that time seems to be past.
this would only apply for pre-agi era though
I find that I get a lot more AI content, but it mostly displaced the original freelancer/procedurally generated spam.
Wouldn't it be nice if Elgoog, OpenAI, or Character.ai published this dataset, considering they definitely have it, and also they caused this issue.
I'm not holding my breath.
had actually posted a question about this around that time, but the only reply i got was by a guy saying it is not likely, because the hn hive mind would drive down such posts.
not sure if he was right because I still see evidence of such stuff.
What are you searching for anyway??