2 comments

  • sans_souse 3 hours ago
    I think a point often missed is that it's not just what the substance and quality of those sources and their associated decline but also the overall decline of sources, period. The first phases involved training the models with a massive backlog of raw knowlege, communicated over thousands of years, and for the majority of that span, this was a world much different from our today; in short, all of our knowledge was "boots on the ground" type, and all of it served to aid our growth, and our record of this tells such a story.

    But our knowledge and growth today is so narrow in scope (in a sense) and there's an ever looming scenario ready to present itself where our perceived growth is actually a recursion and the answer to "what is the purpose" becomes "there is none"

  • ipython 3 hours ago
    So I’ve heard of this model collapse theory. But I’ve also heard of model providers who are intentionally training with synthetically generated data (as a result of insufficient “real” data).

    So I’m curious where the line is? Are there phases in the training/continued pre training/alignment/rlhf pipeline where synthetic data isn’t just harmless but actually beneficial? Is it a question of quantity or a question of how much novelty is in the training data?