How to scale RL to 10^26 FLOPs

(blog.jxmo.io)

80 points | by jxmorris12 4 days ago

6 comments

  • moconnor 22 hours ago
    A very long way of saying "during pretraining let the models think before continuing next-token prediction and then apply those losses to the thinking token gradients too."

    It seems like an interesting idea. You could apply some small regularisation penalty to the number of thinking tokens the model uses. You might have to break up the pretraining data into meaningfully-paritioned chunks. I'd be curious whether at large enough scale models learn to make use of this thinking budget to improve their next-token prediction, and what that looks like.

  • tekacs 1 day ago
    Besides the great subject matter, I love how densely packed this article is with links to relevant papers and materials!

    (Because I almost missed this) In the comments on the post someone linked to this paper: https://arxiv.org/html/2408.15240v1

  • mvkel 1 day ago
    Grok 4 is effectively Grok 3 with massively scaled RL, and the improvements on the benchmarks (and experientially) are minimal.

    Is this a flaw in theory, or application?

    • k__ 22 hours ago
      Half-OT:

      Is there any info out there how the major models differ?

  • Iwan-Zotow 1 day ago
    More data? Where it supposed to come from?
  • sync_silver93 1 day ago
    [flagged]
  • childintime 19 hours ago
    The chemistry of RL dictates that 10^26 FLOPS is about 166 FLOP-mols. But how much weight is this? An electron/FLOP or 1eV/FLOP? That's 0.55mg or just 1ng. Regardless, I'd say it's close to 7 brainfucks, as it's common knowledge they exert a logarithmic force on the intelligence weighing apparatus, F = m * log2(a).