Training mRNA Language Models Across 25 Species for $165

We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.

56 points | by maziyar 2 days ago

7 comments

  • simianwords 2 hours ago
    What makes these Domain specific models work when we don’t have good domain models for health care, chemistry, economics and so on
    • colechristensen 1 hour ago
      >we don’t have good domain models for health care, chemistry, economics and so on

      Who says we don't?

      • simianwords 1 hour ago
        Examples please?
        • colechristensen 52 minutes ago
          No, it's really simple to search for domain specific models being used "in production" all over the place
  • rubicon33 1 hour ago
    Can someone explain what one might use this model for? As a developer with a casual interest in biology it would be fun to play with but honestly not sure what I would do
    • colechristensen 1 hour ago
      You can get your feet wet with genetic engineering for surprisingly little money.

      This guy shows a lot of how it's done: https://www.youtube.com/@thethoughtemporium

      Basically you can design/edit/inject custom genes into things and see real results spending on the scale of $100-$1000.

      • someuser54541 1 hour ago
        Is there something like this in text/readable format?
  • khalic 2 hours ago
    > In Progress: CodonJEPA

    JEPA is going to break the whole industry :D

    • digdugdirk 2 hours ago
      Can you explain this? I haven't heard of JEPA, and from a quick search it seems to be vision/robotics based?
      • khalic 1 hour ago
        It’s a self supervised learning architecture, and it’s pretty much universal. The loss function runs on embeddings, and some other smart architectural choices allover. Worth diving into for a few hours, Yann LeCun gives some interesting talks about it
      • lukeinator42 2 hours ago
  • maziyar 2 days ago
    • xyz100 2 hours ago
      What makes this dataset or problem worth solving compared to other health datasets? Would the results on this task be broadly useful to health?
      • CyberDildonics 1 hour ago
        What other "datasets" are you talking about? How do you "solve a dataset" ?
  • yieldcrv 1 hour ago
    Distributing the load on this will probably be infinitely more useful than “folding at home”
  • HocusLocus 2 hours ago
    gray goo of the future