Training mRNA Language Models Across 25 Species for $165

We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.

56 points | by maziyar 2 days ago

7 comments

simianwords 2 hours ago
What makes these Domain specific models work when we don’t have good domain models for health care, chemistry, economics and so on
[-]
- colechristensen 1 hour ago
  >we don’t have good domain models for health care, chemistry, economics and so on
  Who says we don't?
  [-]
  - simianwords 1 hour ago
    Examples please?
    [-]
    - colechristensen 52 minutes ago
      No, it's really simple to search for domain specific models being used "in production" all over the place
      [-]
      - simianwords 49 minutes ago
        I didn’t find a single one that outperforms a general model.
        [-]
        colechristensen 9 minutes ago
        Ok, alphafold.
        [-]
        simianwords 4 minutes ago
        It’s not a large language model
rubicon33 1 hour ago
Can someone explain what one might use this model for? As a developer with a casual interest in biology it would be fun to play with but honestly not sure what I would do
[-]
- colechristensen 1 hour ago
  You can get your feet wet with genetic engineering for surprisingly little money.
  This guy shows a lot of how it's done: https://www.youtube.com/@thethoughtemporium
  Basically you can design/edit/inject custom genes into things and see real results spending on the scale of $100-$1000.
  [-]
  - someuser54541 1 hour ago
    Is there something like this in text/readable format?
khalic 2 hours ago
> In Progress: CodonJEPA
JEPA is going to break the whole industry :D
[-]
- digdugdirk 2 hours ago
  Can you explain this? I haven't heard of JEPA, and from a quick search it seems to be vision/robotics based?
  [-]
  - khalic 1 hour ago
    It’s a self supervised learning architecture, and it’s pretty much universal. The loss function runs on embeddings, and some other smart architectural choices allover. Worth diving into for a few hours, Yann LeCun gives some interesting talks about it
  - lukeinator42 2 hours ago
    https://openreview.net/pdf?id=BZ5a1r-kVsf
maziyar 2 days ago
full article: https://huggingface.co/blog/OpenMed/training-mrna-models-25-...
[-]
- xyz100 2 hours ago
  What makes this dataset or problem worth solving compared to other health datasets? Would the results on this task be broadly useful to health?
  [-]
  - CyberDildonics 1 hour ago
    What other "datasets" are you talking about? How do you "solve a dataset" ?
yieldcrv 1 hour ago
Distributing the load on this will probably be infinitely more useful than “folding at home”
HocusLocus 2 hours ago
gray goo of the future