Smallest transformer that can add two 10-digit numbers

(github.com)

71 points | by ks2048 1 day ago

11 comments

alexlitz 17 minutes ago
I made a blogpost on my submission (currently the top handwritten one at 36 parameters) https://alexlitzenberger.com/blog/building_a_minimal_transfo...
Sophira 10 minutes ago
I get that this is technically interesting, for certain, but the sheer amount of energy and associated global warming risk needed to do something with >=99% accuracy that we've been able to do easily for decades with a guaranteed 100% accuracy seems to me to be wasteful to the extreme.
[-]
- coolsunglasses 7 minutes ago
  >Hacker News
  not any more, eh?
- thereisnospork 7 minutes ago
  You need to recalibrate your sense of scale if you think that this is a geologically relevant usage of energy.
E-Reverance 1 hour ago
Not sure how much this fits into the rules but I saw on twitter someone claimed 28 params : https://gist.github.com/SeuperHakkerJa/da3050739bea97aabd86e...
amelius 2 hours ago
> In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.
I wonder why they don't just write the code themselves, so by design the focus can be on the model.
i000 50 minutes ago
Would it make sense to embed such single-purpose network with fixed weights within a LLM before pre-training?
medi8r 2 hours ago
You can do that in a single matmul of course.
[-]
- hyperhello 1 hour ago
  So can you take an arbitrary transformer and somehow turn it into a compact set of low-power fast gates by some algorithm?
  [-]
  - measurablefunc 1 hour ago
    I think you're misunderstanding the joke.
    [-]
    - medi8r 1 hour ago
      Yes joke is:
      [A B]
      times
      [1] [1]
      is
      [A+B]
      [-]
      - hyperhello 1 hour ago
        From context then, I infer that a transformer is not comprised of matrix multiplications, because it would simply be one that adds two 10-digit numbers.
        [-]
        medi8r 59 minutes ago
        A transformer tokenizes input, does a bunch of matmul and relu set up in a certain way. It doesn't get to see the raw number (just like you don't when you look at 1+1 you need visual cortex etc. first.)
ks2048 1 hour ago
So, hand-coded weights can do it with 36 params and 311 for trained weights - did anyone try the former architecture, but starting with random weights and learning?
[-]
- alexlitz 12 minutes ago
  For one the specific 36 parameter version is impossible without float64 so you might guess the corollary that it is not exactly amenable to being found by gradient descent. I think the question of how you can structure transformers and neural nets in general so that they can both very parsimoniously represent things like this and have it be amenible to learning by gradient descent.
munro 51 minutes ago
>=99% accuracy wtf?!?
I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.
1over137 57 minutes ago
Now wrap it all in an Electron app!
MarcLore 51 minutes ago
The gap between 36 hand-coded params and 311 trained params is fascinating and honestly underappreciated. It mirrors something we see repeatedly in ML: gradient descent finds solutions in a fundamentally different region of parameter space than a human engineer would design.
When you hand-code the weights, you're essentially implementing a known algorithm (carry-propagation) directly into the network topology. But trained networks often discover distributed representations that spread the computation across more parameters in ways that are harder to interpret but more robust to input distribution shifts.
I'd be curious whether the 311-param trained model generalizes better to bases other than 10, or to addition with different digit counts than it was trained on. In my experience, the 'messier' learned solutions sometimes capture more structural regularity than the clean engineered ones, precisely because they aren't locked into a single algorithmic strategy.
jaunt7632 40 minutes ago
[dead]