Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

(cocoawithlove.com)

151 points | by zdw 1 day ago

4 comments

dagmx 3 hours ago
This is a pretty phenomenal article.
Even for those who don’t care about LLM use, this is just a great article on optimizing Swift performance, which is sadly something that doesn’t have a lot of written material for.
I’m curious if the AMX instructions are truly secret. In theory you could use an M4 or above and get them via SME I think but I’m just guessing as I’ve never tried intrinsic from Swift myself.
[-]
- mathisfun123 2 hours ago
  > get them via SME
  I have no idea what this means - AMX was replaced by SME on M4. It's a new unit not just an "abstract intrinsic" (which would make zero sense).
  [-]
  - dagmx 2 hours ago
    I’m not sure what part is confusing you or how to word it another way to make more sense to you.
    What I’m saying is that instead of using the secret AMX instructions, just use SME , assuming they have the hardware available to them.
    AMX isn’t truly gone afaik , at least according to the folks who have been looking at it. It’s just deprecated and it seems like the architecture treats them somewhat like aliases, preventing concurrent use within a process.
oflannabhra 2 hours ago
Matt Gallagher and CocoaWithLove are major highlights from the early days of my journey in learning iOS development. Awesome to see he is still publishing such high quality information!
[-]
- thomas_viaelo 1 hour ago
  Same. Matt Gallagher's posts on Swift concurrency and Cocoa memory ownership are still the clearest writing on those topics anywhere. This post is still the right answer to a Stack Overflow question I haven't asked yet.
adrian_b 2 hours ago
In TFA it is said that the C version used "-ffast-math" for the compiler to generate fused multiply-add operations.
ML/AI is one of the few applications where the use of "-ffast-math" may be acceptable, but in general one must not use "-ffast-math" to get FMA.
To enable FMA generation by the compiler, the right flag for both gcc and clang is "-ffp-contract=fast".
"-ffast-math" enables "-ffp-contract=fast", but it also enables a bunch of other code transformations that are very undesirable in any application where numerical accuracy matters, and which seldom bring any noticeable performance improvement.
Outside of ML/AI and graphics/games, "-ffast-math" should be used only by experts who fully understand the implications. Actually, even for experts, it is unlikely for "-ffast-math" to be useful, instead of selectively enabling only some of the many options that are aggregated into "-ffast-math".
The fact that most compilers still do not generate by default fused multiply-add operations in 2026, 36 years after the invention of this operation at IBM, is quite dumb.
In the overwhelming majority of cases, using FMA produces more accurate results than not using FMA. (The only cases when this is not true are encountered in certain expressions computed without FMA where some roundings happen to cancel each other.)
The reason why it has not been the default option is that the numeric results are different from those obtained on legacy computers without FMA, which was surprising for naive users. So FMA was disabled to ensure the same results as before, even if the old results were less correct.
This policy of mimicking legacy systems, just to avoid user confusion, should have become obsolete a long time ago.
nromiun 2 hours ago
> Is 1.1 Tflop/s good? Theoretically, the GPU on my M3 Max is capable of around 15 Tflop/s. But the real ceiling for this kind of task is going to be 3-5 Tflop/s
This is so true. And also why people should not take basic GPU benchmarks so seriously. Getting peak performance out of a GPU is much more complex than it is with a CPU.
And it is one of the reasons why Nvidia still has a software moat compared to other GPU companies. CUDA has so many small kernels tuned for getting peak performance for your dataset.
[-]
- billti 2 hours ago
  I keep this link in my favorites and refer to it every now and again. Still one of the best write-ups I've seen on just have vast the difference is between a naive and well tuned kernel
  https://siboehm.com/articles/22/CUDA-MMM